article thumbnail

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

AWS Big Data

In addition to using native managed AWS services that BMS didn’t need to worry about upgrading, BMS was looking to offer an ETL service to non-technical business users that could visually compose data transformation workflows and seamlessly run them on the AWS Glue Apache Spark-based serverless data integration engine.

article thumbnail

BMW Cloud Efficiency Analytics powered by Amazon QuickSight and Amazon Athena

AWS Big Data

It seamlessly consolidates data from various data sources within AWS, including AWS Cost Explorer (and forecasting with Cost Explorer ), AWS Trusted Advisor , and AWS Compute Optimizer. Data providers and consumers are the two fundamental users of a CDH dataset. You might notice that this differs slightly from traditional ETL.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

Cloudera’s Open Data Lakehouse Supercharged with dbt Core(tm)

Cloudera

We’re excited to announce the general availability of the open source adapters for dbt for all the engines in CDP — Apache Hive , Apache Impala , and Apache Spark, with added support for Apache Livy and Cloudera Data Engineering. This variety can result in a lack of standardization, leading to data duplication and inconsistency.

article thumbnail

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

AWS Big Data

Traditionally, such a legacy call center analytics platform would be built on a relational database that stores data from streaming sources. Data transformations through stored procedures and use of materialized views to curate datasets and generate insights is a known pattern with relational databases.

article thumbnail

Amazon EMR on EKS widens the performance gap: Run Apache Spark workloads 5.37 times faster and at 4.3 times lower cost

AWS Big Data

We also share a Spark benchmark solution that suits all Amazon EMR deployment options, so you can replicate the process in your environment for your own performance test cases. The solution uses the TPC-DS dataset and unmodified data schema and table relationships, but derives queries from TPC-DS to support the SparkSQL test cases.

Testing 72
article thumbnail

Cloudera DataFlow Designer: The Key to Agile Data Pipeline Development

Cloudera

Allows them to iteratively develop processing logic and test with as little overhead as possible. Plays nice with existing CI/CD processes to promote a data pipeline to production. Provides monitoring, alerting, and troubleshooting for production data pipelines.

Testing 82
article thumbnail

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

This person (or group of individuals) ensures that the theory behind data quality is communicated to the development team. 2 – Data profiling. Data profiling is an essential process in the DQM lifecycle. This means there are no unintended data errors, and it corresponds to its appropriate designation (e.g.,