Data Leaders Brief

how-to-run-queries-periodically-in-apache-hive

Orchestrate Amazon EMR Serverless Spark jobs with Amazon MWAA, and data validation using Amazon Athena

AWS Big Data

DECEMBER 12, 2023

Many data engineers today use Apache Airflow to build, schedule, and monitor their data pipelines. Amazon Managed Workflows for Apache Airflow (Amazon MWAA) can help simplify the process of building, running, and managing data pipelines. You can use standard SQL to interact with data.

Data Processing

Data Processing Management Statistics Interactive

How the GoDaddy data platform achieved over 60% cost reduction and 50% performance boost by adopting Amazon EMR Serverless

AWS Big Data

MARCH 12, 2024

In this post, we discuss how we enhanced operational efficiency with Amazon EMR Serverless. Using best practices learned from the AWS FinHack program, we fine-tuned resource-intensive jobs, converted Pig and Hive jobs to Spark, and reduced our batch workload spend by 22.75% in 2022. PB of data from its data center to EMR on EC2.

Cost-Benefit

Cost-Benefit Optimization Big Data Metrics

Join 52,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Communication

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

Manufacturing Sustainability Surge: Your Guide to Data-Driven Energy Optimization & Decarbonization

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

MORE WEBINARS

Trending Sources

Optimizing Cloudera Data Engineering Autoscaling Performance

Cloudera

SEPTEMBER 2, 2021

Normally on-premises, one of the key challenges was how to allocate resources within a finite set of resources (i.e., When building CDE, we integrated with Apache YuniKorn which offers rich scheduling capabilities on Kubernetes. . We tested the scaling capabilities of CDE with the following job runs to mimic a real-world scenario: .

Optimization

Optimization Testing Cost-Benefit Measurement

Webinars

The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Communication

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

Manufacturing Sustainability Surge: Your Guide to Data-Driven Energy Optimization & Decarbonization

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

MORE WEBINARS

Migrate Hive data from CDH to CDP public cloud

Cloudera

JUNE 25, 2021

Many Cloudera customers are making the transition from being completely on-prem to cloud by either backing up their data in the cloud, or running multi-functional analytics on CDP Public cloud in AWS or Azure. The Replication Manager service facilitates both disaster recovery and data migration across different environments.

Data Lake

Data Lake Metadata Unstructured Data Management

Build a real-time GDPR-aligned Apache Iceberg data lake

AWS Big Data

FEBRUARY 24, 2023

AWS contributed the Apache Iceberg integration with the AWS Glue Data Catalog , which enables you to use open-source data computation engines like Apache Spark with Iceberg on AWS Glue. In 2022, Amazon Athena announced support of Iceberg , enabling transaction queries on S3 objects.

Data Lake

Data Lake Metadata Testing Data Warehouse

Fraud Detection With Cloudera Stream Processing Part 2: Real-Time Streaming Analytics

Cloudera

JULY 18, 2022

In part 1 of this blog we discussed how Cloudera DataFlow for the Public Cloud (CDF-PC), the universal data distribution service powered by Apache NiFi, can make it easy to acquire data from wherever it originates and move it efficiently to make it available to other applications in a streaming fashion. Data decays! Use case recap.

Analytics

Analytics Dashboards Statistics Visualization

Admission Control Architecture for Cloudera Data Platform

Cloudera

OCTOBER 8, 2021

Apache Impala is a massively parallel in-memory SQL engine supported by Cloudera designed for Analytics and ad hoc queries against data stored in Apache Hive, Apache HBase and Apache Kudu tables. Anatomy of Impala Query Execution. Introduction.

Data Processing

Data Processing Statistics Risk Optimization

Fine-Grained Authorization with Apache Kudu and Apache Ranger

Cloudera

FEBRUARY 11, 2021

which made it possible to restrict access only to Apache Impala where Apache Sentry policies could be applied, enabling a lot more use cases. which made it possible to restrict access only to Apache Impala where Apache Sentry policies could be applied, enabling a lot more use cases. How it works.

Metadata

Metadata Management IT Analytics

Cost Conscious Data Warehousing with Cloudera Data Platform

Cloudera

DECEMBER 10, 2020

Continuous resource consumption in the cloud (billable on-demand by a running clock) makes no sense today because a better option is available: resource consumption that starts when you need it and stops when you don’t. If not, before adopting a cloud data warehouse, consider the true costs of a cloud-native data warehouse.

Data Warehouse

Data Warehouse Metadata Cost-Benefit Optimization

Materialized Views in Hive for Iceberg Table Format

Cloudera

FEBRUARY 8, 2024

Apache Iceberg is a high-performance open table format for petabyte-scale analytic datasets. It brings the reliability and simplicity of SQL tables to big data while enabling engines like Hive, Impala, Spark, Trino, Flink, and Presto to work with the same tables at the same time. Such a query pattern is quite common in BI queries.

Snapshot

Snapshot Metadata Cost-Benefit Data Warehouse

Implement fine-grained access control in Amazon SageMaker Studio and Amazon EMR using Apache Ranger and Microsoft Active Directory

AWS Big Data

NOVEMBER 8, 2023

SageMaker Studio comes with built-in integration with Amazon EMR , enabling data scientists to interactively prepare data at petabyte scale using frameworks such as Apache Spark, Hive, and Presto right from SageMaker Studio notebooks.

Testing

Testing Modeling Management Machine Learning

How Zoom implemented streaming log ingestion and efficient GDPR deletes using Apache Hudi on Amazon EMR

AWS Big Data

MAY 16, 2023

Laws such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) require organizations to retain application logs for a specific period of time. In this post, we explore the architecture and the benefits it provides for Zoom and its users.

Data Lake

Data Lake Cost-Benefit Optimization Testing

An Introduction to Ranger RMS

Cloudera

OCTOBER 5, 2021

Cloudera Data Platform (CDP) supports access controls on tables and columns, as well as on files and directories via Apache Ranger since its first release. Unfortunately, in such instances you would have to create and maintain separate Ranger policies for both Hive and HDFS, that correspond to each other. . How does it help?

Management

Management Metadata IT Interactive

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

This data is then projected into analytics services such as data warehouses, search systems, stream processors, query editors, notebooks, and machine learning (ML) models through direct access, real-time, and batch workflows. Apache Iceberg 1.2.0, Apache Iceberg 1.2.0, and Delta Lake 2.3.0. and Delta Lake 2.3.0.

Data Lake

Data Lake Metadata Optimization Statistics

Preprocess and fine-tune LLMs quickly and cost-effectively using Amazon EMR Serverless and Amazon SageMaker

AWS Big Data

FEBRUARY 1, 2024

The customer was looking for how they could achieve this in the most cost-effective way on AWS. The customer was looking for how they could achieve this in the most cost-effective way on AWS. Solution overview In the following sections, we first introduce the Common Crawl dataset and how to explore and filter the data we need.

Modeling

Modeling Metadata Data Processing Unstructured Data

How Amazon optimized its high-volume financial reconciliation process with Amazon EMR for higher scalability and performance

AWS Big Data

MARCH 28, 2024

This encouraged us to reinvent our reconciliation service powered by AWS services, including Amazon EMR and the Apache Spark distributed processing framework, which uses PySpark. Account reconciliation is an important step to ensure the completeness and accuracy of financial statements. We processed the data sequentially using Python.

Optimization

Optimization IT Big Data Data Processing

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

AWS Big Data

MARCH 3, 2023

Amazon EMR Serverless is a serverless option in Amazon EMR that makes it easy for data analysts and engineers to run open-source big data analytics frameworks without configuring, managing, and scaling clusters or servers. Incremental data is generated in the PostgreSQL table by running custom SQL scripts.

Data Lake

Data Lake Dashboards Metrics Metadata

Auditing to external systems in CDP Private Cloud Base

Cloudera

MAY 26, 2021

in the same time period. All data access is authorized via Attribute-Based Access Control or Role-Based Access Control with Apache Ranger as part of SDX. All data access is authorized via Attribute-Based Access Control or Role-Based Access Control with Apache Ranger as part of SDX. Insider Threats? 31% in two years?

Reporting

Reporting Management Measurement Testing

Topics to watch at the Strata Data Conference in New York 2019

O'Reilly on Data

SEPTEMBER 11, 2019

The Strata Data Conferences helped chronicle the birth of big data, as well as the emergence of data science, streaming, and machine learning (ML) as disruptive phenomena. Strata attracts the leading names in the fields of data management, data engineering, analytics, ML, and artificial intelligence (AI). The term “ML” is No.

IoT

IoT Big Data Data Warehouse Uncertainty

Orchestrate Amazon EMR Serverless Spark jobs with Amazon MWAA, and data validation using Amazon Athena

How the GoDaddy data platform achieved over 60% cost reduction and 50% performance boost by adopting Amazon EMR Serverless

Webinars

Trending Sources

Optimizing Cloudera Data Engineering Autoscaling Performance

Webinars

Migrate Hive data from CDH to CDP public cloud

Build a real-time GDPR-aligned Apache Iceberg data lake

Fraud Detection With Cloudera Stream Processing Part 2: Real-Time Streaming Analytics

Admission Control Architecture for Cloudera Data Platform

Fine-Grained Authorization with Apache Kudu and Apache Ranger

Cost Conscious Data Warehousing with Cloudera Data Platform

Materialized Views in Hive for Iceberg Table Format

Implement fine-grained access control in Amazon SageMaker Studio and Amazon EMR using Apache Ranger and Microsoft Active Directory

How Zoom implemented streaming log ingestion and efficient GDPR deletes using Apache Hudi on Amazon EMR

An Introduction to Ranger RMS

Choosing an open table format for your transactional data lake on AWS

Preprocess and fine-tune LLMs quickly and cost-effectively using Amazon EMR Serverless and Amazon SageMaker

How Amazon optimized its high-volume financial reconciliation process with Amazon EMR for higher scalability and performance

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

Auditing to external systems in CDP Private Cloud Base

Topics to watch at the Strata Data Conference in New York 2019

Stay Connected