Data Lake, Data Processing and Information

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

AWS Big Data

JUNE 10, 2024

For detailed information on managing your Apache Hive metastore using Lake Formation permissions, refer to Query your Apache Hive metastore with AWS Lake Formation permissions. In this post, we present a methodology for deploying a data mesh consisting of multiple Hive data warehouses across EMR clusters.

Data Lake

Data Lake Metadata Data Warehouse Data Processing

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

OCTOBER 3, 2023

A data lake is a centralized repository that you can use to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data and then run different types of analytics for better business insights.

Data Lake

Data Lake Metadata Snapshot Recreation/Entertainment

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. In this post, we describe Orca’s journey building a transactional data lake using Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and AWS Analytics.

Data Lake

Data Lake Analytics Snapshot Optimization

Webinars

Leading the Development of Profitable and Sustainable Products

How To Get Promoted In Product Management

MORE WEBINARS

Create an Apache Hudi-based near-real-time transactional data lake using AWS DMS, Amazon Kinesis, AWS Glue streaming ETL, and data visualization using Amazon QuickSight

AWS Big Data

AUGUST 3, 2023

Data analytics on operational data at near-real time is becoming a common need. Due to the exponential growth of data volume, it has become common practice to replace read replicas with data lakes to have better scalability and performance. For more information, see Changing the default settings for your data lake.

Data Lake

Data Lake Visualization Dashboards Insurance

Build a data lake with Apache Flink on Amazon EMR

AWS Big Data

JANUARY 27, 2023

For example, the Flink FileSystem connector has FileSystemTableFactory to read/write data in Hadoop Distributed File System (HDFS) or Amazon Simple Storage Service (Amazon S3), the Flink HBase connector has HBase2DynamicTableFactory to read/write data in HBase, and the Flink Kafka connector has KafkaDynamicTableFactory to read/write data in Kafka.

Data Lake

Data Lake Metadata Business Analysis Data-driven

Secure cloud fabric: Enhancing data management and AI development for the federal government

CIO Business Intelligence

DECEMBER 19, 2023

In recent years, government agencies have increasingly turned to cloud computing to manage vast amounts of data and streamline operations. While cloud technology has many benefits, it also poses security risks, especially when it comes to protecting sensitive information.

Data Lake

Data Lake Management Cost-Benefit Data Processing

Building and Evaluating GenAI Knowledge Management Systems using Ollama, Trulens and Cloudera

Cloudera

MAY 23, 2024

In modern enterprises, the exponential growth of data means organizational knowledge is distributed across multiple formats, ranging from structured data stores such as data warehouses to multi-format data stores like data lakes. This makes gathering information for decision making a challenge.

Management

Management Metrics Data Processing Data Lake

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, data warehouse, and data lakes can become equally challenging.

Data Lake

Data Lake Analytics Dashboards Metrics

Query your Apache Hive metastore with AWS Lake Formation permissions

AWS Big Data

JULY 20, 2023

The Hive metastore is a repository of metadata about the SQL tables, such as database names, table names, schema, serialization and deserialization information, data location, and partition details of each table. Therefore, organizations have come to host huge volumes of metadata of their structured datasets in the Hive metastore.

Data Lake

Data Lake Metadata Data Processing Big Data

Configure cross-Region table access with the AWS Glue Catalog and AWS Lake Formation

AWS Big Data

AUGUST 3, 2023

Today’s modern data lakes span multiple accounts, AWS Regions, and lines of business in organizations. It’s important that their data solution gives them the ability to share and access data securely and safely across Regions. A resource link is a Data Catalog object that is a link to a database or table.

Data Lake

Data Lake Metadata Management Data Processing

Empowering data-driven excellence: How the Bluestone Data Platform embraced data mesh for success

AWS Big Data

FEBRUARY 27, 2024

Each data producer within the organization has its own data lake in Apache Hudi format, ensuring data sovereignty and autonomy. This enables data-driven decision-making across the organization. Lake Formation – Lake Formation emerged as a cornerstone in Bluestone’s data governance strategy.

Data-driven

Data-driven Data Lake Data Quality Data Governance

Introducing AWS Glue crawler and create table support for Apache Iceberg format

AWS Big Data

AUGUST 16, 2023

Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback. Iceberg captures metadata information on the state of datasets as they evolve and change over time. In the navigation pane, under Data catalog , choose Settings.

Data Lake

Data Lake Metadata Snapshot Management

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

AWS Big Data

DECEMBER 13, 2023

For the past 5 years, BMS has used a custom framework called Enterprise Data Lake Services (EDLS) to create ETL jobs for business users. BMS’s EDLS platform hosts over 5,000 jobs and is growing at 15% YoY (year over year). The Python utility reads the metadata information, parses it, and configures the nodes automatically.

Metadata

Metadata Data Lake Visualization Data Transformation

How Gilead used Amazon Redshift to quickly and cost-effectively load third-party medical claims data

AWS Big Data

NOVEMBER 8, 2023

They recently needed to do a monthly load of 140 TB of uncompressed healthcare claims data in under 24 hours after receiving it to provide analysts and data scientists with up-to-date information on a patient’s healthcare journey. This data volume is expected to increase monthly and is fully refreshed each month.

Data Lake

Data Lake Data Warehouse Cost-Benefit Optimization

Implement alerts in Amazon OpenSearch Service with PagerDuty

AWS Big Data

JUNE 8, 2023

In today’s fast-paced digital world, businesses rely heavily on their data to make informed decisions. This data is often stored and analyzed using various tools, such as Amazon OpenSearch Service , a powerful search and analytics service offered by AWS. For Host , enter events.PagerDuty.com. For Method , choose POST.

Data Lake

Data Lake Dashboards Metrics Testing

TDC Digital leverages IBM Cloud for transparent billing and improved customer satisfaction

IBM Big Data Hub

MAY 19, 2023

With its flagship product, Mebook, TDC Digital offers a comprehensive ERP system that can cater to the specific information system requirements of different departments within a company. Furthermore, TDC Digital had not used any cloud storage solution and experienced latency and downtime while hosting the application in its data center.

Unstructured Data

Unstructured Data Data Processing Manufacturing Data Lake

Access Amazon Athena in your applications using the WebSocket API

AWS Big Data

MARCH 2, 2023

Many organizations are building data lakes to store and analyze large volumes of structured, semi-structured, and unstructured data. In addition, many teams are moving towards a data mesh architecture, which requires them to expose their data sets as easily consumable data products.

Data Lake

Data Lake Testing Interactive Unstructured Data

How smava makes loans transparent and affordable using Amazon Redshift Serverless

AWS Big Data

DECEMBER 21, 2023

To bring their customers the best deals and user experience, smava follows the modern data architecture principles with a data lake as a scalable, durable data store and purpose-built data stores for analytical processing and data consumption.

Data Lake

Data Lake Data Warehouse Data-driven B2B

Join a streaming data source with CDC data for real-time serverless data analytics using AWS Glue, AWS DMS, and Amazon DynamoDB

AWS Big Data

MAY 30, 2023

Customers have been using data warehousing solutions to perform their traditional analytics tasks. Recently, data lakes have gained lot of traction to become the foundation for analytical solutions, because they come with benefits such as scalability, fault tolerance, and support for structured, semi-structured, and unstructured datasets.

Data Lake

Data Lake Data Analytics Analytics Data Processing

Migrate a petabyte-scale data warehouse from Actian Vectorwise to Amazon Redshift

AWS Big Data

MAY 30, 2024

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. It also helps you securely access your data in operational databases, data lakes, or third-party datasets with minimal movement or copying of data.

Data Warehouse

Data Warehouse Data Lake Cost-Benefit Structured Data

CIOs weigh where to place AI bets — and how to de-risk them

CIO Business Intelligence

MARCH 18, 2024

Though a multicloud environment, the agency has most of its cloud implementations hosted on Microsoft Azure, with some on AWS and some on ServiceNow’s 311 citizen information platform. For a typical project that will likely involve a Snowflake data lake hosted currently on Azure, Menon stresses that quality of data is critical. “AI

Risk

Risk Cost-Benefit Data Processing Testing

Dairyland powers up for a generative AI edge

CIO Business Intelligence

APRIL 9, 2024

Beginning in 2021, the Minneapolis-based Microsoft partner helped Dairyland migrate from several custom legacy applications to a commercial implementation of Dynamics 365 and an Azure data lake, which set the stage for the power company’s early foray into AI, according to the systems integrator.

Digital Transformation

Digital Transformation Machine Learning Data Lake Software

What a quarter century of digital transformation at PayPal looks like

CIO Business Intelligence

OCTOBER 4, 2023

At the lowest layer is the infrastructure, made up of databases and data lakes. These applications live on innumerable servers, yet some technology is hosted in the public cloud. Data that unlocks value at both ends is key. This is, according to Shivananda, the brain of PayPal.

Digital Transformation

Digital Transformation Deep Learning Data Lake Risk

Set up advanced rules to validate quality of multiple datasets with AWS Glue Data Quality

AWS Big Data

JUNE 6, 2023

Data is the lifeblood of modern businesses. In today’s data-driven world, companies rely on data to make informed decisions, gain a competitive edge, and provide exceptional customer experiences. However, not all data is created equal. AWS Glue Data Quality measures and monitors the quality of your dataset.

Data Quality

Data Quality Data Lake Visualization Data-driven

10 Keys to a Secure Cloud Data Lakehouse

Cloudera

OCTOBER 25, 2022

The data lakehouse is gaining in popularity because it enables a single platform for all your enterprise data with the flexibility to run any analytic and machine learning (ML) use case. Cloud data lakehouses provide significant scaling, agility, and cost advantages compared to cloud data lakes and cloud data warehouses.

Data Processing

Data Processing Data Lake Cost-Benefit Risk

Enhance query performance using AWS Glue Data Catalog column-level statistics

AWS Big Data

NOVEMBER 22, 2023

Data lakes are designed for storing vast amounts of raw, unstructured, or semi-structured data at a low cost, and organizations share those datasets across multiple departments and teams. The queries on these large datasets read vast amounts of data and can perform complex join operations on multiple datasets.

Statistics

Statistics Data Lake Optimization Data-driven

Create your Private Data Warehousing Environment Using Azure Kubernetes Service

Cloudera

DECEMBER 2, 2021

The following sections provide additional details on other aspects of how this is implemented, as well as information on steps to take to set this up for yourself. In addition to AKS and the load balancers mentioned above, this includes VNET, Data Lake Storage, PostgreSQL Azure database, and more. Enter “0.0.0.0/0” Next Steps.

Data Lake

Data Lake Data Warehouse Data Processing Interactive

Accelerating revenue growth with real-time analytics: Poshmark’s journey

AWS Big Data

MARCH 20, 2023

IP translation – The IP addresses present in events will be translated to city, state, and zip, and enriched with other information to implement near-real-time, location-aware services encompassing security-related functions as well as personalization functions. Operational dashboards are hosted on Grafana integrated with Druid.

Analytics

Analytics Slice and Dice Data Processing Data Lake

Create an end-to-end data strategy for Customer 360 on AWS

AWS Big Data

MARCH 26, 2024

This view is used to identify patterns and trends in customer behavior, which can inform data-driven decisions to improve business outcomes. Although the identity processing solution helps build the unified customer profile, we recommend considering this as part of your data processing capabilities.

Data Strategy

Data Strategy Strategy Data Warehouse Prescriptive Analytics

Run Spark SQL on Amazon Athena Spark

AWS Big Data

OCTOBER 23, 2023

Modern applications store massive amounts of data on Amazon Simple Storage Service (Amazon S3) data lakes, providing cost-effective and highly durable storage, and allowing you to run analytics and machine learning (ML) from your data lake to generate insights on your data.

Data Lake

Data Lake Visualization Optimization Interactive

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

AWS Big Data

APRIL 26, 2024

Set up EMR Studio In this step, we demonstrate the actions needed from the data lake administrator to set up EMR Studio enabled for trusted identity propagation and with IAM Identity Center integration. For more information, refer to Granting table permissions using the named resource method. Select Named Data Catalog resources.

Analytics

Analytics Data Lake Management Enterprise

How Data Governance Protects Sensitive Data

erwin

APRIL 2, 2021

An assessment of the data breaches that crop up like weeds each year supports the conclusion that companies, absent data governance, wind up building security architectures strictly from a technical perspective. Do You Know Where Your Sensitive Data Is? Data is a valuable asset used to operate, manage and grow a business.

Data Governance

Data Governance Cost-Benefit Risk Metadata

10 Things AWS Can Do for Your SaaS Company

Smart Data Collective

FEBRUARY 20, 2022

Data storage databases. Your SaaS company can store and protect any amount of data using Amazon Simple Storage Service (S3), which is ideal for data lakes, cloud-native applications, and mobile apps. Hopefully, it was informative and helpful to you. Well, let’s find out. Artificial intelligence (AI).

Cost-Benefit

Cost-Benefit Data Lake Software Machine Learning

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Cloudera

JANUARY 21, 2021

Cloudera’s Data Warehouse service allows raw data to be stored in the cloud storage of your choice (S3, ADLSg2). It will be stored in your own namespace, and not force you to move data into someone else’s proprietary file formats or hosted storage. Proprietary file formats mean no one else is invited in! Separate compute.

Data Lake

Data Lake Data Warehouse IT Analytics

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

AWS Big Data

MARCH 3, 2023

Building data lakes from continuously changing transactional data of databases and keeping data lakes up to date is a complex task and can be an operational challenge. You can then apply transformations and store data in Delta format for managing inserts, updates, and deletes. Choose Create stack.

Data Lake

Data Lake Dashboards Metrics Metadata

Periscope Data Expands to Israel, Empowering Data Teams with Powerful Tools

Sisense

DECEMBER 11, 2019

The challenge is to do it right, and a crucial way to achieve it is with decisions based on data and analysis that drive measurable business results. This was the key learning from the Sisense event heralding the launch of Periscope Data in Tel Aviv, Israel — the beating heart of the startup nation. What VCs want from startups.

Data Lake

Data Lake Big Data Sales Data-driven

Announcing the 2020 Data Impact Award Winners

Cloudera

NOVEMBER 18, 2020

The technological linchpin of its digital transformation has been its Enterprise Data Architecture & Governance platform. It hosts over 150 big data analytics sandboxes across the region with over 200 users utilizing the sandbox for data discovery. Data Security & Governance.

Internet Publishing and Broadcasting

Internet Publishing and Broadcasting Data-driven Broadcasting Digital Transformation

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

Cargotec captures terabytes of IoT telemetry data from their machinery operated by numerous customers across the globe. This data needs to be ingested into a data lake, transformed, and made available for analytics, machine learning (ML), and visualization. The job runs in the target account.

Metadata

Metadata Data Lake Machine Learning Big Data

Analyze Amazon S3 storage costs using AWS Cost and Usage Reports, Amazon S3 Inventory, and Amazon Athena

AWS Big Data

FEBRUARY 2, 2023

Since its launch in 2006, Amazon Simple Storage Service (Amazon S3) has experienced major growth, supporting multiple use cases such as hosting websites, creating data lakes, serving as object storage for consumer applications, storing logs, and archiving data. This could be your data lake or application S3 bucket.

Reporting

Reporting Data Lake Management Optimization

Automate deployment of an Amazon QuickSight analysis connecting to an Amazon Redshift data warehouse with an AWS CloudFormation template

AWS Big Data

FEBRUARY 16, 2023

For more information about creating visuals, see Adding visuals to Amazon QuickSight analyses. Create an Amazon Redshift data source in AWS CloudFormation In this step, we add the AWS::QuickSight::DataSource section of the CloudFormation template. On the All analyses tab of the QuickSight start page, choose New analysis.

Data Warehouse

Data Warehouse Sales Visualization Data Processing

Break data silos and stream your CDC data with Amazon Redshift streaming and Amazon MSK

AWS Big Data

DECEMBER 13, 2023

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. This solution uses Amazon Aurora MySQL hosting the example database salesdb.

Data Warehouse

Data Warehouse Snapshot Data Processing Management

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

AWS Big Data

NOVEMBER 16, 2023

Amazon Redshift is a popular cloud data warehouse, offering a fully managed cloud-based service that seamlessly integrates with an organization’s Amazon Simple Storage Service (Amazon S3) data lake, real-time streams, machine learning (ML) workflows, transactional workflows, and much more—all while providing up to 7.9x

Enterprise

Enterprise Data Warehouse Snapshot Cost-Benefit

5 ways to maximize your cloud investment

CIO Business Intelligence

JANUARY 10, 2024

For example, if a cloud vendor hosts a data lake that requires operational technology data to synchronize and feed back into a decision algorithm on the production line, we measure latency. CIOs should then use that information to measure cloud value relative to business operations.

Cost-Benefit

Cost-Benefit Measurement Optimization Metrics

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

AWS Big Data

DECEMBER 18, 2023

Solution overview One of the common functionalities involved in data pipelines is extracting data from multiple data sources and exporting it to a data lake or synchronizing the data to another database. For more information, refer to IAM Policies for invoking AWS Glue job from Step Functions.

Metadata

Metadata Visualization Data Lake Data-driven

Design a data mesh pattern for Amazon EMR-based data lakes using AWS Lake Formation with Hive metastore federation

Migrate an existing data lake to a transactional data lake using Apache Iceberg

Webinars

Trending Sources

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Webinars

Create an Apache Hudi-based near-real-time transactional data lake using AWS DMS, Amazon Kinesis, AWS Glue streaming ETL, and data visualization using Amazon QuickSight

Build a data lake with Apache Flink on Amazon EMR

Secure cloud fabric: Enhancing data management and AI development for the federal government

Building and Evaluating GenAI Knowledge Management Systems using Ollama, Trulens and Cloudera

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Query your Apache Hive metastore with AWS Lake Formation permissions

Configure cross-Region table access with the AWS Glue Catalog and AWS Lake Formation

Empowering data-driven excellence: How the Bluestone Data Platform embraced data mesh for success

Introducing AWS Glue crawler and create table support for Apache Iceberg format

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

How Gilead used Amazon Redshift to quickly and cost-effectively load third-party medical claims data

Implement alerts in Amazon OpenSearch Service with PagerDuty

TDC Digital leverages IBM Cloud for transparent billing and improved customer satisfaction

Access Amazon Athena in your applications using the WebSocket API

How smava makes loans transparent and affordable using Amazon Redshift Serverless

Join a streaming data source with CDC data for real-time serverless data analytics using AWS Glue, AWS DMS, and Amazon DynamoDB

Migrate a petabyte-scale data warehouse from Actian Vectorwise to Amazon Redshift

CIOs weigh where to place AI bets — and how to de-risk them

Dairyland powers up for a generative AI edge

What a quarter century of digital transformation at PayPal looks like

Set up advanced rules to validate quality of multiple datasets with AWS Glue Data Quality

10 Keys to a Secure Cloud Data Lakehouse

Enhance query performance using AWS Glue Data Catalog column-level statistics

Create your Private Data Warehousing Environment Using Azure Kubernetes Service

Accelerating revenue growth with real-time analytics: Poshmark’s journey

Create an end-to-end data strategy for Customer 360 on AWS

Run Spark SQL on Amazon Athena Spark

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

How Data Governance Protects Sensitive Data

10 Things AWS Can Do for Your SaaS Company

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

Periscope Data Expands to Israel, Empowering Data Teams with Powerful Tools

Announcing the 2020 Data Impact Award Winners

How Cargotec uses metadata replication to enable cross-account data sharing

Analyze Amazon S3 storage costs using AWS Cost and Usage Reports, Amazon S3 Inventory, and Amazon Athena

Automate deployment of an Amazon QuickSight analysis connecting to an Amazon Redshift data warehouse with an AWS CloudFormation template

Break data silos and stream your CDC data with Amazon Redshift streaming and Amazon MSK

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

5 ways to maximize your cloud investment

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

Stay Connected