Data Lake, Data Processing and Management

Data Lake

Data Processing

Management

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

OCTOBER 3, 2023

A data lake is a centralized repository that you can use to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data and then run different types of analytics for better business insights. We will use AWS Region us-east-1.

Data Lake

Data Lake Metadata Snapshot Recreation/Entertainment

Important Considerations When Migrating to a Data Lake

Smart Data Collective

MARCH 30, 2022

Azure Data Lake Storage Gen2 is based on Azure Blob storage and offers a suite of big data analytics features. If you don’t understand the concept, you might want to check out our previous article on the difference between data lakes and data warehouses. Determine your preparedness. Authentication.

Data Lake

Data Lake Cost-Benefit Data Warehouse Big Data

Join 52,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

How To Get Promoted In Product Management

MORE WEBINARS

Trending Sources

Analytics Vidhya

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. In this post, we describe Orca’s journey building a transactional data lake using Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and AWS Analytics.

Data Lake

Data Lake Analytics Snapshot Optimization

Webinars

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

How To Get Promoted In Product Management

MORE WEBINARS

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

AWS Big Data

MARCH 10, 2023

Since the deluge of big data over a decade ago, many organizations have learned to build applications to process and analyze petabytes of data. Data lakes have served as a central repository to store structured and unstructured data at any scale and in various formats.

Data Lake

Data Lake Sales Data Warehouse Snapshot

Building and Evaluating GenAI Knowledge Management Systems using Ollama, Trulens and Cloudera

Cloudera

MAY 23, 2024

In modern enterprises, the exponential growth of data means organizational knowledge is distributed across multiple formats, ranging from structured data stores such as data warehouses to multi-format data stores like data lakes. Trulens), but this can be much more complex at an enterprise-level to manage.

Management

Management Metrics Data Processing Data Lake

Secure cloud fabric: Enhancing data management and AI development for the federal government

CIO Business Intelligence

DECEMBER 19, 2023

In recent years, government agencies have increasingly turned to cloud computing to manage vast amounts of data and streamline operations. To address these challenges, agencies are turning to a secure cloud fabric that can ensure the confidentiality, integrity, and availability of their data in the cloud.

Data Lake

Data Lake Management Cost-Benefit Data Processing

Create an Apache Hudi-based near-real-time transactional data lake using AWS DMS, Amazon Kinesis, AWS Glue streaming ETL, and data visualization using Amazon QuickSight

AWS Big Data

AUGUST 3, 2023

Data analytics on operational data at near-real time is becoming a common need. Due to the exponential growth of data volume, it has become common practice to replace read replicas with data lakes to have better scalability and performance. Apache Hudi connector for AWS Glue For this post, we use AWS Glue 4.0,

Data Lake

Data Lake Visualization Dashboards Insurance

Top 15 data management platforms

CIO Business Intelligence

JUNE 9, 2022

A data management platform (DMP) is a group of tools designed to help organizations collect and manage data from a wide array of sources and to create reports that help explain what is happening in those data streams. Deploying a DMP can be a great way for companies to navigate a business world dominated by data.

Management

Management Advertising Data Lake Sales

Top 15 data management platforms available today

CIO Business Intelligence

SEPTEMBER 22, 2023

Data management platform definition A data management platform (DMP) is a suite of tools that helps organizations to collect and manage data from a wide array of first-, second-, and third-party sources and to create reports and build customer profiles as part of targeted personalization campaigns.

Management

Management Advertising Data Lake Sales

Habib Bank manages data at scale with Cloudera Data Platform

Cloudera

NOVEMBER 17, 2022

We needed a solution to manage our data at scale, to provide greater experiences to our customers. With Cloudera Data Platform, we aim to unlock value faster and offer consistent data security and governance to meet this goal. HBL aims to double its banked customers by 2025. “

Management

Management Data Lake Consulting Unstructured Data

Build a data lake with Apache Flink on Amazon EMR

AWS Big Data

JANUARY 27, 2023

The AWS Glue Data Catalog provides a uniform repository where disparate systems can store and find metadata to keep track of data in data silos. Apache Flink is a widely used data processing engine for scalable streaming ETL, analytics, and event-driven applications. Apache Hudi also has its own catalog management.

Data Lake

Data Lake Metadata Business Analysis Data-driven

Data Management Requirements for the Enterprise Data Lake

In(tegrate) the Clouds

MAY 1, 2016

SnapLogic published Eight Data Management Requirements for the Enterprise Data Lake. They are: Storage and Data Formats. The company also recently hosted a webinar on Democratizing the Data Lake with Constellation Research and published 2 whitepapers from Mark Madsen. Ingest and Delivery.

Data Lake

Data Lake Enterprise Management Metadata

Governing data in relational databases using Amazon DataZone

AWS Big Data

MAY 7, 2024

Data governance is a key enabler for teams adopting a data-driven culture and operational model to drive innovation with data. Amazon DataZone allows you to simply and securely govern end-to-end data assets stored in your Amazon Redshift data warehouses or data lakes cataloged with the AWS Glue data catalog.

Metadata

Metadata Data Lake Data Processing Data-driven

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

Organizations often need to manage a high volume of data that is growing at an extraordinary rate. At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. We think of this concept as inside-out data movement.

Data Lake

Data Lake Analytics Dashboards Metrics

Query your Apache Hive metastore with AWS Lake Formation permissions

AWS Big Data

JULY 20, 2023

The Hive metastore is a repository of metadata about the SQL tables, such as database names, table names, schema, serialization and deserialization information, data location, and partition details of each table. Therefore, organizations have come to host huge volumes of metadata of their structured datasets in the Hive metastore.

Data Lake

Data Lake Metadata Data Processing Big Data

Stitch Fix seamless migration: Transitioning from self-managed Kafka to Amazon MSK

AWS Big Data

SEPTEMBER 22, 2023

At Stitch Fix, we have been powered by data science since its foundation and rely on many modern data lake and data processing technologies. In our infrastructure, Apache Kafka has emerged as a powerful tool for managing event streams and facilitating real-time data processing.

Management

Management Metrics Cost-Benefit Data Lake

Eight Top DataOps Trends for 2022

DataKitchen

NOVEMBER 29, 2021

Data Gets Meshier. 2022 will bring further momentum behind modular enterprise architectures like data mesh. The data mesh addresses the problems characteristic of large, complex, monolithic data architectures by dividing the system into discrete domains managed by smaller, cross-functional teams.

Testing

Testing Data Lake Data Architecture Manufacturing

Configure cross-Region table access with the AWS Glue Catalog and AWS Lake Formation

AWS Big Data

AUGUST 3, 2023

Today’s modern data lakes span multiple accounts, AWS Regions, and lines of business in organizations. It’s important that their data solution gives them the ability to share and access data securely and safely across Regions. For example, we are using a data lake administrator role called LF-Admin.

Data Lake

Data Lake Metadata Management Data Processing

BMC on BMC: How the company enables IT observability with BMC Helix and AIOps

CIO Business Intelligence

DECEMBER 7, 2023

The organization has 500 applications for business services, 80,000 VMs, 3,000 hosts, and more than 100,000 containers. BMC needed a solution to transform this large volume of data and enable observability to understand thousands of events as a single scenario.

IT Data Lake Business Services Data Processing

Empowering data-driven excellence: How the Bluestone Data Platform embraced data mesh for success

AWS Big Data

FEBRUARY 27, 2024

In this post, we explore how Bluestone uses AWS services, notably the cloud data warehousing service Amazon Redshift , to implement a cutting-edge data mesh architecture, revolutionizing the way they manage, access, and utilize their data assets. This enables data-driven decision-making across the organization.

Data-driven

Data-driven Data Lake Data Quality Data Governance

Migrate a petabyte-scale data warehouse from Actian Vectorwise to Amazon Redshift

AWS Big Data

MAY 30, 2024

Amazon Redshift is a fast, scalable, and fully managed cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. The system had an integration with legacy backend services that were all hosted on premises.

Data Warehouse

Data Warehouse Data Lake Cost-Benefit Structured Data

AWS Glue crawlers support cross-account crawling to support data mesh architecture

AWS Big Data

MARCH 27, 2023

Data lakes have come a long way, and there’s been tremendous innovation in this space. Today’s modern data lakes are cloud native, work with multiple data types, and make this data easily available to diverse stakeholders across the business.

Data Lake

Data Lake Data-driven Management Data Architecture

Introducing AWS Glue crawler and create table support for Apache Iceberg format

AWS Big Data

AUGUST 16, 2023

Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback. AWS Glue crawlers will extract schema information and update the location of Iceberg metadata and schema updates in the Data Catalog. Choose Next. Choose Create.

Data Lake

Data Lake Metadata Snapshot Management

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

AWS Big Data

DECEMBER 13, 2023

In addition to using native managed AWS services that BMS didn’t need to worry about upgrading, BMS was looking to offer an ETL service to non-technical business users that could visually compose data transformation workflows and seamlessly run them on the AWS Glue Apache Spark-based serverless data integration engine.

Metadata

Metadata Data Lake Visualization Data Transformation

DS Smith sets a single-cloud agenda for sustainability

CIO Business Intelligence

DECEMBER 6, 2023

Its digital transformation began with an application modernization phase, in which Dickson and her IT teams determined which applications should be hosted in the public cloud and which should remain on a private cloud. Here, Dickson sees data generated from its industrial machines being very productive.

Manufacturing

Manufacturing Data Lake Digital Transformation Machine Learning

How smava makes loans transparent and affordable using Amazon Redshift Serverless

AWS Big Data

DECEMBER 21, 2023

To speed up the self-service analytics and foster innovation based on data, a solution was needed to provide ways to allow any team to create data products on their own in a decentralized manner. To create and manage the data products, smava uses Amazon Redshift , a cloud data warehouse.

Data Lake

Data Lake Data Warehouse Data-driven B2B

How Gilead used Amazon Redshift to quickly and cost-effectively load third-party medical claims data

AWS Big Data

NOVEMBER 8, 2023

Because Gilead is expanding into biologics and large molecule therapies, and has an ambitious goal of launching 10 innovative therapies by 2030, there is heavy emphasis on using data with AI and machine learning (ML) to accelerate the drug discovery pipeline. This data volume is expected to increase monthly and is fully refreshed each month.

Data Lake

Data Lake Data Warehouse Cost-Benefit Optimization

Preparing the foundations for Generative AI

CIO Business Intelligence

FEBRUARY 20, 2024

All data is held in a lake-centric hub, and protected by a strong, universal security model, with data loss prevention and protection for sensitive data, and features for auditing and forensic investigation already built-in. If this all seems challenging, Avanade can help.

Cost-Benefit

Cost-Benefit Data Lake Data Warehouse Data Integration

Migrate Hive data from CDH to CDP public cloud

Cloudera

JUNE 25, 2021

Many Cloudera customers are making the transition from being completely on-prem to cloud by either backing up their data in the cloud, or running multi-functional analytics on CDP Public cloud in AWS or Azure. The Replication Manager service facilitates both disaster recovery and data migration across different environments.

Data Lake

Data Lake Metadata Unstructured Data Management

CIOs weigh where to place AI bets — and how to de-risk them

CIO Business Intelligence

MARCH 18, 2024

We need to be ready to respond to our CEO to solve problems with AI,” says Srini Gudipati, CIO of Covanta, a company that specializes in sustainable materials management, including large-scale recycling. Our data team uses gen AI on Amazon cloud to explore sustainability metrics. AI tools rely on the data in use in these solutions.

Risk

Risk Cost-Benefit Data Processing Testing

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 3: Visualization and trend analysis using Amazon QuickSight

AWS Big Data

MARCH 29, 2024

Typically, you have multiple accounts to manage and run resources for your data pipeline. To make the dataset and analysis visible for you, complete the following steps: On the QuickSight console, navigate to the user menu and choose Manage QuickSight. In the navigation pane, choose Manage assets. Choose SHARE.

Metrics

Metrics Visualization Dashboards Interactive

Your New Cloud for AI May Be Inside a Colo

CIO Business Intelligence

MAY 23, 2022

Many companies whose AI model training infrastructure is not proximal to their data lake incur steeper costs as the data sets grow larger and AI models become more complex. Companies such as Cyxtera, Digital Realty and Equinix, among others, offer hosting, managing and operations services for AI infrastructure.

Experimentation

Experimentation Cost-Benefit Data Lake Data Science

The essential check list for effective data democratization

CIO Business Intelligence

JANUARY 20, 2023

Of course, cost is a big consideration, says Orlandini, as well as deciding where to host the data, and having it available in a fiscally responsible way. An organization might also question if the data should be maintained on-premises due to security concerns in the public cloud. They have data swamps,” he says.

Data Lake

Data Lake Data-driven Finance Data Architecture

Implement alerts in Amazon OpenSearch Service with PagerDuty

AWS Big Data

JUNE 8, 2023

With automated alerting with a third-party service like PagerDuty , an incident management platform, combined with the robust and powerful alerting plugin provided by OpenSearch Service, businesses can proactively manage and respond to critical events. For instructions, refer to Creating and managing Amazon OpenSearch Service domains.

Data Lake

Data Lake Dashboards Metrics Testing

Dairyland powers up for a generative AI edge

CIO Business Intelligence

APRIL 9, 2024

Melby pushed machine learning models into production very early at Dairyland, improving the cooperative’s weather forecasting capabilities and creating load management applications that “bent the curve” to best manage the company’s power load on peak days, the CIO says.

Digital Transformation

Digital Transformation Machine Learning Data Lake Software

The disruptive potential of open data lakehouse architectures and IBM watsonx.data

IBM Big Data Hub

JUNE 15, 2023

It is comprised of commodity cloud object storage, open data and open table formats, and high-performance open-source query engines. To help organizations scale AI workloads, we recently announced IBM watsonx.data , a data store built on an open data lakehouse architecture and part of the watsonx AI and data platform.

Data Warehouse

Data Warehouse Data Lake Optimization Data-driven

BusinessObjects in the Cloud – No Big Rush and No Big Deal

Paul Blogs on BI

SEPTEMBER 8, 2021

Well firstly, if the main data warehouses, repositories, or application databases that BusinessObjects accesses are on premise, it makes no sense to move BusinessObjects to the cloud until you move its data sources to the cloud. You also have the option of hosting with a third party.

Data Warehouse

Data Warehouse Data Processing Data Lake Testing

Access Amazon Athena in your applications using the WebSocket API

AWS Big Data

MARCH 2, 2023

Many organizations are building data lakes to store and analyze large volumes of structured, semi-structured, and unstructured data. In addition, many teams are moving towards a data mesh architecture, which requires them to expose their data sets as easily consumable data products.

Data Lake

Data Lake Testing Interactive Unstructured Data

Enhance query performance using AWS Glue Data Catalog column-level statistics

AWS Big Data

NOVEMBER 22, 2023

Data lakes are designed for storing vast amounts of raw, unstructured, or semi-structured data at a low cost, and organizations share those datasets across multiple departments and teams. The queries on these large datasets read vast amounts of data and can perform complex join operations on multiple datasets.

Statistics

Statistics Data Lake Optimization Data-driven

Join a streaming data source with CDC data for real-time serverless data analytics using AWS Glue, AWS DMS, and Amazon DynamoDB

AWS Big Data

MAY 30, 2023

Customers have been using data warehousing solutions to perform their traditional analytics tasks. Recently, data lakes have gained lot of traction to become the foundation for analytical solutions, because they come with benefits such as scalability, fault tolerance, and support for structured, semi-structured, and unstructured datasets.

Data Lake

Data Lake Data Analytics Analytics Data Processing

TDC Digital leverages IBM Cloud for transparent billing and improved customer satisfaction

IBM Big Data Hub

MAY 19, 2023

Furthermore, TDC Digital had not used any cloud storage solution and experienced latency and downtime while hosting the application in its data center. TDC Digital is excited about its plans to host its IT infrastructure in IBM data centers, offering better scalability, performance and security.

Unstructured Data

Unstructured Data Data Processing Manufacturing Data Lake

10 Keys to a Secure Cloud Data Lakehouse

Cloudera

OCTOBER 25, 2022

The data lakehouse is gaining in popularity because it enables a single platform for all your enterprise data with the flexibility to run any analytic and machine learning (ML) use case. Cloud data lakehouses provide significant scaling, agility, and cost advantages compared to cloud data lakes and cloud data warehouses.

Data Processing

Data Processing Data Lake Cost-Benefit Risk

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

AWS Big Data

APRIL 26, 2024

To enable your workforce users for analytics with fine-grained data access controls and audit data access, you might have to create multiple AWS Identity and Access Management (IAM) roles with different data permissions and map the workforce users to one of those roles. AWS CloudTrail captures user data access activities.

Analytics

Analytics Data Lake Management Enterprise

What a quarter century of digital transformation at PayPal looks like

CIO Business Intelligence

OCTOBER 4, 2023

PayPal, like many other large companies, suffers attacks every second, and we can only manage this volume of threats through an architecture with reinforced security layers and solid technology, such as AI.” At the lowest layer is the infrastructure, made up of databases and data lakes. Stability is another objective.

Digital Transformation

Digital Transformation Deep Learning Data Lake Risk

Migrate an existing data lake to a transactional data lake using Apache Iceberg

Important Considerations When Migrating to a Data Lake

Webinars

Trending Sources

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Webinars

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

Building and Evaluating GenAI Knowledge Management Systems using Ollama, Trulens and Cloudera

Secure cloud fabric: Enhancing data management and AI development for the federal government

Create an Apache Hudi-based near-real-time transactional data lake using AWS DMS, Amazon Kinesis, AWS Glue streaming ETL, and data visualization using Amazon QuickSight

Top 15 data management platforms

Top 15 data management platforms available today

Habib Bank manages data at scale with Cloudera Data Platform

Build a data lake with Apache Flink on Amazon EMR

Data Management Requirements for the Enterprise Data Lake

Governing data in relational databases using Amazon DataZone

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Query your Apache Hive metastore with AWS Lake Formation permissions

Stitch Fix seamless migration: Transitioning from self-managed Kafka to Amazon MSK

Eight Top DataOps Trends for 2022

Configure cross-Region table access with the AWS Glue Catalog and AWS Lake Formation

BMC on BMC: How the company enables IT observability with BMC Helix and AIOps

Empowering data-driven excellence: How the Bluestone Data Platform embraced data mesh for success

Migrate a petabyte-scale data warehouse from Actian Vectorwise to Amazon Redshift

AWS Glue crawlers support cross-account crawling to support data mesh architecture

Introducing AWS Glue crawler and create table support for Apache Iceberg format

Modernize your ETL platform with AWS Glue Studio: A case study from BMS

DS Smith sets a single-cloud agenda for sustainability

How smava makes loans transparent and affordable using Amazon Redshift Serverless

How Gilead used Amazon Redshift to quickly and cost-effectively load third-party medical claims data

Preparing the foundations for Generative AI

Migrate Hive data from CDH to CDP public cloud

CIOs weigh where to place AI bets — and how to de-risk them

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 3: Visualization and trend analysis using Amazon QuickSight

Your New Cloud for AI May Be Inside a Colo

The essential check list for effective data democratization

Implement alerts in Amazon OpenSearch Service with PagerDuty

Dairyland powers up for a generative AI edge

The disruptive potential of open data lakehouse architectures and IBM watsonx.data

BusinessObjects in the Cloud – No Big Rush and No Big Deal

Access Amazon Athena in your applications using the WebSocket API

Enhance query performance using AWS Glue Data Catalog column-level statistics

Join a streaming data source with CDC data for real-time serverless data analytics using AWS Glue, AWS DMS, and Amazon DynamoDB

TDC Digital leverages IBM Cloud for transparent billing and improved customer satisfaction

10 Keys to a Secure Cloud Data Lakehouse

Use your corporate identities for analytics with Amazon EMR and AWS IAM Identity Center

What a quarter century of digital transformation at PayPal looks like

Stay Connected