Data Analytics, Data Lake, Data Warehouse and Metadata

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the state of datasets as they evolve and change over time. Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback.

Data Lake

Data Lake Data Processing Metadata Snapshot

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale.

Data Lake

Data Lake Metadata Optimization Statistics

How Ruparupa gained updated insights with an Amazon S3 data lake, AWS Glue, Apache Hudi, and Amazon QuickSight

AWS Big Data

FEBRUARY 22, 2023

In this post, we show how Ruparupa implemented an incrementally updated data lake to get insights into their business using Amazon Simple Storage Service (Amazon S3), AWS Glue , Apache Hudi , and Amazon QuickSight. An AWS Glue ETL job, using the Apache Hudi connector, updates the S3 data lake hourly with incremental data.

Data Lake

Data Lake Dashboards Cost-Benefit Metadata

Webinars

How to Optimize the Developer Experience for Monumental Impact

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

AWS Big Data

AUGUST 1, 2023

Although Jira Cloud provides reporting capability, loading this data into a data lake will facilitate enrichment with other business data, as well as support the use of business intelligence (BI) tools and artificial intelligence (AI) and machine learning (ML) applications. For InitialRunFlag , choose Setup.

Data Lake

Data Lake Data Transformation Cost-Benefit Data-driven

What is an open data lakehouse and why you should care?

IBM Big Data Hub

JANUARY 17, 2023

A data lakehouse is an emerging data management architecture that improves efficiency and converges data warehouse and data lake capabilities driven by a need to improve efficiency and obtain critical insights faster. Let’s start with why data lakehouses are becoming increasingly important.

Data Lake

Data Lake Metadata Data Warehouse Data Governance

Data governance in the age of generative AI

AWS Big Data

FEBRUARY 29, 2024

First, many LLM use cases rely on enterprise knowledge that needs to be drawn from unstructured data such as documents, transcripts, and images, in addition to structured data from data warehouses. As part of the transformation, the objects need to be treated to ensure data privacy (for example, PII redaction).

Data Governance

Data Governance Unstructured Data Metadata Data Lake

What is a Data Mesh?

DataKitchen

AUGUST 3, 2021

First-generation – expensive, proprietary enterprise data warehouse and business intelligence platforms maintained by a specialized team drowning in technical debt. Second-generation – gigantic, complex data lake maintained by a specialized team drowning in technical debt. See the pattern?

Data Architecture

Data Architecture Data Lake Cost-Benefit Data Warehouse

The Future of the Data Lakehouse – Open

Cloudera

JUNE 18, 2022

Cloudera customers run some of the biggest data lakes on earth. These lakes power mission critical large scale data analytics, business intelligence (BI), and machine learning use cases, including enterprise data warehouses. On data warehouses and data lakes.

Data Lake

Data Lake Data Warehouse Machine Learning Cost-Benefit

The Future of the Data Lakehouse – Open

CIO Business Intelligence

JUNE 23, 2022

Cloudera customers run some of the biggest data lakes on earth. These lakes power mission critical large scale data analytics, business intelligence (BI), and machine learning use cases, including enterprise data warehouses. On data warehouses and data lakes.

Data Lake

Data Lake Data Warehouse Machine Learning Cost-Benefit

How smava makes loans transparent and affordable using Amazon Redshift Serverless

AWS Big Data

DECEMBER 21, 2023

To speed up the self-service analytics and foster innovation based on data, a solution was needed to provide ways to allow any team to create data products on their own in a decentralized manner. To create and manage the data products, smava uses Amazon Redshift , a cloud data warehouse.

Data Lake

Data Lake Data Warehouse Data-driven B2B

Use Amazon Athena with Spark SQL for your open-source transactional table formats

AWS Big Data

JANUARY 24, 2024

AWS-powered data lakes, supported by the unmatched availability of Amazon Simple Storage Service (Amazon S3), can handle the scale, agility, and flexibility required to combine different data and analytics approaches. It will never remove files that are still required by a non-expired snapshot.

Snapshot

Snapshot Data Lake Metadata Optimization

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

AWS Big Data

JANUARY 24, 2023

This solution only replicates metadata in the Data Catalog, not the actual underlying data. To have a redundant data lake using Lake Formation and AWS Glue in an additional Region, we recommend replicating the Amazon S3-based storage using S3 replication , S3 sync, aws-s3-copy-sync-using-batch or S3 Batch replication process.

Data Architecture

Data Architecture Metadata Data Lake Snapshot

Create an end-to-end data strategy for Customer 360 on AWS

AWS Big Data

MARCH 26, 2024

Profile aggregation – When you’ve uniquely identified a customer, you can build applications in Managed Service for Apache Flink to consolidate all their metadata, from name to interaction history. Then, you transform this data into a concise format. Let’s find out what role each of these components play in the context of C360.

Data Strategy

Data Strategy Strategy Data Warehouse Prescriptive Analytics

Unlock data across organizational boundaries using Amazon DataZone – now generally available

AWS Big Data

OCTOBER 4, 2023

An Amazon DataZone domain contains an associated business data catalog for search and discovery, a set of metadata definitions to decorate the data assets that are used for discovery purposes, and data projects with integrated analytics and ML tools for users and groups to consume and publish data assets.

Metadata

Metadata Data Lake Publishing Data Governance

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

AWS Big Data

AUGUST 16, 2023

Apache Iceberg is an open table format for very large analytic datasets. Iceberg manages large collections of files as tables, and it supports modern analytical data lake operations such as record-level insert, update, delete, and time travel queries. We fetch the metadata of the users_xxxxxx table from Athena.

Data Lake

Data Lake Metadata Testing Snapshot

How HR&A uses Amazon Redshift spatial analytics on Amazon Redshift Serverless to measure digital equity in states across the US

AWS Big Data

DECEMBER 5, 2023

A combination of Amazon Redshift Spectrum and COPY commands are used to ingest the survey data stored as CSV files. For the files with unknown structures, AWS Glue crawlers are used to extract metadata and create table definitions in the Data Catalog. She helps customers architect data analytics solutions at scale on AWS.

Measurement

Measurement Dashboards Data Warehouse Analytics

A hybrid approach in healthcare data warehousing with Amazon Redshift

AWS Big Data

FEBRUARY 21, 2023

Data warehouses play a vital role in healthcare decision-making and serve as a repository of historical data. A healthcare data warehouse can be a single source of truth for clinical quality control systems. What is a dimensional data model? What is a dimensional data model? What is a data vault?

Data Warehouse

Data Warehouse Data Lake Cost-Benefit Modeling

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

AWS Big Data

APRIL 19, 2023

Many customers run big data workloads such as extract, transform, and load (ETL) on Apache Hive to create a data warehouse on Hadoop. We split the solution into two primary components: generating Spark job metadata and running the SQL on Amazon EMR. The script generates a metadata JSON file for each step.

Metadata

Metadata Testing Data Lake Consulting

Augmented data management: Data fabric versus data mesh

IBM Big Data Hub

APRIL 27, 2022

Gartner defines a data fabric as “a design concept that serves as an integrated layer of data and connecting processes. The data fabric architectural approach can simplify data access in an organization and facilitate self-service data consumption at scale.

Management

Management Metadata Data Architecture Data Lake

Data architecture strategy for data quality

IBM Big Data Hub

JANUARY 5, 2023

The right data architecture can help your organization improve data quality because it provides the framework that determines how data is collected, transported, stored, secured, used and shared for business intelligence and data science use cases. Perform data quality monitoring based on pre-configured rules.

Data Quality

Data Quality Data Architecture Strategy Data Lake

Data Mesh 101: How Data Mesh Helps Organizations Be Data-Driven and Achieve Velocity

Ontotext

FEBRUARY 12, 2024

A centralized data engineering team focuses on building a governed self-serviced infrastructure, while domain teams use the services to build full-stack data products. However, data mesh is not about introducing new technologies. by building data products with domain owners.

Data-driven

Data-driven Data Lake Data Quality Business Objectives

The Madness of Data (and analytics) Governance

Andrew White

DECEMBER 9, 2019

The client had recently engaged with a well-known consulting company that had recommended a large data catalog effort to collect all enterprise metadata to help identify all data and business issues. Modern data (and analytics) governance does not necessarily need: Wall-to-wall discovery of your data and metadata.

Analytics

Analytics Data Lake Data Governance Metadata

Exploring real-time streaming for generative AI Applications

AWS Big Data

MARCH 25, 2024

Streaming jobs constantly ingest new data to synchronize across systems and can perform enrichment, transformations, joins, and aggregations across windows of time more efficiently. With a file system sink connector, Apache Flink jobs can deliver data to Amazon S3 in open format (such as JSON, Avro, Parquet, and more) files as data objects.

Data Lake

Data Lake Unstructured Data Management Modeling

Trends in Data Management and Analytics

TDAN

MARCH 19, 2019

Various databases, plus one or more data warehouses, have been the state-of-the art data management infrastructure in companies for years. The emergence of various new concepts, technologies, and applications such as Hadoop, Tableau, R, Power BI, or Data Lakes indicate that changes are under way.

Management

Management Data Lake Data Warehouse Analytics

Overcome these six data consumption challenges for a more data-driven enterprise

IBM Big Data Hub

JUNE 8, 2022

A typical organization’s data landscape consists of a large number of data stores across workflows, business processes and business units, including but not limited to data warehouses, data marts, data lakes, ODS, cloud data stores, and CRM databases. The volume of data assets.

Data-driven

Data-driven Enterprise Data Governance Data Lake

Five benefits of a data catalog

IBM Big Data Hub

DECEMBER 16, 2022

For example, data catalogs have evolved to deliver governance capabilities like managing data quality and data privacy and compliance. It uses metadata and data management tools to organize all data assets within your organization. Technical metadata to describe schemas, indexes and other database objects.

Metadata

Metadata Data Quality Data-driven Data Governance

How SumUp made digital analytics more accessible using AWS Glue

AWS Big Data

JUNE 6, 2023

Unless, of course, the rest of their data also resides in the Google Cloud. In this post we showcase how we used AWS Glue to move siloed digital analytics data, with inconsistent arrival times, to AWS S3 (our Data Lake) and our central data warehouse (DWH), Snowflake.

Analytics

Analytics Data Lake Testing Optimization

Strategically Approaching Graph Technologies

Ontotext

FEBRUARY 26, 2024

Firstly, on the data maturity spectrum, the vast majority of organizations I’ve spoken with are stuck in the information stage. They have massive amounts of data they’re collecting and storing in their relational databases, document stores, data lakes, and data warehouses.

Technology

Technology Cost-Benefit Data-driven Metadata

The Very Group adopts a data catalog to better organize and leverage its online retail capabilities

CIO Business Intelligence

SEPTEMBER 6, 2022

As a result, Pimblett now runs the organization’s data warehouse, analytics, and business intelligence. Establishing a clear and unified approach to data. Very has come full circle as a business built on catalog data, but it took some introspection in order to figure out the best way to get there.

IT

IT Forecasting Data Lake Enterprise

Announcing the 2021 Data Impact Awards

Cloudera

MAY 12, 2021

Use cases could include but are not limited to: workload analysis and replication, migrating or bursting to cloud, data warehouse optimization, and more. Winners included: Connect the Data Lifecycle: Globe Telecom — Raising experience standards and helping customers live enhanced mobile lifestyles.

Digital Transformation

Digital Transformation Machine Learning Optimization Data Lake

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

AWS Big Data

JANUARY 8, 2024

This is the first post to a blog series that offers common architectural patterns in building real-time data streaming infrastructures using Kinesis Data Streams for a wide range of use cases. In this post, we will review the common architectural patterns of two use cases: Time Series Data Analysis and Event Driven Microservices.

Analytics

Analytics IoT Data-driven Snapshot

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

AWS Big Data

MARCH 3, 2023

Building data lakes from continuously changing transactional data of databases and keeping data lakes up to date is a complex task and can be an operational challenge. You can then apply transformations and store data in Delta format for managing inserts, updates, and deletes.

Data Lake

Data Lake Dashboards Metrics Metadata

Week in the Life of an Analyst at Gartner US IT Symposium (virtual) 2021

Andrew White

OCTOBER 22, 2021

Lakehouse (data warehouse and data lake working together) 8. Data Literacy, training, coordination, collaboration 8. Data Management Infrastructure/Data Fabric 5. Data Integration tactics 4. Metadata Strategy 3. Figure 3: The Data and Analytics (infrastructure) Continuum.

IT

IT Data Lake Strategy Data Science

The Gartner 2021 Leadership Vision for Data & Analytics Leaders Webinar Q&A

Andrew White

JANUARY 11, 2021

Note: Delivery of data, analytics solutions and the sustainment of technology, data and services is a question. Data and Analytics Governance: Whats Broken, and What We Need To Do To Fix It. Link Data to Business Outcomes. Data lakes don’t offer this nor should they. Governance. Architecture.

Data Analytics

Data Analytics Analytics Data-driven Finance

What is going on in the world of data and analytics?

Andrew White

MARCH 22, 2019

to weave together the governance and management of master data, application data, and less-widely shared data, and just enough enterprise metadata management. Your Future Requires You to Define Your Real Master Data. This ties into the failure of data governance and MDM (see first item in this list).

Analytics

Analytics Metadata Data Governance Management

Achieve your AI goals with an open data lakehouse approach

IBM Big Data Hub

OCTOBER 4, 2023

Another IDC study showed that while 2/3 of respondents reported using AI-driven data analytics, most reported that less than half of the data under management is available for this type of analytics. from 2022 to 2026. New insights and relationships are found in this combination. All of this supports the use of AI.

Data Lake

Data Lake Metadata Cost-Benefit Data Warehouse

Building a Beautiful Data Lakehouse

CIO Business Intelligence

MARCH 9, 2022

Applying artificial intelligence (AI) to data analytics for deeper, better insights and automation is a growing enterprise IT priority. But the data repository options that have been around for a while tend to fall short in their ability to serve as the foundation for big data analytics powered by AI.

Data Lake

Data Lake Unstructured Data Data Warehouse Data Quality

Regeneron turns to IT to accelerate drug discovery

CIO Business Intelligence

NOVEMBER 4, 2022

Then the data is consumed by SaaS-based computational tools, but it still sits within our organization and sits within the controls of our cloud-based solutions.” Much of Regeneron’s data, of course, is confidential. For that reason, many of its data tools — and even its data lake — were built in-house using AWS. “We

Data Lake

Data Lake IT Experimentation Data-driven

Introducing watsonx: The future of AI for business

IBM Big Data Hub

MAY 9, 2023

With watsonx.data , businesses can quickly connect to data, get trusted insights and reduce data warehouse costs. A data store built on open lakehouse architecture, it runs both on premises and across multi-cloud environments. Savings may vary depending on configurations, workloads and vendors.

Data Warehouse

Data Warehouse Cost-Benefit Machine Learning Modeling

The Data Scientist’s Guide to the Data Catalog

Alation

JULY 19, 2022

As they attempt to put machine learning models into production, data science teams encounter many of the same hurdles that plagued data analytics teams in years past: Finding trusted, valuable data is time-consuming. Obstacles, such as user roles, permissions, and approval request prevent speedy data access.

Metadata

Metadata Data Quality Statistics Data Science

How data stores and governance impact your AI initiatives

IBM Big Data Hub

OCTOBER 12, 2023

Among the tasks necessary for internal and external compliance is the ability to report on the metadata of an AI model. Metadata includes details specific to an AI model such as: The AI model’s creation (when it was created, who created it, etc.)

Cost-Benefit

Cost-Benefit Metadata Data Governance Modeling

Tackling AI’s data challenges with IBM databases on AWS

IBM Big Data Hub

MARCH 14, 2024

The solution: IBM databases on AWS To solve for these challenges, IBM’s portfolio of SaaS database solutions on Amazon Web Services (AWS), enables enterprises to scale applications, analytics and AI across the hybrid cloud landscape.

Cost-Benefit

Cost-Benefit Metadata Optimization Management

The Cloud Connection: How Governance Supports Security

Alation

APRIL 14, 2022

And, as organizations progress and grow, “data drift” starts to impact data usage, models, and your business. In today’s AI/ML-driven world of data analytics, explainability needs a repository just as much as those doing the explaining need access to metadata, EG, information about the data being used.

Metadata

Metadata Data Governance Modeling Data-driven

Cloudera Named a Leader in the 2022 Gartner® Magic Quadrant™ for Cloud Database Management Systems (DBMS)

Cloudera

DECEMBER 16, 2022

We’re proud to be recognized for the data management and data analytics innovations we have delivered in the new Cloudera Data Platform (CDP). Cloudera has always been in the forefront of disruptive technical innovation in data platforms. Cloudera has long had the capabilities of a data lakehouse, if not the label.

Management

Management Metadata Machine Learning Data Lake

Use Apache Iceberg in a data lake to support incremental data processing

Choosing an open table format for your transactional data lake on AWS

Webinars

Trending Sources

How Ruparupa gained updated insights with an Amazon S3 data lake, AWS Glue, Apache Hudi, and Amazon QuickSight

Webinars

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

What is an open data lakehouse and why you should care?

Data governance in the age of generative AI

What is a Data Mesh?

The Future of the Data Lakehouse – Open

The Future of the Data Lakehouse – Open

How smava makes loans transparent and affordable using Amazon Redshift Serverless

Use Amazon Athena with Spark SQL for your open-source transactional table formats

Build a multi-Region and highly resilient modern data architecture using AWS Glue and AWS Lake Formation

Create an end-to-end data strategy for Customer 360 on AWS

Unlock data across organizational boundaries using Amazon DataZone – now generally available

Implement a serverless CDC process with Apache Iceberg using Amazon DynamoDB and Amazon Athena

How HR&A uses Amazon Redshift spatial analytics on Amazon Redshift Serverless to measure digital equity in states across the US

A hybrid approach in healthcare data warehousing with Amazon Redshift

Accelerate HiveQL with Oozie to Spark SQL migration on Amazon EMR

Augmented data management: Data fabric versus data mesh

Data architecture strategy for data quality

Data Mesh 101: How Data Mesh Helps Organizations Be Data-Driven and Achieve Velocity

The Madness of Data (and analytics) Governance

Exploring real-time streaming for generative AI Applications

Trends in Data Management and Analytics

Overcome these six data consumption challenges for a more data-driven enterprise

Five benefits of a data catalog

How SumUp made digital analytics more accessible using AWS Glue

Strategically Approaching Graph Technologies

The Very Group adopts a data catalog to better organize and leverage its online retail capabilities

Announcing the 2021 Data Impact Awards

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

Build incremental data pipelines to load transactional data changes using AWS DMS, Delta 2.0, and Amazon EMR Serverless

Week in the Life of an Analyst at Gartner US IT Symposium (virtual) 2021

The Gartner 2021 Leadership Vision for Data & Analytics Leaders Webinar Q&A

What is going on in the world of data and analytics?

Achieve your AI goals with an open data lakehouse approach

Building a Beautiful Data Lakehouse

Regeneron turns to IT to accelerate drug discovery

Introducing watsonx: The future of AI for business

The Data Scientist’s Guide to the Data Catalog

How data stores and governance impact your AI initiatives

Tackling AI’s data challenges with IBM databases on AWS

The Cloud Connection: How Governance Supports Security

Cloudera Named a Leader in the 2022 Gartner® Magic Quadrant™ for Cloud Database Management Systems (DBMS)

Stay Connected