Analytics, Data Lake, Metadata and Software

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JUNE 10, 2024

In this blog post, we dive into different data aspects and how Cloudinary breaks the two concerns of vendor locking and cost efficient data analytics by using Apache Iceberg, Amazon Simple Storage Service (Amazon S3 ), Amazon Athena , Amazon EMR , and AWS Glue.

Data Lake

Data Lake Metadata Snapshot Analytics

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

AWS Big Data

APRIL 3, 2024

They understand that a one-size-fits-all approach no longer works, and recognize the value in adopting scalable, flexible tools and open data formats to support interoperability in a modern data architecture to accelerate the delivery of new solutions.

Data Lake

Data Lake Snapshot Metadata Data Architecture

Use Apache Iceberg in a data lake to support incremental data processing

AWS Big Data

MARCH 2, 2023

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the state of datasets as they evolve and change over time. Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback.

Data Lake

Data Lake Data Processing Metadata Snapshot

Webinars

Leading the Development of Profitable and Sustainable Products

How To Get Promoted In Product Management

MORE WEBINARS

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

AWS Big Data

JULY 20, 2023

With data becoming the driving force behind many industries today, having a modern data architecture is pivotal for organizations to be successful. In this post, we describe Orca’s journey building a transactional data lake using Amazon Simple Storage Service (Amazon S3), Apache Iceberg, and AWS Analytics.

Data Lake

Data Lake Analytics Snapshot Optimization

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale.

Data Lake

Data Lake Metadata Optimization Statistics

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

AWS Big Data

MAY 24, 2023

When you build your transactional data lake using Apache Iceberg to solve your functional use cases, you need to focus on operational use cases for your S3 data lake to optimize the production environment. availability. show() The snapshots that have expired show the latest snapshot ID as null.

Data Lake

Data Lake Snapshot Metadata Optimization

Build a real-time GDPR-aligned Apache Iceberg data lake

AWS Big Data

FEBRUARY 24, 2023

Data lakes are a popular choice for today’s organizations to store their data around their business activities. As a best practice of a data lake design, data should be immutable once stored. A data lake built on AWS uses Amazon Simple Storage Service (Amazon S3) as its primary storage environment.

Data Lake

Data Lake Metadata Testing Data Warehouse

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, data warehouse, and data lakes can become equally challenging.

Data Lake

Data Lake Analytics Dashboards Metrics

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

AWS Big Data

AUGUST 1, 2023

In the world of software engineering and development, organizations use project management tools like Atlassian Jira Cloud. Companies often take a data lake approach to their analytics, bringing data from many different systems into one place to simplify how the analytics are done.

Data Lake

Data Lake Data Transformation Cost-Benefit Data-driven

Build a data lake with Apache Flink on Amazon EMR

AWS Big Data

JANUARY 27, 2023

With a unified data catalog, you can quickly search datasets and figure out data schema, data format, and location. The AWS Glue Data Catalog provides a uniform repository where disparate systems can store and find metadata to keep track of data in data silos. Refer to Catalogs for more information.

Data Lake

Data Lake Metadata Business Analysis Data-driven

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

Cargotec captures terabytes of IoT telemetry data from their machinery operated by numerous customers across the globe. This data needs to be ingested into a data lake, transformed, and made available for analytics, machine learning (ML), and visualization.

Metadata

Metadata Data Lake Machine Learning Big Data

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

AWS Big Data

NOVEMBER 29, 2023

This cloud service was a significant leap from the traditional data warehousing solutions, which were expensive, not elastic, and required significant expertise to tune and operate. Amazon Redshift Serverless, generally available since 2021, allows you to run and scale analytics without having to provision and manage the data warehouse.

Data Warehouse

Data Warehouse Data Lake Analytics Machine Learning

Data Lakes: What Are They and Who Needs Them?

Jet Global

JULY 2, 2019

To address the flood of data and the needs of enterprise businesses to store, sort, and analyze that data, a new storage solution has evolved: the data lake. What’s in a Data Lake? All the while, your marketing team is relying on marketing automation or CRM software they find the most productive.

Data Lake

Data Lake Data Warehouse Big Data Machine Learning

Collibra Brings Effective Data Governance to Line-of-Business

David Menninger's Analyst Perspectives

SEPTEMBER 28, 2021

Collibra is a data governance software company that offers tools for metadata management and data cataloging. The software enables organizations to find data quickly, identify its source and assure its integrity.

Data Governance

Data Governance Metadata Software Management

Introducing Apache Hudi support with AWS Glue crawlers

AWS Big Data

NOVEMBER 22, 2023

Apache Hudi is an open table format that brings database and data warehouse capabilities to data lakes. Apache Hudi helps data engineers manage complex challenges, such as managing continuously evolving datasets with transactions while maintaining query performance.

Data Lake

Data Lake Snapshot Metadata Optimization

What is a Data Mesh?

DataKitchen

AUGUST 3, 2021

First-generation – expensive, proprietary enterprise data warehouse and business intelligence platforms maintained by a specialized team drowning in technical debt. Second-generation – gigantic, complex data lake maintained by a specialized team drowning in technical debt. See the pattern?

Data Architecture

Data Architecture Data Lake Cost-Benefit Data Warehouse

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

AWS Big Data

OCTOBER 3, 2023

In our previous post Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes , we discussed how you can implement solutions to improve operational efficiencies of your Amazon Simple Storage Service (Amazon S3) data lake that is using the Apache Iceberg open table format and running on the Amazon EMR big data platform.

Optimization

Optimization Snapshot Data Lake Metadata

Modernize your data observability with Amazon OpenSearch Service zero-ETL integration with Amazon S3

AWS Big Data

JUNE 5, 2024

The integration is new way for customers to query operational logs in Amazon S3 and Amazon S3-based data lakes without needing to switch between tools to analyze operational data. OpenSearch is an open source, distributed search and analytics suite derived from Elasticsearch 7.10.

Data Lake

Data Lake Cost-Benefit Dashboards Visualization

Data governance in the age of generative AI

AWS Big Data

FEBRUARY 29, 2024

To provide a response that includes the enterprise context, each user prompt needs to be augmented with a combination of insights from structured data from the data warehouse and unstructured data from the enterprise data lake.

Data Governance

Data Governance Unstructured Data Metadata Data Lake

Query your Apache Hive metastore with AWS Lake Formation permissions

AWS Big Data

JULY 20, 2023

The Hive metastore is a repository of metadata about the SQL tables, such as database names, table names, schema, serialization and deserialization information, data location, and partition details of each table. Apache Hive, Apache Spark, Presto, and Trino can all use a Hive Metastore to retrieve metadata to run queries.

Data Lake

Data Lake Metadata Data Processing Big Data

How BMO improved data security with Amazon Redshift and AWS Lake Formation

AWS Big Data

MARCH 1, 2024

As they continue to implement their Digital First strategy for speed, scale and the elimination of complexity, they are always seeking ways to innovate, modernize and also streamline data access control in the Cloud. BMO has accumulated sensitive financial data and needed to build an analytic environment that was secure and performant.

Data Lake

Data Lake Data Warehouse Management Risk

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

AWS Big Data

NOVEMBER 29, 2023

It enables data engineers, data scientists, and analytics engineers to define the business logic with SQL select statements and eliminates the need to write boilerplate data manipulation language (DML) and data definition language (DDL) expressions.

Data Lake

Data Lake Management Metrics Data Warehouse

The Madness of Data (and analytics) Governance

Andrew White

DECEMBER 9, 2019

The client had recently engaged with a well-known consulting company that had recommended a large data catalog effort to collect all enterprise metadata to help identify all data and business issues. Modern data (and analytics) governance does not necessarily need: Wall-to-wall discovery of your data and metadata.

Analytics

Analytics Data Lake Data Governance Metadata

How Morningstar used tag-based access controls in AWS Lake Formation to manage permissions for an Amazon Redshift data warehouse

AWS Big Data

APRIL 6, 2023

In this post, Morningstar’s Data Lake Team Leads discuss how they utilized tag-based access control in their data lake with AWS Lake Formation and enabled similar controls in Amazon Redshift. However, our consumers pushed us for better query performance and enhanced analytical capabilities.

Data Warehouse

Data Warehouse Data Lake Management Data-driven

Informatica’s new data management clouds target health, finance services

CIO Business Intelligence

MAY 24, 2022

Some of the accelerators included as part of the new platform are integrations with Salesforce, NPI data, National Patient Account Services, Workday, Oracle Fusion HCM Cloud, Orange HRM, Salesforce Health Cloud, MedPro, healthcare-focused cloud company Veeva, and HR vendor UltiPro. Analytics for faster decision making.

Finance

Finance Management Metadata Data Quality

Case study: Policy Enforcement Automation With Semantics

Ontotext

MAY 2, 2024

Data leaders today are faced with an almost impossible challenge. Particularly those on the “the create side of the house” who are tasked to deliver insights and analytics. Such inconsistencies bring lowered trust in the outcomes analytics and insights leaders try to get.

Metadata

Metadata Data Lake Data-driven Enterprise

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

AWS Big Data

JANUARY 8, 2024

This is the first post to a blog series that offers common architectural patterns in building real-time data streaming infrastructures using Kinesis Data Streams for a wide range of use cases. In this post, we will review the common architectural patterns of two use cases: Time Series Data Analysis and Event Driven Microservices.

Analytics

Analytics IoT Data-driven Snapshot

Supercharge Your Data Lakehouse with Apache Iceberg in Cloudera Data Platform

Cloudera

JUNE 30, 2022

We are excited to announce the general availability of Apache Iceberg in Cloudera Data Platform (CDP). Iceberg is a 100% open table format, developed through the Apache Software Foundation , and helps users avoid vendor lock-in. This allows our customers the freedom to choose their preferred analytic tool.

Data Lake

Data Lake Data Architecture Metadata Data Warehouse

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

AWS Big Data

JANUARY 17, 2024

With Amazon EMR 6.15, we launched AWS Lake Formation based fine-grained access controls (FGAC) on Open Table Formats (OTFs), including Apache Hudi, Apache Iceberg, and Delta lake. Many large enterprise companies seek to use their transactional data lake to gain insights and improve decision-making.

Data Lake

Data Lake Snapshot Big Data Data-driven

How smava makes loans transparent and affordable using Amazon Redshift Serverless

AWS Big Data

DECEMBER 21, 2023

To speed up the self-service analytics and foster innovation based on data, a solution was needed to provide ways to allow any team to create data products on their own in a decentralized manner. To create and manage the data products, smava uses Amazon Redshift , a cloud data warehouse.

Data Lake

Data Lake Data Warehouse Data-driven B2B

How Amazon Finance Automation built a data mesh to support distributed data ownership and centralize governance

AWS Big Data

JULY 14, 2023

In this post, we discuss how the Amazon Finance Automation team used AWS Lake Formation and the AWS Glue Data Catalog to build a data mesh architecture that simplified data governance at scale and provided seamless data access for analytics, AI, and machine learning (ML) use cases.

Finance

Finance Metadata Big Data Recreation/Entertainment

Apache Ozone and Dense Data Nodes

Cloudera

APRIL 22, 2021

Today’s enterprise data analytics teams are constantly looking to get the best out of their platforms. Storage plays one of the most important roles in the data platforms strategy, it provides the basis for all compute engines and applications to be built on top of it. Metadata in cluster is disjoint across components.

Data Lake

Data Lake Cost-Benefit Testing Metadata

What is a data architect? Skills, salaries, and how to become a data framework master

CIO Business Intelligence

OCTOBER 13, 2023

Solutions data architect: These individuals design and implement data solutions for specific business needs, including data warehouses, data marts, and data lakes. Application data architect: The application data architect designs and implements data models for specific software applications.

Data Architecture

Data Architecture Data Warehouse Statistics Visualization

How Cloudera Supports Zero Trust for Data

Cloudera

JUNE 7, 2023

It operates independently from compute and storage layers, offering integrated security and governance based on metadata. With persistent context across analytics and cloud environments, SDX simplifies data delivery and access with a unified multi-tenant model. Understanding your data is critical to protecting the data.

Metadata

Metadata Data Lake Optimization Modeling

Create an end-to-end data strategy for Customer 360 on AWS

AWS Big Data

MARCH 26, 2024

In this post, we discuss how you can use purpose-built AWS services to create an end-to-end data strategy for C360 to unify and govern customer data that address these challenges. We recommend building your data strategy around five pillars of C360, as shown in the following figure.

Data Strategy

Data Strategy Strategy Data Warehouse Prescriptive Analytics

Putting the Business Back Into Business Innovation

Timo Elliott

DECEMBER 14, 2022

SAP BTP brings together data and analytics, artificial intelligence, application development, automation, and integration in one, unified environment. You lose the roots: the metadata, the hierarchies, the security, the business context of the data. The analysts call this a data mesh or data fabric strategy.

Data Lake

Data Lake Recreation/Entertainment Metadata Data Warehouse

Usability and Connecting Threads: How Data Fabric Makes Sense Out of Disparate Data

Ontotext

AUGUST 4, 2023

A data fabric utilizes an integrated data layer over existing, discoverable, and inferenced metadata assets to support the design, deployment, and utilization of data across enterprises, including hybrid and multi-cloud platforms. It also helps capture and connect data based on business or domains.

Metadata

Metadata Data-driven Data Architecture Data Quality

Speed up queries with the cost-based optimizer in Amazon Athena

AWS Big Data

NOVEMBER 17, 2023

Amazon Athena is a serverless, interactive analytics service built on open source frameworks, supporting open table file formats. Athena provides a simplified, flexible way to analyze petabytes of data where it lives. Software Development Engineer with Amazon Athena. Analytics Architect on Amazon Athena.

Optimization

Optimization Statistics Metadata Data Lake

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 3: Visualization and trend analysis using Amazon QuickSight

AWS Big Data

MARCH 29, 2024

Data Firehose uses an AWS Lambda function to transform data and ingest the transformed records into an Amazon Simple Storage Service (Amazon S3) bucket. An AWS Glue crawler scans data on the S3 bucket and populates table metadata on the AWS Glue Data Catalog. In his spare time, he enjoys playing tennis.

Metrics

Metrics Visualization Dashboards Interactive

Data Mesh 101: How Data Mesh Helps Organizations Be Data-Driven and Achieve Velocity

Ontotext

FEBRUARY 12, 2024

Transferring ownership of data/datasets to domain-specific units that possess a deeper understanding of rules around the data empowers teams, improves data quality and trust, and greatly accelerates the building of data models and analytics. However, data mesh is not about introducing new technologies.

Data-driven

Data-driven Data Lake Data Quality Business Objectives

What is an Information Steward, and Why You Should Care

Grooper

MARCH 5, 2020

If your organization has any kind of data and analytics initiative, then chances are you have people – maybe even an entire department dedicated to managing and integrating data for (and between) software applications to achieve some sort of business outcome. Is a Power-User or a Data Scientist an Information Steward?

Data Lake

Data Lake Metadata Data Quality Software

Cloudera’s Open Data Lakehouse Supercharged with dbt Core(tm)

Cloudera

OCTOBER 7, 2022

dbt allows data teams to produce trusted data sets for reporting, ML modeling, and operational workflows using SQL, with a simple workflow that follows software engineering best practices like modularity, portability, and continuous integration/continuous development (CI/CD). Introduction.

Data Warehouse

Data Warehouse Data Transformation Testing Data Lake

Top Opportunities for SAP Partners in 2023

Timo Elliott

NOVEMBER 30, 2022

According to IDC, such platforms are THE fastest growth area of the software market, and we’re seeing 50% growth for SAP BTP around the world. To turn it into an asset, you actually have to do something with the data, to change something in the way you do business. The problem is that we’ve been doing analytics wrong for thirty years.

Recreation/Entertainment

Recreation/Entertainment Metadata Data Warehouse Cost-Benefit

How to use foundation models and trusted governance to manage AI workflow risk

IBM Big Data Hub

OCTOBER 16, 2023

It includes processes that trace and document the origin of data, models and associated metadata and pipelines for audits. How to scale AL and ML with built-in governance A fit-for-purpose data store built on an open lakehouse architecture allows you to scale AI and ML while providing built-in governance tools.

Risk

Risk Modeling Management Metadata

How Cloudinary transformed their petabyte scale streaming data lake with Apache Iceberg and AWS Analytics

Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake

Webinars

Trending Sources

Use Apache Iceberg in a data lake to support incremental data processing

Webinars

Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics

Choosing an open table format for your transactional data lake on AWS

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

Build a real-time GDPR-aligned Apache Iceberg data lake

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Empower your Jira data in a data lake with Amazon AppFlow and AWS Glue

Build a data lake with Apache Flink on Amazon EMR

How Cargotec uses metadata replication to enable cross-account data sharing

Amazon Redshift announcements at AWS re:Invent 2023 to enable analytics on all your data

Data Lakes: What Are They and Who Needs Them?

Collibra Brings Effective Data Governance to Line-of-Business

Introducing Apache Hudi support with AWS Glue crawlers

What is a Data Mesh?

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

Modernize your data observability with Amazon OpenSearch Service zero-ETL integration with Amazon S3

Data governance in the age of generative AI

Query your Apache Hive metastore with AWS Lake Formation permissions

How BMO improved data security with Amazon Redshift and AWS Lake Formation

Build and manage your modern data stack using dbt and AWS Glue through dbt-glue, the new “trusted” dbt adapter

The Madness of Data (and analytics) Governance

How Morningstar used tag-based access controls in AWS Lake Formation to manage permissions for an Amazon Redshift data warehouse

Informatica’s new data management clouds target health, finance services

Case study: Policy Enforcement Automation With Semantics

Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

Supercharge Your Data Lakehouse with Apache Iceberg in Cloudera Data Platform

Enforce fine-grained access control on Open Table Formats via Amazon EMR integrated with AWS Lake Formation

How smava makes loans transparent and affordable using Amazon Redshift Serverless

How Amazon Finance Automation built a data mesh to support distributed data ownership and centralize governance

Apache Ozone and Dense Data Nodes

What is a data architect? Skills, salaries, and how to become a data framework master

How Cloudera Supports Zero Trust for Data

Create an end-to-end data strategy for Customer 360 on AWS

Putting the Business Back Into Business Innovation

Usability and Connecting Threads: How Data Fabric Makes Sense Out of Disparate Data

Speed up queries with the cost-based optimizer in Amazon Athena

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics, Part 3: Visualization and trend analysis using Amazon QuickSight

Data Mesh 101: How Data Mesh Helps Organizations Be Data-Driven and Achieve Velocity

What is an Information Steward, and Why You Should Care

Cloudera’s Open Data Lakehouse Supercharged with dbt Core(tm)

Top Opportunities for SAP Partners in 2023

How to use foundation models and trusted governance to manage AI workflow risk

Stay Connected