Data Lake, Document and Metadata

Data Lake

Document

Metadata

Multicloud data lake analytics with Amazon Athena

AWS Big Data

MARCH 18, 2024

Many organizations operate data lakes spanning multiple cloud data stores. In these cases, you may want an integrated query layer to seamlessly run analytical queries across these diverse cloud stores and streamline your data analytics processes. The AWS Glue Data Catalog holds the metadata for Amazon S3 and GCS data.

Data Lake

Data Lake Analytics Cost-Benefit Management

Build a real-time GDPR-aligned Apache Iceberg data lake

AWS Big Data

FEBRUARY 24, 2023

Data lakes are a popular choice for today’s organizations to store their data around their business activities. As a best practice of a data lake design, data should be immutable once stored. A data lake built on AWS uses Amazon Simple Storage Service (Amazon S3) as its primary storage environment.

Data Lake

Data Lake Metadata Testing Data Warehouse

Join 52,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Webinars

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

Manufacturing Sustainability Surge: Your Guide to Data-Driven Energy Optimization & Decarbonization

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

MORE WEBINARS

Trending Sources

Use Amazon Athena with Spark SQL for your open-source transactional table formats

AWS Big Data

JANUARY 24, 2024

AWS-powered data lakes, supported by the unmatched availability of Amazon Simple Storage Service (Amazon S3), can handle the scale, agility, and flexibility required to combine different data and analytics approaches. It will never remove files that are still required by a non-expired snapshot.

Snapshot

Snapshot Data Lake Metadata Optimization

Webinars

Peak Performance: Continuous Testing & Evaluation of LLM-Based Applications

Manufacturing Sustainability Surge: Your Guide to Data-Driven Energy Optimization & Decarbonization

From Developer Experience to Product Experience: How a Shared Focus Fuels Product Success

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

MORE WEBINARS

Data governance in the age of generative AI

AWS Big Data

FEBRUARY 29, 2024

Data governance is a critical building block across all these approaches, and we see two emerging areas of focus. First, many LLM use cases rely on enterprise knowledge that needs to be drawn from unstructured data such as documents, transcripts, and images, in addition to structured data from data warehouses.

Data Governance

Data Governance Unstructured Data Metadata Data Lake

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

AWS Big Data

MARCH 7, 2024

At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance. With this massive data growth, data proliferation across your data stores, data warehouse, and data lakes can become equally challenging.

Data Lake

Data Lake Analytics Dashboards Metrics

Configure cross-Region table access with the AWS Glue Catalog and AWS Lake Formation

AWS Big Data

AUGUST 3, 2023

Today’s modern data lakes span multiple accounts, AWS Regions, and lines of business in organizations. It’s important that their data solution gives them the ability to share and access data securely and safely across Regions. A resource link is a Data Catalog object that is a link to a database or table.

Data Lake

Data Lake Metadata Management Data Processing

Case study: Policy Enforcement Automation With Semantics

Ontotext

MAY 2, 2024

Storage-centric approach In the storage-centric approach, people try to address data silos by throwing everything in a data lake or a data warehouse. But, although, this helps somewhat in terms of architecture, soon these data lakes become unwieldy.

Metadata

Metadata Data Lake Data-driven Enterprise

Exploring real-time streaming for generative AI Applications

AWS Big Data

MARCH 25, 2024

A RAG-based generative AI application can only produce generic responses based on its training data and the relevant documents in the knowledge base. Streaming jobs constantly ingest new data to synchronize across systems and can perform enrichment, transformations, joins, and aggregations across windows of time more efficiently.

Data Lake

Data Lake Unstructured Data Management Modeling

Doing Cloud Migration and Data Governance Right the First Time

erwin

OCTOBER 8, 2020

But even with the “need for speed” to market, new applications must be modeled and documented for compliance, transparency and stakeholder literacy. With all these diverse metadata sources, it is difficult to understand the complicated web they form much less get a simple visual flow of data lineage and impact analysis.

Data Governance

Data Governance Metadata Testing Data Lake

Migrate Hive data from CDH to CDP public cloud

Cloudera

JUNE 25, 2021

The Replication Manager support matrix is documented in our public docs. This blog post outlines detailed step by step instructions to perform Hive Replication from an on-prem CDH cluster to a CDP Public Cloud Data Lake. CDP Data Lake cluster versions – CM 7.4.0, CDP Data Lake cluster versions – CM 7.4.0,

Data Lake

Data Lake Metadata Unstructured Data Management

Integrating Data Governance and Enterprise Architecture

erwin

SEPTEMBER 3, 2020

Data governance provides time-sensitive, current-state architecture information with a high level of quality. It documents your data assets from end to end for business understanding and clear data lineage with traceability. Automating Data Governance and Enterprise Architecture.

Data Governance

Data Governance Enterprise Risk Data Lake

What is a data architect? Skills, salaries, and how to become a data framework master

CIO Business Intelligence

OCTOBER 13, 2023

Data architect Armando Vázquez identifies eight common types of data architects: Enterprise data architect: These data architects oversee an organization’s overall data architecture, defining data architecture strategy and designing and implementing architectures.

Data Architecture

Data Architecture Data Warehouse Statistics Visualization

Cloud Data Science News – Beta 6

Data Science 101

DECEMBER 16, 2019

It now also supports PDF documents. Azure Data Factory Preserves Metadata during File Copy When performing a File copy between Amazon S3, Azure Blob, and Azure Data Lake Gen 2, the metadata will be copied as well. Not a huge update but still a nice feature. Azure Database for MySQL now supports MySQL 8.0

Data Science

Data Science Metadata Machine Learning Data Lake

How to use foundation models and trusted governance to manage AI workflow risk

IBM Big Data Hub

OCTOBER 16, 2023

It includes processes that trace and document the origin of data, models and associated metadata and pipelines for audits. How to scale AL and ML with built-in governance A fit-for-purpose data store built on an open lakehouse architecture allows you to scale AI and ML while providing built-in governance tools.

Risk

Risk Modeling Management Metadata

Data Mesh 101: How Data Mesh Can Be Used in an Organization

Ontotext

FEBRUARY 12, 2024

This calls for additional planning, documentation, and testing. A data mesh will likely require more engineers to get started, so a critical mass is needed for successful adoption. Each team should be accountable for providing their prepared data sets to downstream systems. For instance, JPMorgan Chase & Co.

Data Quality

Data Quality Data-driven Data Lake Data Governance

How gaming companies can use Amazon Redshift Serverless to build scalable analytical applications faster and easier

AWS Big Data

MARCH 7, 2023

A data hub contains data at multiple levels of granularity and is often not integrated. It differs from a data lake by offering data that is pre-validated and standardized, allowing for simpler consumption by users. Data hubs and data lakes can coexist in an organization, complementing each other.

Analytics

Analytics Data Warehouse Data Lake Metadata

What Is a Data Catalog?

Alation

FEBRUARY 13, 2020

Why do we need a data catalog? What does a data catalog do? These are all good questions and a logical place to start your data cataloging journey. Data catalogs have become the standard for metadata management in the age of big data and self-service analytics. Figure 1 – Data Catalog Metadata Subjects.

Metadata

Metadata Data Lake Recreation/Entertainment Big Data

Advancing AI: The emergence of a modern information lifecycle

CIO Business Intelligence

DECEMBER 4, 2023

The ways modern data is used, processed, and analyzed are continuously evolving as machine learning technology becomes better at these tasks. With constant advances in intelligent document processing, compute power, DevOps workflows, and AI, the content, context, and value of unstructured data is rapidly increasing.

Unstructured Data

Unstructured Data Data Lake Business Objectives Metadata

Data Governance Makes Data Security Less Scary

erwin

OCTOBER 31, 2019

What data do we have and where is it? Data is a critical asset used to operate, manage and grow a business. While sometimes at rest in databases, data lakes and data warehouses; a large percentage is federated and integrated across the enterprise, introducing governance, manageability and risk issues that must be managed.

Data Governance

Data Governance Metadata Risk Data Lake

Data Profiling: What It Is and How to Perfect It

Alation

APRIL 18, 2023

Gartner defines data profiling as: A technology for discovering and investigating data quality issues, such as duplication, lack of consistency, and lack of accuracy and completeness. The tools provide data statistics, such as degree of duplication and ratios of attribute values, both in tabular and graphical formats.

IT Data Quality Metadata Data Governance

How Data Governance Protects Sensitive Data

erwin

APRIL 2, 2021

And knowing the business purpose translates into actively governing personal data against potential privacy and security violations. Do You Know Where Your Sensitive Data Is? Data is a valuable asset used to operate, manage and grow a business.

Data Governance

Data Governance Cost-Benefit Risk Metadata

Data Preparation and Data Mapping: The Glue Between Data Management and Data Governance to Accelerate Insights and Reduce Risks

erwin

JANUARY 11, 2019

Organizations have spent a lot of time and money trying to harmonize data across diverse platforms , including cleansing, uploading metadata, converting code, defining business glossaries, tracking data transformations and so on. And there’s control of that landscape to facilitate insight and collaboration and limit risk.

Data Governance

Data Governance Risk Metadata Management

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

AWS Big Data

NOVEMBER 16, 2023

Amazon Redshift is a popular cloud data warehouse, offering a fully managed cloud-based service that seamlessly integrates with an organization’s Amazon Simple Storage Service (Amazon S3) data lake, real-time streams, machine learning (ML) workflows, transactional workflows, and much more—all while providing up to 7.9x

Enterprise

Enterprise Data Warehouse Snapshot Cost-Benefit

Choosing an open table format for your transactional data lake on AWS

AWS Big Data

JUNE 9, 2023

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale.

Data Lake

Data Lake Metadata Optimization Statistics

Data Cataloging in the Data Lake: Alation + Kylo

Alation

FEBRUARY 20, 2020

When it was no longer a hard requirement that a physical data model be created upon the ingestion of data, there was a resulting drop in richness of the description and consistency of the data stored in Hadoop. You did not have to understand or prepare the data to get it into Hadoop, so people rarely did.

Data Lake

Data Lake Metadata Structured Data Big Data

Introducing hybrid access mode for AWS Glue Data Catalog to secure access using AWS Lake Formation and IAM and Amazon S3 policies

AWS Big Data

SEPTEMBER 26, 2023

AWS Lake Formation helps you centrally govern, secure, and globally share data for analytics and machine learning. With Lake Formation, you can manage access control for your data lake data in Amazon Simple Storage Service (Amazon S3 ) and its metadata in AWS Glue Data Catalog in one place with familiar database-style features.

Data Lake

Data Lake Metadata Management Modeling

Themes and Conferences per Pacoid, Episode 8

Domino Data Lab

APRIL 3, 2019

Data governance shows up as the fourth-most-popular kind of solution that enterprise teams were adopting or evaluating during 2019. That’s a lot of priorities – especially when you group together closely related items such as data lineage and metadata management which rank nearby. in lieu of simply landing in a data lake.

Data Governance

Data Governance Machine Learning Metadata Big Data

Unstructured data management and governance using AWS AI/ML and analytics services

AWS Big Data

OCTOBER 25, 2023

Text, images, audio, and videos are common examples of unstructured data. Most companies produce and consume unstructured data such as documents, emails, web pages, engagement center phone calls, and social media. Therefore, there is a need to being able to analyze and extract value from the data economically and flexibly.

Unstructured Data

Unstructured Data Metadata Management Analytics

Top Graph Use Cases and Enterprise Applications (with Real World Examples)

Ontotext

MARCH 8, 2023

Rich metadata and semantic modeling continue to drive the matching of 50K training materials to specific curricula, leading new, data-driven, audience-based marketing efforts that demonstrate how the recommender service is achieving increased engagement and performance from over 2.3 million users.

Enterprise

Enterprise Knowledge Discovery Risk Data-driven

How BMO improved data security with Amazon Redshift and AWS Lake Formation

AWS Big Data

MARCH 1, 2024

One of the bank’s key challenges related to strict cybersecurity requirements is to implement field level encryption for personally identifiable information (PII), Payment Card Industry (PCI), and data that is classified as high privacy risk (HPR). Only users with required permissions are allowed to access data in clear text.

Data Lake

Data Lake Data Warehouse Management Risk

Breaking State and Local Data Silos with Modern Data Architectures

Cloudera

AUGUST 30, 2022

It outlines a scenario in which “recently married people might want to change their names on their driver’s licenses or other documentation. That should be easy, but when agencies don’t share data or applications, they don’t have a unified view of people. Forrester ).

Data Architecture

Data Architecture Data Lake Metadata Data Warehouse

Why Spreadsheets Are Your Secret Weapon for Efficient Data Governance

Alation

APRIL 6, 2023

Other forms of governance address specific sets or domains of data including information governance (for unstructured data), metadata governance (for data documentation), and domain-specific data (master, customer, product, etc.). Data catalogs and spreadsheets are related in many ways.

Data Governance

Data Governance Metadata Cost-Benefit Structured Data

The Data Scientist’s Guide to the Data Catalog

Alation

JULY 19, 2022

A data catalog can assist directly with every step, but model development. And even then, information from the data catalog can be transferred to a model connector , allowing data scientists to benefit from curated metadata within those platforms. How Data Catalogs Help Data Scientists Ask Better Questions.

Metadata

Metadata Data Quality Statistics Data Science

Five actionable steps to GDPR compliance (Right to be forgotten) with Amazon Redshift

AWS Big Data

JULY 28, 2023

To successfully respond to a data subject’s requests, organizations should have a clear strategy to determine how data is forgotten, flagged, anonymized, or deleted, and they should have clear guidelines in place for data audits. Data mapping involves identifying and documenting the flow of personal data in an organization.

Snapshot

Snapshot Metadata Measurement Data Warehouse

The Security Challenges of Data Warehousing in the Cloud

Cloudera

NOVEMBER 5, 2020

When you register an Environment in CDP, a Data Lake is automatically deployed for that environment. Data Lake security and governance is managed by a shared set of services running within a Data Lake cluster. Apache Atlas — metadata management and governance: lineage, analytics, attributes.

Data Lake

Data Lake Data Warehouse Metadata Optimization

The Cloud Connection: How Governance Supports Security

Alation

APRIL 14, 2022

In today’s AI/ML-driven world of data analytics, explainability needs a repository just as much as those doing the explaining need access to metadata, EG, information about the data being used. The Cloud Data Migration Challenge. Pushing data to a data lake and assuming it is ready for use is shortsighted.

Metadata

Metadata Data Governance Modeling Data-driven

Why We Started the Data Intelligence Project

Alation

JULY 7, 2022

In 2013 I joined American Family Insurance as a metadata analyst. I had always been fascinated by how people find, organize, and access information, so a metadata management role after school was a natural choice. The use cases for metadata are boundless, offering opportunities for innovation in every sector.

Metadata

Metadata Data-driven Insurance Statistics

Cloudera Data Platform extends Hybrid Cloud vision support by supporting Google Cloud

Cloudera

MARCH 31, 2021

Customers who have chosen Google Cloud as their cloud platform can now use CDP Public Cloud to create secure governed data lakes in their own cloud accounts and deliver security, compliance and metadata management across multiple compute clusters. Google Cloud Storage buckets – in the same subregion as your subnets .

Data Lake

Data Lake Metadata Enterprise Analytics

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Cloudera

JANUARY 21, 2021

With CDW, as an integrated service of CDP, your line of business gets immediate resources needed for faster application launches and expedited data access, all while protecting the company’s multi-year investment in centralized data management, security, and governance. Proprietary file formats mean no one else is invited in!

Data Lake

Data Lake Data Warehouse IT Analytics

Clean up your Excel and CSV files without writing code using AWS Glue DataBrew

AWS Big Data

NOVEMBER 15, 2023

You can use its built-in transformations, recipes, as well as integrations with the AWS Glue Data Catalog and Amazon Simple Storage Service (Amazon S3) to preprocess the data in your landing zone, clean it up, and send it downstream for analytical processing. For Matching conditions , choose Match all conditions.

Metadata

Metadata Sales Data Lake Big Data

Non-JSON ingestion using Amazon Kinesis Data Streams, Amazon MSK, and Amazon Redshift Streaming Ingestion

AWS Big Data

OCTOBER 2, 2023

JSON data in Amazon Redshift Amazon Redshift enables storage, processing, and analytics on JSON data through the SUPER data type, PartiQL language, materialized views, and data lake queries. This approach requires an additional step of schema retrieval and decoding based on context.

Cost-Benefit

Cost-Benefit Metadata Structured Data Management

Federate Amazon QuickSight access with open-source identity provider Keycloak

AWS Big Data

JUNE 13, 2023

Download the SAML metadata file. In the navigation pane under Clients , import the SAML metadata file. Download the Keycloak IdP SAML metadata file from that URL location. For Metadata document , upload the Keycloak IdP SAML metadata XML file you downloaded and saved to your local machine earlier.

Metadata

Metadata Dashboards Business Intelligence Management

How data stores and governance impact your AI initiatives

IBM Big Data Hub

OCTOBER 12, 2023

Among the tasks necessary for internal and external compliance is the ability to report on the metadata of an AI model. Metadata includes details specific to an AI model such as: The AI model’s creation (when it was created, who created it, etc.) This allows for a high degree of transparency and auditability.

Cost-Benefit

Cost-Benefit Metadata Data Governance Modeling

Cloudera’s Open Data Lakehouse Supercharged with dbt Core(tm)

Cloudera

OCTOBER 7, 2022

Using these adapters, Cloudera customers can use dbt to collaborate, test, deploy, and document their data transformation and analytic pipelines on CDP Public Cloud, CDP One, and CDP Private Cloud. To date, dbt was only available on proprietary cloud data warehouses, with very little interoperability between different engines.

Data Warehouse

Data Warehouse Data Transformation Testing Data Lake

Multicloud data lake analytics with Amazon Athena

Build a real-time GDPR-aligned Apache Iceberg data lake

Webinars

Trending Sources

Use Amazon Athena with Spark SQL for your open-source transactional table formats

Webinars

Data governance in the age of generative AI

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Configure cross-Region table access with the AWS Glue Catalog and AWS Lake Formation

Case study: Policy Enforcement Automation With Semantics

Exploring real-time streaming for generative AI Applications

Doing Cloud Migration and Data Governance Right the First Time

Migrate Hive data from CDH to CDP public cloud

Integrating Data Governance and Enterprise Architecture

What is a data architect? Skills, salaries, and how to become a data framework master

Cloud Data Science News – Beta 6

How to use foundation models and trusted governance to manage AI workflow risk

Data Mesh 101: How Data Mesh Can Be Used in an Organization

How gaming companies can use Amazon Redshift Serverless to build scalable analytical applications faster and easier

What Is a Data Catalog?

Advancing AI: The emergence of a modern information lifecycle

Data Governance Makes Data Security Less Scary

Data Profiling: What It Is and How to Perfect It

How Data Governance Protects Sensitive Data

Data Preparation and Data Mapping: The Glue Between Data Management and Data Governance to Accelerate Insights and Reduce Risks

Power enterprise-grade Data Vaults with Amazon Redshift – Part 2

Choosing an open table format for your transactional data lake on AWS

Data Cataloging in the Data Lake: Alation + Kylo

Introducing hybrid access mode for AWS Glue Data Catalog to secure access using AWS Lake Formation and IAM and Amazon S3 policies

Themes and Conferences per Pacoid, Episode 8

Unstructured data management and governance using AWS AI/ML and analytics services

Top Graph Use Cases and Enterprise Applications (with Real World Examples)

How BMO improved data security with Amazon Redshift and AWS Lake Formation

Breaking State and Local Data Silos with Modern Data Architectures

Why Spreadsheets Are Your Secret Weapon for Efficient Data Governance

The Data Scientist’s Guide to the Data Catalog

Five actionable steps to GDPR compliance (Right to be forgotten) with Amazon Redshift

The Security Challenges of Data Warehousing in the Cloud

The Cloud Connection: How Governance Supports Security

Why We Started the Data Intelligence Project

Cloudera Data Platform extends Hybrid Cloud vision support by supporting Google Cloud

Get Your Analytics Insights Instantly – Without Abandoning Central IT

Clean up your Excel and CSV files without writing code using AWS Glue DataBrew

Non-JSON ingestion using Amazon Kinesis Data Streams, Amazon MSK, and Amazon Redshift Streaming Ingestion

Federate Amazon QuickSight access with open-source identity provider Keycloak

How data stores and governance impact your AI initiatives

Cloudera’s Open Data Lakehouse Supercharged with dbt Core(tm)

Stay Connected