Blog, Data Processing and Metadata

Top 10 Data Lineage Podcasts, Blogs, and Magazines

Octopai

JANUARY 31, 2021

Our list of Top 10 Data Lineage Podcasts, Blogs, and Websites To Follow in 2021. The host is Tobias Macey, an engineer with many years of experience. The particular episode we recommend looks at how WeWork struggled with understanding their data lineage so they created a metadata repository to increase visibility. Agile Data.

Data Governance

Data Governance Data Processing Data Quality Metadata

Migrate an existing data lake to a transactional data lake using Apache Iceberg

AWS Big Data

OCTOBER 3, 2023

This means the data files in the data lake aren’t modified during the migration and all Apache Iceberg metadata files (manifests, manifest files, and table metadata files) are generated outside the purview of the data. In this method, the metadata are recreated in an isolated environment and colocated with the existing data files.

Data Lake

Data Lake Metadata Snapshot Recreation/Entertainment

5G network rollout using DevOps: Myth or reality?

IBM Big Data Hub

JUNE 12, 2023

Public cloud support: Many CSPs use hyperscalers like AWS to host their 5G network functions, which requires automated deployment and lifecycle management. Hybrid cloud support: Some network functions must be hosted on a private data center, but that also the requires ability to automatically place network functions dynamically.

Testing

Testing Data Processing Metadata Management

Webinars

How to Optimize the Developer Experience for Monumental Impact

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Understanding User Needs and Satisfying Them

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You Need to Know

Leading the Development of Profitable and Sustainable Products

MORE WEBINARS

Streaming Ingestion for Apache Iceberg With Cloudera Stream Processing

Cloudera

MARCH 2, 2023

In this blog post, we are going to share with you how Cloudera Stream Processing ( CSP ) is integrated with Apache Iceberg and how you can use the SQL Stream Builder ( SSB ) interface in CSP to create stateful stream processing jobs using SQL. To provide the CM host we can copy the FQDN of the node where Cloudera Manager is running.

Snapshot

Snapshot Data Processing Metadata Management

Apache Ozone Powers Data Science in CDP Private Cloud

Cloudera

AUGUST 26, 2021

In this blog post, we will ingest a real world dataset into Ozone, create a Hive table on top of it and analyze the data to study the correlation between new vaccinations and new cases per country using a Spark ML Jupyter notebook in CML. Learn more about the impacts of global data sharing in this blog, The Ethics of Data Exchange.

Data Science

Data Science Forecasting Metadata Machine Learning

Mastering Ingress in the UI: Elevating your app visibility

IBM Big Data Hub

NOVEMBER 3, 2023

v1 kind: Ingress metadata: annotations: kubernetes.io/ingress.class: ALB generation: 1 name: echo-ingress namespace: echo-namespace spec: rules: - host: techcorp.com // 1. Domain http: paths: - backend: service: name: echo-service port: number: 8080 path: /echo pathType: Prefix tls: - hosts: - techcorp.com secretName: echo-secret // 3.

Data Processing

Data Processing Metadata Management Testing

A Reference Architecture for the Cloudera Private Cloud Base Data Platform

Cloudera

JULY 15, 2021

This blog post provides an overview of best practice for the design and deployment of clusters incorporating hardware and operating system configuration, along with guidance for networking and security as well as integration with existing enterprise infrastructure. Introduction and Rationale. Networking . Clocks must also be synchronized.

Data Processing

Data Processing Metadata Testing Management

Setting up and Getting Started with Cloudera’s New SQL AI Assistant

Cloudera

JANUARY 19, 2024

As described in our recent blog post , an SQL AI Assistant has been integrated into Hue with the capability to leverage the power of large language models (LLMs) for a number of SQL tasks. This blog post aims to help you understand what you can do to get started with generative AI assisted SQL using Hue image version 2023.0.16.0

Data Warehouse

Data Warehouse Data Processing Optimization Modeling

Build efficient, cross-Regional, I/O-intensive workloads with Dask on AWS

AWS Big Data

MAY 4, 2023

Amazon’s Open Data Sponsorship Program allows organizations to host free of charge on AWS. These datasets are distributed across the world and hosted for public use. Data scientists have access to the Jupyter notebook hosted on SageMaker. The OpenSearch Service domain stores metadata on the datasets connected at the Regions.

Data Processing

Data Processing Metadata Informatics Interactive

Upgrade Hortonworks Data Platform (HDP) to Cloudera Data Platform (CDP) Private Cloud Base

Cloudera

FEBRUARY 17, 2022

One of our previous blogs discussed the four paths to get from legacy platforms to CDP Private Cloud Base. In this blog and accompanying video, we deep dive into the mechanics of running an in-place upgrade from HDP3 to CDP Private Cloud Base. After Ambari has been upgraded, download the cluster blueprints with hosts.

Testing

Testing Data Processing Metadata Management

Data governance beyond SDX: Adding third party assets to Apache Atlas

Cloudera

MARCH 9, 2021

In this blog, we’ll highlight the key CDP aspects that provide data governance and lineage and show how they can be extended to incorporate metadata for non-CDP systems from across the enterprise. Extending Atlas’ metadata model. The example 1_typedef-server.json describes the server typedef used in this blog. .

Data Governance

Data Governance Metadata Enterprise Data Processing

Migrate Hive data from CDH to CDP public cloud

Cloudera

JUNE 25, 2021

This blog post outlines detailed step by step instructions to perform Hive Replication from an on-prem CDH cluster to a CDP Public Cloud Data Lake. The Sentry service serves authorization metadata from the database backed storage; it does not handle actual privilege validation. This blog post is not a substitute for that.

Data Lake

Data Lake Metadata Unstructured Data Management

Do Large Language Models Dream of Knowledge Graphs – Impressions from Day 2 At SEMANTiCS 2023

Ontotext

OCTOBER 12, 2023

Both speakers talked about common metadata standards and adequate language resources as key enablers of efficient interoperable, multilingual projects. Just like the typewriter in the hall hosting the Poster’s park , LLMs are yet another tool poised to change the way we work with language.

Modeling

Modeling Recreation/Entertainment Data Processing Metadata

The Top Three Entangled Trends in Data Architectures: Data Mesh, Data Fabric, and Hybrid Architectures

Cloudera

SEPTEMBER 29, 2022

The data product is not just the data itself, but a bunch of metadata that surrounds it — the simple stuff like schema is a given. It is also agnostic to where the different domains are hosted. There are tons of blogs/videos etc about data mesh. This team or domain expert will be responsible for the data produced by the team.

Data Architecture

Data Architecture Metadata Data Warehouse Sales

KGF 2023: Bikes To The Moon, Datastrophies, Abstract Art And A Knowledge Graph Forum To Embrace Them All

Ontotext

DECEMBER 1, 2023

Atanas Kiryakov presenting at KGF 2023 about Where Shall and Enterprise Start their Knowledge Graph Journey Only data integration through semantic metadata can drive business efficiency as “it’s the glue that turns knowledge graphs into hubs of metadata and content”.

Metadata

Metadata Sales Consulting Enterprise

Data Governance Maturity and Tracking Progress

erwin

APRIL 16, 2021

erwin recently hosted the third in its six-part webinar series on the practice of data governance and how to proactively deal with its complexities. This webinar will discuss how to answer critical questions through data catalogs and business glossaries, powered by effective metadata management. erwin Data Intelligence.

Data Governance

Data Governance Metadata Cost-Benefit Digital Transformation

From Data Silos to Data Fabric with Knowledge Graphs

Ontotext

SEPTEMBER 15, 2020

This means the creation of reusable data services, machine-readable semantic metadata and APIs that ensure the integration and orchestration of data across the organization and with third-party external data. This means having the ability to define and relate all types of metadata.

Metadata

Metadata Knowledge Discovery Data Quality Strategy

HDFS Data Encryption at Rest on Cloudera Data Platform

Cloudera

APRIL 23, 2021

To prevent the management of these keys (which can run in the millions) from becoming a performance bottleneck, the encryption key itself is stored in the file metadata. Each file will have an EDEK which is stored in the file’s metadata. Select hosts for Active and Passive KTS servers. Data in the file is encrypted with DEK.

Data Processing

Data Processing Metadata Testing Management

Enhance query performance using AWS Glue Data Catalog column-level statistics

AWS Big Data

NOVEMBER 22, 2023

The workflow consists of the following high level steps: Cataloging the Amazon S3 Bucket: Utilize AWS Glue Crawler to crawl the designated Amazon S3 bucket, extracting metadata, and seamlessly storing it in the AWS Glue data catalog. We’ll query these tables using Amazon Athena and Amazon Redshift Spectrum. Keep the default option.

Statistics

Statistics Data Lake Optimization Data-driven

How Cargotec uses metadata replication to enable cross-account data sharing

AWS Big Data

JUNE 7, 2023

This is a guest blog post co-written with Sumesh M R from Cargotec and Tero Karttunen from Knowit Finland. In this blog, we discuss the technical challenges faced by Cargotec in replicating their AWS Glue metadata across AWS accounts, and how they navigated these challenges successfully to enable cross-account data sharing.

Metadata

Metadata Data Lake Machine Learning Big Data

How Data Governance Protects Sensitive Data

erwin

APRIL 2, 2021

Protecting what traditionally has been considered personally identifiable information (PII) — people’s names, addresses, government identification numbers and so forth — that a business collects, and hosts is just the beginning of GDPR mandates.

Data Governance

Data Governance Cost-Benefit Risk Metadata

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

datapine

SEPTEMBER 29, 2022

It involves: Reviewing data in detail Comparing and contrasting the data to its own metadata Running statistical models Data quality reports. Many companies use so-called “legacy systems” for their databases that are decades old, and when the inevitable transition time comes, there’s a whole host of problems to deal with.

Data Quality

Data Quality Metrics Data-driven Management

Announcing Alation 4.0 with Alation Connect

Alation

FEBRUARY 20, 2020

What the mapping is of technical metadata to business descriptions. Alation Connect synchronizes metadata, sample data, and query logs into the Alation Data Catalog. All connections allow for Alation Data Catalog to automatically inventory & catalog queries and these engines may be hosted and operated on-premise or in the cloud.

Metadata

Metadata Enterprise Data Processing Data Architecture

Alation Accelerates Growth and Global Impact — and Welcomes 2 New Leaders

Alation

MAY 11, 2023

In this blog, I’ll detail how we’ve grown in EMEA specifically, sharing exciting updates and plans for the future. This multi-brand online retailer hosts thousands of products for sale on the internet and collects millions of bits and bytes of data across customer touchpoints each day. But first: mark your calendars!

B2B

B2B Finance Data Governance Marketing

Build event-driven data pipelines using AWS Controllers for Kubernetes and Amazon EMR on EKS

AWS Big Data

MARCH 30, 2023

Amazon Elastic Kubernetes Service (Amazon EKS) is becoming a popular choice among AWS customers to host long-running analytics and AI or machine learning (ML) workloads. services.k8s.aws/v1alpha1 kind: Bucket metadata: name: sparkjob-demo-bucket spec: name: sparkjob-demo-bucket kubectl apply -f ack-yamls/s3.yaml We use the s3.yaml

Data-driven

Data-driven Metadata Testing Management

Fivetran Modern Data Stack Conference 2023: Key Takeaways

Alation

APRIL 14, 2023

In this blog, I’ll share a quick high-level overview of the event, with an eye to core themes. In his talk, Mitesh revealed that Alation delivers useful information about data via metadata, and explored why context is key to building reliable data pipelines. Let’s dive in! Keen to learn more about Fivetran’s evolution?

Data Warehouse

Data Warehouse Data-driven Digital Transformation Metadata

What Is Alation Connected Sheets? Q&A with the Creators

Alation

NOVEMBER 28, 2022

And they rarely, if ever, host the most current data available. In the future, spreadsheet users will be able to curate and publish rich metadata about their spreadsheets back into the data catalog. A centralized repository of metadata on the spreadsheets will eliminate this confusion. Subscribe to Alation's Blog.

Metadata

Metadata Enterprise Cost-Benefit Finance

Sovereign Clouds: Partner Perspectives on Safeguarding Critical Customer Data

CIO Business Intelligence

APRIL 27, 2022

Rajgopal adds that all customer data, metadata, and escalation data are kept on Indian soil at all times in an ironclad environment. For more perspectives on Sovereign Cloud solutions, read the latest partner blogs from AU Cloud , NxtGen , ThinkOn and Tieto. These are questions and thoughts for all CIOs to ponder.

Digital Transformation

Digital Transformation Metadata Risk Management

Ontotext Invents the Universe So You Don’t Need To

Ontotext

NOVEMBER 22, 2020

Content Enrichment and Metadata Management. The value of metadata for content providers is well-established. When that metadata is connected within a knowledge graph, a powerful mechanism for content enrichment is unlocked. Ontotext Platform can be employed for a number of applications within an enterprise.

Metadata

Metadata Cost-Benefit Unstructured Data Technology

Announcing the 2021 Data Impact Awards

Cloudera

MAY 12, 2021

2020 saw us hosting our first ever fully digital Data Impact Awards ceremony, and it certainly was one of the highlights of our year. The post Announcing the 2021 Data Impact Awards appeared first on Cloudera Blog. SECURITY AND GOVERNANCE LEADERSHIP. Show us what is possible! .

Digital Transformation

Digital Transformation Machine Learning Optimization Data Lake

Alation Accelerates Growth and Global Impact — and Welcomes 2 New Leaders

Alation

MAY 11, 2023

In this blog, I’ll detail how we’ve grown in EMEA specifically, sharing exciting updates and plans for the future. This multi-brand online retailer hosts thousands of products for sale on the internet and collects millions of bits and bytes of data across customer touchpoints each day. But first: mark your calendars!

B2B

B2B Finance Data Governance Marketing

Gartner D&A Summit Bake-Offs Explored Flooding Impact And Reasons for Optimism!

Rita Sallam

APRIL 2, 2023

We also gave the demo script and data set to all vendors in the Exhibit Hall to create demos for their booths and to submit for this blog. This blog highlights some notable findings and the videos from participating vendors. From there, participants were randomly selected and invited to present live at the Show Floor Showdown sessions.

Optimization

Optimization Machine Learning Insurance Risk

A Guide on How to Leverage DSPM with Your Security Stack and Enhance Data Security Posture

Laminar Security

NOVEMBER 16, 2023

To learn more about the differences between DSPM and CSPM, check out our blog post on DSPM vs CSPM , and why you need both for comprehensive cloud security. You can then use CSPM to check if your cloud resources that host or access this data are securely configured and comply with the relevant regulations or standards.

Risk

Risk Cost-Benefit Dashboards Management

Discover and Explore Data Faster with the CDP DDE Template

Cloudera

SEPTEMBER 1, 2020

With HDFS, Solr servers are essentially stateless, so host failures have minimal consequences. Coordinates distribution of data and metadata, also known as shards. For the examples presented in this blog, we assume you have a CDP account already. Includes a drag-n-drop style, GUI-based Search Dashboard Designer.

Snapshot

Snapshot Unstructured Data Dashboards Interactive

Extreme data center pressure? Burst to the cloud with CDP!

Cloudera

NOVEMBER 12, 2020

Inability to maintain context – This is the worst of them all because every time a data set or workload is re-used, you must recreate its context including security, metadata, and governance. This feature ensures workloads remain in context with all common data, including metadata management, data governance, and security policies.

Data Warehouse

Data Warehouse Reporting Risk Cost-Benefit

Boosting Object Storage Performance with Ozone Manager

Cloudera

JULY 19, 2023

It is a replicated, highly-available service that is responsible for managing the metadata for all objects stored in Ozone. In this blog post, we will highlight the work done recently to improve the performance of Ozone Manager to scale to exabytes of data. The hardware specifications are included at the end of this blog.

Management

Management Metadata Metrics Optimization

6 benefits of data lineage for financial services

IBM Big Data Hub

FEBRUARY 26, 2024

Download the Gartner® Market Guide for Active Metadata Management 1. Efficient cloud migrations McKinsey predicts that $8 out of every $10 for IT hosting will go toward the cloud by 2024. The post 6 benefits of data lineage for financial services appeared first on IBM Blog.

Cost-Benefit

Cost-Benefit Metadata Data Governance Reporting

GoDaddy: Customer-First Digital Transformation

Alation

FEBRUARY 13, 2020

Graves: As I mentioned, one of the key things for us is that we sell web products for our customers to build their own web presence – domains, hosting, shopping carts, and SSL certs. Subscribe to Alation's Blog. What role does data play in your customer-first culture? Are they stopping somewhere in setup? Thank you Sharon.

Digital Transformation

Digital Transformation Data-driven Business Intelligence Big Data

The importance of data ingestion and integration for enterprise AI

IBM Big Data Hub

JANUARY 9, 2024

Data ingestion must be done properly from the start, as mishandling it can lead to a host of new issues. 4 key components to ensure reliable data ingestion Data quality and governance: Data quality means ensuring the security of data sources, maintaining holistic data and providing clear metadata.

Enterprise

Enterprise Data Integration Data Quality Contextual Data

AI governance is rapidly evolving — Here’s how government agencies must prepare

IBM Big Data Hub

APRIL 11, 2024

We recommend that these hackathons be extended in scope to address the challenges of AI governance, through these steps: Step 1: Three months before the pilots are presented, have a candidate governance leader host a keynote on AI ethics to hackathon participants. We find that most are disincentivized because they have quotas to meet.

Risk

Risk Consulting Modeling Data Processing

Simplify data loading into Type 2 slowly changing dimensions in Amazon Redshift

AWS Big Data

MARCH 9, 2023

SCD2 metadata – rec_eff_dt and rec_exp_dt indicate the state of the record. Register source tables in the AWS Glue Data Catalog We use an AWS Glue crawler to infer metadata from delimited data files like the CSV files used in this post. When you’re creating the AWS Glue crawler, create a new database named rs-dimension-blog.

Slice and Dice

Slice and Dice Data Warehouse Metrics Metadata

How Backstage streamlines software development and increases efficiency

IBM Big Data Hub

APRIL 1, 2024

GitOps for repo data Backstage allows developers and teams to express the metadata about their projects from yaml files. ” Rather than paying a cloud to host that proxy for you, you can move that proxy into Backstage and present it as a single product. This is like APIGEE or APIM, but “in-house.”

Software

Software Advertising Data Processing Metadata

New Features in Cloudera Streams Messaging Public Cloud 7.2.12

Cloudera

OCTOBER 25, 2021

Cruise Control will automatically rebalance the partition replicas on the cluster making use of the newly added brokers in the event of an up scale, or down scaling will move replicas off the hosts that are targeted to be decommissioned. an Atlas hook was provided that once configured allows for Kafka metadata to be collected.

Metrics

Metrics Data Processing Metadata Management

Habib Bank manages data at scale with Cloudera Data Platform

Cloudera

NOVEMBER 17, 2022

The platform’s capabilities in security, metadata, and governance will provide robust support to HBL’s focus on compliance and keeping data clean and safe in an increasingly complex regulatory and threat environment. The post Habib Bank manages data at scale with Cloudera Data Platform appeared first on Cloudera Blog.

Management

Management Data Lake Consulting Unstructured Data

Top 10 Data Lineage Podcasts, Blogs, and Magazines

Migrate an existing data lake to a transactional data lake using Apache Iceberg

Webinars

Trending Sources

5G network rollout using DevOps: Myth or reality?

Webinars

Streaming Ingestion for Apache Iceberg With Cloudera Stream Processing

Apache Ozone Powers Data Science in CDP Private Cloud

Mastering Ingress in the UI: Elevating your app visibility

A Reference Architecture for the Cloudera Private Cloud Base Data Platform

Setting up and Getting Started with Cloudera’s New SQL AI Assistant

Build efficient, cross-Regional, I/O-intensive workloads with Dask on AWS

Upgrade Hortonworks Data Platform (HDP) to Cloudera Data Platform (CDP) Private Cloud Base

Data governance beyond SDX: Adding third party assets to Apache Atlas

Migrate Hive data from CDH to CDP public cloud

Do Large Language Models Dream of Knowledge Graphs – Impressions from Day 2 At SEMANTiCS 2023

The Top Three Entangled Trends in Data Architectures: Data Mesh, Data Fabric, and Hybrid Architectures

KGF 2023: Bikes To The Moon, Datastrophies, Abstract Art And A Knowledge Graph Forum To Embrace Them All

Data Governance Maturity and Tracking Progress

From Data Silos to Data Fabric with Knowledge Graphs

HDFS Data Encryption at Rest on Cloudera Data Platform

Enhance query performance using AWS Glue Data Catalog column-level statistics

How Cargotec uses metadata replication to enable cross-account data sharing

How Data Governance Protects Sensitive Data

The Ultimate Guide to Modern Data Quality Management (DQM) For An Effective Data Quality Control Driven by The Right Metrics

Announcing Alation 4.0 with Alation Connect

Alation Accelerates Growth and Global Impact — and Welcomes 2 New Leaders

Build event-driven data pipelines using AWS Controllers for Kubernetes and Amazon EMR on EKS

Fivetran Modern Data Stack Conference 2023: Key Takeaways

What Is Alation Connected Sheets? Q&A with the Creators

Sovereign Clouds: Partner Perspectives on Safeguarding Critical Customer Data

Ontotext Invents the Universe So You Don’t Need To

Announcing the 2021 Data Impact Awards

Alation Accelerates Growth and Global Impact — and Welcomes 2 New Leaders

Gartner D&A Summit Bake-Offs Explored Flooding Impact And Reasons for Optimism!

A Guide on How to Leverage DSPM with Your Security Stack and Enhance Data Security Posture

Discover and Explore Data Faster with the CDP DDE Template

Extreme data center pressure? Burst to the cloud with CDP!

Boosting Object Storage Performance with Ozone Manager

6 benefits of data lineage for financial services

GoDaddy: Customer-First Digital Transformation

The importance of data ingestion and integration for enterprise AI

AI governance is rapidly evolving — Here’s how government agencies must prepare

Simplify data loading into Type 2 slowly changing dimensions in Amazon Redshift

How Backstage streamlines software development and increases efficiency

New Features in Cloudera Streams Messaging Public Cloud 7.2.12

Habib Bank manages data at scale with Cloudera Data Platform

Stay Connected