Data Cataloging in the Data Lake: Alation + Kylo

By Stephanie McReynolds

Published on February 20, 2020

We are living in a new era of data defined by two massively disruptive trends – one architectural and the other organizational. Architecturally the introduction of Hadoop, a file system designed to store massive amounts of data, radically affected the cost model of data. Organizationally the innovation of self-service analytics, pioneered by Tableau and Qlik, fundamentally transformed the user model for data analysis. A “big data” revolution has ensued.

Disruptive Trend #1: Hadoop

More than any other advancement in analytic systems over the last 10 years, Hadoop has disrupted data ecosystems. By dramatically lowering the cost of storing data for analysis, it ushered in an era of massive data collection. By changing the cost structure of collecting data, it increased the volume of data stored in every organization. Additionally, Hadoop removed the requirement to model or structure data when writing to a physical store.

When it was no longer a hard requirement that a physical data model be created upon the ingestion of data, there was a resulting drop in richness of the description and consistency of the data stored in Hadoop. You did not have to understand or prepare the data to get it into Hadoop, so people rarely did. The result, as many industry observers have put it, is that many data lakes become data swamps.

Disruptive Trend #2: Self-Service Analytics

At the same time, the industry saw a second disruptive trend, organizational disruption in the form of self-service analytics. New data visualization user interfaces from Tableau and Qlik proved that any business user can analyze their own data. With the right user interface design and access to structured data, business users can effectively discover their own insights.

Along the way, these non-technical users created descriptions in the process of doing the analysis. By interacting with Tableau’s features to re-label data columns and build derived metrics, Tableau users were effectively documenting and modeling data on their own, often without realizing it. This generation of descriptions is significantly more scalable than the prior reliance on technical human resources to model and describe the data in advance.

The Rise of the Data Catalog

Today’s data catalog is born of both of these disruptive trends. A data catalog automatically associates data with the rich descriptions created in self-service analytic and preparation tools. As a consequence, the catalog is a platform for human curation of business metadata – glossaries of shared terms, approved definitions of data, and policies for data usage. Data catalogs automate the process of mapping the inventory of data to the correct descriptions and the knowledge of how to use the data, providing an interface for analysts, data stewards and data engineers to collaborate on their shared resource.

At Alation we’re committed to building the richest data catalog that we can at the intersection of technical, behavioral and business metadata. We’re using new techniques from machine learning, AI and crowdsourced curation to power the creation of a single source of reference for thousands of users. And part of our innovation journey is to integrate with tools that help people prepare and analyze data.

Introducing Integration with Kylo

Last week, Teradata’s Think Big business unit released Kylo, an open source project which enables data lakes on Hadoop, NiFi, and Spark. Data lakes are large storage repositories that hold information in its native format until needed. But rather than just enable the creation of a data lake, Kylo assists with the transformation of raw data in that lake into business-relevant information by including features for self-service data ingestion and wrangling without coding.

As teams work with Kylo, it captures the basic, technical metadata in the core source foundational framework. Through open REST APIs this metadata is shared with Alation to be married to the descriptions that we naturally collect in our data catalog. The result is a rich perspective on the usage of data that drives analyst productivity in self-service environments. In sum, by integrating with Kylo, Alation empowers Big Data projects to show ROI faster, by getting the cleanly described data into the hands of the people that need to consume it.

Kylo (code on GitHub) has been based on experience working on 150 data lake projects for Fortune 1000 companies. We’re looking forward to working closely with the Teradata Think Big team to introduce Alation into those projects and develop plans with Teradata to offer services, training and support built on the integration of Kylo with Alation.

Disruptive Trend #1: Hadoop
Disruptive Trend #2: Self-Service Analytics
The Rise of the Data Catalog
Introducing Integration with Kylo