Data Lineage: Case Studies of Data-Driven Businesses

Parth Shukla 29 Nov, 2022 • 5 min read

This article was published as a part of the Data Science Blogathon.

Introduction

Data lineage is the process of analyzing the path of the data and how it is involved in different methods with time. Many businesses and companies use it to get an idea of the source, data pathway, and how the data is being used. It can help organizations gain insight from the data to plan for future steps and use the data for better product or service performance.

In this article, we will discuss 3 case studies where data-driven companies like Netflix, Slack, and Postman implemented data lineage and benefitted from that. Here we will also discuss their process of it and its technique they applied while implementing and using it.

Data Lineage Case Studies

Some data-driven businesses like Netflix, Slack, UBS, Postman, and Airbnb are convinced of the benefits of data lineage and are now using it and reaping returns. Let us discuss the data linkage process in these companies and how they get benefitted from it.

Case 1: Improved data infrastructure reliability and efficiency at Netflix

Netflix has been convinced of the benefits of the data lineage and has implemented it. At the project’s inception stage, they defined design goals to help guide the architecture and development work to deliver a complete, accurate, reliable, and scalable lineage system mapping Netflix’s diverse data landscape. A few of these principles are:

Ensure data integrity
Enable seamless integration
Design a flexible data model

Based on a standard data model at the entity level, they have built a generic relationship model that describes the dependencies between any pair of entities. Using this approach, they can make a unified data model and the repository to deliver the proper leverage to enable multiple use cases such as data discovery, SLA service, and Data Efficiency.

Case 2: Easy operational maintenance and better execution of data programs at Slack

Slack has been convinced of the benefits of data lineage, and hence they have also invested in the same. Slack states that as datasets become more complex and the number of contributors grows, it becomes more and more challenging to understand the relationships between different data sources.

To make it easier for folks to use their lineage data, they have produced a flattened version of tier tables and stored it in Hive. The flattened table allows folks to query lineage data in our data warehouse and also makes queries easier to write/run for typical use cases.

Also, with the help of data lineage, they have worked on a notifications system. They have built notification tooling on their internal Data Portal to allow their data consumers to use lineage information and notify downstream consumers. There is a notify button, using which the dataset owners can get information.

Case 3: Moving beyond data discovery at Postman

Postman has also fixed a missing layer in their data layer. Postman’s data system was pretty simple. They had a set of data tables, and information about those tables lived in the heads of Their early data team members. This worked when the company and its data were small but needed help to keep up as it started to grow exponentially.

Postman currently has hundreds of team members distributed across four continents and more than 17 million users from 500,000 companies using their API platform.

Postman Co-founder and CTO Ankit Sobti wanted to ensure that data was democratized. He said that it is a challenging task for a data engineering team to gain insights from data at any given time in the day. He believed that everyone in the company should be able to access the data and gain insights. This became very tedious in 2020 when Potman became fully online due to the COVID pandemic.

The data team decided to take on Postman’s data system as a project to address this issue. Their main goal was to make Postman’s data easier to access and understand, both for new hires within the data team and for people across the company with the help of data lineage.

They have used data lineage to know where the data comes from and how it is connected to other layers. Data lineage helped them understand the data’s connectivity and daily bugs and errors occurring on the system. It helped them solve issues quicker; Without asking a doubt, the slack team could solve the problem by just looking at data lineage. They are also planning to take further steps in data lineage to make their data management more accessible and quicker.

When Data Lineage is a No-brainer (of No Use) For Some Organizations

Data lineage is proven the best fit solution for most organizations working with data and data management. Still, there are some cases where it is proven to be a no-brainer for organizations.

Some organizations store a large amount of data and work with many data sources and storage. Data lineage can prove a no-brainer for such an organization, as it needs to provide the best reliable information for such data.

Data lineage provides information about the data sources and the entire lifecycle of the data; the data’s design lineage can help one get an idea about the data’s head and consumption. However, it is helpful for architects to understand the implementation of how data flows. However, subject matter experts in the business that wish to audit the data processing can find it complex to navigate.

Business lineage provides simplified views on analyzing business types over the design lineage. A business lineage report may only show the significant systems or may eliminate the systems and job structures only to show the transformation.

So this is how the data lineage is designed to show things quickly and easily, but not to search the items. Let us suppose that the organization works with a large amount of data or discrete data sources that vary frequently. It will not be able to find the desired information from the data as it can show the flowchart or lifecycle of the data. Still, the results from it will only be reliable for a small amount of data or varying data. Hence, it is a proven no-brainer for organizations working with large volumes and ranging data.

Conclusion

In this article, we discussed some case studies of the data-driven companies that implemented and used the data lineage and its application and benefitted from that. We saw data-driven companies like Netflix, Slack, and Postman, which used the concept in their database, which returned positive results. Knowledge about these companies and their data lineage process will help one understand how colossal data companies are using this and also help one answer the questions asked in data engineering interviews very efficiently.

Some Key Takeaways from this article are:

1. Today, most data-driven companies use data lineage for better data governance and handling.

2. Companies with data sources can implement data lineage very efficiently and help them get more idea about the data being used in no time.

3. It is a no-brainer or not so useful for companies with a small amount of generation of data or startups with lighter databases.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.