Data in Context: Lineage Explorer in DataHub

Gabriel Lyons
DataHub
Published in
3 min readJun 11, 2021

--

DataHub aims to empower users to discover, trust and take action on data in their organizations. Understanding where a data product comes from and how it is being used is critical for these goals. To give these insights to data professionals, we built the DataHub Lineage Explorer.

DataHub Lineage Explorer

This means DataHub can trace the flow of data from its creation, through all its transformations, to the point where it is consumed as a data product. In this post, we’ll go into why we built this, how you can use it, and what is on the horizon for lineage metadata.

Why lineage is important for data professionals

Build trust in data

Lineage is critical to the refinement step of data discovery. You have found a data product by issuing a search query or perhaps browsing a taxonomy. You have access to its title, a description, and can spot-check data in some rows or statistics in its columns. However, these can appear correct even if the input data has issues. Examining lineage is an important input for knowing you can trust a dataset.

Downstream and upstream lineage build different types of trust.

Looking at downstream lineage lets you validate the quality of a data product. If an executive dashboard consumes a dataset, this indicates it has already been vetted by someone else. Looking at upstream lineage tells you if the sources of truth for your data product are trustworthy. Certified, reliable, and well-maintained upstream dependencies let you verify the data product in question is built on a stable foundation. Combining downstream and upstream lineage validates that a data product is what it appears to be.

Act decisively with data

Lineage is crucial for datasets you are familiar with- even data products you created or maintain. When you notice a data quality issue arise in your data product, how can you identify the source? If you haven’t changed the logic that produces the chart or table, an upstream dependency must be the culprit. Lineage allows you to trace issues back to the source.

Sometimes, the source is a change in an upstream dependency’s contract. In other cases, it is a transient issue. Modern data stacks include pipelines, feature generation tools, streams, and other operational components. Issues with a downstream dataset can often be due to operational issues with these components. That makes capturing the timeliness, frequency, and success of your data transformations all the more important. Using operational lineage, owners can identify when infrastructure issues create problems in the data products they maintain.

Lineage comes into play when updating a data product. You must take downstream dependencies into account when making breaking changes. Using DataHub lineage, you can get up-to-date information on who depends on the data product you are changing and work with the respective owners to ensure a smooth transition.

Or click here to finish reading….

Connect with DataHub

Join us on SlackSign up for our NewsletterFollow us on Twitter

--

--