All About Data Lineage
- jaquasianicole
- Dec 6, 2023
- 3 min read
The data analysis process involves collecting, transforming, and modeling data to discover patterns for results. This process begins with identifying the business problem and collecting the necessary data. After defining the objective there should be a plan to collect and aggregate the relevant data (Hillier, 2023). Utilizing first-party data means sourcing from your company’s dataset, which will likely be uniform in structure. It is good practice to collect metadata at every step for lineage analysis.
Data lineage reveals the life cycle of the dataset by documenting the data flow from start to finish. Documenting the lineage of data means recording what transformations were applied, how the data changed, and why. This historical record is done to validate data accuracy and improve data quality. It is a discipline within metadata management that allows data consumers to understand the context of the dataset on a deeper level. Data lineage is a key documentation tool to establish a shared understanding within an organization by upholding transparency (Sacolick, 2021). Many organizations choose to implement this idea to maintain a data-driven culture and support analytics and machine learning efforts. It is also done as a data governance practice that aligns with regulatory requirements to track private and sensitive data. Some companies choose to conduct regular audits of data flow documentation as a proactive measure. This can be done manually by validating the metadata associated with data source descriptions, transformations, and field mappings. There are automated data lineage tools that provide lineage capture and validation features to streamline the process.
As mentioned, there are many benefits to practicing data lineage documentation however a small organization may choose to opt out. This could be due to limited resources, data simplicity where it's not needed, or because of legacy systems. It is up to the organization to determine whether or not they have a valid reason for not establishing data lineage because the decision depends on so many factors. This decision requires weighing the consequences associated with neglecting data lineage mapping. Organizations can experience inefficiencies due to obstacles during the integration process that could have been remedied beforehand. Poor data management can impact decision-making because analysts can misinterpret the data and come to incorrect conclusions. This is why it is important to promote a unified data management standard through helpful metadata. It could also be difficult for an organization to prove compliance with regulations, which can lead to fines or a damaged reputation. The data analysis process involves identifying the necessary datasets to complete a project, however, without proper documentation the common difficulties can be exasperated.
For those who recognize the benefits and decide to participate in data lineage documentation, there are specific moments to collect metadata. The first moment is during data ingestion, which is the instance of importing the data file from the source. Metadata captured at this stage should include details of the source and data ownership. Data lineage can be included at this point by describing where the data is going and its intended purpose. This can also be further by initiating the process of data profiling. As analysts first interact with a dataset they can examine and create summaries that detail characteristics such as accuracy and completeness (Pratt & Lewis, 2022).
Data integrity is the concept of validating the accuracy, completeness, consistency, and validity of a dataset. This is an important element in maintaining data quality because it ensures analysts are equipped with data free of error for accurate insights. The standard of data integrity is supported by practicing a thorough data lineage process. By identifying sources, tracking movement, and validating transformations organizations can uphold a high level of data quality. Data lineage should detail data processing to track operations performed and the impact on the data. For example, documenting the filters applied or the counts from a particular column for future comparison is considered part of the lineage record. It is also important to provide data lineage on query history, such as joins to validate the process and possibly optimize future queries.
References & Citations
Hillier, W. (2023, May 31). A Step-by-Step Guide to the Data Analysis Process [2023].
CareerFoundry. https://careerfoundry.com/en/blog/data-analytics/the-data-analysis-process-step-by-step/
Pratt, M. K., & Lewis, S. (2022). data profiling. Data Management.
Sacolick, I. (2021, April 5). Data lineage: What it is and why it’s important. InfoWorld.

Comments