Several years ago, I met a senior director from a large company. He mentioned the company he worked for was facing data quality issues that eroded customer satisfaction, and he had spent months investigating the potential causes and how to fix them. “What have you found?” I asked eagerly. “It is a tough issue. I did not find a single cause, on the contrary, many things went wrong,” he replied. He then started citing a long list of what contributed to the data quality issues – almost every department in the company was involved and it was hard for him to decide where to begin next. This is a typical case when dealing with Data Quality, which directly relates to how an organization is doing its business and the entire life cycle of the data itself.
.
Before data science became mainstream, data quality was mostly mentioned for the reports delivered to internal or external clients. Nowadays, because machine learning requires a large amount of training data, the internal datasets within an organization are in high demand. In addition, the analytics are always hungry for data and constantly search for data assets that can potentially add value, which has led to quick adoptions of new datasets or data sources not explored or used before. This trend has made data management and good practices of ensuring good data quality more important than ever.
.
The goal of this article is to give you a clear idea of how to build a data pipeline that creates and sustains good data quality from the beginning. In other words, data quality is not something that can be fundamentally improved by finding problems and fixing them. Instead, every organization should start by producing data with good quality in the first place.
.
First of all, what is Data Quality? Generally speaking, data is of high quality when it satisfies the requirements of its intended use for clients, decision-makers, downstream applications and processes. A good analogy is the quality of a product produced by a manufacturer, for which good product quality is not the business outcome, but drives customer satisfaction and impacts the value and life cycle of the product itself. Similarly, the quality of the data is an important attribute that could drive the value of the data and, hence, impact aspects of the business outcome, such as regulatory compliance, customer satisfaction, or accuracy of decision making. Below lists 5 main criteria used to measure data quality:
-
Accuracy: for whatever data described, it needs to be accurate.
-
Relevancy: the data should meet the requirements for the intended use.
-
Completeness: the data should not have missing values or miss data records.
-
Timeliness: the data should be up to date.
-
Consistency: the data should have the data format as expected and can be cross reference-able with the same results.
The standard for good data quality can differ depending on the requirement and the nature of the data itself. For example, the core customer dataset of a company needs to meet very high standards for the above criteria, while there could be a higher tolerance of errors or incompleteness for a third-party data source. For an organization to deliver data with good quality, it needs to manage and control each data storage created in the pipeline from the beginning to the end. Many organizations simply focus on the final data and invest in data quality control effort right before it is delivered. This is not good enough and too often, when an issue is found in the end, it is already too late – either it takes a long time to find out where the problem came from, or it becomes too costly and time consuming to fix the issue. However, if a company can manage the data quality of each dataset at the time when it is received or created, the data quality is naturally guaranteed. There are 7 essential steps to making that happen:
.
1. Rigorous data profiling and control of incoming data
.
In most cases, bad data comes from data receiving. In an organization, the data usually comes from other sources outside the control of the company or department. It could be the data sent from another organization, or, in many cases, collected by third-party software. Therefore, its data quality cannot be guaranteed, and a rigorous data quality control of incoming data is perhaps the most important aspect among all data quality control tasks. A good data profiling tool then comes in handy; such a tool should be capable of examining the following aspects of the data:
-
Data format and data patterns
-
Data consistency on each record
-
Data value distributions and abnormalies
-
Completeness of the data
It is also essential to automate the data profiling and data quality alerts so that the quality of incoming data is consistently controlled and managed whenever it is received – never assume an incoming data is as good as expected without profiling and checks. Lastly, each piece of incoming data should be managed using the same standards and best practices, and a centralized catalog and KPI dashboard should be established to accurately record and monitor the quality of the data.
2. Careful data pipeline design to avoid duplicate data
.
Duplicate data refers to when the whole or part of data is created from the same data source, using the same logic, but by different people or teams likely for different downstream purposes. When duplicate data is created, it is very likely out of sync and leads to different results, with cascading effects throughout multiple systems or databases. In the end, when a data issue arises, it becomes difficult or time-consuming to trace the root cause, not to mention fixing it.
In order for an organization to prevent this from happening, a data pipeline needs to be clearly defined and carefully designed in areas including data assets, data modeling, business rules, and architecture. Effective communication is also needed to promote and enforce data sharing within the organization, which will improve overall efficiency and reduce any potential data quality issues caused by data duplications. This gets into the core of data management, the details of which are beyond the scope of this article. On a high level, there are 3 areas that need to be established to prevent duplicate data from being created:
-
A data governance program, which clearly defines the ownership of a dataset and effectively communicates and promotes dataset sharing to avoid any department silos.
-
Centralized data assets management and data modeling, which are reviewed and audited regularly.
-
Clear logical design of data pipelines at the enterprise level, which is shared across the organization.
With today’s rapid changes in technology platforms, solid data management and enterprise-level data governance are essential for future successful platform migrations.
.
3. Accurate gathering of data requirements
.
An important aspect of having good data quality is to satisfy the requirements and deliver the data to clients and users for what the data is intended for. It is not as simple as it first sounds, because:
-
It is not easy to properly present the data. Truly understanding what a client is looking for requires thorough data discoveries, data analysis, and clear communications, often via data examples and visualizations.
-
The requirement should capture all data conditions and scenarios – it is considered incomplete if all the dependencies or conditions are not reviewed and documented.
-
Clear documentation of the requirements, with easy access and sharing, is another important aspect, which should be enforced by the Data Governance Committee.
The role of Business Analyst is essential in requirement gathering. Their understanding of the clients, as well as current systems, allows them to speak both sides’ languages. After gathering the requirements, business analysts also perform impact analysis and help to come up with test plans to make sure the data produced meets the requirements.
.
4. Enforcement of data integrity
.
An important feature of the relational database is the ability to enforce data Integrity using techniques such as foreign keys, check constraints, and triggers. When the data volume grows, along with more and more data sources and deliverables, not all datasets can live in a single database system. The referential integrity of the data, therefore, needs to be enforced by applications and processes, which need to be defined by best practices of data governance and included in the design for implementation. In today’s big data world, referential enforcement has become more and more difficult. Without the mindset of enforcing integrity in the first place, the referenced data could become out of date, incomplete or delayed, which then leads to serious data quality issues.
.
5. Integration of data lineage traceability into the data pipelines
.
For a well-designed data pipeline, the time to troubleshoot a data issue should not increase with the complexity of the system or the volume of the data. Without the data lineage traceability built into the pipeline, when a data issue happens, it could take hours or days to track down the cause. Sometimes it could go through multiple teams and require data engineers to look into the code to investigate.
Data Lineage traceability has 2 aspects:
-
Meta-data: the ability to trace through the relationships between datasets, data fields and the transformation logic in between.
-
Data itself: the ability to trace a data issue quickly to the individual record(s) in an upstream data source.
Meta-data traceability is an essential part of effective data governance. This is enabled by clear documentation and modeling of each dataset from the beginning, including its fields and structure. When a data pipeline is designed and enforced by the data governance, the meta-data traceability should be established at the same time. Today, meta-data lineage tracking is a must-have capability for any data governance tool on the market, which makes it easier to store and trace through datasets and fields by a few clicks, instead of having data experts go through documents, databases, and even programs.
.
Data traceability is more difficult than meta-data traceability. Below lists some common techniques to enable this ability:
-
Trace by unique keys of each dataset: This first requires each dataset has one or a group of unique keys, which is then carried down to the downstream dataset through the pipeline. However, not every dataset can be traced by unique keys. For example, when a dataset is aggregated, the keys from the source get lost in the aggregated data.
-
Build a unique sequence number, such as a transaction identifier or record identifier when there are no obvious unique keys in the data itself.
-
Build link tables when there are many-to-many relationships, but not 1-to-1or 1-to-many.
-
Add timestamp (or version) to each data record, to indicate when it is added or changed.
-
Log data change in a log table with the value before a change and the timestamp when the change happens
Data traceability takes time to design and implement. It is, however, strategically critical for data architects and engineers to build it into the pipeline from the beginning; it is definitely worth the effort considering it will save a tremendous amount of time when a data quality issue does happen. Furthermore, data traceability lays the foundation for further improving data quality reports and dashboards that enables one to find out data issues earlier before the data is delivered to clients or internal users.
.
6. Automated regression testing as part of change management
.
Obviously, data quality issues often occur when a new dataset is introduced or an existing dataset is modified. For effective change management, test plans should be built with 2 themes: 1) confirming the change meets the requirement; 2) ensuring the change does not have an unintentional impact on the data in the pipelines that should not be changed. For mission-critical datasets, when a change happens, regular regression testing should be implemented for every deliverable and comparisons should be done for every field and every row of a dataset. With the rapid progress of technologies in big data, system migration constantly happens in a few years. Automated regression test with thorough data comparisons is a must to make sure good data quality is maintained consistently.
.
7. Capable data quality control teams
.
Lastly, 2 types of teams play critical roles to ensure high data quality for an organization:
Quality Assurance: This team checks the quality of software and programs whenever changes happen. Rigorous change management performed by this team is essential to ensure data quality in an organization that undergoes fast transformations and changes with data-intensive applications.
.
Production Quality Control: Depending on an organization, this team does not have to be a separate team by itself. Sometimes it can be a function of the Quality Assurance or Business Analyst team. The team needs to have a good understanding of the business rules and business requirements, and be equipped by the tools and dashboards to detect abnormalities, outliers, broken trends and any other unusual scenarios that happen on Production. The objective of this team is to identify any data quality issue and have it fixed before users and clients do. This team also needs to partner with customer service teams and can get direct feedback from customers and address their concerns quickly. With the advances of modern AI technologies, efficiency can be potentially improved drastically. However, as stated at the beginning of this article, quality control at the end is necessary but not sufficient to ensure a company creates and sustains good data quality. The 6 steps stated above are also required.
.
Summary
.
In conclusion, good data quality requires disciplined data governance, rigorous management of incoming data, accurate requirement gathering, thorough regression testing for change management and careful design of data pipelines, in addition to data quality control programs for the data delivered both externally and internally. For all quality problems, it is much easier and less costly to prevent the data issue from happening in the first place, rather than relying on defending systems and ad hoc fixes to deal with data quality problems. Finally, by following the 7 steps in this article, good data quality can not only be guaranteed and but also sustained.