An organization with 1000 employees, in 2022, has an average of 177 SaaS applications. Most of these applications store data relevant to their needs, However, in order to perform cross-organizational analysis, this data needs to be aggregated, enriched and integrated. This process vastly increases the scope of data quality initiative from the past days, when all the data came from a handful of internal ERP or CRM applications that stored data in a structured manner. New AI and ML use cases are often using synthetic data which relies on good quality real-life data.
If we spent the past decade amassing more data, the current decade is more concerned with ensuring we have the right data.
“Gartner estimates that the cost of poor data quality to be on an average of $15M for each organization.” This is the decade where new ways of delivering data, such as data mesh, data products, data sharing and marketplaces are starting to become mainstream.
Take the example of an orders table in a retail application. Sales taxes differ widely in the US across states, counties and cities and change frequently. Your data quality subsystem should detect if it infers that incorrect taxes may have been applied on a certain order. The sooner an organization can capture and rectify problems, the lower is the cost.
The title of this section is ironic, because so much of traditional data quality is rule-based. Yes, the rethink requires that we shift from static, predefined rules to embarking on a discovery of rules that are hidden inside data. These rules are inferred from patterns that exist in the data and using ML algorithms can predict the reliability of new incoming data. When the inferred rules are combined with existing rules, a much richer data quality system emerges.
We have realized the limitations of creating rules and policies in the dynamic and fast-changing world of data. The new frontier is to understand the “behavior” of data using sophisticated ML models and dynamically detect anomalies and recommend remediation steps. An example of discovering rules is based on the volume of data that typically enters a system. This volume increases as a business grows by a steady rate which can be predicted using ML techniques. If suddenly there is an unexplained drift from the expected range, then the data quality product should alert the stakeholders. The faster this is done; the blast radius of the damage can be contained.
The modern data quality approach is contextual and designed to deliver data outcomes faster and with higher reliability and trust.
Want to explore our Modern Data Quality Platform? Request a demo