We live in a complex world that is full of data, and it’s getting even more full every day. In 2020, the world collectively created, captured, copied, and consumed nearly 64.2 zettabytes of data and by 2025 that figure is expected to more than double to 180 zettabytes.
Increasingly, companies depend on this data to create great experiences for customers and drive revenue. At the same time, without a way to automate the process of detecting data quality issues, all of this data can quickly get out of hand, eroding trust and hurting the bottom line.
Data observability systems have emerged as crucial tools for data-driven companies, helping them leverage huge amounts of data without sacrificing quality and reliability. Data observability systems collect signals about the data, including data freshness, nulls and blanks, and business rule violations, and use those signals to detect or even prevent errors. With an effective data observability system in place, data teams can answer many of the biggest questions surrounding resource allocation, personnel training, and hiring by focusing on what data matters most as well as providing a way to measure remediation efforts.
Data observability is often implemented by engineers, and because of this, it’s easy to assume that data observability is primarily an engineering effort. But a truly effective data observability system requires effective data science. As data scientists, we can make a huge difference in helping to make data observability better by applying the most appropriate statistical and machine learning tools to automate these complex processes. Data science is responsible for some of the most important parts of a data observability system.
Data Science and Data Observability While I was at Uber, I led the team responsible for developing the internal Data Quality Monitor (DQM) to track the data health of critical platforms, leveraging forecasting and anomaly detection on signal metrics for observability. We discovered specific signals that are relevant to data health, such as column row counts and averages that give an overview of the general health of the data landscape at the company. From that experience and my experience at Bigeye, I have learned just how critical data science is to an effective observability system.
Let’s take anomaly detection as an example. Anomaly detection, coupled with the application of powerful forecasting on time-series signals, ensures that the system can bundle data dynamics together to describe the health of the data driving the business. With this anomaly detection in place, data teams can proactively detect issues (even the “unknown, unknowns” that can wreak
havoc on data applications) and adapt to changes in the business, and leadership can hone into the root causes of business problems quickly. Without advanced anomaly detection, data observability systems are left being reactive, waiting for something to go wrong and then trying to build that into the system after. But this is a losing battle, as the business and the data change, new issues will continue to emerge.
Let’s dig in further by looking at slow degradation. Slow degradation refers to a data pipeline issue that appears to be low severity and is therefore easy to miss, but that can snowball over time into much more severe issues if left unchecked. This results in the terrible realization that for weeks or months an undetected issue has been eroding data pipelines. Teams would have to hunt down which datasets are affected and go through a complex remediation process to correct for eroded data. Unfortunately, many teams do not take appropriate and timely actions, and sometimes this may lead to legal problems when the underlying data is important for the company’s finances.
Slow degradation alerting is just one example of dynamic data-driven anomaly detection. Other examples include reinforcement learning, anomaly exclusion, pattern recognition, and more. In short, data science is critical to most aspects that make anomaly detection intelligent for the massive data world we live in today.
The Observability Opportunity
Observability has played a huge role in making software more reliable with companies like Datadog, Dynatrace, New Relic, and AppDynamics carving out a huge market for Application Performance Monitoring (APM). Now data observability is revolutionizing the data space – and this fast-expanding market requires data scientists to help build world-class data observability systems.
If you are currently a data scientist or studying to dive into this career path, there are a plethora of interesting observability problems awaiting for you to solve, and some of the new approaches can change the landscape of tech in the years to come.
Henry Li, senior data scientist, Bigeye