IT administrators have used failure metrics for decades to track the reliability and performance of their infrastructure, whether it be PC hardware, networks, or servers.
After all, most experts agree that to manage something well, you need to measure it.
Data engineers and DataOps teams have also adopted failure metrics to measure the reliability of their data and data pipelines, and the effectiveness of their troubleshooting efforts.
However, when it comes specifically to data, some metrics are more relevant and useful than others, especially in today’s cloud-heavy environments.
Ranking Metrics
This blog ranks the dozen most common failure metrics in use today, in order of relevance and importance for data engineers, starting with the most niche and least relevant ones, and finishing with the most important ones that all DataOps teams should be tracking. After that, I’ll discuss how a continuous, multidimensional data observability platform like Acceldata can be invaluable to helping data engineers and data reliability engineers optimize these metrics.
12. Mean Time To Failure (MTTF)
Historically, this term measures the average lifespan of a non-repairable piece of hardware or device under normal operating conditions. MTTF is potentially useful for data engineers overseeing mission-critical data centers and on-premises data servers that want to plan their hardware refreshes around the predicted lifespans of hard disks or solid-state drives. Secondarily, the network hubs, switches, and cards move data from node to node.
Of course, responsibility for such hardware usually lies primarily with IT or network admins, reducing the importance of MTTF to data engineers. MTTF has also become increasingly irrelevant as many organizations move their data to hosted providers or cloud-native web services. It’s also generally less useful than Mean Time Between Failures (MTBF), which I discuss later.
11. Mean Time To Detect (MTTD)
A metric popular in cybersecurity circles that can help measure the effectiveness of your monitoring and observability platforms and automated alerts. However, overemphasizing MTTD can backfire. For instance, monitoring systems tuned for the shortest MTTD can become prone to alerting too quickly and too often. This can create a tidal wave of alerts for minor issues or outright false positives. This can demoralize data engineers and create the serious problem of alert fatigue.
Also, the best continuous observability platforms use machine learning or advanced analytics to predict failures and bottlenecks before they happen. MTTD does not capture the superiority of data observability systems capable of such predictions.
10. Mean Time To Identify (MTTI)
Mostly interchangeable with MTTD above, MTTI shares the same advantages and disadvantages.
9. Mean Time To Verify (MTTV)
This usually denotes the last step in the resolution or recovery process. MTTV tracks the time from when a fix is deployed and when it is proven that the fix has solved the issue. With today’s complex data pipelines and far-flung, heterogeneous data repositories, reducing MTTV can actually be a significant challenge when done manually. Potentially useful for data engineering managers, but few others.
8. Mean Time To Know (MTTK)
Measures the gap between when an alert is sent, and when the cause of that issue is discovered. This can be a good way to track the forensic skills of your DataOps team. Otherwise, MTTK is a fairly niche metric.
7. Mean Time To Acknowledge (MTTA)
Tracks the time from when a failure is detected to when work begins on an issue. Like MTTK (Mean Time To Know), this granular metric can help track and boost the responsiveness of on-call DataOps teams, and also help ensure that internal customers and users are notified in a timely fashion that their problems are being handled. MTTA works best when paired with MTTK or MTTR (Mean Time To Respond). This ensures that on-call data engineers don’t game the system by, for instance, responding to alerts instantly but start their actual work at a more leisurely pace.
6. Mean Time To Respond (MTTR)
The lesser version of MTTR, measures how long it takes for your team to respond to a pager alert or email. This metric can be useful to track and motivate data engineering teams. But it is a fairly granular metric that is best used in conjunction with the better-known MTTR (Mean Time To Recover/Resolve/Repair). That way, you can track how long it takes DataOps teams to respond to problems and how long it takes them to fix them.
5. Mean Time Between Service Incidents (MTBSI)
This is calculated by adding Mean Times Between Failures (MTBF) and MTRS/MTTR (Mean Time to Restore Service/Mean Time To Recovery). This is an important strategic metric that can be shared with your internal customers that captures both the reliability of your infrastructure and the responsiveness and skill of your DataOps team at properly diagnosing root causes.
4. Mean Time to Restore Service (MTRS)
This is a useful business-centric metric for data engineers focused on performance and uptime for customers. It can apply to both on-premises data servers and infrastructure that is hosted or run on a public, multi-tenant service. In those contexts, it is synonymous with Mean Time To Recovery/Resolve/Repair (MTTR). However, its non-applicability to data quality issues knocks it down a few notches from MTTR.
3. Mean Time Between Failures (MTBF)
What a difference a preposition makes. Mean Time To Failure (MTTF) only applies to hardware that cannot be repaired, making it a fairly niche metric. Mean Time Between Failures (MTBF), meanwhile, can be applied to both repairable hardware and software, which, unless it has been hopelessly corrupted, can be restarted. For instance, MTBF would be a great metric for tracking data applications and data server crashes. That flexibility makes MTBF a key metric that all data teams should employ both to improve team performance and improve relations with its business-side customers.
MTBF should NOT include the time to repair hardware or recover/restore service. To account for that, data engineers would use a KPI such as MTBSI (Mean Time Between Service Incidents), which would include MTBF and either MTTR (Mean Time To Recovery) or MTRS (Mean Time to Restore Service).
2. Mean Time To Recover/Resolve/Restore/Repair (MTTR)
The differences between each of these R words are subtle but salient in the data context. Are you tracking how long it takes to bring an interrupted data pipeline back online? Use Recover or Restore. Or do you need to measure how long it takes to locate and fix a data error or other data quality issue? Use Resolve or Restore.
MTTR includes the time to diagnose the symptoms or general problem, perform Root Cause Analysis (RCA) to locate the specific causes, and then fix it. It is pretty much synonymous with MTRS (Mean Time to Restore Service).
MTTR is probably the best-known failure metric in the ITOps and DevOps communities. It can be used to improve the DataOps team performance and also be shared with your internal users.
Perhaps surprisingly, I am only ranking it as the second-most-important metric for data engineers and other DataOps team members, however.
1. Mean Down Time (MDT)
Minimizing data downtime, whether caused by bottlenecks or unreliable data, is the closest thing there is to an overarching goal for data engineering. Zero downtime is the target, though this is obviously practically unachievable, especially when you include both scheduled and unscheduled downtime. Mean Down Time can also be expressed in reverse in terms of uptime percentage, with the goal typically being 99.999 percent availability or five nines of high availability.