AWS outage – when the ‘technical debt’ bill comes due
Most press coverage of AWS’s recent outage has explained the event as having been caused by a “typo” that one of its engineers made when updating a billing subsystem. The ‘typo’ spin on this story may be the media’s way of dramatizing the blunder and making it easy to explain. Certainly, Amazon’s own post-mortem noted that “one of the inputs to the command was entered incorrectly {read ‘typo’} and a larger set of servers was removed than intended.” I believe that Amazon is trying to shift the blame from how they have designed, protected and audited their systems – a systemic process that affects all of their operations – and have instead chosen to portray this as a one-off event that happened just within one small subsystem bcause someone didn’t follow the approved playbook.
AWS didn’t follow one of the key principles advocated by Google’s team in managing technical debt: ‘Make it hard to do the wrong thing’
Google’s advice to make it hard for errors to happen follows the practice of all leading safety organizations. For example, the controls in the cockpit of a jet airliner have been designed to an excruciating (and necessary) level of detail. If there’s a button or a switch that shouldn’t be pressed by mistake it may be covered by a housing that needs to be opened before being able to press the button inside. Similarly, the systems engineers at AWS should have already had a built in safety check to make it impossible for simple ‘typos’ to cause a network outage. Was there have a maximum value criteria for the specific input field that caused this problem? If not, why not? Also, what other systems have similar capabilities and dependencies and are all of those dependencies being assessed for similar problems?
Talk with your CIO
Google’s article on managing technical debt is a great starting point for having a conversation with your CIO about the procedures they have established to prevent outages and mishaps. Managing technical debt may seem like a nuisance and being told by your CIO that ‘we are following industry standards’ may be comforting and seem like a satisfactory answer. However, this recent AWS incident points to the need for CEOs to be fully informed so that they can independently be certain that the best procedures are in place to manage such risks.
Technical debt implications for Machine Learning & Data Science
If like many companies you are using Data Science to drive business operations and to guide strategic decision-making then Google’s 2014 paper Hidden Technical Debt in Machine Learning Systems is a must-read. For example, the authors note there are many established procedures for managing dependency debt in software engineering but finding comparable dependency debt in machine learning systems is more costly and not as well established. The Google team offers the following useful questions:
- How easily can an entirely new algorithmic approach be tested at full scale?
- What is the transitive closure of all data dependencies?
- How precisely can the impact of a new change to the system be measured?
- Does improving one model or signal degrade others?
- How quickly can new members of the team be brought up to speed?