As a former tech trends forecaster, I’m always pondering what we could achieve and how a better understanding of history could help us navigate the decades in front of us.
Companies large and small have overcome major data handling obstacles over the past 20 years. It may not seem so, but we’ve gone a long way in 20 years. Remember, Nick Carr was only beginning to declare that “IT Doesn’t Matter” in an article in Harvard Business Review in 2003.
Back then, many CIOs didn’t even seem to care much about “data”. Data in its rawer forms was too abstract and too messy for the C-suite. They were still thinking in terms of monolithic application suites and maintaining the status quo, more or less.
One thing that’s a lot different now than in 2002: Smaller organizations can have a bigger impact sooner, and companies can grow bigger faster. That fact is not only due to the advent of cloud computing, but also improved data sharing and distributed processing.
Below is a small sample of major enterprise data success stories we’ve seen over the last 20 years.
The adoption of APIs and birth of Amazon Web Services
Back in 2002, when his development teams were still focused only on their internal customers, Jeff Bezos of Amazon sent a now-famous memo. The memo, known later as the “API Mandate”, included the following edicts (directly quoted here):
- All teams will henceforth expose their data and functionality through service interfaces.
- Teams must communicate with each other through these interfaces.
- There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team’s data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network.
- It doesn’t matter what technology they use. HTTP, Corba, Pubsub, custom protocols — doesn’t matter.
- All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions.
- Anyone who doesn’t do this will be fired.
- Thank you; have a nice day!
This memo established the cornerstone principles for what became a major trend in application programming interfaces, or APIs. One of the older names for APIs was web services. We struggled with web services for quite awhile. As the memo reflects, various protocols designed to accomplish what APIs do have been around for over 30 years. But Amazon provided the discipline, insight and push that finally led to success and broad adoption of APIs.
By 2006, Amazon had established a service-oriented architecture (SOA) and was able to offer pioneering Amazon Web Services (AWS) to external customers for the first time. By 2020, AWS was generating over $40 billion in revenues annually for Amazon and its shareholders. By 2022, AWS had customers in 190 different countries.
Distributed “big data” analytics for the masses
Back in the 1990s when Google was just getting started, few organizations were collecting and processing as much web data as Google was. And the servers and software to handle the processing were costly.
Given the storage and processing costs they were facing, Google in the early 2000s decided to scalably optimize distributed data processing, redesigning aspects of the architecture to use commodity servers of its own design. It created the Google File System, inspired in part by IBM’s General Parallel File System (GPFS). Wealthy companies that offered credit scoring and other types of very large-scale business analytics services who were less price sensitive and data hungry had been using GPFS. The Google File System, by contrast, became a distributed file system for those on a budget.
Then Google added MapReduce to make processing across large clusters or server farms possible.
Yahoo reverse engineered this approach and open sourced it, calling the result Apache Hadoop. Within a few years, before adoption of public cloud computing had really taken off, Hadoop clusters were the primary means of big data processing. Companies like Google and Yahoo within the space of a few years were running distributed commodity clusters with 10,000 or more nodes. Open source Hadoop sparked broader adoption. By 2013, most of the Fortune 50 were running Hadoop clusters.
Though many companies have since migrated their data to public clouds, the Hadoop Distributed File System taught us some valuable lessons. For one thing, we were overusing databases. File systems or blob/bucket storage can be sufficient, depending on the use case. Companies like Filebase have emerged who even offer decentralized file systems–in Filebase’s case, oe that includes a familiar S3 interface. Nowadays, searching or otherwise manipulating data in storage buckets using something like BigQuery has become more and more common.
Knowledge graphs for enterprise transformation
Back just after Hurricane Katrina hit in 2005, Parsa Mirhaji, a cardiologist and computer science PhD, was already making his database skills pay off. During the hurricane’s aftermath, he harnessed the power of a disaster response system he’d designed. Then later, in the 2010s, he and a team at Montefiore/Einstein built an integrated, real-time, analytics-ready, continually updated semantic knowledge graph. He ran this graph in parallel with the hospital chain’s regulatory system. The graph has many different uses, given that advanced analytics and machine learning can be run on top of the graph. But other uses are even in the operational realm.
The source data is a mix of data types and services, some of which are real-time, others batch (such as from a knowledge graph that includes structured SNOWMED, nomenclature, comprehensive medical reference content etc.). The target data lake is a semantic integration graph that has connectors to the electronic medical records database and other data modeled in other ways.
The graph allows output model-driven, contextualized feeds from the knowledge graph. At the graph’s heart is a minimalist patient-centric ontology.
What MIrhaji and his team have built is a graph that can serve more and more patient- or cohort-specific analytics as the graph expands and evolves. As it was able to provide the results of horizontal studies, for example, the hospital chain was able to move to a pay for performance business model and save costs in many different areas.
Many have heard stories of the bigger companies I’ve mentioned, to the point where we’ve taken their accomplishments for granted. But smaller organizations like Montefiore Health are in some ways even more impressive, because they’ve innovated in industries with much smaller margins and far less budget. I’m looking forward to finding more of these rags-to-riches data stories over the coming decade, now that compute, networking and storage have all improved and driven down the the cost of doing more with data.