If you are a start-up director in 2016, not a week goes by without someone talking to you about unicorns.
It is difficult to imagine the feeling of riding a unicorn. Do you feel the wind of success rush against its skin? Do you fear the arrows and traps of pitiless hunters? Do you feel a special sense of excitement when you get to the end of the rainbow of profits?
I find it easier to imagine the state of mind of the directors of Cloudera, Hortonworks or MapR, who are now the leading Big Data unicorns.
Three years ago, Wikibon talked about Hadoop to describe the forthcoming war between Hortonworks and Cloudera. Their particular challenge then was to establish themselves as one of the leaders in Hadoop distribution. Their strategic objective then was to keep the traditional software and database players (IBM, Microsoft, Oracle and HP) outside this ecosystem.
Today, in 2016, there are few reliable statistics available on the deployment rates of Hadoop distributions. A Dezire article mentions 53% for Cloudera, 16% for Hortonworks and 11% for MapR. In any case, it seems clear that Cloudera, Hortonworks and MapR share 80% of the market between them. They have therefore bravely succeeded.
But now what is next?
It is not necessarily them to battling it out for a few extra percent. Taking a few extra market points away from Hortonworks is not going to make Cloudera the next Oracle!
In the end, the formula is simple:
- If “Big Data on Hadoop is a success”: Cloudera and Hortonworks can become the next Oracle
- If “Big Data on Hadoop is a semi-success”: Cloudera and Hortonworks remain two companies among many others
The fear of Big Data cracking
Underlying the Hadoop distributions war, a diffuse fear is beginning to take hold of the market, like a fine crack. What if Big Data on Hadoop weren’t to work that well?
- If the important need for volumes of data only brought value in the end to a minority of companies or to very specific uses?
- If these systems were too complex or costly to become a global success?
- What if in the end it is the move to the scale of old technologies (e.g. Oracle relational databases) which would become a success again, as they gain in maturity?
- Or else could a new technology which comes from the Google laboratories be up and running and set a new standard?
Indeed, many companies have invested heavily in Hadoop clusters. Most of the technological companies created since 2010 have put their money on Hadoop and built part of their information systems around it. For them, Hadoop is natural. For others, their enthusiastic investment runs the risk of being transformed into the fear of creating “yet another data warehouse”, a storage silo the added value of which they will find difficult to justify
The global risk is to see Hadoop reduced to the role of “backup” for large volumes of data (as it is cheaper) or of “sandbox” for certain more adventurous equipment.
Manifesto for demanding Big Data analytics applications
How can we ensure that Hadoop is used more and more for critical added value applications?
The response already does not belong to one sole vendor or to one sole typology of application – probably a large part of the response is found in the implementation of critical transactional applications on Hadoop.
From the point of view of the “analytics” applications (which seeks to analyse data in order to bring an incremental value), certain bias could allow a systematic movement in the right direction:
- Viewing alone is not enough. Allowing data to be viewed admittedly creates value, but not enough to justify the big data investment. Every Hadoop cluster must find its justification in an operational application, in other words directly behind a production system. An operational application must give results which can be used at the lowest level (e.g. a score associated with a client rather than a view of the types of client)
- Going end-to-end. Preparing or cleaning the data to make it available to the target hardware is all very well, but it is not enough. Making a more interesting predictive model is all very well, but it is important to be able to test it. A requirement in the months to come is to consider the Big Data information projects end to end: from the data source to the destination.
- Dare to embrace complexity. In principle, your company has already tested all the processes by threshold, rule of three, formula, …based on its data. To go and look for a marginal value and overtake its competition, your business will need an algorithm
- Force the high-target users to use Hadoop. Let’s be realistic: even if there is a certain infatuation with R among analysts, not all users are going to learn Python, R or Spark! A lot of people are allergic to the command line! Hadoop is, for now, for many a crude ecosystem, made by developers for developers. There is a whole ecosystem democratisation stage which is only half-finished
We could call this corpus of principles a manifesto for demanding and added value applications on Hadoop.
Holding to this is not clear:
- An image will always speak more clearly than a data flow
- going end to end requires going through plenty of barriers,
- it is not simple to explain complicated things
- and the gulf between the objective and the technique remains large.
Hadoop launched the idea that “global” Big Data platforms were possible. Of course, everything still needs to be created in terms of organisation, tools, practices so that everyone can benefit from them. The preceding revolution in this was the birth of relational databases; and nearly 15 years were needed for it to find their full potential, for example in the democratisation of website creation. We are only at the start!
Cloudera, Hortonworks, etc… will do all they can to stop the Big Data from cracking. But let’s all play a part !