If you still can’t figure out what exactly Hadoop is, don’t worry, you’re not alone. But hopefully this multi-part blog series will help! Hadoop is very difficult to understand if, like most people, you’re gathering bits and pieces of info here and there. And, sorry Wikipedia, but your page isn’t very helpful if you aren’t a computer scientist. But when you deconstruct it layer by layer, the puzzle starts to make sense.
With that out of the way, off we go into the wild, wacky world of Hadoop in this first blog post, which is intended to provide a high level overview. Data is being generated at breakneck speeds and volumes that were unfathomable even a decade ago, and organizations are asking, where do we put it? How do we utilize it once we’ve put it there? Hadoop is an ecosystem of tools for storing, analyzing, and doing all kinds of other cool things with “Big Data”. You may have heard of Petabyte, which is literally more than a million billion bytes of information. And you may have heard of an Exabyte, which is an order of magnitude beyond that. Hadoop is used to go beyond these and handle Zettabytes and Yottabytes–basically you’ll need scientific notation to write these out the number of bytes in these without getting cramps in your hands.
Hadoop started in the early 2000s when a Yahoo! team led by Doug Cutting and Michael J. Cafarella in 2005 took principles from a Google paper and ran with them to create a better search engine. You see, a search engine requires incredible computing power, and the standard computing model based on a single computer wasn’t nearly enough–they needed a way to combine the power of multiple computers. Hadoop did just that. Eventually they felt guilty and turned it over to the Apache Software Foundation as an open source software (I’m totally kidding about the “guilty” part–these are great guys, no one accuses them of anything bad).
One extremely important point–aside from its ability to store/manage massive amounts of data–is that Hadoop is also distinguished by its ability to manage almost any type of data, whether structured, relational data, semi-structured or unstructured. So while a SQL database is fairly narrow in what it can manage, Hadoop takes everything from spreadsheets to videos of cats falling into toilets.
It’s probably not very important to know that Hadoop’s core components are written in the (highly confusing) Java programming language and is architected for Linux (so people using Windows, Mac etc. need to run it on virtual machines). But if anyone ever asks you, there you have it. It is important to know, however, that the wider ecosystem of Hadoop has adapted so that you can use almost all of the major programming languages–particularly SQL and Python.
Am I my brother’s zookeeper? Fitting it all together
Hadoop itself has three core components:
- HDFS, which manages the storage, or figures out where to put the data when it is imported into Hadoop (remember, this may be on one of thousands of servers, so it’s quite a job). It’s kind of like those people at Disneyland who direct you to your parking spot.
- YARN manages the computation resources and schedules jobs. So it tells the servers, ‘you do this’ and ‘you do that’
- MapReduce is how humans interact with YARN and tell it what to do
And then there are a litany of applications that connect with Hadoop and do various cool things, like Pig, Hive, HBase, Scoop, Flume and others. You may have noticed that some of these are named after animals (and Hadoop itself was named after a toy stuffed elephant). Well, every zoo needs a Zookeeper. Ambari provides a general monitoring dashboard, while Zookeeper, as the name suggests, offers a centralized way to run all these systems in concert.
Hadoop uses 3 major computing principles to perform its magic: clustering, schema-on-read and map + reduce. If you want to dive deeper, there are lots of sources out there that explain these. Or, if you want it explained in a much more down to earth and digestible manner, hold tight for my next blog!