Learning new technologies is a never-ending process. Even the most experienced developer will be familiar with this experience of trying to keep up with the ever-changing technology landscape and it’s shifting trends, buzzwords, and languages. Big Data doesn’t seem to be an exception to this rule. There seems to be a steady stream of new tools and frameworks popping up every few months in the Big Data world. So what’s a developer to do? How can we get our hands dirty, roll up our sleeves, and start digging into this exciting world without burying ourselves under all these new technologies?
This article offers one approach for approaching Hadoop that was based on my experiences over the past few years at work and as an instructor.
- The first step is to find a sandbox or development environment where you can play around with the technology without too much overhead and risk. The best way for me has been to use my own laptop as a tested, but I also know some developers who like to use Amazon EC2 instances (e.g., on AWS). This article assumes that we’re using our laptops and that we want to follow along and learn by example (i.e., we’re not going to go through all the details of how everything works under the hood).
- A local sandbox also implies that we don’t necessarily need access to Hadoop infrastructure such as Zookeeper, Name Nodes, secondary Name Node, Job History Server, Task Trackers, Data Nodes, etc. But we do need to be able to set up a Hadoop cluster! There are several ways of achieving this goal. A few years ago I came across Cloudera’s Vagrant project and have been happily using it ever since. Now that Cloudera supports both Vagrant and Docker-based deployments for CDH5, it has become even more awesome. We can create a sandbox in just a couple of minutes that is compatible with the Horton works Sandbox.
- We’ll start by installing Virtual Box (the virtual machine software) and Vagrant on our laptops. If you don’t want to use Vagrant, you might consider these other options: An Amazon EC2 cloud instance with Hadoop pre-installed. Although there is a number of good AMIs out there, I didn’t find a Vagrant-compatible one that allowed me to set up a Hadoop cluster quickly. Sorry! EC2 also doesn’t let you change the amount of memory for an instance without going through some hoops and launching a new instance type or using startup scripts. A bare-metal installation on your laptop (be aware of the 32-bit vs 64-bit requirements). In my case, this was an HP Envy M6 laptop with 16GB RAM running Windows 8 x64, but make sure it’s got sufficient disk space available as well as CPU power.
- Although there is no fixed standard for deploying Hadoop clusters on Vagrant, most of the time you’ll need at least 3 machines. This Vagrantfile I’m using is based on one originally by Karthik Srinivas – thanks for sharing your work! So let’s fetch the source code and create our sandbox:
- Note that this will download an Ubuntu Precise image with Virtual Box Guest Additions installed by default from the Oracle public repo. When everything has been deployed successfully, we should see a bunch of running VMs in our Virtual Box UI:
- That’s it! We’re now ready to start working with Hadoop and a local sandbox. The only thing that we still need is a good book or tutorial on Hadoop. Although I know there are many excellent resources out there, the following one has been my favorite over the past couple of years: Sam R. Alapati’s Learning Apache Hadoop – this is also where the screenshots of directory contents come from.
- If you want to be more adventurous (and install some additional goodies), take a look at these Vagrantfiles for Horton works Sandbox and Cloudera Quick Start VM (based on CDH4). You can use them as-is or as a starting point to create your own sandbox. I’ve also created one for Horton works Data Platform (HDP) on Windows Server 2012 x64, which you can find here.
- I’d like to thank my colleagues Colin Cameron and Joe Crobak who helped me with getting this article just right. If you have questions or comments about Hadoop on Vagrant, I encourage you to ping us.
Conclusion:
Vagrant gives you the ability to define your infrastructure in code. This makes it easy for developers to get started quickly on their laptops with an exact replica of production environments. It also allows admins to have repeatable processes, e.g., by building reusable VM templates that can be spun up/down quickly across different projects or teams. Moreover, Vagrant is extremely popular for automating test cases both locally and in the cloud (e.g., run them on Open Stack bare metal). All these are reasons why I’ve been using Vagrant extensively over the past few years! And if you don’t yet know about Vagrant, what are you waiting for?