So when I started looking at Hadoop a while ago I decided that the best way to learn it was to build an Hadoop cluster. That presented a number of problems. The first was of course, what to build it on.
To build a meaningful cluster your going to need at least five or six machines to build it on. There are various ways you can do this.
- You can do it using virtual machines, and in fact this is probably the easiest way to do it. If you look around any number of people will offer you pre-built Hadoop VMs for you to play with. But that breaks the first rule of learning, your not doing the install so your not going to learn anything about how you install Hadoop and it's inner workings. You can certainly build your own VMs, but that divorces you from the hardware :-(
- You can do it on a Cloud Service such as Amazon EC2 - but that can get expensive and it's still divorcing you from the hardware :-(
- You can build it on a number of second hand or scrounged PCs. This'll certainly work and you will definitely get yours hand dirty with the hardware - probably very dirty as you clean out several years worth of grim that always infests older PCs. There are other disadvantages to this approach that may not be immediately obvious. The cost of running 5 or 6 PCs, the heat they generate, the amount of desk space they take up, and the objections from your better half about the jet engine like noise from the fans as you start them all up. A colleague of mine who followed this approach used to start his cluster up remotely for demo purposes but had to stop when his wife threatened to disassemble it if it he wasn't present when it started.
Meet the Raspberry Pi, a credit-card sized computer that was launched about 18 months ago by the Raspberry Pi Foundation as an education tool. It's a complete computer with an ARM CPU, 512MB RAM, video, 10/100Mb ethernet, USB ports and SC card storage on a single board the size of a credit card. And the killer bit - it costs $35 (about £25).
You see where I'm going with this :-D
Make no mistake about it - there are challenges to using the Raspberry Pi - it's very resource limited. The CPU is a 700MHz ARM processor, the RAM is only 512MB and the network is only 100Mb. But overcoming challenges helps you learn - though you may lose some hair in the process ;-)
There's a great quote from Meet The Robinsons - "From failure you learn; from success, no so much". Implementing an Hadoop cluster on Raspberry Pi's certainly provided me with some failures :-)
To get started I built a single node setup - the good news is the hardware only costs about £40. A Raspberry Pi model B, a 16GB SD Card, a PSU and a network cable.
TIP: Only buy quality SD Cards and try to get a Class 10 card. I know a lot of people have problems with SD cards corrupting on Raspberry Pi's. So far I haven't had this happen to me (touch wood).
Of course I did not do this in isolation and I used as my starting point many of the great blogs posts from people round the world who have installed Hadoop and have installed earlier versions of Hadoop on Raspberry Pi's. For example Michael G. Noll, Toby Myer, Rasesh Mori, Y12 Studio, raspberrypicloud, and Sarah Secret.jp. Thanks for sharing, guys!
Part 2 will cover the single node install. Part 3 will cover the multi-node hardware & Part 4 will cover the multi-node install.