Here is the tutorial I used by Steve Cook on youtube.
Links to downloads libraries for java:
Here is the data:
The basics of Mahout (which is an Apache product) is to accomplish the following:
- Collaborative Filtering (recommendations)
- Classification (spam email or not)
- Clustering (Google news)
My background languages are java, objective-c, SQL, and html….but on Python!! Not a problem. I turned my attention to good ole’ youtube.com, and while I was doing the elliptical machine at 5:30am, I ran into some great videos from OneStopProgramming.
Summary steps to get started:
- install Python exe
- install notepad++ if you don’t already have it
- create a simple .py script
- open command prompt, find the .py and run it
- len(“hello”)b = 5
- help(len) = gives info about len function
- dir() = gives you all the variables you’ve declared
- “H” in “Hello” = TRUE
- “h” in “Hello” = FALSE
To begin playing around with what Hadoop does, I decided to go down the path of using HortonWorks Sandbox. One of the first things the setup has you do, is install Oracle VirtualBox, which is a virtual machine. Within that virtual machine is where the Sandbox will run. One note, the browser IP is wrong in the tutorial, it should be http://127.0.0.1:8000 to open the Sandbox GUI.
I then proceeded to follow the “Hello World” tutorial with I was able to import some actual data from the NYSE and run some Hive and Pig queries. I have a substantial SQL background (but is not essential) so it was a breeze.
I’m impressed on how easy and well written the tutorial was. Great way to get started!