Category Archives: Hadoop

Array_Position Custom UDF in Hive

Working with arrays in hive is pretty slick. However, I’ve run into an issue in which in the published Hive UDFs there is no function to return an index of a value within an array when it contains an item you’re looking for. So I took it upon myself to write it. This code runs on hive:


#array_position.py
#!/usr/bin/python
import sys

def get_position (item, item_list_string):
	try:
                # making a list as it would be a string coming from hive
		item_list = item_list_string.split(',') 
		array_position = 0
		for position, value in enumerate(item_list):
			value = value.replace('[', '')
			value= value.replace(']', '')
			value = value.replace('\"', '')
			if value == item:
				array_position = position + 1
		# Add all the output values to a list
		output = [str(array_position)]
		# Print output as tab delimited string objects to stdout
		print '\t'.join(map(str, output))
	except:
		m = 0

def main(argv):
	# Hive submits each record to stdin
	# The record/line is stripped of extra characters
	for line in sys.stdin:
		line = line.strip()
		item, item_list_string = line.split('\t')
		get_position(item, item_list_string)

if __name__ == "__main__":
	main(sys.argv[0]) # 0 as there are no args besides the hive query fields


The hive code to invoke this function is as follows:


ADD FILE array_position.py;
SELECT TRANSFORM (a.item, a.item_list_string)
USING 'array_position.py'
AS array_position
from
(select item, item_list_string
from your_table) a; 

Running Python in Hive/Hadoop

One of the things I love about running Hive is the ability to run Python and leverage the power of the parallel processing. Below I’m going to show a stripped down example of how to integrate a Hive statement & Python together to aggregate data to prepare it for modeling. Keep in mind, you can also use Hive & Python to transform data line by line as well, and it extremely handy for data transformation.

Use case: print out an array of products sold to a particular user. Again is a basic example, but you can build upon this and generate products sold for every user, then use KNN to generate clusters of users, or perhaps Association Rules to generate baskets.

Here is the Python script, which will have to be saved in local Hadoop path:


#!/usr/bin/python
import sys

items_sold = []  # create global list variable

class Items:  # create class to store and access items added
    def __init__(self, x):
    	self.x = x

    def set_x(self, x):
        self.x = x
    
    def get_x(self):
        return self.x

def print_results():  # print output in Hive
	result_set = [item.get_x() for item in items_sold];
	print (result_set)

	# Hive submits each record to stdin
	# The record/line is stripped of extra characters and submitted
for line in sys.stdin:
	line = line.strip()
	purchased_item = line.split('\t')
	items_sold.append(Items(purchased_item))

print_results()

Here is the hive statement:


add file blog_hive.py; 
select TRANSFORM (a.purchased_item)
using 'blog_hive.py'
AS array_purchased
from (select purchased_item from company_purchases where user_id = 'u1') a;

Result in Hive will be similar to this: [‘s_123’, ‘s_234’, ‘s890’]

Crowd-sourced Recommender Demo

Recommender Demo – click here!

This demo of a recommender is to illustrate an example of how a website (online music, e-commerce, news) generates recommendations to increase engagement and conversions.

This is not production ready, merely a POC of how it works.

* user selects favorite activities
* data is passed to server and processed in hadoop
* user can go to results page and select an activity to get recommendations

At this point, an auto-workflow has not been built, so there are a series of steps to create the new dataset. Here are the general steps:

1. user data feeds into database via website (which is used in generating recommendations)
2. data is moved and process in Hadoop
3. data is moved to MySQL, accessible using PHP
4. user selects an activity, and the crowd-sourced recommendations are displayed

Example: How Crowd-Sourcing Works (co-occurrence recommendations) Using Activities

All Users Activity History
| Activity | Art Fair | Fishing | Shovel Snow | Wedding |
| Jon          | Yes           | Yes         | Yes                      | No              |
| Jane        | No            | Yes         | No                        | Yes            |
| Jill            | Yes           | Yes         | No                        | Yes            |

A New User like to go to Weddings, and we need to recommend them other activities:
* Find Wedding in History Matrix who also enjoyed Wedding to it: U{Jane, Jill}
* Identify other activities same users (U) enjoyed, and rank by count

Recommendation
| Activity | Rank | Count of User (co-occurrence |
| Fishing  |  1         |  2                                                               |
| Art Fair |  2         | 1                                                                |

Predictive Algorithms on Million Song Dataset

I’ve had the opportunity within a Data Mining course in my graduate Software Engineering program to be part of a project in which we were to create a “recommendation engine”. The dataset we used was called the which there are 1M songs, along with play history of 380k users.

The goal was to provide a recommendation (ranked 1-10) of songs based on a current song played. We used three algorithms, Association Rules, Naive Bayes, and user-user co-occurance. When tested, the results were mixed, with Association Rules providing the top F1 scores, but also had the lowest # of recommendations (for a large portion of songs had less than 10 songs recommended). Co-occurance was close behind with the 2nd best F1 score, and provided the largest output of songs, as well as the lowest requirement of computational requirements.

Here is the full project on github.

Getting Started with Hadoop

To begin playing around with what Hadoop does, I decided to go down the path of using HortonWorks Sandbox.  One of the first things the setup has you do, is install Oracle VirtualBox, which is a virtual machine.  Within that virtual machine is where the Sandbox will run.  One note, the browser IP is wrong in the tutorial, it should be http://127.0.0.1:8000 to open the Sandbox GUI.

I then proceeded to follow the “Hello World” tutorial with I was able to import some actual data from the NYSE and run some Hive and Pig queries.  I have a substantial SQL background (but is not essential) so it was a breeze.

I’m impressed on how easy and well written the tutorial was.  Great way to get started!