CNN: Transfer Learning vs build from scratch

When building a CNN (convolutional neural network), there are some things you’ll need and some things you should consider .  What you’ll need is access to GPU, and the next is you’ll need a lot of labeled images.  And when I say a lot, it could be minimum of 1,000 per class.  However, using transfer learning you may be able to get away with less.  Tensorflow and Theono backed packages such as Keras, provides the ability to use pre-trained models learning as the inputs to your newly created model, and with out a doubt, this helps model performance metrics.  Especially works if you training images are some what closely related to ImageNet dataset.  The main aspect to consider is just building the CNN from a transfer model or giving a shot at building it from scratch.

Regarding transfer learning, the reality is however, that most real-world applications of CNN for image recognition are not going to be that similar to ImageNet base of images.  Not all is lost as you can still use those pre-trained model to help you achieve higher model accuracy.  But what’s the cost?   I ran a test of some image recognition project.  And here are the considerations with using transer learning:

  1.  training time – this could substantially increase your processing time, depending on your model architecture
  2. size of model – instead of a model that is 50mb, now how about 300mb.  For some people in academics this is no big deal.  But I’m talking a web service or having this model work locally on a phone or simple CPU, smaller is better
  3. can only use RGB images when using ImageNet pre-trained model.  Bummer, b/c many times grayscale is all that is needed to perform well, and RGB requires more processing power and size of final model

To understand the trade offs between a CNN backed by transfer learning versus building CNN from scratch, I tested it out on a small dataset I’m working on.  Details on my dataset:

  • 2 classes; class 0: 250 labeled images, class 1: 1,000 labeled images (noticed classes are unbalanced?  It’s a real-world problem)
  • images do not closely resemble ImageNet (again, this is more real-world)

I’m running two models, one will be CNN from scratch, and the other will be leveraging transfer learning in which I’ll freeze the top 7 layers.

Both will use image augmentation, edge detection, and cross-validation to help with getting the most out of limited images in my training set.  Will be running up to 300 epochs, with patience of 10,  and callbacks to minimize log loss.  I sure I could spend more time on trying to make marginal improvements on both models, but in this case I wanted to time box this initial model building to help me decide which path I go.

Results of CNN from scratch (on the smaller, more difficult class: class 1)

Results of CNN with transfer learning (on the smaller, more difficult class: class 1)

No surprise, the F1 score is better on the model with transfer learning at 0.93 vs 0.91.  But add the expense of a model that is 10x as large.  You make the call on the path you choose.

Stream Data with StreamSets & Analyze with Spark

Tis the season of NFL football, and one way to capture excitement is Twitter data. I’ve tinkered around with Twitter’s Developer API before, but this time I wanted to use a streaming service I’ve heard good things about, called .

After I received the Tweets semi-raw data, I wanted to analyze the Tweet data using Spark. I choose Spark as the distributed nature of the RDD is great for using large amounts of data (and I’m not sure on how much I’ll be getting).

My idea was to do a count of tweets for a particular team/game and see if the volume of Tweets would predict whether that team actually wins or losses the game.

Data Collection Process


I have done a little work with the Twitter Developer API in the past, which I had used Python to parse the tweets as they arrived. I found this process very simple, but I was a little apprehensive brining StreamSets into the mix. However, having the knowledge of a scalable ETL and streaming program like StreamSets is good idea.

To use StreamSets, I did some google searches on “streaming twitter StreamSets”. I found a very well put together tutorial. It looked promising, so I felt confident enough to download the StreamSets application on my Mac and install it. I was a 145mb zip download, extracted as a java project.

After starting via the terminal, I was able to connect to it via localhost through my web browser, which I appreciate.

To connect to Twitter API via the StreamSets HTTP Client, I had to define Resource URL. Instead of getting all the tweets available, I decided to filter only tweets with “nfl” located in the tweet or hash tag. Also note, the Twitter API is random sampled real-time amount of Tweets. All other filtering and counting I was planning on doing in Spark later, but I’m sure some more of that ETL could have also been done in StreamSets.

As for the credentials to connect to Twitter, I had to enter four values: Consumer Key, Consumer Secret, Token, Token Secret. At this point as a test using the StreamSets UI, I connected the HTTP Client to save in Local FileSystem, then ran the pipeline.

I reviewed a few of lines of raw Tweet output in a text editor and online JSON viewer. I decided I didn’t need all the the JSON fields, so I created Field Removal into my pipeline between the HTTP Client and saving to the Local FileSystem. The fields I decided keep (I went with more rather than less as I didn’t know exactly what I’d need in Spark): Create Date/Time, userId, Tweet text, username, user location, user timezone, hashtags, retweet status, retweet count, location. After running, it looked good!

As I was in NFL week 13 (Thursday 11/30 – Sunday 12/03), I decided to run the Pipeline on the Thursday game as a test. I noticed plenty of data (around 5k tweets) relating to the NFL for those three hours of 7pm to 10pm. I thought this was a good proxy for plenty of data to capture for Sunday – when was my intended data to go after for analysis of the project.

Final pipeline diagram:

On Sunday, 12/3, I started the Pipeline at 11:59am and ran it until about 715pm that day. By running during that time, it would allow me have option of analysis on both the noon and 3pm games.

After I stopped the Pipeline I had 9 folders of data (one folder for each hour, which was default setting in StreamSets local file system settings (the first hour was only one min, and the last folder representing 7pm was also very few). The size of all the Sunday tweets was about 52mb.

Before diving into Spark, I wanted to get an idea on the amount of tweets in my data for data validation purposes. Using the terminal, I did a “wc -l filename” for the 12pm and 3pm hours. The total lines were 3,145 and 4,110. Since I have about 7 full hours, I would expect my data in Spark to have about 20k – 25k Tweets.

Spark Processing and Validation


I had the data in my local drive on the Cluster, so now I copied that data to HDFS for Spark to access. After starting the Spark-shell, I went to read in the data using the HDFS path + “/*”. However, after doing a count on the Tweets, it seems very low. It turns out, I needed additional “/*” added to access at the subdirectories. I did a count on the RDD, and came out to 20,202, which validated to the linux command I ran on my local in which I estimated 20k – 25k Tweets.

Moving on to what I was looking for, which at this point counting number of Tweets during a game for a particular team playing. I decided to break the dataset into two RDDs. The first would be mapping and getting just the “hour” of the Tweet. The second would be mapping to get the “text” of the Tweet.

The final data structure would need to combine the two RDDs, so I could count across specific hours and Tweet contained the team name. I decided on the tuple data structure. Then, I just filter the tuple by hours of a game and team name. For example, for the Vikings/Rams game (which started at noon), would be an hour representing noon, 1pm, 2pm and Tweet text containing “vikings” or “rams”.

I had to repeat this process for each team in the noon games, which there were seven games. At this point, I decided to create a JAR and submit the job via Spark-Submit. The input the shell-script to run the JAR on the cluster was to enter Input Data Location, Output Data Location, and team name. By doing this, it sped up the process of gathering the count of Tweets for each team as I just had to update the team name in the shell-script and running it right from the Cluster..

I was making the assumption noon game would run from noon – 3pm.

Other Programs


I used text editors for writing my code. On my Windows, it was Notepad++.

For creating the JAR file for Spark-Submit, I used the Cloudera VM, and run Eclipse IDE.

For moving JARs along with connecting to the cluster & Cloudera VM, I used Putty and WinSCP.  If using my mac, I would would just SCP.

For visualizing the results I used Excel. If I were using more variables in the dataset and looking for more of a dynamic visualization, I most likely would have used Tableau.



Of the 7 games played at noon, 4 of the 7 who were winners had more Tweets. I don’t think that is it significant to say the Tweet activity predicted the outcome, but interesting nonetheless.



I have used a Hadoop cluster many times over the past 3 years. From a Data Science perspective, it’s really not the greatest tool due to effort needed to move data and lack on statistical/visualizations tools built-in. Going forward, if I were to consult similar tools, I would look into something like Cloudera’s Data Science Workbench. However, I’m a firm believer in the knowledge to perform all functions through the command line, so this project further enhanced my skillset.

Chatbot using Azure’s NLP LUIS w/Demo

In the spirit of AI, I decided to look into various chatbot frameworks available and build a POC. I landed on Microsoft’s Azure Bot Service as my preferred choice. I’ve had a positive experience with Azure’s ML tools and Microsoft has done a tremendous job in past 3 years investing in cloud services. Also, their Bot Service they also integrates with a NLP service called LUIS (Language Understanding Intelligence Service) – which is also owned by Microsoft. ***Note demo is located at bottom of post***
Coding it was simple in which I chose NODE.js instead of C#. The template provided within the dashboard was a great start, and I only had to make a few updates to help integrate with LUIS on NLP for my domain of my choice. The entire dev and deployment is 100% serverless, which I love.
More about the NLP using LUIS, it’s job is to take a utterance(aka sentence), and determine the intent. That intent is used by the Bot Framework to reply a result. In order for your chatbot to provide some functionally, you need to train a model in LUIS with these utterances. The model requires two things: Intents (verb) and Entities (nouns).
I decided to create a fictitious chatbot for a company called “Jon’s Auto Repair”. The goal of the chatboat was to allow the customer to find out services offered and the cost. Much more could be built including scheduling services, and custom Q&A.
1) Intents – greeting, get_service, get_cost, cant_service
2) Entities- service, cost
Sample utterance entered in LUIS to train the model: “what do new brakes cost”. After entering, I tag “brakes” -> service entity, and “cost” -> cost entity.
The LUIS endpoint you can query with a string. For example, here is the result of the query: “my brakes are squeak can you fix”. Behind this scenes this is passed the endpoint here: LUIS endpoint
Here is part of JSON response showing the top scored intent:
"query": "my brakes squeak can you fix",
"topScoringIntent": {
"intent": "get_service",
"score": 0.52136153

Here are a few screenshots showing it working:


Here is the live demo hosted by Aruze, and it’s as simple and pasting the embed code provided:



Predict Specific Claims in Medicare Data



Whenever I’m faced with a machine learning task, my goal on day 1 is to build an initial model. The model will without a doubt need to be tuned in days or even weeks after, but it’s good to have a starting point. In the project below, I timeboxed a machine learning inital model of about 4 hours to see how far I get along with some initial results.

Problem Statement

A peer of mine in my Master’s program mentioned there is publicly available Medicare CMS data. I have very little knowledge of healthcare data, but thought I’d explore the data and see if there was an aspect that could be useful in buidling a model to make predictions.

The data:

  • 2008 claims outpatient data (used this, only 1 of 20 available samples, still about 1.1 million rows of claims data)
  • 2008 beneficiary data (used this)
  • 2008 claims inpatient data (did not use this due to initial time constraint)
  • 2008 presciption data (did not use this due to initial time constraint)

I identified one useful infomation to build a model on: Predict a medicare claim specific to ICD9 codes relating to diseases of circulatory system (this makes up about 11% of claims).

Steps included in this project

  • getting the data
  • exploring the data
    • identifying potential useful features
  • transforming the data
  • data exploration
  • dimensionality reduction
  • selectling initial models
  • evaluating initial models
  • summarizing

Useful imports and settings

In [15]:
import datetime
import numpy as np
import pandas as pd
from sklearn import linear_model, ensemble, decomposition
from sklearn.preprocessing import MinMaxScaler, Imputer
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve, recall_score
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Settings for plots
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 6.0)

pd.options.display.max_columns = None

def warn(*args, **kwargs):
warnings.warn = warn

Read in data, convert columns, join datasets

In [16]:
dateparse = lambda x: pd.datetime.strptime(x, '%Y%m%d')

df_outpatient_claims1 = pd.read_csv('DE1_0_2008_to_2010_Outpatient_Claims_Sample_1.csv')
df_beneficiary1 = pd.read_csv('DE1_0_2008_Beneficiary_Summary_File_Sample_1.csv'
                             ,parse_dates=[1], date_parser=dateparse)

# calc age from birthdate
date_2009 = pd.datetime.strptime('20090101', '%Y%m%d')
df_beneficiary1['AGE'] = (date_2009 - df_beneficiary1['BENE_BIRTH_DT']).astype('<m8[Y]')

# join data sets
df_joined = pd.merge(df_outpatient_claims1, df_beneficiary1, left_on='DESYNPUF_ID', right_on='DESYNPUF_ID', how='inner')

(790790, 108)
count 7.907900e+05 790790.000000 7.795370e+05 7.795370e+05 790790.000000 790790.000000 7.730000e+05 1.344330e+05 2.576660e+05 790790.000000 200.000000 790790.000000 790790.000000 0.0 2.086000e+03 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000
mean 5.425026e+14 1.014230 2.008925e+07 2.008929e+07 283.924569 10.239760 4.975733e+09 4.947815e+09 4.904993e+09 0.012898 5535.690000 2.825466 83.845876 NaN 2.008088e+07 1.577348 1.250675 25.668658 377.350066 11.681684 11.530626 2.681388 7.501712 1.653897 1.489232 1.668855 1.872031 1.729455 1.608191 1.348555 1.304843 1.708268 1.721176 1.911604 4657.439687 526.227770 198.064594 1943.823152 593.896534 66.611920 2240.245678 620.647972 34.333034 72.318409
std 2.858482e+11 0.118438 7.475135e+03 7.471623e+03 571.392794 234.668372 2.874373e+09 2.890483e+09 2.889353e+09 2.315506 3164.871486 15.596522 178.759708 NaN 2.580152e+02 0.493981 0.708571 15.140097 266.921315 1.769500 2.121507 4.914490 5.714692 0.475727 0.499884 0.470625 0.334056 0.444241 0.488155 0.476513 0.460341 0.454560 0.448421 0.283871 12205.547142 1289.921983 2646.498161 3849.941441 1055.746848 597.738709 2228.993683 580.428308 121.383274 13.006691
min 5.420123e+14 1.000000 2.007121e+07 2.008010e+07 -100.000000 0.000000 1.024080e+05 3.258650e+05 1.024080e+05 0.000000 61.000000 0.000000 0.000000 NaN 2.008010e+07 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 -3000.000000 0.000000 0.000000 -90.000000 0.000000 0.000000 0.000000 0.000000 0.000000 25.000000
25% 5.422523e+14 1.000000 2.008092e+07 2.008093e+07 40.000000 0.000000 2.521572e+09 2.474254e+09 2.443058e+09 0.000000 3771.750000 0.000000 0.000000 NaN 2.008070e+07 1.000000 1.000000 11.000000 141.000000 12.000000 12.000000 0.000000 0.000000 1.000000 1.000000 1.000000 2.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 2.000000 0.000000 0.000000 0.000000 240.000000 60.000000 0.000000 760.000000 220.000000 0.000000 67.000000
50% 5.425023e+14 1.000000 2.009043e+07 2.009043e+07 80.000000 0.000000 4.904972e+09 4.870758e+09 4.774818e+09 0.000000 4533.500000 0.000000 20.000000 NaN 2.008090e+07 2.000000 1.000000 25.000000 350.000000 12.000000 12.000000 0.000000 12.000000 2.000000 1.000000 2.000000 2.000000 2.000000 2.000000 1.000000 1.000000 2.000000 2.000000 2.000000 0.000000 0.000000 0.000000 810.000000 260.000000 0.000000 1570.000000 460.000000 0.000000 73.000000
75% 5.427523e+14 1.000000 2.009120e+07 2.009121e+07 200.000000 0.000000 7.501324e+09 7.485573e+09 7.491945e+09 0.000000 8745.250000 0.000000 70.000000 NaN 2.008110e+07 2.000000 1.000000 38.000000 570.000000 12.000000 12.000000 0.000000 12.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 3000.000000 1024.000000 0.000000 2030.000000 680.000000 0.000000 3080.000000 860.000000 10.000000 81.000000
max 5.429923e+14 2.000000 2.010123e+07 2.010123e+07 3300.000000 14000.000000 9.999886e+09 9.999615e+09 9.999470e+09 800.000000 9961.000000 200.000000 1100.000000 NaN 2.008120e+07 2.000000 5.000000 54.000000 999.000000 12.000000 12.000000 12.000000 12.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 164220.000000 53096.000000 68000.000000 50020.000000 12450.000000 14400.000000 21160.000000 5260.000000 2040.000000 100.000000

Identify Features from Beneficiary data (just grabbed them all to start)

In [17]:
            , 'SP_OSTEOPRS', 'SP_RA_OA', 'SP_STRKETIA']

# The name of the column for the output varaible.
target = 'ICD9_DGNS_CD_1'

Group Target ICD9 codes from Claims data (chose Circulatory System Diseases – which is 1 of 17 ICD9 groupings)

In [18]:
# 390 ‐ 459 Diseases of the circulatory system
df_joined_circ = df_joined.where(pd.to_numeric(df_joined['ICD9_DGNS_CD_1'], errors='coerce')>=390)
df_joined_circ = df_joined_circ.where(pd.to_numeric(df_joined_circ['ICD9_DGNS_CD_1'], errors='coerce')<=459)

# reduce data for only circulatory codes
df_null_target = df_joined_circ['ICD9_DGNS_CD_1'].notnull()
df_joined_cleaned = df_joined_circ.loc[df_null_target]
print('shape of data', df_joined_cleaned.shape)
shape of data (2144, 108)
In [19]:
plt.xlabel('code number')
plt.ylabel('count of specific code')
plt.title('ICD9 Codes: Diseases of the circulatory system')
print('ICD9 code 412 makes up about', df_joined_cleaned[target].value_counts()[0]/df_joined_cleaned.shape[0], 'of the data')
ICD9 code 412 makes up about 0.298041044776 of the data

Before running any model, more preprossing needed to convert text to numbers, and dealing with NaN

In [20]:
# replace Y and N
df_joined_cleaned['BENE_ESRD_IND'] = df_joined_cleaned['BENE_ESRD_IND'].astype(str)
df_joined_cleaned.BENE_ESRD_IND.replace(['Y', '0'], [1, 0], inplace=True)

# replace NaN with median, mean, most_frequent
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)[features])
df_joined_cleaned[features] = imp.transform(df_joined_cleaned[features])

Split data into train and test

In [21]:
from sklearn.cross_validation import train_test_split

x = df_joined_cleaned[features]
y = df_joined_cleaned[target]

# Divide the data into a training and a test set.
random_state = 0  # Fixed so that everybody has got the same split
test_set_fraction = 0.2
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_set_fraction, 

print('Size of training set: {}'.format(len(x_train)))
print('Size of test set: {}'.format(len(x_test)))
Size of training set: 1715
Size of test set: 429

Running two algorithms: Random Forest and Logistic Regression

In [22]:
rf = ensemble.RandomForestClassifier(random_state=15) # set seed, y_train)
print('random forest model score', rf.score(x_test,y_test))

lm = linear_model.LogisticRegression(),y_train)
print('logistic regression model score', lm.score(x_test,y_test))

y_pred = lm.predict(x_test)
random forest model score 0.20979020979
logistic regression model score 0.286713286713

Dimensionality Reduction with PCA

In [23]:
pca = decomposition.PCA(n_components=9)
print('original shape prior to PCA', x_train.shape)
x_train_new = pca.fit_transform(x_train)
x_test_new = pca.transform(x_test)
print('new shape after to PCA', x_train_new.shape)
original shape prior to PCA (1715, 19)
new shape after to PCA (1715, 9)

Run Algorithms again with reduced features thanks to PCA

In [24]:
rf = ensemble.RandomForestClassifier(random_state=15) # set seed, y_train)
print('random forest model score', rf.score(x_test_new, y_test))

lm = linear_model.LogisticRegression(), y_train)
print('logistic regression model score', lm.score(x_test_new, y_test))

print('After dimensionality reduction, our performance increased on Logistic Regression, and performed better than Random Forest')
random forest model score 0.198135198135
logistic regression model score 0.298368298368
After dimensionality reduction, our performance increased on Logistic Regression, and performed better than Random Forest

Plotting Results

In [25]:
# get class list for chart
def class_classification_report(cr, title='Classification report ', with_avg_total=False,

    lines = cr.split('\n')
    classes = []
    for line in lines[2 : (len(lines) - 3)]:
        t = line.split()
    return classes

y_pred = lm.predict(x_test_new)
y_label = class_classification_report((classification_report(y_test, y_pred)))
In [26]:
from sklearn.metrics import recall_score, f1_score, precision_score

print('Recall = ', recall_score(y_test, y_pred, average='weighted'))
print('Precision = ', precision_score(y_test, y_pred, average='weighted'))
print('F1 = ', f1_score(y_test, y_pred, average='weighted'))
Recall =  0.298368298368
Precision =  0.116132524663
F1 =  0.142505332535
In [27]:
# plot Precision, Recall, F1
n_groups = len(f1_score(y_test, y_pred, average=None))
index = np.arange(n_groups)

width = 0.5
fig, ax = plt.subplots()
rects1 =, f1_score(y_test, y_pred, average=None)
               , width, alpha=0.8, color='r')
rects2 = + width, recall_score(y_test, y_pred, average=None)
               , width, color='y')
rects3 = + width * 2, precision_score(y_test, y_pred, average=None)
               , width, color='b')
ax.legend((rects1[0], rects2[0], rects3[0]), ('F1', 'Recall', 'Precision'))
plt.xticks(index + width, np.sort(y_label), rotation=90)
plt.title('ICD9 Codes: Diseases of the circulatory system')

Interpreting Model Results

Of the 40 ICD9 codes representing Circulatory diseases, my model only produced predicted for 0422, 0430, and 412, which isn’t ideal, but those three codes make us 37% of my training data. Above, I plotted, recall, precision, and f1 scores. I like using f1 score as it’s really a balance of recall & precision (what portion of true-positives is your model getting and how good it is at predicting true-positives). At this point, much more investigating the data and tweaking the models is needed to improve performance. Gaining domain knowledge is this field would certaining help too!

The data is unbalanced, and if I would have just guessed code 412 for all instances, my Recall rate would have increase, but then my Precision and f1 would have dropped.

Final Thoughts

This was a “quick and dirty” model building exercise, which didn’t produce great results, but is a good starting point. Rarely are you going to get great results with limited amount of work.

Overall, there is a some opportunity here, but would take many more iterations of model tuning. I would recommend bringing in the Drug Prescription data source, along with a couple more years of claims data so health trends by patient could be leveraged.

Array_Position Custom UDF in Hive

Working with arrays in hive is pretty slick. However, I’ve run into an issue in which in the published Hive UDFs there is no function to return an index of a value within an array when it contains an item you’re looking for. So I took it upon myself to write it. This code runs on hive:
import sys

def get_position (item, item_list_string):
                # making a list as it would be a string coming from hive
		item_list = item_list_string.split(',') 
		array_position = 0
		for position, value in enumerate(item_list):
			value = value.replace('[', '')
			value= value.replace(']', '')
			value = value.replace('\"', '')
			if value == item:
				array_position = position + 1
		# Add all the output values to a list
		output = [str(array_position)]
		# Print output as tab delimited string objects to stdout
		print '\t'.join(map(str, output))
		m = 0

def main(argv):
	# Hive submits each record to stdin
	# The record/line is stripped of extra characters
	for line in sys.stdin:
		line = line.strip()
		item, item_list_string = line.split('\t')
		get_position(item, item_list_string)

if __name__ == "__main__":
	main(sys.argv[0]) # 0 as there are no args besides the hive query fields

The hive code to invoke this function is as follows:

SELECT TRANSFORM (a.item, a.item_list_string)
AS array_position
(select item, item_list_string
from your_table) a; 

Object Orientated Design for Data Science

Working with complex datasets, often custom code is needed for an intended solution. However, when designing custom code, the use object-oriented design practices promote code reusability and ease to update & extend funtionality. In this post, I’m going to look at the Titanic data (of the 2k passengers, which survived, etc), you can download here Titanic dataset.

This is a basic dataset, however the principles can be applied to more complex data sets. The idea is to perform different operations on each row of data depending on the passenger class (1st, 2nd, 3rd, or the ship’s crew). This could be accomplished by using a bunch of “if, else” statements, but again I’m looking for clean and reusable code here, and when working with complex data sets, it’s a much better approach.

For your reference, here is the top few rows of the data:


And here is the code (with comments):

def readData():
   # source data column mapping: Passenger,Class,Sex,Age,Survived
    hf = HandlerFactory() #create instance
    with open('titanic.csv',"r") as fl:
        next(fl) # skip header
        for lines in fl:
            lines = lines.strip('\n')
            fields = lines.split(',')
            passenger_class = fields[1]
            handler = hf.getHandler(passenger_class)
class HandlerFactory():
    def __init__(self):
        self.handlers = {} #dict for each handler class
    def register(self,handler):
        self.handlers[handler.getField()] = handler #add each class to dict
    def getHandler(self,fld):
        return self.handlers[fld] # returns the python class to handle specific passenger class

class FieldHander: # base python class to be inherited
    def __init__(self, field):
        self.field = field
    def getField(self):
        return self.field
    def setField(self, field):
        self.field = field    
class CrewHandler(FieldHander): # print passenger class and survived
    def __init__(self):
        FieldHander.__init__(self, "Crew")
    def apply(self, fields):
        print ([fields[1]])  

class FirstClassHandler(FieldHander): # print passenger class, gender, age, and survived
    def __init__(self):
        FieldHander.__init__(self, "1st")
    def apply(self, fields):
        print ([fields[1], fields[2], fields[3], fields[4]]) 

class SecondClassHandler(FieldHander): # print passenger class, gender, and survived
    def __init__(self):
        FieldHander.__init__(self, "2nd")
    def apply(self, fields):
        print ([fields[1], fields[2], fields[4]])

class ThirdClassHandler(FieldHander): # print passenger class and survived
    def __init__(self):
        FieldHander.__init__(self, "3rd")
    def apply(self, fields):
        print ([fields[1], fields[4]])

And the results (notice how each class of passenger has different fields chosen):
[‘3rd’, ‘No’]
[‘3rd’, ‘Yes’]
[‘2nd’, ‘Male’, ‘No’]
[‘1st’, ‘Male’, ‘Adult’, ‘No’]

Running Python in Hive/Hadoop

One of the things I love about running Hive is the ability to run Python and leverage the power of the parallel processing. Below I’m going to show a stripped down example of how to integrate a Hive statement & Python together to aggregate data to prepare it for modeling. Keep in mind, you can also use Hive & Python to transform data line by line as well, and it extremely handy for data transformation.

Use case: print out an array of products sold to a particular user. Again is a basic example, but you can build upon this and generate products sold for every user, then use KNN to generate clusters of users, or perhaps Association Rules to generate baskets.

Here is the Python script, which will have to be saved in local Hadoop path:

import sys

items_sold = []  # create global list variable

class Items:  # create class to store and access items added
    def __init__(self, x):
    	self.x = x

    def set_x(self, x):
        self.x = x
    def get_x(self):
        return self.x

def print_results():  # print output in Hive
	result_set = [item.get_x() for item in items_sold];
	print (result_set)

	# Hive submits each record to stdin
	# The record/line is stripped of extra characters and submitted
for line in sys.stdin:
	line = line.strip()
	purchased_item = line.split('\t')


Here is the hive statement:

add file; 
select TRANSFORM (a.purchased_item)
using ''
AS array_purchased
from (select purchased_item from company_purchases where user_id = 'u1') a;

Result in Hive will be similar to this: [‘s_123’, ‘s_234’, ‘s890’]

Probability of a Revenue Threshold

A retailer’s website purchases have an average order size of $100 and a standard deviation of $75. What is the probability of 10 orders generating over $1,250 in Revenue?

mean = $100.00
stdev = $75.00

avg_order_needed = $1250/10 = $125.00
standard_error = $75/sqrt(10) = $23.72
z-score = (125.00 – 100.00)/23.72 = 1.05

We are looking to solve for this shaded area under the curve.


Looking up on z-table for 1.05, the probability is 0.1469 or 14.7% of a obtaining $1,250 in Revenue from 10 random orders.