Predict Specific Claims in Medicare Data



Whenever I’m faced with a machine learning task, my goal on day 1 is to build an initial model. The model will without a doubt need to be tuned in days or even weeks after, but it’s good to have a starting point. In the project below, I timeboxed a machine learning inital model of about 4 hours to see how far I get along with some initial results.

Problem Statement

A peer of mine in my Master’s program mentioned there is publicly available Medicare CMS data. I have very little knowledge of healthcare data, but thought I’d explore the data and see if there was an aspect that could be useful in buidling a model to make predictions.

The data:

  • 2008 claims outpatient data (used this, only 1 of 20 available samples, still about 1.1 million rows of claims data)
  • 2008 beneficiary data (used this)
  • 2008 claims inpatient data (did not use this due to initial time constraint)
  • 2008 presciption data (did not use this due to initial time constraint)

I identified one useful infomation to build a model on: Predict a medicare claim specific to ICD9 codes relating to diseases of circulatory system (this makes up about 11% of claims).

Steps included in this project

  • getting the data
  • exploring the data
    • identifying potential useful features
  • transforming the data
  • data exploration
  • dimensionality reduction
  • selectling initial models
  • evaluating initial models
  • summarizing

Useful imports and settings

In [15]:
import datetime
import numpy as np
import pandas as pd
from sklearn import linear_model, ensemble, decomposition
from sklearn.preprocessing import MinMaxScaler, Imputer
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve, recall_score
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Settings for plots
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 6.0)

pd.options.display.max_columns = None

def warn(*args, **kwargs):
warnings.warn = warn

Read in data, convert columns, join datasets

In [16]:
dateparse = lambda x: pd.datetime.strptime(x, '%Y%m%d')

df_outpatient_claims1 = pd.read_csv('DE1_0_2008_to_2010_Outpatient_Claims_Sample_1.csv')
df_beneficiary1 = pd.read_csv('DE1_0_2008_Beneficiary_Summary_File_Sample_1.csv'
                             ,parse_dates=[1], date_parser=dateparse)

# calc age from birthdate
date_2009 = pd.datetime.strptime('20090101', '%Y%m%d')
df_beneficiary1['AGE'] = (date_2009 - df_beneficiary1['BENE_BIRTH_DT']).astype('<m8[Y]')

# join data sets
df_joined = pd.merge(df_outpatient_claims1, df_beneficiary1, left_on='DESYNPUF_ID', right_on='DESYNPUF_ID', how='inner')

(790790, 108)
count 7.907900e+05 790790.000000 7.795370e+05 7.795370e+05 790790.000000 790790.000000 7.730000e+05 1.344330e+05 2.576660e+05 790790.000000 200.000000 790790.000000 790790.000000 0.0 2.086000e+03 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000 790790.000000
mean 5.425026e+14 1.014230 2.008925e+07 2.008929e+07 283.924569 10.239760 4.975733e+09 4.947815e+09 4.904993e+09 0.012898 5535.690000 2.825466 83.845876 NaN 2.008088e+07 1.577348 1.250675 25.668658 377.350066 11.681684 11.530626 2.681388 7.501712 1.653897 1.489232 1.668855 1.872031 1.729455 1.608191 1.348555 1.304843 1.708268 1.721176 1.911604 4657.439687 526.227770 198.064594 1943.823152 593.896534 66.611920 2240.245678 620.647972 34.333034 72.318409
std 2.858482e+11 0.118438 7.475135e+03 7.471623e+03 571.392794 234.668372 2.874373e+09 2.890483e+09 2.889353e+09 2.315506 3164.871486 15.596522 178.759708 NaN 2.580152e+02 0.493981 0.708571 15.140097 266.921315 1.769500 2.121507 4.914490 5.714692 0.475727 0.499884 0.470625 0.334056 0.444241 0.488155 0.476513 0.460341 0.454560 0.448421 0.283871 12205.547142 1289.921983 2646.498161 3849.941441 1055.746848 597.738709 2228.993683 580.428308 121.383274 13.006691
min 5.420123e+14 1.000000 2.007121e+07 2.008010e+07 -100.000000 0.000000 1.024080e+05 3.258650e+05 1.024080e+05 0.000000 61.000000 0.000000 0.000000 NaN 2.008010e+07 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 -3000.000000 0.000000 0.000000 -90.000000 0.000000 0.000000 0.000000 0.000000 0.000000 25.000000
25% 5.422523e+14 1.000000 2.008092e+07 2.008093e+07 40.000000 0.000000 2.521572e+09 2.474254e+09 2.443058e+09 0.000000 3771.750000 0.000000 0.000000 NaN 2.008070e+07 1.000000 1.000000 11.000000 141.000000 12.000000 12.000000 0.000000 0.000000 1.000000 1.000000 1.000000 2.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 2.000000 0.000000 0.000000 0.000000 240.000000 60.000000 0.000000 760.000000 220.000000 0.000000 67.000000
50% 5.425023e+14 1.000000 2.009043e+07 2.009043e+07 80.000000 0.000000 4.904972e+09 4.870758e+09 4.774818e+09 0.000000 4533.500000 0.000000 20.000000 NaN 2.008090e+07 2.000000 1.000000 25.000000 350.000000 12.000000 12.000000 0.000000 12.000000 2.000000 1.000000 2.000000 2.000000 2.000000 2.000000 1.000000 1.000000 2.000000 2.000000 2.000000 0.000000 0.000000 0.000000 810.000000 260.000000 0.000000 1570.000000 460.000000 0.000000 73.000000
75% 5.427523e+14 1.000000 2.009120e+07 2.009121e+07 200.000000 0.000000 7.501324e+09 7.485573e+09 7.491945e+09 0.000000 8745.250000 0.000000 70.000000 NaN 2.008110e+07 2.000000 1.000000 38.000000 570.000000 12.000000 12.000000 0.000000 12.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 3000.000000 1024.000000 0.000000 2030.000000 680.000000 0.000000 3080.000000 860.000000 10.000000 81.000000
max 5.429923e+14 2.000000 2.010123e+07 2.010123e+07 3300.000000 14000.000000 9.999886e+09 9.999615e+09 9.999470e+09 800.000000 9961.000000 200.000000 1100.000000 NaN 2.008120e+07 2.000000 5.000000 54.000000 999.000000 12.000000 12.000000 12.000000 12.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 164220.000000 53096.000000 68000.000000 50020.000000 12450.000000 14400.000000 21160.000000 5260.000000 2040.000000 100.000000

Identify Features from Beneficiary data (just grabbed them all to start)

In [17]:
            , 'SP_OSTEOPRS', 'SP_RA_OA', 'SP_STRKETIA']

# The name of the column for the output varaible.
target = 'ICD9_DGNS_CD_1'

Group Target ICD9 codes from Claims data (chose Circulatory System Diseases – which is 1 of 17 ICD9 groupings)

In [18]:
# 390 ‐ 459 Diseases of the circulatory system
df_joined_circ = df_joined.where(pd.to_numeric(df_joined['ICD9_DGNS_CD_1'], errors='coerce')>=390)
df_joined_circ = df_joined_circ.where(pd.to_numeric(df_joined_circ['ICD9_DGNS_CD_1'], errors='coerce')<=459)

# reduce data for only circulatory codes
df_null_target = df_joined_circ['ICD9_DGNS_CD_1'].notnull()
df_joined_cleaned = df_joined_circ.loc[df_null_target]
print('shape of data', df_joined_cleaned.shape)
shape of data (2144, 108)
In [19]:
plt.xlabel('code number')
plt.ylabel('count of specific code')
plt.title('ICD9 Codes: Diseases of the circulatory system')
print('ICD9 code 412 makes up about', df_joined_cleaned[target].value_counts()[0]/df_joined_cleaned.shape[0], 'of the data')
ICD9 code 412 makes up about 0.298041044776 of the data

Before running any model, more preprossing needed to convert text to numbers, and dealing with NaN

In [20]:
# replace Y and N
df_joined_cleaned['BENE_ESRD_IND'] = df_joined_cleaned['BENE_ESRD_IND'].astype(str)
df_joined_cleaned.BENE_ESRD_IND.replace(['Y', '0'], [1, 0], inplace=True)

# replace NaN with median, mean, most_frequent
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)[features])
df_joined_cleaned[features] = imp.transform(df_joined_cleaned[features])

Split data into train and test

In [21]:
from sklearn.cross_validation import train_test_split

x = df_joined_cleaned[features]
y = df_joined_cleaned[target]

# Divide the data into a training and a test set.
random_state = 0  # Fixed so that everybody has got the same split
test_set_fraction = 0.2
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_set_fraction, 

print('Size of training set: {}'.format(len(x_train)))
print('Size of test set: {}'.format(len(x_test)))
Size of training set: 1715
Size of test set: 429

Running two algorithms: Random Forest and Logistic Regression

In [22]:
rf = ensemble.RandomForestClassifier(random_state=15) # set seed, y_train)
print('random forest model score', rf.score(x_test,y_test))

lm = linear_model.LogisticRegression(),y_train)
print('logistic regression model score', lm.score(x_test,y_test))

y_pred = lm.predict(x_test)
random forest model score 0.20979020979
logistic regression model score 0.286713286713

Dimensionality Reduction with PCA

In [23]:
pca = decomposition.PCA(n_components=9)
print('original shape prior to PCA', x_train.shape)
x_train_new = pca.fit_transform(x_train)
x_test_new = pca.transform(x_test)
print('new shape after to PCA', x_train_new.shape)
original shape prior to PCA (1715, 19)
new shape after to PCA (1715, 9)

Run Algorithms again with reduced features thanks to PCA

In [24]:
rf = ensemble.RandomForestClassifier(random_state=15) # set seed, y_train)
print('random forest model score', rf.score(x_test_new, y_test))

lm = linear_model.LogisticRegression(), y_train)
print('logistic regression model score', lm.score(x_test_new, y_test))

print('After dimensionality reduction, our performance increased on Logistic Regression, and performed better than Random Forest')
random forest model score 0.198135198135
logistic regression model score 0.298368298368
After dimensionality reduction, our performance increased on Logistic Regression, and performed better than Random Forest

Plotting Results

In [25]:
# get class list for chart
def class_classification_report(cr, title='Classification report ', with_avg_total=False,

    lines = cr.split('\n')
    classes = []
    for line in lines[2 : (len(lines) - 3)]:
        t = line.split()
    return classes

y_pred = lm.predict(x_test_new)
y_label = class_classification_report((classification_report(y_test, y_pred)))
In [26]:
from sklearn.metrics import recall_score, f1_score, precision_score

print('Recall = ', recall_score(y_test, y_pred, average='weighted'))
print('Precision = ', precision_score(y_test, y_pred, average='weighted'))
print('F1 = ', f1_score(y_test, y_pred, average='weighted'))
Recall =  0.298368298368
Precision =  0.116132524663
F1 =  0.142505332535
In [27]:
# plot Precision, Recall, F1
n_groups = len(f1_score(y_test, y_pred, average=None))
index = np.arange(n_groups)

width = 0.5
fig, ax = plt.subplots()
rects1 =, f1_score(y_test, y_pred, average=None)
               , width, alpha=0.8, color='r')
rects2 = + width, recall_score(y_test, y_pred, average=None)
               , width, color='y')
rects3 = + width * 2, precision_score(y_test, y_pred, average=None)
               , width, color='b')
ax.legend((rects1[0], rects2[0], rects3[0]), ('F1', 'Recall', 'Precision'))
plt.xticks(index + width, np.sort(y_label), rotation=90)
plt.title('ICD9 Codes: Diseases of the circulatory system')

Interpreting Model Results

Of the 40 ICD9 codes representing Circulatory diseases, my model only produced predicted for 0422, 0430, and 412, which isn’t ideal, but those three codes make us 37% of my training data. Above, I plotted, recall, precision, and f1 scores. I like using f1 score as it’s really a balance of recall & precision (what portion of true-positives is your model getting and how good it is at predicting true-positives). At this point, much more investigating the data and tweaking the models is needed to improve performance. Gaining domain knowledge is this field would certaining help too!

The data is unbalanced, and if I would have just guessed code 412 for all instances, my Recall rate would have increase, but then my Precision and f1 would have dropped.

Final Thoughts

This was a “quick and dirty” model building exercise, which didn’t produce great results, but is a good starting point. Rarely are you going to get great results with limited amount of work.

Overall, there is a some opportunity here, but would take many more iterations of model tuning. I would recommend bringing in the Drug Prescription data source, along with a couple more years of claims data so health trends by patient could be leveraged.

Array_Position Custom UDF in Hive

Working with arrays in hive is pretty slick. However, I’ve run into an issue in which in the published Hive UDFs there is no function to return an index of a value within an array when it contains an item you’re looking for. So I took it upon myself to write it. This code runs on hive:
import sys

def get_position (item, item_list_string):
                # making a list as it would be a string coming from hive
		item_list = item_list_string.split(',') 
		array_position = 0
		for position, value in enumerate(item_list):
			value = value.replace('[', '')
			value= value.replace(']', '')
			value = value.replace('\"', '')
			if value == item:
				array_position = position + 1
		# Add all the output values to a list
		output = [str(array_position)]
		# Print output as tab delimited string objects to stdout
		print '\t'.join(map(str, output))
		m = 0

def main(argv):
	# Hive submits each record to stdin
	# The record/line is stripped of extra characters
	for line in sys.stdin:
		line = line.strip()
		item, item_list_string = line.split('\t')
		get_position(item, item_list_string)

if __name__ == "__main__":
	main(sys.argv[0]) # 0 as there are no args besides the hive query fields

The hive code to invoke this function is as follows:

SELECT TRANSFORM (a.item, a.item_list_string)
AS array_position
(select item, item_list_string
from your_table) a; 

Object Orientated Design for Data Science

Working with complex datasets, often custom code is needed for an intended solution. However, when designing custom code, the use object-oriented design practices promote code reusability and ease to update & extend funtionality. In this post, I’m going to look at the Titanic data (of the 2k passengers, which survived, etc), you can download here Titanic dataset.

This is a basic dataset, however the principles can be applied to more complex data sets. The idea is to perform different operations on each row of data depending on the passenger class (1st, 2nd, 3rd, or the ship’s crew). This could be accomplished by using a bunch of “if, else” statements, but again I’m looking for clean and reusable code here, and when working with complex data sets, it’s a much better approach.

For your reference, here is the top few rows of the data:


And here is the code (with comments):

def readData():
   # source data column mapping: Passenger,Class,Sex,Age,Survived
    hf = HandlerFactory() #create instance
    with open('titanic.csv',"r") as fl:
        next(fl) # skip header
        for lines in fl:
            lines = lines.strip('\n')
            fields = lines.split(',')
            passenger_class = fields[1]
            handler = hf.getHandler(passenger_class)
class HandlerFactory():
    def __init__(self):
        self.handlers = {} #dict for each handler class
    def register(self,handler):
        self.handlers[handler.getField()] = handler #add each class to dict
    def getHandler(self,fld):
        return self.handlers[fld] # returns the python class to handle specific passenger class

class FieldHander: # base python class to be inherited
    def __init__(self, field):
        self.field = field
    def getField(self):
        return self.field
    def setField(self, field):
        self.field = field    
class CrewHandler(FieldHander): # print passenger class and survived
    def __init__(self):
        FieldHander.__init__(self, "Crew")
    def apply(self, fields):
        print ([fields[1]])  

class FirstClassHandler(FieldHander): # print passenger class, gender, age, and survived
    def __init__(self):
        FieldHander.__init__(self, "1st")
    def apply(self, fields):
        print ([fields[1], fields[2], fields[3], fields[4]]) 

class SecondClassHandler(FieldHander): # print passenger class, gender, and survived
    def __init__(self):
        FieldHander.__init__(self, "2nd")
    def apply(self, fields):
        print ([fields[1], fields[2], fields[4]])

class ThirdClassHandler(FieldHander): # print passenger class and survived
    def __init__(self):
        FieldHander.__init__(self, "3rd")
    def apply(self, fields):
        print ([fields[1], fields[4]])

And the results (notice how each class of passenger has different fields chosen):
[‘3rd’, ‘No’]
[‘3rd’, ‘Yes’]
[‘2nd’, ‘Male’, ‘No’]
[‘1st’, ‘Male’, ‘Adult’, ‘No’]

Running Python in Hive/Hadoop

One of the things I love about running Hive is the ability to run Python and leverage the power of the parallel processing. Below I’m going to show a stripped down example of how to integrate a Hive statement & Python together to aggregate data to prepare it for modeling. Keep in mind, you can also use Hive & Python to transform data line by line as well, and it extremely handy for data transformation.

Use case: print out an array of products sold to a particular user. Again is a basic example, but you can build upon this and generate products sold for every user, then use KNN to generate clusters of users, or perhaps Association Rules to generate baskets.

Here is the Python script, which will have to be saved in local Hadoop path:

import sys

items_sold = []  # create global list variable

class Items:  # create class to store and access items added
    def __init__(self, x):
    	self.x = x

    def set_x(self, x):
        self.x = x
    def get_x(self):
        return self.x

def print_results():  # print output in Hive
	result_set = [item.get_x() for item in items_sold];
	print (result_set)

	# Hive submits each record to stdin
	# The record/line is stripped of extra characters and submitted
for line in sys.stdin:
	line = line.strip()
	purchased_item = line.split('\t')


Here is the hive statement:

add file; 
select TRANSFORM (a.purchased_item)
using ''
AS array_purchased
from (select purchased_item from company_purchases where user_id = 'u1') a;

Result in Hive will be similar to this: [‘s_123’, ‘s_234’, ‘s890’]

Probability of a Revenue Threshold

A retailer’s website purchases have an average order size of $100 and a standard deviation of $75. What is the probability of 10 orders generating over $1,250 in Revenue?

mean = $100.00
stdev = $75.00

avg_order_needed = $1250/10 = $125.00
standard_error = $75/sqrt(10) = $23.72
z-score = (125.00 – 100.00)/23.72 = 1.05

We are looking to solve for this shaded area under the curve.


Looking up on z-table for 1.05, the probability is 0.1469 or 14.7% of a obtaining $1,250 in Revenue from 10 random orders.

Crowd-sourced Recommender Demo

Recommender Demo – click here!

This demo of a recommender is to illustrate an example of how a website (online music, e-commerce, news) generates recommendations to increase engagement and conversions.

This is not production ready, merely a POC of how it works.

* user selects favorite activities
* data is passed to server and processed in hadoop
* user can go to results page and select an activity to get recommendations

At this point, an auto-workflow has not been built, so there are a series of steps to create the new dataset. Here are the general steps:

1. user data feeds into database via website (which is used in generating recommendations)
2. data is moved and process in Hadoop
3. data is moved to MySQL, accessible using PHP
4. user selects an activity, and the crowd-sourced recommendations are displayed

Example: How Crowd-Sourcing Works (co-occurrence recommendations) Using Activities

All Users Activity History
| Activity | Art Fair | Fishing | Shovel Snow | Wedding |
| Jon          | Yes           | Yes         | Yes                      | No              |
| Jane        | No            | Yes         | No                        | Yes            |
| Jill            | Yes           | Yes         | No                        | Yes            |

A New User like to go to Weddings, and we need to recommend them other activities:
* Find Wedding in History Matrix who also enjoyed Wedding to it: U{Jane, Jill}
* Identify other activities same users (U) enjoyed, and rank by count

| Activity | Rank | Count of User (co-occurrence |
| Fishing  |  1         |  2                                                               |
| Art Fair |  2         | 1                                                                |

Predictive Algorithms on Million Song Dataset

I’ve had the opportunity within a Data Mining course in my graduate Software Engineering program to be part of a project in which we were to create a “recommendation engine”. The dataset we used was called the which there are 1M songs, along with play history of 380k users.

The goal was to provide a recommendation (ranked 1-10) of songs based on a current song played. We used three algorithms, Association Rules, Naive Bayes, and user-user co-occurance. When tested, the results were mixed, with Association Rules providing the top F1 scores, but also had the lowest # of recommendations (for a large portion of songs had less than 10 songs recommended). Co-occurance was close behind with the 2nd best F1 score, and provided the largest output of songs, as well as the lowest requirement of computational requirements.

Here is the full project on github.

Web Traffic using Linear Modeling

Wanted to illustrate a simple example to understand rate of change of web traffic over time using linear regression. My data is web traffic hits by day for past 8 months, here is top few rows:

date ,visits
10/11/14 ,37896
10/12/14 ,24098
10/13/14 ,35550
10/14/14 ,38610
10/15/14 ,35739
10/16/14 ,30316
…. through May 2015

First, I want to plot the data and add line of best fit:
plot(data$date, data$visits,pch=19,col="blue",main="Web Traffic", xlab="Date",ylab="Visits")
lm1 <- lm(data$visits ~ data$date) abline(lm1,col="red",lwd=3)


#(Intercept) data$date
#-2404.5259 148.9

To interpret this model, would be that we see 149 additional hits each day.

That model was great for absolute increase, but what if we want to average increase. To do so we can run the linear regression using log:

(Intercept) data$date
0.00000 1.00322

To interpret, would be a 0.3% increase in web traffic per day.

And other way we could look at change per day would be a generalized linear model with poisson.
plot(data$date, data$visits,pch=19,col="green",xlab="Date",ylab="Visits")
glm1 <- glm(data$visits ~ data$date, family="poisson") abline(lm1,col="red",lwd=3) # for linear model line lines(data$date,glm1$fitted,col="blue",lwd=3) # lm fit for possion


confint(glm1,level=0.95) # CI
#2.5 % 97.5 %
#(Intercept) -55.999943551 -45.190626728
#data$date 0.002976299 0.003632503

To interpret, 95% confident the increase web hits/day falls between range of 0.003 and 0.004, which is right inline with previous method of using linear regression log.

Convert Tab-Delimited to CSV

This is a very simple exercise, but necessary at times in Data Science.

f = open("input_data.txt") # input file tab delimited
f.readline() # skip the first line if needed for header removal
for line in f:
mystring = line.replace("\t", ",")
print('file created successfully')