Probability of a Revenue Threshold

A retailer’s website purchases have an average order size of $100 and a standard deviation of $75. What is the probability of 10 orders generating over $1,250 in Revenue?

mean = $100.00
stdev = $75.00

avg_order_needed = $1250/10 = $125.00
standard_error = $75/sqrt(10) = $23.72
z-score = (125.00 – 100.00)/23.72 = 1.05

We are looking to solve for this shaded area under the curve.


Looking up on z-table for 1.05, the probability is 0.1469 or 14.7% of a obtaining $1,250 in Revenue from 10 random orders.

Crowd-sourced Recommender Demo

Recommender Demo – click here!

This demo of a recommender is to illustrate an example of how a website (online music, e-commerce, news) generates recommendations to increase engagement and conversions.

This is not production ready, merely a POC of how it works.

* user selects favorite activities
* data is passed to server and processed in hadoop
* user can go to results page and select an activity to get recommendations

At this point, an auto-workflow has not been built, so there are a series of steps to create the new dataset. Here are the general steps:

1. user data feeds into database via website (which is used in generating recommendations)
2. data is moved and process in Hadoop
3. data is moved to MySQL, accessible using PHP
4. user selects an activity, and the crowd-sourced recommendations are displayed

Example: How Crowd-Sourcing Works (co-occurrence recommendations) Using Activities

All Users Activity History
| Activity | Art Fair | Fishing | Shovel Snow | Wedding |
| Jon          | Yes           | Yes         | Yes                      | No              |
| Jane        | No            | Yes         | No                        | Yes            |
| Jill            | Yes           | Yes         | No                        | Yes            |

A New User like to go to Weddings, and we need to recommend them other activities:
* Find Wedding in History Matrix who also enjoyed Wedding to it: U{Jane, Jill}
* Identify other activities same users (U) enjoyed, and rank by count

| Activity | Rank | Count of User (co-occurrence |
| Fishing  |  1         |  2                                                               |
| Art Fair |  2         | 1                                                                |

Predictive Algorithms on Million Song Dataset

I’ve had the opportunity within a Data Mining course in my graduate Software Engineering program to be part of a project in which we were to create a “recommendation engine”. The dataset we used was called the which there are 1M songs, along with play history of 380k users.

The goal was to provide a recommendation (ranked 1-10) of songs based on a current song played. We used three algorithms, Association Rules, Naive Bayes, and user-user co-occurance. When tested, the results were mixed, with Association Rules providing the top F1 scores, but also had the lowest # of recommendations (for a large portion of songs had less than 10 songs recommended). Co-occurance was close behind with the 2nd best F1 score, and provided the largest output of songs, as well as the lowest requirement of computational requirements.

Here is the full project on github.

Web Traffic using Linear Modeling

Wanted to illustrate a simple example to understand rate of change of web traffic over time using linear regression. My data is web traffic hits by day for past 8 months, here is top few rows:

date ,visits
10/11/14 ,37896
10/12/14 ,24098
10/13/14 ,35550
10/14/14 ,38610
10/15/14 ,35739
10/16/14 ,30316
…. through May 2015

First, I want to plot the data and add line of best fit:
plot(data$date, data$visits,pch=19,col="blue",main="Web Traffic", xlab="Date",ylab="Visits")
lm1 <- lm(data$visits ~ data$date) abline(lm1,col="red",lwd=3)


#(Intercept) data$date
#-2404.5259 148.9

To interpret this model, would be that we see 149 additional hits each day.

That model was great for absolute increase, but what if we want to average increase. To do so we can run the linear regression using log:

(Intercept) data$date
0.00000 1.00322

To interpret, would be a 0.3% increase in web traffic per day.

And other way we could look at change per day would be a generalized linear model with poisson.
plot(data$date, data$visits,pch=19,col="green",xlab="Date",ylab="Visits")
glm1 <- glm(data$visits ~ data$date, family="poisson") abline(lm1,col="red",lwd=3) # for linear model line lines(data$date,glm1$fitted,col="blue",lwd=3) # lm fit for possion


confint(glm1,level=0.95) # CI
#2.5 % 97.5 %
#(Intercept) -55.999943551 -45.190626728
#data$date 0.002976299 0.003632503

To interpret, 95% confident the increase web hits/day falls between range of 0.003 and 0.004, which is right inline with previous method of using linear regression log.

Convert Tab-Delimited to CSV

This is a very simple exercise, but necessary at times in Data Science.

f = open("input_data.txt") # input file tab delimited
f.readline() # skip the first line if needed for header removal
for line in f:
mystring = line.replace("\t", ",")
print('file created successfully')

Probability of Web Clicks in a Day

Below is a simplified example using R in which you can apply a probability that a day has a certain # of visits. The web visits are approx normally distributed, and we want to know the probability of getting fewer than 50 visits/day.

# web traffic for last seven days
web_visits <- c(64, 34, 55, 47, 52, 59, 77) visits_day <- mean(web_visits) # mean = 55.4 sd_visits_day <- sd(web_visits) # standard_deviation = 13.5 goal_visits <- 50 #result pnorm(visits_day, goal_visits, sd=sd_visits_day, lower.tail=F) # .344 or 34.4% probability you'll have fewer than 50 web visits

source: Statistical Inference, John Hopkins University/Coursera by Brian Caffo

Combine .txt Files in Python

Using the HDFS (Hadoop File System), I was able to save data from a query which the hopes of using for analysis.  From there, I moved the files using SCP to my local machine. However, I was dealing with over 700 .txt files that would be needed to be combined.  Looking at the file names,  they are is this format “000000_0” to “000770_0”.   In Unix, the simple way is to use a command such as “cat * > new_file_name”, which will combine all files.   But there could be times you don’t want all files in a directory to be combined, or need some sort of logic applied.  Using Python, here is my code to make it happen:

# create a empty file called "combined.csv" using # a text editor
# first file:
for line in open("000000_0"):

# files with 1 digit:
for num in range(1,9):
f = open("00000"+str(num)+"_0")

# # skip the header
for line in f:
f.close() # not really needed

# files with 2 digits:
for num in range(10,99):
f = open("0000"+str(num)+"_0")
# # skip the header
for line in f:
f.close() # not really needed

# files with 3 digits:
for num in range(100,769):
f = open("000"+str(num)+"_0")
# # skip the header
for line in f:

Replace Null Values in CSV with Java

I’ve been working with the weka java machining learning algorithms, and with the large amounts of data extracted from databases via SQL, I’ve been running into the issue of having null values.  In order for the data to be read by weka, they nulls need to be replaced with a value, which in my case, should be 0.

Here is a sample of one row of my data:  2725079062,2,77,,,,,,,,,,,,,,,4,2,2,,t
There are many ways to do this, here is a common approach:

 mylines=mylines.replaceAll(",,", ",0,");

however what I was running into is that if there are multiple “,,” it doesn’t parse correctly.  Below is how I solved it:

public class ConvertData {
public static void main(String[] args) throws IOException {
BufferedReader br = new BufferedReader(new FileReader(
BufferedWriter bw = new BufferedWriter(new FileWriter(
String line = "";
while ((line = br.readLine()) != null) {
String[] values = line.split(",", -1); 
// make an array out of line
String writableString = ""; 
//initial string which will be the final output for the row
ArrayList al = new ArrayList(); 
// use array list because can edit array and modify size easily
for (String element : values) {
if (element==null || element.length()==0) {
} else {
for (String s : al){ // add commas between each element of arraylist
writableString += s + ",";
writableString = writableString.substring(0, // remove last comma
writableString.length() - 1);
bw.write(writableString + "\n"); 

//writes the line and carriage return

Final output would be like this:

*side note, remember to resfresh the data folder when exporting