Predictive Algorithms on Million Song Dataset

I’ve had the opportunity within a Data Mining course in my graduate Software Engineering program to be part of a project in which we were to create a “recommendation engine”. The dataset we used was called the which there are 1M songs, along with play history of 380k users.

The goal was to provide a recommendation (ranked 1-10) of songs based on a current song played. We used three algorithms, Association Rules, Naive Bayes, and user-user co-occurance. When tested, the results were mixed, with Association Rules providing the top F1 scores, but also had the lowest # of recommendations (for a large portion of songs had less than 10 songs recommended). Co-occurance was close behind with the 2nd best F1 score, and provided the largest output of songs, as well as the lowest requirement of computational requirements.

Here is the full project on github.

Web Traffic using Linear Modeling

Wanted to illustrate a simple example to understand rate of change of web traffic over time using linear regression. My data is web traffic hits by day for past 8 months, here is top few rows:

date ,visits
10/11/14 ,37896
10/12/14 ,24098
10/13/14 ,35550
10/14/14 ,38610
10/15/14 ,35739
10/16/14 ,30316
…. through May 2015

First, I want to plot the data and add line of best fit:
plot(data$date, data$visits,pch=19,col="blue",main="Web Traffic", xlab="Date",ylab="Visits")
lm1 <- lm(data$visits ~ data$date) abline(lm1,col="red",lwd=3)


#(Intercept) data$date
#-2404.5259 148.9

To interpret this model, would be that we see 149 additional hits each day.

That model was great for absolute increase, but what if we want to average increase. To do so we can run the linear regression using log:

(Intercept) data$date
0.00000 1.00322

To interpret, would be a 0.3% increase in web traffic per day.

And other way we could look at change per day would be a generalized linear model with poisson.
plot(data$date, data$visits,pch=19,col="green",xlab="Date",ylab="Visits")
glm1 <- glm(data$visits ~ data$date, family="poisson") abline(lm1,col="red",lwd=3) # for linear model line lines(data$date,glm1$fitted,col="blue",lwd=3) # lm fit for possion


confint(glm1,level=0.95) # CI
#2.5 % 97.5 %
#(Intercept) -55.999943551 -45.190626728
#data$date 0.002976299 0.003632503

To interpret, 95% confident the increase web hits/day falls between range of 0.003 and 0.004, which is right inline with previous method of using linear regression log.