Replace Null Values in CSV with Java

Standard

I’ve been working with the weka java machining learning algorithms, and with the large amounts of data extracted from databases via SQL, I’ve been running into the issue of having null values.  In order for the data to be read by weka, they nulls need to be replaced with a value, which in my case, should be 0.

Here is a sample of one row of my data:  2725079062,2,77,,,,,,,,,,,,,,,4,2,2,,t
There are many ways to do this, here is a common approach:

 mylines=mylines.replaceAll(",,", ",0,");

however what I was running into is that if there are multiple “,,” it doesn’t parse correctly.  Below is how I solved it:

public class ConvertData {
public static void main(String[] args) throws IOException {
BufferedReader br = new BufferedReader(new FileReader(
"data/address_yes_raw.csv"));
BufferedWriter bw = new BufferedWriter(new FileWriter(
"data/address_yes_raw_zero.csv"));
String line = "";
while ((line = br.readLine()) != null) {
String[] values = line.split(",", -1); 
// make an array out of line
String writableString = ""; 
//initial string which will be the final output for the row
ArrayList al = new ArrayList(); 
// use array list because can edit array and modify size easily
for (String element : values) {
if (element==null || element.length()==0) {
al.add("0");
} else {
al.add(element);
}
}
for (String s : al){ // add commas between each element of arraylist
writableString += s + ",";
}
writableString = writableString.substring(0, // remove last comma
writableString.length() - 1);
bw.write(writableString + "\n"); 

//writes the line and carriage return
}
br.close();
bw.close();
}
}

Final output would be like this:
2725079062,2,77,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,2,2,0,t

*side note, remember to resfresh the data folder when exporting

What the heck is Mahout?

Standard

Here is the tutorial I used by Steve Cook on youtube.

Links to downloads libraries for java:

http://mahout.apache.org/general/downloads.html

http://www.slf4j.org/download.html

Here is the data:

MovieLens

https://code.google.com/p/guava-libraries/

The basics of Mahout (which is an Apache product) is to accomplish the following:

  • Collaborative Filtering (recommendations)
  • Classification (spam email or not)
  • Clustering (Google news)

Getting Started with Python

Standard

My background languages are java, objective-c, SQL, and html….but on Python!! Not a problem. I turned my attention to good ole’ youtube.com, and while I was doing the elliptical machine at 5:30am, I ran into some great videos from OneStopProgramming.

Summary steps to get started:

  • install Python exe
  • install notepad++ if you don’t already have it
  • create a simple .py script
  • open command prompt, find the .py and run it

Useful functions:

  • len(“hello”)b = 5
  • help(len) = gives info about len function
  • dir() = gives you all the variables you’ve declared
  • “H” in “Hello” = TRUE
  • “h” in “Hello” = FALSE

 

Getting Started with Hadoop

Standard

To begin playing around with what Hadoop does, I decided to go down the path of using HortonWorks Sandbox.  One of the first things the setup has you do, is install Oracle VirtualBox, which is a virtual machine.  Within that virtual machine is where the Sandbox will run.  One note, the browser IP is wrong in the tutorial, it should be http://127.0.0.1:8000 to open the Sandbox GUI.

I then proceeded to follow the “Hello World” tutorial with I was able to import some actual data from the NYSE and run some Hive and Pig queries.  I have a substantial SQL background (but is not essential) so it was a breeze.

I’m impressed on how easy and well written the tutorial was.  Great way to get started!