Machine Learning for Hackers: Using F#

I decided I wanted to learn more about F# so my Road Alert project.  I started by watching this great video.  After reviewing it a couple of times, I realized that I could try and do chapter 1 of Machine Learning for Hackers using F#.

Since I already had the data from this blog post, I just had to follow Luca’s example.  I wrote the following code in an F# project in Visual Studio 2012.

  1. open System.IO
  2. type UFOLibrary() =
  3.     member this.GetDetailData() =
  4.         let path = "C:\Users\Jamie\Documents\Visual Studio 2012\Projects\MachineLearningWithFSharp_Solution\Tff.MachineLearningWithFSharp.Chapter01\ufo_awesome.txt"
  5.         let fileStream = new FileStream(path,FileMode.Open,FileAccess.Read)
  6.         let streamReader = new StreamReader(fileStream)
  7.         let contents = streamReader.ReadToEnd()
  8.         let usStates = [|"AL";"AK";"AZ";"AR";"CA";"CO";"CT";"DE";"DC";"FL";"GA";"HI";"ID";"IL";"IN";"IA";
  9.                          "KS";"KY";"LA";"ME";"MD";"MA";"MI";"MN";"MS";"MO";"MT";"NE";"NV";"NH";"NJ";"NM";
  10.                          "NY";"NC";"ND";"OH";"OK";"OR";"PA";"RI";"SC";"SD";"TN";"TX";"UT";"VT";"VA";"WA";
  11.                           "WV";"WI";"WY"|]
  12.         let cleanContents =
  13.             contents.Split([|'\n'|])
  14.             |> Seq.map(fun line -> line.Split([|'\t'|]))
  15.             Seq.head()

I then added a C# console project to the solution and added the following code:

  1. static void Main(string[] args)
  2. {
  3.     Console.WriteLine("Start");
  4.     UFOLibrary ufoLibrary = new UFOLibrary();
  5.  
  6.     foreach (String currentString in ufoLibrary.GetDetailData())
  7.     {
  8.         Console.WriteLine(currentString);
  9.     }
  10.     Console.WriteLine("End");
  11.     Console.ReadKey();
  12. }

 

Sure enough, when I hit F5

image

How cool is it to call F# code from a C# project and it just works?  I feel a whole new world of possibilites just opened to me.

I then went back to the book and saw that they used the head function in R that returns the top 10 rows of data.  The F# head only returns the top 1 so I had to make the following change to my F# to duplicate the effect:

  1. let cleanContents =
  2.     contents.Split([|'\n'|])
  3.     |> Seq.map(fun line -> line.Split([|'\t'|]))
  4.     |> Seq.take(10)

 

I then had to remove the defective rows that had malformed data. To do this, I went back to the F# code and changed it to this

  1. let cleanContents =
  2.     contents.Split([|'\n'|])
  3.     |> Seq.map(fun line -> line.Split([|'\t'|]))

 

I then went back to the Console app to change it like this:

  1. Console.WriteLine("Start");
  2. UFOLibrary ufoLibrary = new UFOLibrary();
  3. IEnumerable<String> rows = ufoLibrary.GetDetailData();
  4. Console.WriteLine(String.Format("Number of rows: {0}", rows.Count()));
  5. Console.WriteLine("End");
  6. Console.ReadKey();

 

And I see this when I hit F5

image

So now I have a baseline of 61,394 rows.

My 1st step is to removed rows that do not have 6 columns.  To do that, I changed my code to this:

  1. Console.WriteLine("Start");
  2. UFOLibrary ufoLibrary = new UFOLibrary();
  3. IEnumerable<String> rows = ufoLibrary.GetDetailData();
  4. Console.WriteLine(String.Format("Number of rows: {0}", rows.Count()));
  5. Console.WriteLine("End");
  6. Console.ReadKey();

and when I hit F5, I can see that the number of records has dropped:

image

I then want to removed the bad date fields the way they did it in the book – all dates have to be 8 characters in length, no more, no less.

Going back to the F# code, I added this line

  1. |> Seq.filter(fun values -> values.[0].Length = 8)

 

and sure enough, fewer records in my dataset:

image

And finally applying the same logic to the second column – which is also a date

  1. |> Seq.filter(fun values -> values.[1].Length = 8)

 

image

Which raises eyebrows, I assume there would be some malformed data in the 2ndcolumn independent of the 1st column, but I guess not.

I then wanted to convert the 1st two columns from strings into DateTimes.  Going back to Luca’s examples, I did this:

  1. |> Seq.map(fun values ->
  2.     System.DateTime.Parse(values.[0]),
  3.     System.DateTime.Parse(values.[1]),
  4.     values.[2],
  5.     values.[2],
  6.     values.[3],
  7.     values.[4],
  8.     values.[5])

Interestingly, I then went back to my Console application and got this

Error    1    Cannot implicitly convert type ‘System.Collections.Generic.IEnumerable<System.Tuple<System.DateTime,System.DateTime,string,string,string,string>>’ to ‘System.Collections.Generic.IEnumerable<string[]>’. An explicit conversion exists (are you missing a cast?)

So I then did this:

   1: var rows = ufoLibrary.GetData();

so I can compile again.  When I ran it, I got his exception:

image

 

So it looks like R can handle YYYYMMDD while F# DateTime.Parse() can not.  So I went back to The different ways to parse in .NET I changed the parsing to this:

  1. System.DateTime.ParseExact(values.[0],"yyyymmdd",System.Globalization.CultureInfo.InvariantCulture),
  2. System.DateTime.ParseExact(values.[1],"yyyymmdd",System.Globalization.CultureInfo.InvariantCulture),

When I ran it, I got this:

image

Which I am not sure is progress.  so then it hit me that the data in the strings might be out of bounds – for example a month of “13”.  So I added the following filters to the dataset:

  1. |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(0,4)) > 1900)
  2. |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(0,4)) > 1900)
  3. |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(0,4)) < 2100)
  4. |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(0,4)) < 2100)
  5. |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(4,2)) > 0)
  6. |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(4,2)) > 0)
  7. |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(4,2)) <= 12)
  8. |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(4,2)) <= 12)      
  9. |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(6,2)) > 0)
  10. |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(6,2)) > 0)
  11. |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(6,2)) <= 31)
  12. |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(6,2)) <= 31)

 

Sure enough, now when I run it:

image

Which matches what the book’s R example.

I then wanted to match what the book does in terms of cleaning the city,state field (column).  We are only interested in data from the united states that follows the “City,State” pattern.  The R examples does some conditional logic to clean this data, up, which I didn’t want to do in F#.

So I added this filter than split the City,State column and checked that the state value is only 2 characters in length R uses the “Clean” keyword to remove white space, F# uses “Trim()”

  1. |> Seq.filter(fun values -> values.[2].Split(',').[1].Trim().Length = 2)

 

image

 

Next, the book limits the location values to only the Unites States.  To do that, it creates a list of values of all 50 postal codes (lower case) to then compare the state portion of the location field.  To that end, I added a string array like so:

  1. let usStates = [|"AL";"AK";"AZ";"AR";"CA";"CO";"CT";"DE";"DC";"FL";"GA";"HI";"ID";"IL";"IN";"IA";
  2.                  "KS";"KY";"LA";"ME";"MD";"MA";"MI";"MN";"MS";"MO";"MT";"NE";"NV";"NH";"NJ";"NM";
  3.                  "NY";"NC";"ND";"OH";"OK";"OR";"PA";"RI";"SC";"SD";"TN";"TX";"UT";"VT";"VA";"WA";
  4.                   "WV";"WI";"WY"|]

I then add this filter (took me about 45 minutes to figure out):

  1. |> Seq.filter(fun values -> Seq.exists(fun elem -> elem = values.[2].Split(',').[1].Trim().ToUpperInvariant()) usStates)

 

image

So now I am 1/2 way done with Chapter 1 – the data has now been cleaned and is ready to be analyzed. Here is the code that I have so far:

  1. member this.GetDetailData() =
  2.     let path = "C:\Users\Jamie\Documents\Visual Studio 2012\Projects\MachineLearningWithFSharp_Solution\Tff.MachineLearningWithFSharp.Chapter01\ufo_awesome.txt"
  3.     let fileStream = new FileStream(path,FileMode.Open,FileAccess.Read)
  4.     let streamReader = new StreamReader(fileStream)
  5.     let contents = streamReader.ReadToEnd()
  6.     let usStates = [|"AL";"AK";"AZ";"AR";"CA";"CO";"CT";"DE";"DC";"FL";"GA";"HI";"ID";"IL";"IN";"IA";
  7.                      "KS";"KY";"LA";"ME";"MD";"MA";"MI";"MN";"MS";"MO";"MT";"NE";"NV";"NH";"NJ";"NM";
  8.                      "NY";"NC";"ND";"OH";"OK";"OR";"PA";"RI";"SC";"SD";"TN";"TX";"UT";"VT";"VA";"WA";
  9.                       "WV";"WI";"WY"|]
  10.     let cleanContents =
  11.         contents.Split([|'\n'|])
  12.         |> Seq.map(fun line -> line.Split([|'\t'|]))
  13.         |> Seq.filter(fun values -> values |> Seq.length = 6)
  14.         |> Seq.filter(fun values -> values.[0].Length = 8)
  15.         |> Seq.filter(fun values -> values.[1].Length = 8)
  16.         |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(0,4)) > 1900)
  17.         |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(0,4)) > 1900)
  18.         |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(0,4)) < 2100)
  19.         |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(0,4)) < 2100)
  20.         |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(4,2)) > 0)
  21.         |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(4,2)) > 0)
  22.         |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(4,2)) <= 12)
  23.         |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(4,2)) <= 12)      
  24.         |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(6,2)) > 0)
  25.         |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(6,2)) > 0)
  26.         |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(6,2)) <= 31)
  27.         |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(6,2)) <= 31)
  28.         |> Seq.filter(fun values -> values.[2].Split(',').[1].Trim().Length = 2)
  29.         |> Seq.filter(fun values -> Seq.exists(fun elem -> elem = values.[2].Split(',').[1].Trim().ToUpperInvariant()) usStates)
  30.         |> Seq.map(fun values ->
  31.             System.DateTime.ParseExact(values.[0],"yyyymmdd",System.Globalization.CultureInfo.InvariantCulture),
  32.             System.DateTime.ParseExact(values.[1],"yyyymmdd",System.Globalization.CultureInfo.InvariantCulture),
  33.             values.[2].Split(',').[0].Trim(),
  34.             values.[2].Split(',').[1].Trim().ToUpperInvariant(),
  35.             values.[3],
  36.             values.[4],
  37.             values.[5])
  38.     cleanContents

 

I now want to finish up the chapter where the analysis happens.  R uses some built-in plotting libraries (ggplot).  Following Luca’s example of this

image 

I went to the flying frogs libraries and, alas, there is no longer a free edition.

image

So I am bit stuck.  I’ll continue to work on it for next week’s blog…

Advertisements

One Response to Machine Learning for Hackers: Using F#

  1. Pingback: este enlace

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: