Multiple Linear Regression Using R and F#

Following up on my previous post, I decided to test calling R from F# for a multiple linear regression.  I decided to use the dataset from chapter 1 of Machine Learning For Hackers (ufo sightings).

Step #1 was to open R from F#

  1. #r @"C:\TFS\Tff.RDotNetExample_Solution\packages\R.NET.1.5.3\lib\net40\RDotNet.dll"
  2. #r @"C:\TFS\Tff.RDotNetExample_Solution\packages\R.NET.1.5.3\lib\net40\RDotNet.NativeLibrary.dll"
  3.  
  4. open System.IO
  5. open RDotNet
  6.  
  7.  
  8. //open R
  9. let environmentPath = System.Environment.GetEnvironmentVariable("PATH")
  10. let binaryPath = @"C:\Program Files\R\R-3.0.1\bin\x64"
  11. System.Environment.SetEnvironmentVariable("PATH",environmentPath+System.IO.Path.PathSeparator.ToString()+binaryPath)
  12.  
  13. let engine = RDotNet.REngine.CreateInstance("RDotNet")
  14. engine.Initialize()

 

Step #2 was to import the ufo dataset and clean it:

  1. //open dataset
  2. let path = @"C:\TFS\Tff.RDotNetExample_Solution\Tff.RDotNetExample\ufo_awesome.txt"
  3. let fileStream = new FileStream(path,FileMode.Open,FileAccess.Read)
  4. let streamReader = new StreamReader(fileStream)
  5. let contents = streamReader.ReadToEnd()
  6. let usStates = [|"AL";"AK";"AZ";"AR";"CA";"CO";"CT";"DE";"DC";"FL";"GA";"HI";"ID";"IL";"IN";"IA";
  7.                     "KS";"KY";"LA";"ME";"MD";"MA";"MI";"MN";"MS";"MO";"MT";"NE";"NV";"NH";"NJ";"NM";
  8.                     "NY";"NC";"ND";"OH";"OK";"OR";"PA";"RI";"SC";"SD";"TN";"TX";"UT";"VT";"VA";"WA";
  9.                     "WV";"WI";"WY"|]
  10. let cleanContents =
  11.     contents.Split([|'\n'|])
  12.     |> Seq.map(fun line -> line.Split([|'\t'|]))
  13.     |> Seq.filter(fun values -> values |> Seq.length = 6)
  14.     |> Seq.filter(fun values -> values.[0].Length = 8)
  15.     |> Seq.filter(fun values -> values.[1].Length = 8)
  16.     |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(0,4)) > 1900)
  17.     |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(0,4)) > 1900)
  18.     |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(0,4)) < 2100)
  19.     |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(0,4)) < 2100)
  20.     |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(4,2)) > 0)
  21.     |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(4,2)) > 0)
  22.     |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(4,2)) <= 12)
  23.     |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(4,2)) <= 12)      
  24.     |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(6,2)) > 0)
  25.     |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(6,2)) > 0)
  26.     |> Seq.filter(fun values -> System.Int32.Parse(values.[0].Substring(6,2)) <= 31)
  27.     |> Seq.filter(fun values -> System.Int32.Parse(values.[1].Substring(6,2)) <= 31)
  28.     |> Seq.filter(fun values -> values.[2].Split(',').[1].Trim().Length = 2)
  29.     |> Seq.filter(fun values -> Seq.exists(fun elem -> elem = values.[2].Split(',').[1].Trim().ToUpperInvariant()) usStates)
  30.     |> Seq.map(fun values ->
  31.         System.DateTime.ParseExact(values.[0],"yyyymmdd",System.Globalization.CultureInfo.InvariantCulture),
  32.         System.DateTime.ParseExact(values.[1],"yyyymmdd",System.Globalization.CultureInfo.InvariantCulture),
  33.         values.[2].Split(',').[0].Trim(),
  34.         values.[2].Split(',').[1].Trim().ToUpperInvariant(),
  35.         values.[3],
  36.         values.[4],
  37.         values.[5])
  38. cleanContents
  39.  
  40. let relevantContents =
  41.     cleanContents
  42.     |> Seq.map(fun (a,b,c,d,e,f,g) -> a.Year,d,g.Length)

 

Step #3 was to run the regression using the dataset.  You will notice that I made the length of the report the Y (dependent) variable – not that I think I will find any causality but it was a good enough to use).  Also, notice the Seq.Map of each column in the larger Seq(Int*String*Int) into the Vector.

  1. let reportLength = engine.CreateIntegerVector(relevantContents |> Seq.map (fun (a,b,c) -> c))
  2. engine.SetSymbol("reportLength", reportLength)
  3. let year = engine.CreateIntegerVector(relevantContents |> Seq.map (fun (a,b,c) -> a))
  4. engine.SetSymbol("year", year)
  5. let state = engine.CreateCharacterVector(relevantContents |> Seq.map (fun (a,b,c) -> b))
  6. engine.SetSymbol("state", state)
  7.  
  8. let calcExpression = "lm(formula = reportLength ~ year + state)"
  9. let testResult = engine.Evaluate(calcExpression).AsList()

 

Sure enough, you can get the results of the regression.  The challenge is teasing out the values that are interesting from the real data structure that is returned (testResult in this example)

> testResult.Item(0).AsCharacter();;
val it : CharacterVector =
  seq
    ["31775.5599180962"; "-15.2760355122386"; "37.8028841898059";
     "-91.2309146099364"; …]

Intercepts, I think, are Item(0).

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: