Screen Scraping The Department Of Health →

Traffic Stop Disposition: Classification Using F# and KNN

January 14, 2014 1 Comment

I have already looked at the summary statistics of the traffic stop data I received from the town here. My next stop was to try and do a machine learning exercise with the data. One of the more interesting questions I want to answer is what factors into weather a person gets a warning or a ticket (called disposition)? Of all of the factors that may be involved, the dataset that I have is fairly limited:

Using dispositionId as the result variable, there is StopDateTime and Location (Latitude/Longitude). Fortunately, DateTime can be decomposed into several input variables. For this exercise, I wanted to use the following:

TimeOfDay
DayOfWeek
DayOfMonth
MonthOfYear
Location (Latitude:Longitude)

And the resulting variable being disposition. To make it easier for analysis, I limited the analysis set to finalDisposition as either “verbal warning” or “citation” I decided to do a K-Nearest Neighbor because it is regarded as an easy machine learning algorithm to learn and the question does seem to be a classification problem.

My first step was to decide weather to write or borrow the KNN algorithm. After looking at what kind of code would be needed to write my own and then looking at some other libraries, I decided to use Accord.Net.

My next first step was to get the data via the web service I spun up here.

namespace ChickenSoftware.RoadAlert.Analysis
 
open FSharp.Data
open Microsoft.FSharp.Data.TypeProviders
open Accord.MachineLearning
 
type roadAlert2 = JsonProvider<"http://chickensoftware.com/roadalert/api/trafficstopsearch/Sample&quot;>
type MachineLearningEngine =
    static member RoadAlertDoc = roadAlert2.Load("http://chickensoftware.com/roadalert/api/trafficstopsearch&quot;)

My next first step was to filter the data to only verbal warnings (7) or citations (15).

  static member BaseDataSet =
      MachineLearningEngine.RoadAlertDoc
            |> Seq.filter(funx -> x.DispositionId = 7 || x.DispositionId = 15)
          |> Seq.map(fun x -> x.Id, x.StopDateTime, x.Latitude, x.Longitude, x.DispositionId)
          |> Seq.map(fun (a,b,c,d,e) -> a, b, System.Math.Round(c,3), System.Math.Round(d,3), e)
          |> Seq.map(fun (a,b,c,d,e) -> a, b, c.ToString() + ":" + d.ToString(), e)
          |> Seq.map(fun (a,b,c,d) -> a,b,c, match d with
                                              |7 -> 0
                                              |15 -> 1
                                              |_ -> 1)
          |> Seq.map(fun (a,b,c,d) -> a, b.Hour, b.DayOfWeek.GetHashCode(), b.Day, b.Month, c, d)
          |> Seq.toList

You will notice that I had to transform the dispositionIds from 7 and 15 to 1 and 0. The reason why is that the KNN method in Accord.Net assumes that the values match the index position in the array. I had to dig into the source code of Accord.Net to figure that one out.

My next step was to divide the dataset in half: one half being the training sample and the other the validation sample:

static member TrainingSample =
    let midNumber = MachineLearningEngine.NumberOfRecords/ 2
    MachineLearningEngine.BaseDataSet
        |> Seq.filter(fun (a,b,c,d,e,f,g) -> a < midNumber)
        |> Seq.toList
 
static member ValidationSample =
    let midNumber = MachineLearningEngine.NumberOfRecords/ 2
    MachineLearningEngine.BaseDataSet
        |> Seq.filter(fun (a,b,c,d,e,f,g) -> a > midNumber)
        |> Seq.toList

The next step was to actually run the KKN. Before I could do that though, I had to create the distance function. Since this was my 1st time, I dropped the geocoordinates and focused only on the time of day derivatives.

static member RunKNN inputs outputs input =
    let distanceFunction (a:int,b:int,c:int,d:int) (e:int,f:int,g:int,h:int) =  
      let b1 = b * 4
      let f1 = f * 4
      let d1 = d * 2
      let h1 = h * 2
      float((pown(a-e) 2) + (pown(b1-f1) 2) + (pown(c-g) 2) + (pown(d1-h1) 2))
 
    let distanceDelegate = 
          System.Func<(int * int * int * int),(int * int * int * int),float>(distanceFunction)
    
    let knn = new KNearestNeighbors<int*int*int*int>(10,2,inputs,outputs,distanceDelegate)
    knn.Compute(input)

You will notice I tried to normalize the values so that they all had the same basis. They are not exact, but they are close. You will also notice that I had to create a delegate from for the distanceFunction (thanks to Mimo on SO). This is because Accord.NET was written in C# with C# consumers in mind and F# has a couple of places where the interfaces are not as seemless as one would hope.

In any event, once the KKN function was written, I wrote a function that to the validation sample, made a guess via KKN, and then reported the result:

static member GetValidationsViaKKN  =
    let inputs = MachineLearningEngine.TrainingInputClass
    let outputs = MachineLearningEngine.TrainingOutputClass
    let validations = MachineLearningEngine.ValidationClass
 
    validations
        |> Seq.map(fun (a,b,c,d,e) -> e, MachineLearningEngine.RunKNN inputs outputs (a,b,c,d))
        |> Seq.toList
 
static member GetSuccessPercentageOfValidations =
    let validations = MachineLearningEngine.GetValidationsViaKKN
    let matches = validations
                    |> Seq.map(fun (a,b) -> match (a=b) with
                                                | true -> 1
                                                | false -> 0)
 
    let recordCount =  validations |> Seq.length
    let numberCorrect = matches |> Seq.sum
    let successPercentage = double(numberCorrect) / double(recordCount)
    recordCount, numberCorrect, successPercentage

I then hopped over to my UI console app and looked that the success percentage.

private static void GetSuccessPercentageOfValidations()
{
    var output = MachineLearningEngine.GetSuccessPercentageOfValidations;
    Console.WriteLine(output.Item1.ToString() + ":" + output.Item2.ToString() + ":" + output.Item3.ToString());
}

So there are 12,837 records in the validation sample and the classifier guessed the correct disposition 9,001 times – a success percentage of 70%

So it looks like there is something there. However, it is not clear that this is a good classifier without further tests – specifically seeing if the how to most common case results when pushing though the classifier. Also, I would assume to make this a true ‘machine learning’ algorithm I would have to feed the results back to the distance function to see if I can alter it to get the success percentage higher.

One quick note about methodology – I used unit tests pretty extensively to understand how the KKN works. I created a series of tests with some sample data to see who the function reacted.

[TestMethod]
public void TestKKN_ReturnsExpected()
{
 
    Tuple<int, int, int, int>[] inputs = { 
        new Tuple<int, int, int, int>(1, 0, 15, 1), 
        new Tuple<int,int,int,int>(1,0,11,1)};
    int[] outputs = { 1, 1 };
 
    var input = new Tuple<int, int, int, int>(1, 1, 1, 1);
 
    var output = MachineLearningEngine.RunKNN(inputs, outputs, input);
 
}

This was a big help to get me up and running (walking, really..)…

Filed under Analytics, F#

One Response to Traffic Stop Disposition: Classification Using F# and KNN

Pingback: F# Weekly #3, 2014 | Sergey Tihon's Blog

Jamie Dixon's Home

Traffic Stop Disposition: Classification Using F# and KNN

One Response to Traffic Stop Disposition: Classification Using F# and KNN

Leave a comment Cancel reply

Categories

Recent Posts

Archives

Blogroll

Meta

Jamie Dixon's Home

Traffic Stop Disposition: Classification Using F# and KNN

Share this:

Related

One Response to Traffic Stop Disposition: Classification Using F# and KNN

Leave a comment Cancel reply

Categories

Recent Posts

Archives

Blogroll

Meta