Traffic Stop Disposition: Classification Using F# and KNN
January 14, 2014 1 Comment
I have already looked at the summary statistics of the traffic stop data I received from the town here. My next stop was to try and do a machine learning exercise with the data. One of the more interesting questions I want to answer is what factors into weather a person gets a warning or a ticket (called disposition)? Of all of the factors that may be involved, the dataset that I have is fairly limited:
Using dispositionId as the result variable, there is StopDateTime and Location (Latitude/Longitude). Fortunately, DateTime can be decomposed into several input variables. For this exercise, I wanted to use the following:
- Location (Latitude:Longitude)
And the resulting variable being disposition. To make it easier for analysis, I limited the analysis set to finalDisposition as either “verbal warning” or “citation” I decided to do a K-Nearest Neighbor because it is regarded as an easy machine learning algorithm to learn and the question does seem to be a classification problem.
My first step was to decide weather to write or borrow the KNN algorithm. After looking at what kind of code would be needed to write my own and then looking at some other libraries, I decided to use Accord.Net.
My next first step was to get the data via the web service I spun up here.
My next first step was to filter the data to only verbal warnings (7) or citations (15).
You will notice that I had to transform the dispositionIds from 7 and 15 to 1 and 0. The reason why is that the KNN method in Accord.Net assumes that the values match the index position in the array. I had to dig into the source code of Accord.Net to figure that one out.
My next step was to divide the dataset in half: one half being the training sample and the other the validation sample:
The next step was to actually run the KKN. Before I could do that though, I had to create the distance function. Since this was my 1st time, I dropped the geocoordinates and focused only on the time of day derivatives.
You will notice I tried to normalize the values so that they all had the same basis. They are not exact, but they are close. You will also notice that I had to create a delegate from for the distanceFunction (thanks to Mimo on SO). This is because Accord.NET was written in C# with C# consumers in mind and F# has a couple of places where the interfaces are not as seemless as one would hope.
In any event, once the KKN function was written, I wrote a function that to the validation sample, made a guess via KKN, and then reported the result:
I then hopped over to my UI console app and looked that the success percentage.
So there are 12,837 records in the validation sample and the classifier guessed the correct disposition 9,001 times – a success percentage of 70%
So it looks like there is something there. However, it is not clear that this is a good classifier without further tests – specifically seeing if the how to most common case results when pushing though the classifier. Also, I would assume to make this a true ‘machine learning’ algorithm I would have to feed the results back to the distance function to see if I can alter it to get the success percentage higher.
One quick note about methodology – I used unit tests pretty extensively to understand how the KKN works. I created a series of tests with some sample data to see who the function reacted.
This was a big help to get me up and running (walking, really..)…