Traffic Stop Analysis Using F#

Now that I have the traffic stop services up and running, it is time to actually do something with the data.  The data set is all traffic stops in my town for 2012 with some limited information: date/time of the stop, the geolocation of the stop, and the final disposition of the stop.  The data looks like this:

image

My 1st step was to look at the Date/Time and see if there are any patterns in DayOfMonth, MonthOfYear, And TimeOfDay.  To that end, I spun up a F# project and added my 1st method that determines the total number of records in the dataset:

  1. type roadAlert = JsonProvider<"http://chickensoftware.com/roadalert/api/trafficstopsearch/Sample&quot;>
  2. type AnalysisEngine =
  3.     static member RoadAlertDoc = roadAlert.Load("http://chickensoftware.com/roadalert/api/trafficstopsearch&quot;)
  4.  
  5.     static member NumberOfRecords =
  6.         AnalysisEngine.RoadAlertDoc
  7.             |> Seq.length

Since I am a TDDer more than a REPLer, I went and wrote a covering unit test.

  1. [TestMethod]
  2. public void NumberOfRecords_ReturnsExpected()
  3. {
  4.     Int32 notEpected = 0;
  5.     Int32 actual = AnalysisEngine.NumberOfRecords;
  6.     Assert.AreNotEqual(notEpected, actual);
  7. }

A couple of things to note about this:

1) This is really an integration test, not a unit test.  I could have written the test like this:

  1. [TestMethod]
  2. public void NumberOfRecordsFor2012DataSet_ReturnsExpected()
  3. {
  4.     Int32 expected = 27778;
  5.     Int32 actual = AnalysisEngine.NumberOfRecords;
  6.     Assert.AreEqual(expected, actual);
  7. }

But that means I am tying the test to the specific data sample (in its current state) – and I don’t want to do that.

2) I am finding that my F# code has many more functions than the code written by other people – esp data scientists.  I think it has to do with contrasting methodologies.  Instead of spending time in the REPL with a small piece of code to get it right and then adding the code into the larger code base, I am writing very small piece of code in the class and then using unit tests to get it right.  The upshot of that is that there are lots of small, independently testable pieces of code – I think this stems from my background of writing production apps that are for business problems and not for academic papers.  Also, I use classes in source files versus script files because I plan to plug the code into larger .NET applications that will be written in C# and/or VB.NET.

In any event, once I has the total number of records, I went to see how they broke down into month:

  1. static member ActualTrafficStopsByMonth =
  2.     AnalysisEngine.RoadAlertDoc
  3.         |> Seq.map(fun x -> x.StopDateTime.Month)
  4.         |> Seq.countBy(fun x-> x)
  5.         |> Seq.toList

  1. [TestMethod]
  2. public void ActualTrafficStopsByMonth_ReturnsExpected()
  3. {
  4.     Int32 notExpected = 0;
  5.     var stops = AnalysisEngine.ActualTrafficStopsByMonth;
  6.     Assert.AreNotEqual(notExpected, stops.Length);
  7.  
  8. }

 

I then created a function that shows the expected number of stops by month.  Pattern matching with F# makes creating the month list a snap.  Note that is is a true unit test because I am not dependent on external data:

  1. static member Months =
  2.     let monthList = [1..12]
  3.     Seq.map (fun x ->
  4.             match x with
  5.                 | 1 | 3 | 5 | 7 | 8 | 10 | 12 -> x,31,31./365.
  6.                 | 2 -> x,28,28./365.
  7.                 | 4 | 6 | 9 | 11 -> x,30, 30./365.
  8.                 | _ -> x,0,0.                    
  9.         ) monthList
  10.     |> Seq.toList   

  1. static member ExpectedTrafficStopsByMonth numberOfStops =
  2.     AnalysisEngine.Months
  3.         |> Seq.map(fun (x,y,z) ->
  4.             x, int(z*numberOfStops))
  5.         |> Seq.toList

  1. [TestMethod]
  2. public void ExpectedTrafficStopsByMonth_ReturnsExpected()
  3. {
  4.     var stops = AnalysisEngine.ExpectedTrafficStopsByMonth(27778);
  5.     double expected = 2359;
  6.     double actual =stops[0].Item2;
  7.  
  8.     Assert.AreEqual(expected, actual);
  9. }

With the actual and expected ready to go, I then put the two side by side:

  1. static member TrafficStopsByMonth =
  2.     let numberOfStops = float(AnalysisEngine.NumberOfRecords)
  3.     let monthlyExpected = AnalysisEngine.ExpectedTrafficStopsByMonth numberOfStops
  4.     let monthlyActual = AnalysisEngine.ActualTrafficStopsByMonth
  5.     Seq.zip monthlyExpected monthlyActual
  6.         |> Seq.map(fun (x,y) -> fst x, snd x, snd y, snd y – snd x, (float(snd y) – float(snd x))/float(snd x))
  7.         |> Seq.toList

  1. [TestMethod]
  2. public void TrafficStopsByMonth_ReturnsExpected()
  3. {
  4.     var output = AnalysisEngine.TrafficStopsByMonth;
  5.     Assert.IsNotNull(output);
  6.  
  7. }

All of my unit tests ran green

image

so now I am ready to roll.  I created a quick console UI

  1. static void Main(string[] args)
  2. {
  3.     Console.WriteLine("Start");
  4.  
  5.     foreach (var tuple in AnalysisEngine.TrafficStopsByMonth)
  6.     {
  7.         Console.WriteLine(tuple.Item1 + ":" + tuple.Item2 + ":" + tuple.Item3 + ":" + tuple.Item4 + ":" + tuple.Item5);
  8.     }
  9.  
  10.     Console.WriteLine("End");
  11.     Console.ReadKey();
  12. }

image

With the output.  Obviously, a UX person could put some real pizzaz front of this data, but that is something to do another day.  If you didn’t see it in the code above, the tuple is constructed as: Month,ExpectedStops,ActualStops,Difference,%Difference.  So the real interesting thing is that September was 47% higher than expected with December 26% less.  That kind of wide variation begs for more analysis.

I then did a similar analysis by DayOfMonth:

  1. static member ActualTrafficStopsByDay =
  2.     AnalysisEngine.RoadAlertDoc
  3.         |> Seq.map(fun x -> x.StopDateTime.Day)
  4.         |> Seq.countBy(fun x-> x)
  5.         |> Seq.toList
  6.  
  7. static member Days =
  8.     let dayList = [1..31]
  9.     Seq.map (fun x ->
  10.             match x with
  11.                 | x when x < 29 -> x, 12, 12./365.
  12.                 | 29 | 30 -> x, 11, 11./365.
  13.                 | 31 -> x, 7, 7./365.
  14.                 | _ -> x, 0, 0.                 
  15.         ) dayList
  16.     |> Seq.toList     
  17.  
  18. static member ExpectedTrafficStopsByDay numberOfStops =
  19.     AnalysisEngine.Days
  20.         |> Seq.map(fun (x,y,z) ->
  21.             x, int(z*numberOfStops))
  22.         |> Seq.toList    
  23.  
  24. static member TrafficStopsByDay =
  25.     let numberOfStops = float(AnalysisEngine.NumberOfRecords)
  26.     let dailyExpected = AnalysisEngine.ExpectedTrafficStopsByDay numberOfStops
  27.     let dailyActual = AnalysisEngine.ActualTrafficStopsByDay
  28.     Seq.zip dailyExpected dailyActual
  29.         |> Seq.map(fun (x,y) -> fst x, snd x, snd y, snd y – snd x, (float(snd y) – float(snd x))/float(snd x))
  30.         |> Seq.toList

image

The interesting thing is that there are higher than expected traffic stops in the last half of the month (esp the 25th and 26th) and much lower in the 1st part of the month.

And by TimeOfDay

  1. static member ActualTrafficStopsByHour =
  2.     AnalysisEngine.RoadAlertDoc
  3.         |> Seq.map(fun x -> x.StopDateTime.Hour)
  4.         |> Seq.countBy(fun x-> x)
  5.         |> Seq.toList
  6.  
  7. static member Hours =
  8.     let hourList = [1..24]
  9.     Seq.map (fun x ->
  10.                 x,1, 1./24.
  11.         ) hourList
  12.     |> Seq.toList     
  13.  
  14. static member ExpectedTrafficStopsByHour numberOfStops =
  15.     AnalysisEngine.Hours
  16.         |> Seq.map(fun (x,y,z) ->
  17.             x, int(z*numberOfStops))
  18.         |> Seq.toList    
  19.  
  20. static member TrafficStopsByHour =
  21.     let numberOfStops = float(AnalysisEngine.NumberOfRecords)
  22.     let hourlyExpected = AnalysisEngine.ExpectedTrafficStopsByHour numberOfStops
  23.     let hourlyActual = AnalysisEngine.ActualTrafficStopsByHour
  24.     Seq.zip hourlyExpected hourlyActual
  25.         |> Seq.map(fun (x,y) -> fst x, snd x, snd y, snd y – snd x, (float(snd y) – float(snd x))/float(snd x))
  26.         |> Seq.toList

image

 

The interesting thing here is that there are much higher than expected number of traffic stops from 1-2 AM (61% and 123%) with significantly less between 8PM and midnight.  Finally, I looked at GPS location for the stops.

  1. static member ActualTrafficStopsByGPS =  
  2.     AnalysisEngine.RoadAlertDoc
  3.         |> Seq.map(fun x -> System.Math.Round(x.Latitude,3).ToString() + ":" + System.Math.Round(x.Longitude,3).ToString())
  4.         |> Seq.countBy(fun x-> x)
  5.         |> Seq.sortBy snd
  6.         |> Seq.toList
  7.         |> List.rev
  8.  
  9. static member GetVarianceOfTrafficStopsByGPS =
  10.     let trafficStopList = AnalysisEngine.ActualTrafficStopsByGPS
  11.                             |> Seq.map(fun x -> double(snd x))
  12.                             |> Seq.toList
  13.     AnalysisEngine.Variance(trafficStopList)
  14.  
  15. static member GetAverageOfTrafficStopsByGPS =
  16.     AnalysisEngine.ActualTrafficStopsByGPS
  17.         |> Seq.map(fun x -> double(snd x))
  18.         |> Seq.average

 

You can see that I rounded the Latitude and Longitude to 3 decimal places.  Using Wikipedia, saying that 4 decimals at 23N is 10.24M and 45N it is 7.87M for latitude, I imputed that 35 is 8.94M.  With 1 M = 3.28 feet, that means that 4 decimals is with 30 feet and 3 decimals is within 300 feet and 2 decimals is within 3,000 feet.  300 feet seems like a good compromise so I ran with that.

So running the average and variance and the top GPS locations:

image

With an average of 11 stops per GPS location (less than 1 a month) and a variance of 725, there does not seem be a strong relationship between GPS location and traffic stops.

The upshot of all of this analysis seems to point to avoid getting stopped it is less important where you are than when you are.  This is confirmed anecdotally too – the Town actually broadcasts when they will have heightened traffic surveillance on Twitter and the like.  Ignore open data at your own risk.  

In any event, I my next step is to run this data though a machine-learning algorithm to see if there is anything else to uncover.

One Response to Traffic Stop Analysis Using F#

  1. Pingback: F# Weekly #2, 2014 | Sergey Tihon's Blog

Leave a comment