Traffic Stop Analysis Using F#
January 7, 2014 1 Comment
Now that I have the traffic stop services up and running, it is time to actually do something with the data. The data set is all traffic stops in my town for 2012 with some limited information: date/time of the stop, the geolocation of the stop, and the final disposition of the stop. The data looks like this:
My 1st step was to look at the Date/Time and see if there are any patterns in DayOfMonth, MonthOfYear, And TimeOfDay. To that end, I spun up a F# project and added my 1st method that determines the total number of records in the dataset:
- type roadAlert = JsonProvider<"http://chickensoftware.com/roadalert/api/trafficstopsearch/Sample">
- type AnalysisEngine =
- static member RoadAlertDoc = roadAlert.Load("http://chickensoftware.com/roadalert/api/trafficstopsearch")
- static member NumberOfRecords =
- AnalysisEngine.RoadAlertDoc
- |> Seq.length
Since I am a TDDer more than a REPLer, I went and wrote a covering unit test.
- [TestMethod]
- public void NumberOfRecords_ReturnsExpected()
- {
- Int32 notEpected = 0;
- Int32 actual = AnalysisEngine.NumberOfRecords;
- Assert.AreNotEqual(notEpected, actual);
- }
A couple of things to note about this:
1) This is really an integration test, not a unit test. I could have written the test like this:
- [TestMethod]
- public void NumberOfRecordsFor2012DataSet_ReturnsExpected()
- {
- Int32 expected = 27778;
- Int32 actual = AnalysisEngine.NumberOfRecords;
- Assert.AreEqual(expected, actual);
- }
But that means I am tying the test to the specific data sample (in its current state) – and I don’t want to do that.
2) I am finding that my F# code has many more functions than the code written by other people – esp data scientists. I think it has to do with contrasting methodologies. Instead of spending time in the REPL with a small piece of code to get it right and then adding the code into the larger code base, I am writing very small piece of code in the class and then using unit tests to get it right. The upshot of that is that there are lots of small, independently testable pieces of code – I think this stems from my background of writing production apps that are for business problems and not for academic papers. Also, I use classes in source files versus script files because I plan to plug the code into larger .NET applications that will be written in C# and/or VB.NET.
In any event, once I has the total number of records, I went to see how they broke down into month:
- static member ActualTrafficStopsByMonth =
- AnalysisEngine.RoadAlertDoc
- |> Seq.map(fun x -> x.StopDateTime.Month)
- |> Seq.countBy(fun x-> x)
- |> Seq.toList
- [TestMethod]
- public void ActualTrafficStopsByMonth_ReturnsExpected()
- {
- Int32 notExpected = 0;
- var stops = AnalysisEngine.ActualTrafficStopsByMonth;
- Assert.AreNotEqual(notExpected, stops.Length);
- }
I then created a function that shows the expected number of stops by month. Pattern matching with F# makes creating the month list a snap. Note that is is a true unit test because I am not dependent on external data:
- static member Months =
- let monthList = [1..12]
- Seq.map (fun x ->
- match x with
- | 1 | 3 | 5 | 7 | 8 | 10 | 12 -> x,31,31./365.
- | 2 -> x,28,28./365.
- | 4 | 6 | 9 | 11 -> x,30, 30./365.
- | _ -> x,0,0.
- ) monthList
- |> Seq.toList
- static member ExpectedTrafficStopsByMonth numberOfStops =
- AnalysisEngine.Months
- |> Seq.map(fun (x,y,z) ->
- x, int(z*numberOfStops))
- |> Seq.toList
- [TestMethod]
- public void ExpectedTrafficStopsByMonth_ReturnsExpected()
- {
- var stops = AnalysisEngine.ExpectedTrafficStopsByMonth(27778);
- double expected = 2359;
- double actual =stops[0].Item2;
- Assert.AreEqual(expected, actual);
- }
With the actual and expected ready to go, I then put the two side by side:
- static member TrafficStopsByMonth =
- let numberOfStops = float(AnalysisEngine.NumberOfRecords)
- let monthlyExpected = AnalysisEngine.ExpectedTrafficStopsByMonth numberOfStops
- let monthlyActual = AnalysisEngine.ActualTrafficStopsByMonth
- Seq.zip monthlyExpected monthlyActual
- |> Seq.map(fun (x,y) -> fst x, snd x, snd y, snd y – snd x, (float(snd y) – float(snd x))/float(snd x))
- |> Seq.toList
- [TestMethod]
- public void TrafficStopsByMonth_ReturnsExpected()
- {
- var output = AnalysisEngine.TrafficStopsByMonth;
- Assert.IsNotNull(output);
- }
All of my unit tests ran green
so now I am ready to roll. I created a quick console UI
- static void Main(string[] args)
- {
- Console.WriteLine("Start");
- foreach (var tuple in AnalysisEngine.TrafficStopsByMonth)
- {
- Console.WriteLine(tuple.Item1 + ":" + tuple.Item2 + ":" + tuple.Item3 + ":" + tuple.Item4 + ":" + tuple.Item5);
- }
- Console.WriteLine("End");
- Console.ReadKey();
- }
With the output. Obviously, a UX person could put some real pizzaz front of this data, but that is something to do another day. If you didn’t see it in the code above, the tuple is constructed as: Month,ExpectedStops,ActualStops,Difference,%Difference. So the real interesting thing is that September was 47% higher than expected with December 26% less. That kind of wide variation begs for more analysis.
I then did a similar analysis by DayOfMonth:
- static member ActualTrafficStopsByDay =
- AnalysisEngine.RoadAlertDoc
- |> Seq.map(fun x -> x.StopDateTime.Day)
- |> Seq.countBy(fun x-> x)
- |> Seq.toList
- static member Days =
- let dayList = [1..31]
- Seq.map (fun x ->
- match x with
- | x when x < 29 -> x, 12, 12./365.
- | 29 | 30 -> x, 11, 11./365.
- | 31 -> x, 7, 7./365.
- | _ -> x, 0, 0.
- ) dayList
- |> Seq.toList
- static member ExpectedTrafficStopsByDay numberOfStops =
- AnalysisEngine.Days
- |> Seq.map(fun (x,y,z) ->
- x, int(z*numberOfStops))
- |> Seq.toList
- static member TrafficStopsByDay =
- let numberOfStops = float(AnalysisEngine.NumberOfRecords)
- let dailyExpected = AnalysisEngine.ExpectedTrafficStopsByDay numberOfStops
- let dailyActual = AnalysisEngine.ActualTrafficStopsByDay
- Seq.zip dailyExpected dailyActual
- |> Seq.map(fun (x,y) -> fst x, snd x, snd y, snd y – snd x, (float(snd y) – float(snd x))/float(snd x))
- |> Seq.toList
The interesting thing is that there are higher than expected traffic stops in the last half of the month (esp the 25th and 26th) and much lower in the 1st part of the month.
And by TimeOfDay
- static member ActualTrafficStopsByHour =
- AnalysisEngine.RoadAlertDoc
- |> Seq.map(fun x -> x.StopDateTime.Hour)
- |> Seq.countBy(fun x-> x)
- |> Seq.toList
- static member Hours =
- let hourList = [1..24]
- Seq.map (fun x ->
- x,1, 1./24.
- ) hourList
- |> Seq.toList
- static member ExpectedTrafficStopsByHour numberOfStops =
- AnalysisEngine.Hours
- |> Seq.map(fun (x,y,z) ->
- x, int(z*numberOfStops))
- |> Seq.toList
- static member TrafficStopsByHour =
- let numberOfStops = float(AnalysisEngine.NumberOfRecords)
- let hourlyExpected = AnalysisEngine.ExpectedTrafficStopsByHour numberOfStops
- let hourlyActual = AnalysisEngine.ActualTrafficStopsByHour
- Seq.zip hourlyExpected hourlyActual
- |> Seq.map(fun (x,y) -> fst x, snd x, snd y, snd y – snd x, (float(snd y) – float(snd x))/float(snd x))
- |> Seq.toList
The interesting thing here is that there are much higher than expected number of traffic stops from 1-2 AM (61% and 123%) with significantly less between 8PM and midnight. Finally, I looked at GPS location for the stops.
- static member ActualTrafficStopsByGPS =
- AnalysisEngine.RoadAlertDoc
- |> Seq.map(fun x -> System.Math.Round(x.Latitude,3).ToString() + ":" + System.Math.Round(x.Longitude,3).ToString())
- |> Seq.countBy(fun x-> x)
- |> Seq.sortBy snd
- |> Seq.toList
- |> List.rev
- static member GetVarianceOfTrafficStopsByGPS =
- let trafficStopList = AnalysisEngine.ActualTrafficStopsByGPS
- |> Seq.map(fun x -> double(snd x))
- |> Seq.toList
- AnalysisEngine.Variance(trafficStopList)
- static member GetAverageOfTrafficStopsByGPS =
- AnalysisEngine.ActualTrafficStopsByGPS
- |> Seq.map(fun x -> double(snd x))
- |> Seq.average
You can see that I rounded the Latitude and Longitude to 3 decimal places. Using Wikipedia, saying that 4 decimals at 23N is 10.24M and 45N it is 7.87M for latitude, I imputed that 35 is 8.94M. With 1 M = 3.28 feet, that means that 4 decimals is with 30 feet and 3 decimals is within 300 feet and 2 decimals is within 3,000 feet. 300 feet seems like a good compromise so I ran with that.
So running the average and variance and the top GPS locations:
With an average of 11 stops per GPS location (less than 1 a month) and a variance of 725, there does not seem be a strong relationship between GPS location and traffic stops.
The upshot of all of this analysis seems to point to avoid getting stopped it is less important where you are than when you are. This is confirmed anecdotally too – the Town actually broadcasts when they will have heightened traffic surveillance on Twitter and the like. Ignore open data at your own risk.
In any event, I my next step is to run this data though a machine-learning algorithm to see if there is anything else to uncover.
Pingback: F# Weekly #2, 2014 | Sergey Tihon's Blog