Restaurant Classifier: Async For Faster Performance?
March 4, 2014 2 Comments
Going back to my restaurant classifier using F# from last week, I decided to speed things up some. Each request to the Yellow Pages API takes 1 second, so with the 5,682 records, I am looking at a little over 1.5 hours to pull down the data when running serial.
I first thought about making my methods async so I changed the API call method to async and used the Http.AsyncRequest method like so (line 10 below):
I then made the covering function async also (line 11 below)
The problem is that invoking the covering function via an anonymous method did not work easily.
After screwing around with the synax a bit, I went over to stack overflow where I found out two things:
- There is not an easy way to do it (I was hoping for a Seq.FilterAsyc method)
- Thomas Petricek is above my pay-grade.
In any event, I decided to drop the async and just look at parallelism. Turns out that there is a Parallel Seq class called PSeq, it is just not in the FSharp core library yet. I created a PSeq file in my project, moved it to the top and dropped the code in. I then changed the method call to use PSeq to invoke the serial methods:
When I first invoked it and looked at Fiddler (OT: did anyone notice that Fiddler’s new logo looks alot like a FSharp one? Probably just a coincidence), it was clear that things were running in parallel and that performance would improve. I have two cores on this workstation so my time be cut in half.
With the parallel method in my back pocket, I decided to see the ultimate result of the restaurant classification. I created a quick console app
I then ran the search on YP.com using my 4 core laptop and got the following results:
Compared to my original classifier based on name:
So the results make sense. The YP serial search would take at least 94.7 minutes, the YP parallel search took 41 minutes, and the in-memory name search took 3 seconds. The YP search(s) found restaurants that the name did not (Wang’s Kitchen, Crazy Fire Mongolian Grill, etc…) – 275 to 221, or 24% more restaurants.
I think that the next step is to look at the classifier and see how many restaurants are in both datasets and why the ones that are not in the YP one – where they are (did they even pay to be in the Yellow Pages?). Perhaps there is another YP category that can be considered. Also, it would be interesting to see of the restaurants that are in the name search and in the Yellow Pages that were not classified as Chinese – the false positive rate. Finally, I did see some 500s in Fiddler that had “read time out” so there is room for improvement to account for the transient faults…