Parsing Wake County Tax Site With F#

Based on the response of my last post on Wake County School scores, I decided to look at each school’s revenue base.   Instead of looking at free and reduced lunch as a correlating factor for school scores, I wanted to look at the aggregate home valuations of each school’s population.

To do that, I thought of Wake County Tax Department’s web site found here, which you can look up an address and see the tax value of the property.  Although they don’t have an api, their web site’s search result page has a predictable uri like this: http://services.wakegov.com/realestate/Account.asp?id=0000001 so by placing in a 7-character integer, I could theoretically look at all of the tax records for the county.  Also, the HTML of the result page is standardized so parsing it should be fairly straightforward.

So I fired up Visual Studio and opened up the F# REPL. The first thing I did was to bring in the Html type provider and wire up a standard page for the type.

1 #r "../packages/FSharp.Data.2.1.1/lib/net40/FSharp.Data.dll" 2 open FSharp.Data 3 type context = HtmlProvider<"../data/RealEstateSample.html"> 4

I then could bring down all of the DOM elements for the page: and find all of the <Table> elements

1 let uri = "http://services.wakegov.com/realestate/Account.asp?id=0000001" 2 let body = context.Load(uri).Html.Body() 3 let tables = body.Descendants("TABLE") |> Seq.toList 4 tables |> Seq.length 5

image

So there are 14 tables on the page.  After some manual inspection, the table that holds the address information is table number 7:

1 let addressTable = tables.[7] 2

image

My first thought was to parse the text to see if there are key words that I can search on

1 let baseText = taxTable.ToString() 2 let marker = baseText.IndexOf("Total Value Assessed") 3 let remainingText = baseText.Substring(marker) 4 let marker' = remainingText.IndexOf("$") 5 let remainingText' = remainingText.Substring(marker') 6 let marker'' = remainingText'.IndexOf("<") 7 let finalText = remainingText'.Substring(0,marker'')

I then thought, “Jamie you are being stupid”.  Since the DOM is structured consistently,  I can just use the type provider and search on tags:

1 let addressTable = tables.[7] 2 let fonts = addressTable.Descendants("font") |> Seq.toList 3 let addressOne = fonts.[1].InnerText() 4 let addressTwo = fonts.[2].InnerText() 5 let addressThree = fonts.[3].InnerText() 6

and sure enough

image

And then going to table number 11, I can get the assessed value:

1 let taxTable = tables.[11] 2 let fonts' = taxTable.Descendants("font") |> Seq.toList 3 let assessedValue = fonts'.[3].InnerText() 4

and how cool is this?

image

So with the data elements in place, I need a way of saving the data.  Fortunately, the Json type provider is also in FSharp.Data so I could do this:

1 let valuation = JsonValue.Record [| 2 "addressOne", JsonValue.String addressOne 3 "addressTwo", JsonValue.String addressTwo 4 "addressThree", JsonValue.String addressThree 5 "assessedValue", JsonValue.String assessedValue |] 6 open System.IO 7 File.AppendAllText(@"C:\Data\dataTest.json",valuation.ToString()) 8

And in the file:

image

So now I have the pieces to make requests to the Wake County site and put the values into a json file.  I decided to push the data to the file after each request so if there is a reentrant fault, I would not lose everything:  So here is the gist and here is the results:

image

I then decided to see how long it will take to download the 1st 1,000 Ints.

1 #time 2 [1..100] |> Seq.iter(fun id -> doValuation id)

and with fiddler running

image

It took about 5 minutes for 1,000 ints

image

so extrapolating the max possible (9,999,999), it would take 83 hours.

image

Two thoughts come to mind for the next step

1) Use MBrace with some VMs on Azure to do the requests in parallel

2) Do a binary search to see the actual upper number for Wake County.

Tune in next week so see if that works.

Advertisements

One Response to Parsing Wake County Tax Site With F#

  1. Pingback: F# Weekly #8, 2015 | Sergey Tihon's Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: