Parsing Wake County School System Attendance Assignment Site With F#

As a follow up to this post, I then turned my attention to parsing the Wake County Public School Assignment Site.  If you are not familiar, large schools districts in America have a concept of ‘nodes’ where a child is assigned to a school pyramid (elementary, middle, high schools) based on their home address.  This gives the school attendance tremendous power because a house’s value is directly tied to how “good” (real or perceived) their assigned school pyramid.  WCPSS has a site here where you can enter in your address and find out the school pyramid.

Since there is not a public Api or even a publically available dataset, I decided to see if I could screen scrape the site.  The first challenge is that you need to navigate through 2 pages to get to your answer.  Here is the Fiddler trace

image

The first mistake you will notice is that they are using php.  The second is that they are using the same uri and they are parameterizing the requests via the form value:

image

Finally, their third mistake is that the pages comes back in an non-consistent way, making the DOM traversal more challenging.

Undaunted, I fired up Visual Studio. Because there are 2 pages that need to be used, I imported both of them as a the model for the HtmlTypeProvider

image

I then pulled out the form query string and placed them into some values.  The code so far:

1 #r "../packages/FSharp.Data.2.1.1/lib/net40/FSharp.Data.dll" 2 3 open System.Net 4 open FSharp.Data 5 6 type context = HtmlProvider<"../data/HouseSearchSample01.html"> 7 type context' = HtmlProvider<"../data/HouseSearchSample02.html"> 8 9 let uri = "http://wwwgis2.wcpss.net/addressLookup/index.php" 10 let streetLookup = "StreetTemplateValue=STRATH&StreetName=Strathorn+Dr+Cary&StreetNumber=904&SubmitAddressSelectPage=CONTINUE&DefaultAction=SubmitAddressSelectPage" 11 let streetLookup' = "SelectAssignment%7C2014%7CCURRENT=2014-15&DefaultAction=SelectAssignment%7C2014%7CCURRENT&DefaultAction=SelectAssignment%7C2015%7CCURRENT&CatchmentCode=CA+0198.2&StreetName=Strathorn+Dr+Cary&StreetTemplateValue=STRATH&StreetNumber=904&StreetZipCode=27519" 12

Skipping the 1st page, I decided to make a request and see if I could get the school information out of the DOM.  It well enough but you can see the immediate problem –> the page’s structure varies so just tagging the n element of the table will not work

1 let webClient = new WebClient() 2 webClient.Headers.Add("Content-Type", "application/x-www-form-urlencoded") 3 let result = webClient.UploadString(uri,"POST",streetLookup') 4 let body = context'.Parse(result).Html.Body() 5 6 let tables = body.Descendants("TABLE") |> Seq.toList 7 let schoolTable = tables.[0] 8 let schoolRows = schoolTable.Descendants("TR") |> Seq.toList 9 let elementaryDatas = schoolRows.[0].Descendants("TD") |> Seq.toList 10 let elementarySchool = elementaryDatas.[1].InnerText() 11 let middleSchoolDatas = schoolRows.[1].Descendants("TD") |> Seq.toList 12 let middleSchool = middleSchoolDatas.[1].InnerText() 13 //Need to skip for the enrollement cap message 14 let highSchoolDatas = schoolRows.[3].Descendants("TD") |> Seq.toList 15 let highSchool = highSchoolDatas.[1].InnerText() 16

 

image

I decided to take the dog for a walk and that time away from the keyboard was very helpful because I realized that although the table is not consistent, I don’t need it to be for my purposes.  All I need are the schools names for a given address.  What I need to do it remove all of the noise and just find the rows of the table with useful data:

1 let webClient = new WebClient() 2 webClient.Headers.Add("Content-Type", "application/x-www-form-urlencoded") 3 let result = webClient.UploadString(uri,"POST",streetLookup') 4 let body = context'.Parse(result).Html.Body() 5 6 let tables = body.Descendants("TABLE") |> Seq.toList 7 let schoolTable = tables.[0] 8 let schoolRows = schoolTable.Descendants("TR") |> Seq.toList 9 let schoolData = schoolRows |> Seq.collect(fun r -> r.Descendants("TD")) |>Seq.toList 10 let schoolData' = schoolData |> Seq.map(fun d -> d.InnerText().Trim()) 11 let schoolData'' = schoolData' |> Seq.filter(fun s -> s <> System.String.Empty) 12 13 //Strip out noise 14 let removeNonEssentialData (s:string) = 15 let markerPosition = s.IndexOf('(') 16 match markerPosition with 17 | -1 -> s 18 | _ -> s.Substring(0,markerPosition).Trim() 19 20 let schoolData''' = schoolData'' |> Seq.map(fun s -> removeNonEssentialData(s)) 21 22 let unimportantPhrases = [|"Neighborhood Busing";"This school has an enrollment cap"|] 23 let containsUnimportantPhrase (s:string) = 24 unimportantPhrases |> Seq.exists(fun p -> s.Contains(p)) 25 26 let schoolData'''' = schoolData''' |> Seq.filter(fun s -> containsUnimportantPhrase(s) = false ) 27 28 schoolData''''

And Boom goes the dynamite:

image

So working backwards, I need to parse the 1st page to get the CatchmentCode for an address, build the second’s page form data and then parse the results.  Parsing the 1st page for the catachmentCode was very straight forward:

1 let result = webClient.UploadString(uri,"POST",streetLookup) 2 let body = context.Parse(result).Html.Body() 3 let inputs = body.Descendants("INPUT") |> Seq.toList

image

1 let catchmentCode = inputs' |> Seq.filter(fun (n,v) -> n = "CatchmentCode") 2 |> Seq.map(fun (n,v) -> v) 3 |> Seq.head 4 let streetName = inputs' |> Seq.filter(fun (n,v) -> n = "StreetName") 5 |> Seq.map(fun (n,v) -> v) 6 |> Seq.head 7 let streetTemplateValue = inputs' |> Seq.filter(fun (n,v) -> n = "StreetTemplateValue") 8 |> Seq.map(fun (n,v) -> v) 9 |> Seq.head 10 let streetNumber = inputs' |> Seq.filter(fun (n,v) -> n = "StreetNumber") 11 |> Seq.map(fun (n,v) -> v) 12 |> Seq.head 13 let streetZipCode = inputs' |> Seq.filter(fun (n,v) -> n = "StreetZipCode") 14 |> Seq.map(fun (n,v) -> v) 15 |> Seq.head

 

image

So the answer is there, just the code sucks.  I refactored it to a single function and

1 let getValueFromInput(nameToFind:string) = 2 inputs' |> Seq.filter(fun (n,v) -> n = nameToFind) 3 |> Seq.map(fun (n,v) -> v) 4 |> Seq.head 5 let catchmentCode = getValueFromInput("CatchmentCode") 6 let streetName = getValueFromInput("StreetName") 7 let streetTemplateValue = getValueFromInput("StreetTemplateValue") 8 let streetNumber =getValueFromInput("StreetNumber") 9 let streetZipCode = getValueFromInput("StreetZipCode")

With the page 1 out of the way, I was ready to start altering the form query string.  I pulled the values out of the string and set up like this:

1 let streetTemplateValue = "STRAT" 2 let street = "Strathorn" 3 let suffix = "Dr" 4 let city = "Cary" 5 let streetNumber = "904" 6 let streetName = street+"+"+suffix+"+"+city 7 let streetLookup = "StreetTemplateValue="+streetTemplateValue+"&StreetName="+streetName+"&StreetNumber="+streetNumber+"&SubmitAddressSelectPage=CONTINUE&DefaultAction=SubmitAddressSelectPage" 8

1 let streetLookup' = "SelectAssignment%7C2014%7CCURRENT=2014-15&DefaultAction=SelectAssignment%7C2014%7CCURRENT&DefaultAction=SelectAssignment%7C2015%7CCURRENT&CatchmentCode="+catchmentCode+"&StreetName="+streetName+"&StreetTemplateValue="+streetTemplateValue+"&StreetNumber="+streetNumber+"&StreetZipCode="+streetZipCode 2

So now it was just a matter of creating some data structures to pass into the 1st query string

1 type SearchCriteria = {streetTemplateValue:string;street:string;suffix:string;city:string;streetNumber:string;} 2 3 let searchCriteria = {streetTemplateValue="STRAT";street="Strathorn";suffix="Dr";city="Cary";streetNumber="904"} 4 //Page1 Query String 5 let streetName = searchCriteria.street+"+"+searchCriteria.suffix+"+"+searchCriteria.city 6 let streetLookup = "StreetTemplateValue="+searchCriteria.streetTemplateValue+"&StreetName="+streetName+"&StreetNumber="+searchCriteria.streetNumber+"&SubmitAddressSelectPage=CONTINUE&DefaultAction=SubmitAddressSelectPage" 7

and we now have the basis for a series of functions to do the school lookup.  You can see the gist here.

Advertisements

One Response to Parsing Wake County School System Attendance Assignment Site With F#

  1. Pingback: F# Weekly #9, 2015 | Sergey Tihon's Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: