Creating a crosswalk table between WCPSS School assignment results and school report card school list

As part of my Wake County School Score analysis, I needed to build a cross-walk table between the Wake County School Site parsing that I did here and the school score result set.  The screen scraping put schools in this kind of format:


as an added wrinkle, there is some extra data for some of the schools:


while the score result set is in this format:


So I want to create a cross-walk table with this format:


Step one of this process is to get all of the distinct values from the school site data.  If the data was in a sql server database, getting that would as simple as

“Select distinct shoolName from reallyBigDenormalizedTable”

But the data is not in Sql Server, it is in a no-sql database and the json is structured where the names are in an array inside the data structure.  After messing around with the query syntax to traverse the nested array, I gave up and decided to sample the database.

Step one was to get a record out via the index

1 let getSchools (index:int) = 2 try 3 let endpointUrl = "" 4 let client = new DocumentClient(new Uri(endpointUrl), authKey) 5 let database = client.CreateDatabaseQuery().Where(fun db -> db.Id = "wakecounty" ).ToArray().FirstOrDefault() 6 let collection = client.CreateDocumentCollectionQuery(database.CollectionsLink).Where(fun dc -> dc.Id = "houseassignment").ToArray().FirstOrDefault() 7 let documentLink = collection.SelfLink 8 let queryString = "SELECT * FROM houseassignment WHERE houseassignment.houseIndex = " + index.ToString() 9 let query = client.CreateDocumentQuery(documentLink,queryString) 10 match Seq.length query with 11 | 0 -> None 12 | _ -> 13 let firstValue = query |> Seq.head 14 let assignment = HouseAssignment.Parse(firstValue.ToString()) 15 Some assignment.Schools 16 with 17 | :? HttpRequestException as ex -> 18 None 19

The next step was to create an array of index numbers that have random values in them.  I found this really good extension method to System.Random to populate the array.  The next question was “how big does the sample size have to be to get most/all of the schools?”  I started seeding the array with different values and ran these functions:

1 let random = new System.Random(42) 2 let indexes = random.GetValues(1,350000) |> Seq.take(10000) |> Seq.toArray 3 let allSchools = indexes |> i -> getSchools(i)) |> Seq.toArray 4 let getNumberOfSchools (trial:int) = 5 let trialSchools = allSchools.[1..trial] 6 let allSchools' = trialSchools |> Seq.filter(fun s -> s.IsSome) 7 let allSchools'' = allSchools' |> Seq.collect(fun s -> s.Value) 8 let uniqueSchools = allSchools'' |> Seq.distinct 9 uniqueSchools |> Seq.length 10 11 let trialCount = [|1..9999|] 12 13 trialCount |> t -> t, getNumberOfSchools(t)) 14 |> Seq.iter(fun (t, c) -> printfn "%A %A" t c) 15

The sample above shows 10,000 records, which is pretty good.  If you graph it, you can see that you get the max values around 2,500.


Unfortunately, there were 11 schools on the report card that were not in the 10,000 set.  Confronted with this reality, I did what any reasonable research would do… I dropped them.  My guess is that these schools are not part of a base school pyramid, rather they are “application schools” like STEM or leadership academies.

In any event, with the list of schools down, I copied them into Excel and sorted them alphabetically.  I then put the school score list next to them and started matching.  Within 15 minutes, I had a creditable crosswalk table.


You can see the gist here

Predictive Analytics With Microsoft Azure Machine Learning

(On vacation this week)

Over the Christmas holiday, I had some time to look at some of the books that have been sitting on my bookshelf.  One of these was Predictive Analytics With Microsoft Machine Learning by Barga, Fontama, and Tok. 


This book is a great introduction to both analytics and Azure ML.  I really appreciated how the authors started off with a couple of basic experiments to get your feet wet, then moved over to some theory about different ML techniques, and then finished out the rest of the book with some hand-on labs.

I worked through all of the labs (except 1) in about 6 hours.  The labs follow a very nice step-by-step pattern with plenty of screen shots.  My only quibble with the book is that the most interesting lab was Building a Chun Model that relied on data from a third party.  When I went to the 3rd party’s website to download the data, the data had broken links and 404s.  I went to the book’s site at APress and its did not have the data either.  That was kinda frustrating and something that the authors should have considered.

In any event, if you have some time, working through Predictive Analytics With Microsoft Azure Machine Learning is well worth the time and is quite fun.

Aggregation of WCPSS Tax Records with School Assignment

So the next part of my WCPSS hit parade, I need a way of combing the screen scrape that I did from the Wake County Tax Records as described here and the screen scrape of the Wake County Public School Assignments as found here.  Getting data from the DocumentDb is straight foreword as long as you don’t ask too much from the query syntax.

I created two functions that pull the tax record and the school assignment via the index number:

1 let getAssignment (id:int) = 2 let collection = client.CreateDocumentCollectionQuery(database.CollectionsLink).Where(fun dc -> dc.Id = "houseassignment").ToArray().FirstOrDefault() 3 let documentLink = collection.SelfLink 4 let queryString = "SELECT * FROM houseassignment WHERE houseassignment.houseIndex = " + id.ToString() 5 let query = client.CreateDocumentQuery(documentLink,queryString) 6 match query |> Seq.length with 7 | 0 -> None 8 | _ -> 9 let assignmentValue = query |> Seq.head 10 let assignment = HouseAssignment.Parse(assignmentValue.ToString()) 11 Some assignment 12 13 let getValuation (id:int) = 14 let collection = client.CreateDocumentCollectionQuery(database.CollectionsLink).Where(fun dc -> dc.Id = "taxinformation").ToArray().FirstOrDefault() 15 let documentLink = collection.SelfLink 16 let queryString = "SELECT * FROM taxinformation WHERE taxinformation.index = 1" 17 let query = client.CreateDocumentQuery(documentLink,queryString) 18 match query |> Seq.length with 19 | 0 -> None 20 | _ -> 21 let valuationValue = query |> Seq.head 22 let valuation = HouseValuation.Parse(valuationValue.ToString()) 23 Some valuation

Note option types are being used because there any many index values where there is not a corresponding record.  Also, there might a situation where the assignment has a record but the valuation does not and vice-versa so I created a function to only put the records together where there both records:

1 let assignSchoolTaxBase (id:int) = 2 let assignment = getAssignment(id) 3 let valuation = getValuation(id) 4 match assignment.IsSome,valuation.IsSome with 5 | true, true -> assignment.Value.Schools 6 |> s -> s, valuation.Value.AssessedValue) 7 |> Some 8 | _ -> None

And running this on the first record, we are getting expected. 


Also, running it on an index where there there is not a record, we are also getting expected


With the matching working, we need a way of bring all of the school arrays together and then aggregating the tax valuation.  I decided to take a step by step approach to this, even though there might be a more terse way to write it. 

1 #time 2 indexes |> i -> assignSchoolTaxBase(i)) 3 |> Seq.filter(fun s -> s.IsSome) 4 |> Seq.collect(fun s -> s.Value) 5 |> Seq.groupBy(fun (s,av) -> s) 6 |> (s,ss) -> s,ss |> Seq.sumBy(fun (s,av)-> av)) 7 |> Seq.toArray

When I run it on the 1st 10 records, the values come back as expected


So the last step is to run it on all 350,000 indexes (let indexes = [|1..350000|]).  The problem is that after a long period of time, things were not returning.  So this is where the power of Azure comes in –> there is no problem so large I can’t thow more cores at it.  I went to management portal and increased the VM to 8 cores


I then went into the code base and added pseq for the database calls (which I assume was taking the longest time):

1 #time 2 let indexes = [|1..350000|] 3 let assignedValues = indexes |> i -> assignSchoolTaxBase(i)) |> Seq.toArray 4 5 let filePath = @"C:\Git\WakeCountySchoolScores\SchoolValuation.csv" 6 7 assignedValues 8 |> Seq.filter(fun s -> s.IsSome) 9 |> Seq.collect(fun s -> s.Value) 10 |> Seq.groupBy(fun (s,av) -> s) 11 |> (s,ss) -> s,ss |> Seq.sumBy(fun (s,av)-> av)) 12 |> (s,v) -> s + "," + v.ToString() + Environment.NewLine) 13 |> Seq.iter(fun (s) -> File.AppendAllText(filePath, s))

and after 2 hours:


Combining Wake County Real Estate Lookup with Wake County School Assignment

As a follow up to this post and this post, I want to combine looking up Wake County Real Estate valuation with the Wake County School Assignment.  The matching values between the two datasets is the house address.

The first thing I did was to create a new script file in the project.  I then added a reference to the script that does the WCPSS lookup.  I then added a Json provider that will server as the type of the Wake County Real Estate Valuation data that was stored previously in a DocumentDb instance.

1 #r "../packages/FSharp.Data.2.1.1/lib/net40/FSharp.Data.dll" 2 #r "../packages/Microsoft.Azure.Documents.Client.0.9.2-preview/lib/net40/Microsoft.Azure.Documents.Client.dll" 3 #r "../packages/Newtonsoft.Json.4.5.11/lib/net40/Newtonsoft.Json.dll" 4 5 #load "SchoolAssignments.fsx" 6 7 open System 8 open System.IO 9 open FSharp.Data 10 open System.Linq 11 open SchoolAssignments 12 open Microsoft.Azure.Documents 13 open Microsoft.Azure.Documents.Client 14 open Microsoft.Azure.Documents.Linq 15 16 type HouseValuation = JsonProvider<"../data/HouseValuationSample.json">

The house valuation json looks like this:


  "index": 1,

  "addressOne": "1506 WAKE FOREST RD ",

  "addressTwo": "RALEIGH NC 27604-1331",

  "addressThree": " ",

  "assessedValue": "$34,848",

  "id": "c0e931de-68b8-452e-8365-66d3a4a93483",


  "_ts": 1423934277,

  "_self": "dbs/pmVVAA==/colls/pmVVALZMZAE=/docs/pmVVALZMZAEBAAAAAAAAAA==/",

  "_etag": "\"0000c100-0000-0000-0000-54df83450000\"",

  "_attachments": "attachments/"



The first method pulls the data from the DocumentDb and serializes it into an instance of the type:

1 let getPropertyValue(id: int)= 2 let endpointUrl = "" 3 let authKey = "" 4 let client = new DocumentClient(new Uri(endpointUrl), authKey) 5 let database = client.CreateDatabaseQuery().Where(fun db -> db.Id = "wakecounty" ).ToArray().FirstOrDefault() 6 let collection = client.CreateDocumentCollectionQuery(database.CollectionsLink).Where(fun dc -> dc.Id = "taxinformation").ToArray().FirstOrDefault() 7 let documentLink = collection.SelfLink 8 let queryString = "SELECT * FROM taxinformation WHERE taxinformation.index = " + id.ToString() 9 let query = client.CreateDocumentQuery(documentLink,queryString) 10 let firstValue = query |> Seq.head 11 HouseValuation.Parse(firstValue.ToString()) 12

The next method uses the School Look script to pull the data from the WCPSS site.  The only real gotchas was that the space deliminator (char32) was not the only way to split the address.  The WCPSS site also added in a the hard break (char160).  It took me about a hour to figure out wht “” was not breaking into a array of words via splitting on “ “.  <sigh>

1 let createSchoolAssignmentSearchCriteria(houseValuation: option<HouseValuation.Root>) = 2 match houseValuation.IsSome with 3 | true -> let deliminators = [|(char)32;(char)160|] 4 let addressOneTokens = houseValuation.Value.AddressOne.Split(deliminators) 5 let streetNumber = addressOneTokens.[0] 6 let streetTemplateValue = addressOneTokens.[1] 7 let streetName = addressOneTokens.[1..] |> Array.reduce(fun acc t -> acc + "+" + t) 8 let addressTwoTokens = houseValuation.Value.AddressTwo.Split(deliminators) 9 let city = addressTwoTokens.[0] 10 let streetName' = streetName + city 11 Some {SearchCriteria.streetTemplateValue=streetTemplateValue; 12 streetName=streetName'; 13 streetNumber=streetNumber;} 14 | false -> None 15

In any event, the last piece was to take the value and push it back up to another DocumentDb collection:

1 let writeSchoolAssignmentToDocumentDb(houseAssignment:option<HouseAssignment>) = 2 match houseAssignment.IsSome with 3 | true -> 4 let endpointUrl = "" 5 let authKey = "" 6 let client = new DocumentClient(new Uri(endpointUrl), authKey) 7 let database = client.CreateDatabaseQuery().Where(fun db -> db.Id = "wakecounty" ).ToArray().FirstOrDefault() 8 let collection = client.CreateDocumentCollectionQuery(database.CollectionsLink).Where(fun dc -> dc.Id = "houseassignment").ToArray().FirstOrDefault() 9 let documentLink = collection.SelfLink 10 client.CreateDocumentAsync(documentLink, houseAssignment.Value) |> ignore 11 | false -> () 12 13

With that in place, the final function puts it all together:

1 let createHouseAssignment(id:int)= 2 let houseValuation = getPropertyValue(id) 3 let schools = houseValuation 4 |> createSchoolAssignmentSearchCriteria 5 |> createSearchCriteria' 6 |> createPage2QueryString 7 |> getSchoolData 8 match schools.IsSome with 9 | true -> Some {houseIndex=houseValuation.Value.Index; schools=schools.Value} 10 | false -> None 11

and now we have an end to end way of combing the content on two different sites:

1 //#time 2 //[1..100] |> Seq.iter(fun id -> generateHouseAssignment id)

gives this:


You can see the gist here