Predictive Analytics With Microsoft Azure Machine Learning

(On vacation this week)

Over the Christmas holiday, I had some time to look at some of the books that have been sitting on my bookshelf.  One of these was Predictive Analytics With Microsoft Machine Learning by Barga, Fontama, and Tok. 

image

This book is a great introduction to both analytics and Azure ML.  I really appreciated how the authors started off with a couple of basic experiments to get your feet wet, then moved over to some theory about different ML techniques, and then finished out the rest of the book with some hand-on labs.

I worked through all of the labs (except 1) in about 6 hours.  The labs follow a very nice step-by-step pattern with plenty of screen shots.  My only quibble with the book is that the most interesting lab was Building a Chun Model that relied on data from a third party.  When I went to the 3rd party’s website to download the data, the data had broken links and 404s.  I went to the book’s site at APress and its did not have the data either.  That was kinda frustrating and something that the authors should have considered.

In any event, if you have some time, working through Predictive Analytics With Microsoft Azure Machine Learning is well worth the time and is quite fun.

Aggregation of WCPSS Tax Records with School Assignment

So the next part of my WCPSS hit parade, I need a way of combing the screen scrape that I did from the Wake County Tax Records as described here and the screen scrape of the Wake County Public School Assignments as found here.  Getting data from the DocumentDb is straight foreword as long as you don’t ask too much from the query syntax.

I created two functions that pull the tax record and the school assignment via the index number:

1 let getAssignment (id:int) = 2 let collection = client.CreateDocumentCollectionQuery(database.CollectionsLink).Where(fun dc -> dc.Id = "houseassignment").ToArray().FirstOrDefault() 3 let documentLink = collection.SelfLink 4 let queryString = "SELECT * FROM houseassignment WHERE houseassignment.houseIndex = " + id.ToString() 5 let query = client.CreateDocumentQuery(documentLink,queryString) 6 match query |> Seq.length with 7 | 0 -> None 8 | _ -> 9 let assignmentValue = query |> Seq.head 10 let assignment = HouseAssignment.Parse(assignmentValue.ToString()) 11 Some assignment 12 13 let getValuation (id:int) = 14 let collection = client.CreateDocumentCollectionQuery(database.CollectionsLink).Where(fun dc -> dc.Id = "taxinformation").ToArray().FirstOrDefault() 15 let documentLink = collection.SelfLink 16 let queryString = "SELECT * FROM taxinformation WHERE taxinformation.index = 1" 17 let query = client.CreateDocumentQuery(documentLink,queryString) 18 match query |> Seq.length with 19 | 0 -> None 20 | _ -> 21 let valuationValue = query |> Seq.head 22 let valuation = HouseValuation.Parse(valuationValue.ToString()) 23 Some valuation

Note option types are being used because there any many index values where there is not a corresponding record.  Also, there might a situation where the assignment has a record but the valuation does not and vice-versa so I created a function to only put the records together where there both records:

1 let assignSchoolTaxBase (id:int) = 2 let assignment = getAssignment(id) 3 let valuation = getValuation(id) 4 match assignment.IsSome,valuation.IsSome with 5 | true, true -> assignment.Value.Schools 6 |> Seq.map(fun s -> s, valuation.Value.AssessedValue) 7 |> Some 8 | _ -> None

And running this on the first record, we are getting expected. 

image

Also, running it on an index where there there is not a record, we are also getting expected

image

With the matching working, we need a way of bring all of the school arrays together and then aggregating the tax valuation.  I decided to take a step by step approach to this, even though there might be a more terse way to write it. 

1 #time 2 indexes |> Seq.map(fun i -> assignSchoolTaxBase(i)) 3 |> Seq.filter(fun s -> s.IsSome) 4 |> Seq.collect(fun s -> s.Value) 5 |> Seq.groupBy(fun (s,av) -> s) 6 |> Seq.map(fun (s,ss) -> s,ss |> Seq.sumBy(fun (s,av)-> av)) 7 |> Seq.toArray

When I run it on the 1st 10 records, the values come back as expected

image

So the last step is to run it on all 350,000 indexes (let indexes = [|1..350000|]).  The problem is that after a long period of time, things were not returning.  So this is where the power of Azure comes in –> there is no problem so large I can’t thow more cores at it.  I went to management portal and increased the VM to 8 cores

Capture

I then went into the code base and added pseq for the database calls (which I assume was taking the longest time):

1 #time 2 let indexes = [|1..350000|] 3 let assignedValues = indexes |> PSeq.map(fun i -> assignSchoolTaxBase(i)) |> Seq.toArray 4 5 let filePath = @"C:\Git\WakeCountySchoolScores\SchoolValuation.csv" 6 7 assignedValues 8 |> Seq.filter(fun s -> s.IsSome) 9 |> Seq.collect(fun s -> s.Value) 10 |> Seq.groupBy(fun (s,av) -> s) 11 |> Seq.map(fun (s,ss) -> s,ss |> Seq.sumBy(fun (s,av)-> av)) 12 |> Seq.map(fun (s,v) -> s + "," + v.ToString() + Environment.NewLine) 13 |> Seq.iter(fun (s) -> File.AppendAllText(filePath, s))

and after 2 hours:

image

Combining Wake County Real Estate Lookup with Wake County School Assignment

As a follow up to this post and this post, I want to combine looking up Wake County Real Estate valuation with the Wake County School Assignment.  The matching values between the two datasets is the house address.

The first thing I did was to create a new script file in the project.  I then added a reference to the script that does the WCPSS lookup.  I then added a Json provider that will server as the type of the Wake County Real Estate Valuation data that was stored previously in a DocumentDb instance.

1 #r "../packages/FSharp.Data.2.1.1/lib/net40/FSharp.Data.dll" 2 #r "../packages/Microsoft.Azure.Documents.Client.0.9.2-preview/lib/net40/Microsoft.Azure.Documents.Client.dll" 3 #r "../packages/Newtonsoft.Json.4.5.11/lib/net40/Newtonsoft.Json.dll" 4 5 #load "SchoolAssignments.fsx" 6 7 open System 8 open System.IO 9 open FSharp.Data 10 open System.Linq 11 open SchoolAssignments 12 open Microsoft.Azure.Documents 13 open Microsoft.Azure.Documents.Client 14 open Microsoft.Azure.Documents.Linq 15 16 type HouseValuation = JsonProvider<"../data/HouseValuationSample.json">

The house valuation json looks like this:

{

  "index": 1,

  "addressOne": "1506 WAKE FOREST RD ",

  "addressTwo": "RALEIGH NC 27604-1331",

  "addressThree": " ",

  "assessedValue": "$34,848",

  "id": "c0e931de-68b8-452e-8365-66d3a4a93483",

  "_rid": "pmVVALZMZAEBAAAAAAAAAA==",

  "_ts": 1423934277,

  "_self": "dbs/pmVVAA==/colls/pmVVALZMZAE=/docs/pmVVALZMZAEBAAAAAAAAAA==/",

  "_etag": "\"0000c100-0000-0000-0000-54df83450000\"",

  "_attachments": "attachments/"

}

 

The first method pulls the data from the DocumentDb and serializes it into an instance of the type:

1 let getPropertyValue(id: int)= 2 let endpointUrl = "" 3 let authKey = "" 4 let client = new DocumentClient(new Uri(endpointUrl), authKey) 5 let database = client.CreateDatabaseQuery().Where(fun db -> db.Id = "wakecounty" ).ToArray().FirstOrDefault() 6 let collection = client.CreateDocumentCollectionQuery(database.CollectionsLink).Where(fun dc -> dc.Id = "taxinformation").ToArray().FirstOrDefault() 7 let documentLink = collection.SelfLink 8 let queryString = "SELECT * FROM taxinformation WHERE taxinformation.index = " + id.ToString() 9 let query = client.CreateDocumentQuery(documentLink,queryString) 10 let firstValue = query |> Seq.head 11 HouseValuation.Parse(firstValue.ToString()) 12

The next method uses the School Look script to pull the data from the WCPSS site.  The only real gotchas was that the space deliminator (char32) was not the only way to split the address.  The WCPSS site also added in a the hard break (char160).  It took me about a hour to figure out wht “” was not breaking into a array of words via splitting on “ “.  <sigh>

1 let createSchoolAssignmentSearchCriteria(houseValuation: option<HouseValuation.Root>) = 2 match houseValuation.IsSome with 3 | true -> let deliminators = [|(char)32;(char)160|] 4 let addressOneTokens = houseValuation.Value.AddressOne.Split(deliminators) 5 let streetNumber = addressOneTokens.[0] 6 let streetTemplateValue = addressOneTokens.[1] 7 let streetName = addressOneTokens.[1..] |> Array.reduce(fun acc t -> acc + "+" + t) 8 let addressTwoTokens = houseValuation.Value.AddressTwo.Split(deliminators) 9 let city = addressTwoTokens.[0] 10 let streetName' = streetName + city 11 Some {SearchCriteria.streetTemplateValue=streetTemplateValue; 12 streetName=streetName'; 13 streetNumber=streetNumber;} 14 | false -> None 15

In any event, the last piece was to take the value and push it back up to another DocumentDb collection:

1 let writeSchoolAssignmentToDocumentDb(houseAssignment:option<HouseAssignment>) = 2 match houseAssignment.IsSome with 3 | true -> 4 let endpointUrl = "" 5 let authKey = "" 6 let client = new DocumentClient(new Uri(endpointUrl), authKey) 7 let database = client.CreateDatabaseQuery().Where(fun db -> db.Id = "wakecounty" ).ToArray().FirstOrDefault() 8 let collection = client.CreateDocumentCollectionQuery(database.CollectionsLink).Where(fun dc -> dc.Id = "houseassignment").ToArray().FirstOrDefault() 9 let documentLink = collection.SelfLink 10 client.CreateDocumentAsync(documentLink, houseAssignment.Value) |> ignore 11 | false -> () 12 13

With that in place, the final function puts it all together:

1 let createHouseAssignment(id:int)= 2 let houseValuation = getPropertyValue(id) 3 let schools = houseValuation 4 |> createSchoolAssignmentSearchCriteria 5 |> createSearchCriteria' 6 |> createPage2QueryString 7 |> getSchoolData 8 match schools.IsSome with 9 | true -> Some {houseIndex=houseValuation.Value.Index; schools=schools.Value} 10 | false -> None 11

and now we have an end to end way of combing the content on two different sites:

1 //#time 2 //[1..100] |> Seq.iter(fun id -> generateHouseAssignment id)

gives this:

imageimage

You can see the gist here

Analytics in the Microsoft Stack

Disclaimer:  I really don’t know what I am talking about

I received an email from a coworker/friend yesterday with this in the body:

So, I have a friend who works for a major supermarket chain. In IT, they are straight out of the year 2000. They have tons and tons of data in SQL Server and I think Oracle. The industrial engineers (who do all of the planning) ask the IT group to run queries throughout the day, which takes hours to run. They use Excel for most of their processing. On the weekends, they run reporting queries which take hours and hours to run – all to get just basic information.

This got my wheels spinning about how I would approach the problem with the analytics toolset that I know is available.  The supermarket chain has a couple of problems

  • Lots of data that takes too long to munge through
  • The planners are dependent on IT group for processing the data

I would expect the official Microsoft answer is that they should implement Sql Server Analytics with Power BI.  I would assume if the group threw enough resources at this solution, it would work.  I then thought of a couple of alternative paths:

The first thing that comes to mind is using HDInsight (Microsoft’s Hadoop product)  on Azure.  That way the queries can run in a distributed manner and they can provision machines as they need them -> and when they are not running their queries, they can de-allocate the machines.

The second thought is using AzureML to do their model generation.  However, depending on the size of the datasets, AzureML may not be able to scale.  I have only used Azure ML on smaller datasets.

The third thought was using R?  I don’t think R is the best answer here.  Everything I know about R is that it is designed for data exploration and analysis of datasets that comfortably fit into the local machine’s memory.  Performance on R is horrible and scaling R is a real challenge. 

What about F#?  So this might be a good answer.  If you use the Hive Type Provider, you can get the benefits of HDInsight to do the processing and then have the goodness of the language syntax and REPL for data exploration.  Also, the group could look at MBrace for some kick-butt distributed processing that can scale on Azure. Finally, if they don come up with some kind of insight that lends itself for building analytics or models into an app, you can take the code out of the script file and stick it into a compliable assembly all within Visual Studio. 

What about Python?  No idea, I don’t enough about it

What about Matlab, SAS, etc..  No idea.  I stopped using those tools when R showed up.

What about Watson?  No idea.  I think I will have a better idea once I go to this.

Parsing Wake County School System Attendance Assignment Site With F#

As a follow up to this post, I then turned my attention to parsing the Wake County Public School Assignment Site.  If you are not familiar, large schools districts in America have a concept of ‘nodes’ where a child is assigned to a school pyramid (elementary, middle, high schools) based on their home address.  This gives the school attendance tremendous power because a house’s value is directly tied to how “good” (real or perceived) their assigned school pyramid.  WCPSS has a site here where you can enter in your address and find out the school pyramid.

Since there is not a public Api or even a publically available dataset, I decided to see if I could screen scrape the site.  The first challenge is that you need to navigate through 2 pages to get to your answer.  Here is the Fiddler trace

image

The first mistake you will notice is that they are using php.  The second is that they are using the same uri and they are parameterizing the requests via the form value:

image

Finally, their third mistake is that the pages comes back in an non-consistent way, making the DOM traversal more challenging.

Undaunted, I fired up Visual Studio. Because there are 2 pages that need to be used, I imported both of them as a the model for the HtmlTypeProvider

image

I then pulled out the form query string and placed them into some values.  The code so far:

1 #r "../packages/FSharp.Data.2.1.1/lib/net40/FSharp.Data.dll" 2 3 open System.Net 4 open FSharp.Data 5 6 type context = HtmlProvider<"../data/HouseSearchSample01.html"> 7 type context' = HtmlProvider<"../data/HouseSearchSample02.html"> 8 9 let uri = "http://wwwgis2.wcpss.net/addressLookup/index.php" 10 let streetLookup = "StreetTemplateValue=STRATH&StreetName=Strathorn+Dr+Cary&StreetNumber=904&SubmitAddressSelectPage=CONTINUE&DefaultAction=SubmitAddressSelectPage" 11 let streetLookup' = "SelectAssignment%7C2014%7CCURRENT=2014-15&DefaultAction=SelectAssignment%7C2014%7CCURRENT&DefaultAction=SelectAssignment%7C2015%7CCURRENT&CatchmentCode=CA+0198.2&StreetName=Strathorn+Dr+Cary&StreetTemplateValue=STRATH&StreetNumber=904&StreetZipCode=27519" 12

Skipping the 1st page, I decided to make a request and see if I could get the school information out of the DOM.  It well enough but you can see the immediate problem –> the page’s structure varies so just tagging the n element of the table will not work

1 let webClient = new WebClient() 2 webClient.Headers.Add("Content-Type", "application/x-www-form-urlencoded") 3 let result = webClient.UploadString(uri,"POST",streetLookup') 4 let body = context'.Parse(result).Html.Body() 5 6 let tables = body.Descendants("TABLE") |> Seq.toList 7 let schoolTable = tables.[0] 8 let schoolRows = schoolTable.Descendants("TR") |> Seq.toList 9 let elementaryDatas = schoolRows.[0].Descendants("TD") |> Seq.toList 10 let elementarySchool = elementaryDatas.[1].InnerText() 11 let middleSchoolDatas = schoolRows.[1].Descendants("TD") |> Seq.toList 12 let middleSchool = middleSchoolDatas.[1].InnerText() 13 //Need to skip for the enrollement cap message 14 let highSchoolDatas = schoolRows.[3].Descendants("TD") |> Seq.toList 15 let highSchool = highSchoolDatas.[1].InnerText() 16

 

image

I decided to take the dog for a walk and that time away from the keyboard was very helpful because I realized that although the table is not consistent, I don’t need it to be for my purposes.  All I need are the schools names for a given address.  What I need to do it remove all of the noise and just find the rows of the table with useful data:

1 let webClient = new WebClient() 2 webClient.Headers.Add("Content-Type", "application/x-www-form-urlencoded") 3 let result = webClient.UploadString(uri,"POST",streetLookup') 4 let body = context'.Parse(result).Html.Body() 5 6 let tables = body.Descendants("TABLE") |> Seq.toList 7 let schoolTable = tables.[0] 8 let schoolRows = schoolTable.Descendants("TR") |> Seq.toList 9 let schoolData = schoolRows |> Seq.collect(fun r -> r.Descendants("TD")) |>Seq.toList 10 let schoolData' = schoolData |> Seq.map(fun d -> d.InnerText().Trim()) 11 let schoolData'' = schoolData' |> Seq.filter(fun s -> s <> System.String.Empty) 12 13 //Strip out noise 14 let removeNonEssentialData (s:string) = 15 let markerPosition = s.IndexOf('(') 16 match markerPosition with 17 | -1 -> s 18 | _ -> s.Substring(0,markerPosition).Trim() 19 20 let schoolData''' = schoolData'' |> Seq.map(fun s -> removeNonEssentialData(s)) 21 22 let unimportantPhrases = [|"Neighborhood Busing";"This school has an enrollment cap"|] 23 let containsUnimportantPhrase (s:string) = 24 unimportantPhrases |> Seq.exists(fun p -> s.Contains(p)) 25 26 let schoolData'''' = schoolData''' |> Seq.filter(fun s -> containsUnimportantPhrase(s) = false ) 27 28 schoolData''''

And Boom goes the dynamite:

image

So working backwards, I need to parse the 1st page to get the CatchmentCode for an address, build the second’s page form data and then parse the results.  Parsing the 1st page for the catachmentCode was very straight forward:

1 let result = webClient.UploadString(uri,"POST",streetLookup) 2 let body = context.Parse(result).Html.Body() 3 let inputs = body.Descendants("INPUT") |> Seq.toList

image

1 let catchmentCode = inputs' |> Seq.filter(fun (n,v) -> n = "CatchmentCode") 2 |> Seq.map(fun (n,v) -> v) 3 |> Seq.head 4 let streetName = inputs' |> Seq.filter(fun (n,v) -> n = "StreetName") 5 |> Seq.map(fun (n,v) -> v) 6 |> Seq.head 7 let streetTemplateValue = inputs' |> Seq.filter(fun (n,v) -> n = "StreetTemplateValue") 8 |> Seq.map(fun (n,v) -> v) 9 |> Seq.head 10 let streetNumber = inputs' |> Seq.filter(fun (n,v) -> n = "StreetNumber") 11 |> Seq.map(fun (n,v) -> v) 12 |> Seq.head 13 let streetZipCode = inputs' |> Seq.filter(fun (n,v) -> n = "StreetZipCode") 14 |> Seq.map(fun (n,v) -> v) 15 |> Seq.head

 

image

So the answer is there, just the code sucks.  I refactored it to a single function and

1 let getValueFromInput(nameToFind:string) = 2 inputs' |> Seq.filter(fun (n,v) -> n = nameToFind) 3 |> Seq.map(fun (n,v) -> v) 4 |> Seq.head 5 let catchmentCode = getValueFromInput("CatchmentCode") 6 let streetName = getValueFromInput("StreetName") 7 let streetTemplateValue = getValueFromInput("StreetTemplateValue") 8 let streetNumber =getValueFromInput("StreetNumber") 9 let streetZipCode = getValueFromInput("StreetZipCode")

With the page 1 out of the way, I was ready to start altering the form query string.  I pulled the values out of the string and set up like this:

1 let streetTemplateValue = "STRAT" 2 let street = "Strathorn" 3 let suffix = "Dr" 4 let city = "Cary" 5 let streetNumber = "904" 6 let streetName = street+"+"+suffix+"+"+city 7 let streetLookup = "StreetTemplateValue="+streetTemplateValue+"&StreetName="+streetName+"&StreetNumber="+streetNumber+"&SubmitAddressSelectPage=CONTINUE&DefaultAction=SubmitAddressSelectPage" 8

1 let streetLookup' = "SelectAssignment%7C2014%7CCURRENT=2014-15&DefaultAction=SelectAssignment%7C2014%7CCURRENT&DefaultAction=SelectAssignment%7C2015%7CCURRENT&CatchmentCode="+catchmentCode+"&StreetName="+streetName+"&StreetTemplateValue="+streetTemplateValue+"&StreetNumber="+streetNumber+"&StreetZipCode="+streetZipCode 2

So now it was just a matter of creating some data structures to pass into the 1st query string

1 type SearchCriteria = {streetTemplateValue:string;street:string;suffix:string;city:string;streetNumber:string;} 2 3 let searchCriteria = {streetTemplateValue="STRAT";street="Strathorn";suffix="Dr";city="Cary";streetNumber="904"} 4 //Page1 Query String 5 let streetName = searchCriteria.street+"+"+searchCriteria.suffix+"+"+searchCriteria.city 6 let streetLookup = "StreetTemplateValue="+searchCriteria.streetTemplateValue+"&StreetName="+streetName+"&StreetNumber="+searchCriteria.streetNumber+"&SubmitAddressSelectPage=CONTINUE&DefaultAction=SubmitAddressSelectPage" 7

and we now have the basis for a series of functions to do the school lookup.  You can see the gist here.

Parsing Wake County Tax Site With F#

Based on the response of my last post on Wake County School scores, I decided to look at each school’s revenue base.   Instead of looking at free and reduced lunch as a correlating factor for school scores, I wanted to look at the aggregate home valuations of each school’s population.

To do that, I thought of Wake County Tax Department’s web site found here, which you can look up an address and see the tax value of the property.  Although they don’t have an api, their web site’s search result page has a predictable uri like this: http://services.wakegov.com/realestate/Account.asp?id=0000001 so by placing in a 7-character integer, I could theoretically look at all of the tax records for the county.  Also, the HTML of the result page is standardized so parsing it should be fairly straightforward.

So I fired up Visual Studio and opened up the F# REPL. The first thing I did was to bring in the Html type provider and wire up a standard page for the type.

1 #r "../packages/FSharp.Data.2.1.1/lib/net40/FSharp.Data.dll" 2 open FSharp.Data 3 type context = HtmlProvider<"../data/RealEstateSample.html"> 4

I then could bring down all of the DOM elements for the page: and find all of the <Table> elements

1 let uri = "http://services.wakegov.com/realestate/Account.asp?id=0000001" 2 let body = context.Load(uri).Html.Body() 3 let tables = body.Descendants("TABLE") |> Seq.toList 4 tables |> Seq.length 5

image

So there are 14 tables on the page.  After some manual inspection, the table that holds the address information is table number 7:

1 let addressTable = tables.[7] 2

image

My first thought was to parse the text to see if there are key words that I can search on

1 let baseText = taxTable.ToString() 2 let marker = baseText.IndexOf("Total Value Assessed") 3 let remainingText = baseText.Substring(marker) 4 let marker' = remainingText.IndexOf("$") 5 let remainingText' = remainingText.Substring(marker') 6 let marker'' = remainingText'.IndexOf("<") 7 let finalText = remainingText'.Substring(0,marker'')

I then thought, “Jamie you are being stupid”.  Since the DOM is structured consistently,  I can just use the type provider and search on tags:

1 let addressTable = tables.[7] 2 let fonts = addressTable.Descendants("font") |> Seq.toList 3 let addressOne = fonts.[1].InnerText() 4 let addressTwo = fonts.[2].InnerText() 5 let addressThree = fonts.[3].InnerText() 6

and sure enough

image

And then going to table number 11, I can get the assessed value:

1 let taxTable = tables.[11] 2 let fonts' = taxTable.Descendants("font") |> Seq.toList 3 let assessedValue = fonts'.[3].InnerText() 4

and how cool is this?

image

So with the data elements in place, I need a way of saving the data.  Fortunately, the Json type provider is also in FSharp.Data so I could do this:

1 let valuation = JsonValue.Record [| 2 "addressOne", JsonValue.String addressOne 3 "addressTwo", JsonValue.String addressTwo 4 "addressThree", JsonValue.String addressThree 5 "assessedValue", JsonValue.String assessedValue |] 6 open System.IO 7 File.AppendAllText(@"C:\Data\dataTest.json",valuation.ToString()) 8

And in the file:

image

So now I have the pieces to make requests to the Wake County site and put the values into a json file.  I decided to push the data to the file after each request so if there is a reentrant fault, I would not lose everything:  So here is the gist and here is the results:

image

I then decided to see how long it will take to download the 1st 1,000 Ints.

1 #time 2 [1..100] |> Seq.iter(fun id -> doValuation id)

and with fiddler running

image

It took about 5 minutes for 1,000 ints

image

so extrapolating the max possible (9,999,999), it would take 83 hours.

image

Two thoughts come to mind for the next step

1) Use MBrace with some VMs on Azure to do the requests in parallel

2) Do a binary search to see the actual upper number for Wake County.

Tune in next week so see if that works.

Wake County School Report Cards Using R

Recently Wake County School Systems released a “school report card” that can be used to compare how well a school is doing relative to the other schools in the state. As expected, it made front-page news in our local newspaper here.  The key theme was that schools that have kids from poorer families have worse results than schools from more affluent families.  Though this shouldn’t come as a surprise, the follow-up op eds were equally predictable: more money for poorer schools, changing the rating scale, etc..

I thought it would be an interesting data set to analyze to see if the conclusion that the N&O came up with was, in fact, the only conclusion you can get out of the dataset..  Since they did simple crosstab analysis, perhaps there was some other analysis that could be done?  I know that news paper articles are at a pretty low level reading level and perhaps they are also written at a low analytical level also?  I went to the website to download the data here and quickly ran into two items:

1) The dataset is very limited –> there are only 3 creditable variables in the dataset (county, free and reduced percent, and the school score).  It is almost as if the dataset was purposely limited to only support the conclusion.

2) The dataset is shown in a way that alternative analysis is very hard.  You have to install Tableau if you want to look the data yourself.  Parsing Tableau was a pain because even with Fiddler, they don’t render the results as HTML with some tags but as images.

Side Note –> My guess is that Tableau is trying to be the Flash for the analytics space.  I find it curious that companies/organizations that think they are just “one tool away” from good analytics.   Even the age of Watson,  it is never the tooling – it is always the analyst that determines the usefulness of a dataset.  It would much better if WCPSS embraced open data and had higher expectations of the people using the datasets.

In any event, with the 14 day trial of Tableau, I could download into Access.  I then exported the data into a .txt file (where it should have been in the 1st place).  I the pumped it into R Studio like so:

image

I then created 2 variables from the FreeAndReducedLunch and SchoolScores vectors.  When I ran the correlation the 1st time, I got an NA, meaning that there are some mal-formed data. 

image

I re-ran the correlation using only complete data and sure enough, there is a creditable correlation –> higher the percent of free and reduced lunch, the lower the score.  The N&O is right. 

image

I then added a filter to only look at Wake County and there is even a stronger correlation in Wake County than the state as a whole:

image

As I mentioned earlier, the dataset was set up for a pre-decided conclusion by limited the number of independent variables and the choice of using Tableau as the reporting mechanism.  I decided to augment the dataset with additional information.  My son plays in TYO and I unsuccessful tried to set up an orchestra at our local elementary school 8 years ago.  I also thought of this article where  some families tried to get more orchestras in Wake County schools.  Fortunately, the list of schools with orchestra can be found here and it did not take very long to add an “HasAnStringsProgram” field to the dataset.

image

Running a correlation for just the WCPSS schools shows that there is no relationship  between a school having an orchestra and their performance grade. 

image

So the statement by the parents in the N&O like this

… that music students have higher graduation rates, grades and test scores …

might be true for all music but a specialized strings program does not seem to impact the school’s score –> at least with this data.

Follow

Get every new post delivered to your Inbox.