Refactoring McCaffrey’s Regression to F#

James McCaffrey’s most recent MSDN article is about multi-class regression article is a great starting place for folks interested in the ins and outs of creating a regression.  You can find the article here.  He wrote the code in C# in a very much imperative style so the FSharp in me immediately wanted to rewrite it in F#.

Interestingly, Mathias Brandewinder also had the same idea and did a better (and more complete) job than me.  You can see his post here.

I decided to duck into McCaffrey’s code and see where I could rewrite part of the code.  My first step was to move his C# code to a more manageable format.

image

I changed the project from a console app to a .dll and then split the two classes into their own file.  I then added some unit tests so that I can verify that my reworking was correct:

1 [TestClass] 2 public class CSLogisticMultiTests 3 { 4 LogisticMulti _lc = null; 5 double[][] _trainData; 6 double[][] _testData; 7 8 public CSLogisticMultiTests() 9 { 10 int numFeatures = 4; 11 int numClasses = 3; 12 int numRows = 1000; 13 int seed = 42; 14 var data = LogisticMultiProgram.MakeDummyData(numFeatures, numClasses, numRows, seed); 15 LogisticMultiProgram.SplitTrainTest(data, 0.80, 7, out _trainData, out _testData); 16 _lc = new LogisticMulti(numFeatures, numClasses); 17 18 int maxEpochs = 100; 19 double learnRate = 0.01; 20 double decay = 0.10; 21 _lc.Train(_trainData, maxEpochs, learnRate, decay); 22 } 23 24 [TestMethod] 25 public void GetWeights_ReturnExpected() 26 { 27 double[][] bestWts = _lc.GetWeights(); 28 var expected = 13.939104508387803; 29 var actual = bestWts[0][0]; 30 Assert.AreEqual(expected, actual); 31 } 32 33 [TestMethod] 34 public void GetBiases_ReturnExpected() 35 { 36 double[] bestBiases = _lc.GetBiases(); 37 var expected = 11.795019237894717; 38 var actual = bestBiases[0]; 39 Assert.AreEqual(expected, actual); 40 } 41 42 [TestMethod] 43 public void GetTrainAccuracy_ReturnExpected() 44 { 45 var expected = 0.92125; 46 var actual = _lc.Accuracy(_trainData); 47 Assert.AreEqual(expected, actual); 48 } 49 50 [TestMethod] 51 public void GetTestAccuracy_ReturnExpected() 52 { 53 var expected = 0.895; 54 double actual = _lc.Accuracy(_testData); 55 Assert.AreEqual(expected, actual); 56 } 57 } 58

You will notice that this is the exact code that McCaffrey uses in his output for the Console app.  In any event, they were running all green

image

I then went into the F# Project and fired up the REPL.  I decided to start with the MakeDummyData method because it seemed beefy enough to demonstrate the language differences between the languages, it is fairly self-contained, and its data is already testable.  Here is the first 9 lines of code.

1 Random rnd = new Random(seed); 2 double[][] wts = new double[numFeatures][]; 3 for (int i = 0; i < numFeatures; ++i) 4 wts[i] = new double[numClasses]; 5 double hi = 10.0; 6 double lo = -10.0; 7 for (int i = 0; i < numFeatures; ++i) 8 for (int j = 0; j < numClasses; ++j) 9 wts[i][j] = (hi - lo) * rnd.NextDouble() + lo;

And here is the F# equivalent

1 let rnd = new Random(seed) 2 let hi = 10.0 3 let lo = -10.0 4 let wts = Array.create numFeatures (Array.create numClasses 1.) 5 let wts' = wts |> Array.map(fun row -> row |> Array.map(fun col -> (hi - lo) * rnd.NextDouble() + lo)) 6

There is one obvious difference and 1 subtle difference.  The obvious difference is that the F# code does not do any looping to create and populate the array of arrays data structure, rather it uses  the high-order Array.Map function.   This reduces the idiomatic line count from 9 to 5  – a 50% decrease (and a funny move from the 1980s).  (Note that I use the words “idiomatic line count” because you can reduce both examples to a single line of code but that makes in unworkable by humans.  Both examples show the typical way you would write code in the language.)  So with the fewer lines of code, which is more readable?  That is a subjective opinion.  A C#/Java/Javascript/Curly-Brace dev would say the C#.  Everyone else in the world would say F#.

The less obvious difference is that F# emphasizes immutability so that there are two variables (wts and wts’) and the C# has 1 variable that is mutated.  The implication is lost in such a small example, but if the numFeatures was large, you would want to take advantage of mutli-core processors and the F# code is ready for parallelism.  The C# code would have to be reworked to use an immutable collection.

The next lines create and populate the biases variable.  The C# Code:

1 double[] biases = new double[numClasses]; 2 for (int i = 0; i < numClasses; ++i) 3 biases[i] = (hi - lo) * rnd.NextDouble() + lo; 4

And the F# Code 

1 let biases = Array.create numClasses 1. 2 let biases' = biases |> Array.map(fun row -> (hi - lo) * rnd.NextDouble() + lo) 3

Same deal as before.  No loops or mutation.  Fewer lines of code and better readability.

The last set of code is a ball of string so it is very hard to separate out.

   

1 double[][] result = new double[numRows][]; // allocate result 2 for (int i = 0; i < numRows; ++i) 3 result[i] = new double[numFeatures + numClasses]; 4 5 for (int i = 0; i < numRows; ++i) // create one row at a time 6 { 7 double[] x = new double[numFeatures]; // generate random x-values 8 for (int j = 0; j < numFeatures; ++j) 9 x[j] = (hi - lo) * rnd.NextDouble() + lo; 10 11 double[] y = new double[numClasses]; // computed outputs storage 12 for (int j = 0; j < numClasses; ++j) // compute z-values 13 { 14 for (int f = 0; f < numFeatures; ++f) 15 y[j] += x[f] * wts[f][j]; 16 y[j] += biases[j]; 17 } 18 19 // determine loc. of max (no need for 1 / 1 + e^-z) 20 int maxIndex = 0; 21 double maxVal = y[0]; 22 for (int c = 0; c < numClasses; ++c) 23 { 24 if (y[c] > maxVal) 25 { 26 maxVal = y[c]; 27 maxIndex = c; 28 } 29 } 30 31 for (int c = 0; c < numClasses; ++c) // convert y to 0s or 1s 32 if (c == maxIndex) 33 y[c] = 1.0; 34 else 35 y[c] = 0.0; 36 37 int col = 0; // copy x and y into result 38 for (int f = 0; f < numFeatures; ++f) 39 result[i][col++] = x[f]; 40 for (int c = 0; c < numClasses; ++c) 41 result[i][col++] = y[c]; 42 } 43

Note the use of code comments, which is typically considered a code smell, even in demonstration code.

Here is the F# Code:

1 let x = Array.create numFeatures 1. 2 let x' = x |> Array.map(fun row -> (hi - lo) * rnd.NextDouble() + lo) 3 4 let xWts = Array.zip x' wts' 5 let xWts' = xWts |> Array.map(fun (x,wts) -> wts |> Array.sumBy(fun wt -> wt * x)) 6 7 let y = Array.create numClasses 1. 8 let yWts = Array.zip y xWts' 9 let y' = yWts |> Array.map(fun (y,xwt) -> y + xwt) 10 11 let yBias = Array.zip y' biases' 12 let y'' = yBias |> Array.map(fun (y,bias) -> y + bias) 13 14 let maxVal = y'' |> Array.max 15 16 let y''' = y'' |> Array.map(fun y -> if y = maxVal then 1. else 0.) 17 18 let xy = Array.append x' y''' 19 let result = Array.create numRows xy

This is pretty much the same as before,no loops, immutability, and a 50% reduction of code.  Also, notice that by using a more functional style breaks apart the ball of string.  Individual values are one their own line to be individual evaluated and manipulated.  Also, the if..then statement goes to a single line. 

So I had a lot of fun working through these examples.  The major differences were

  • Amount of Code and Code Readability
  • Immutability and ready for parallelism
    I am not planning to refactor the rest of the project, but you can too as the project is found here.  I am curious if using an array of arrays is the best way to represent the matric –> I guess it is standard for the curly-brace community?  I would think using Deedle would be better, but I don’t know enough about it (yet).

 

Two More Reasons To Use F#

On March 1st, James McCaffrey posted a blog article about why he doesn’t like FSharp found here.  Being that it took 3 weeks for anyone to notice is revealing in of itself, but the post is probably important  being that McCaffrey writes monthly in MSDN on machine learning/scientific computing so he has a certain amount of visibility.  To his credit, McCaffrey did try and use F# in one of his articles when FSharp first came out –> unfortunately, he wrote the code in an imperative style so he pretty much missed the point and benefit of using F#.  Interestingly, he also writes his C# without using the important OO concepts that would make his code much more usable to the larger community (especially polymorphic dispatch). 

In any event,  the responses from the FSharp community were what you would pretty much expect, with two very good responses here and here (and probably more to come).   I had posed this similar question to the FSharp Google group a while back with even more reasons why people don’t use FSharp and some good responses why they use it.  Recently, Eric Sink also wrote a good article on FSharp adoption found here.

For the last year, I have had the opportunity to work with a couple of startups in Raleigh, NC that are using FSharp and have a couple of observations that haven’t been mentioned so far (I think) in response to McCaffrey :

  • CTOs in FSharp shops don’t want you to learn FSharp.  They view using FSharp as a competitive advantage and hope that their .NET competitors continue to use C# exclusively.  Their rational is less to do with the language itself (C# is a great language), but the folks who can’t go between the two languages (or see how learning FSharp makes you a better C# coder and vice-versa) are not the developers they want on their team. The FSharp shops I know about  have no problem attracting top-flight talent –> no recruiter, no posts on dice, no resumes, no interviews.  Interestingly, the rock star .NET developers have already left their  C# comfort zone.  A majority of these developers are webevs so they have been using javascript for least a couple of years.  For many, it was their first foray out of the C# bubble and they hated it.  But like most worthwhile things, they stuck with it and now are proficient and may even enjoy it.  In any event, McCaffrey also doesn’t like HTML/Javascript/CSS (7:45 here) so I guess those developers are in the same boat.

  • You don’t want any of the the 100,000 jobs on Stack Overflow that McCaffrey talks about.  My instinct is that those jobs are targeted to the 50% of the C# developers still don’t use linq and/or lambda expressions.  Those are the companies that view developers as a commodity.  This is not where you want to be because:
  1. They are wrong.  The world is not flat.  Never has been.  CMMI, six-sigma process improvement, and other such things do not work in software engineering.  The problem for those companies is that they have lots of architects that don’t write production code, project managers that are second-careering into technology, and off-shore development managers who have no idea about the domain they are managing.  All of these people all have mortgages, colleges to pay for etc… so this self-protecting bureaucracy will be slow to die.  Therefore, they will continue to try and attract coders that don’t want to think outside their comfort zone
  2. It sucks working there – because you are just a cog in their machine.  You will probably be maintaining post-back websites or fat-client applications.  Who needs Xamerian anyway?  And be happy with that 2% raise.  But they do have a startup culture.

In any event, I hope to meet McCaffery at //Build later this month.  My guess is that since his mind is made up, nothing I say will change his opinion.  But, it should be interesting to talk with him and I really do enjoy his article’s on MSDN – so we have that common ground.

WCPSS Scores and Property Tax Valuations Using R

With all of the data gathered and organized I was ready to do some analytics using R.  The first thing I did was to load the four major datasets into R.

image

  • NCScores is the original dataset that has the school score.  I had already done did an analysis on it here.
  • SchoolValuation is the aggrgrate property values for each school as determined by scraping the Wake County Tax Website and Wake County School Assignment websites.  You can read how it was created here and here.
  • SchoolNameMatch is a crosswalk table between the school name as found in the NCScores dataframe and the School Valuation dataframe.  You can read how it was created here
  • WakeCountySchoolInfo is an export from WCPSS that was tossed around at open data day.

Step one was to reduce the North Carolina Scores data to only Wake County

1 #Create Wake County Scores From NC State Scores 2 WakeCountyScores <- NCScores[NCScores$District == 'Wake County Schools',] 3

The next step was to add in the SchoolNameMatch so that we have the Tax Valuation School Name

1 #Join SchoolNameMatch to Wake County Scores 2 WakeCountyScores <- merge(x=WakeCountyScores, y=SchoolNameMatch, by.x="School", by.y="WCPSS") 3

Interestingly, R is smart enough that the common field not duplicated, just the additional field(s) are added

image

The next step was to add in the Wake County Property Values, remove the Property field as it is no longer needed, and convert the TaxBase field from string to numeric

1 #Join Property Values 2 WakeCountyScores <- merge(x=WakeCountyScores, y=SchoolValuation, by.x="Property", by.y="SchooName") 3 4 #Remove Property column 5 WakeCountyScores$Property = NULL 6 7 #Turn tax base to numeric 8 WakeCountyScores$TaxBase <- as.numeric(WakeCountyScores$TaxBase) 9

Eager to do an analysis, I pumped the data into a correlation

1 #Do a Correlation 2 cor(WakeCountyScores$TaxBase,WakeCountyScores$SchoolScore,use="complete") 3

image

So clearly my expectations that property values track with FreeAndReducedLunch (.85 correlation) were not met.  I decided to use Practical Data Science with R Chapter 3 (Exploring Data)  as a guide to better understand the dataset.

1 #Practical Data Science With R, Chapter3 2 summary(WakeCountyScores) 3 summary(WakeCountyScores$TaxBase) 4

image

image

So there is quite a range in tax base!  The next task was to use some graphs to explore the data.  I added in ggplot2

image

and followed the books example for a histogram.  I started with score and it comes out as expected.  I then tried a historgram on TaxBase and had to tinker with the binwidth to make a meaningful chart:

1 #Historgrams 2 ggplot(WakeCountyScores) + geom_histogram(aes(x=SchoolScore),binwidth=5,fill="gray") 3 ggplot(WakeCountyScores) + geom_histogram(aes(x=TaxBase),binwidth=10000,fill="gray") 4 #Ooops 5 ggplot(WakeCountyScores) + geom_histogram(aes(x=TaxBase),binwidth=5000000,fill="gray") 6

 

image

image

The book then moves to an example studying income, which is directly analogous to the TaxBase so I followed it very closely.  The next graph were some density graphs.  Note the second one is a logarithmic one:

1 #Density 2 library(scales) 3 ggplot(WakeCountyScores) + geom_density(aes(x=TaxBase)) + scale_x_continuous(labels=dollar) 4 ggplot(WakeCountyScores) + geom_density(aes(x=TaxBase)) + scale_x_log10(labels=dollar) + annotation_logticks(sides="bt") 5

 

image

 

image

So kinda interesting that most schools cluster in terms of their tax base, but because there is such a wide range with a majority clustered to the low end, the logarithmic curve is much more revealing.

The book then moved into showing the relationship between two variables.  In this case, SchoolScore as the Y variable and TaxBase as the X variable:

1 #Relationship between TaxBase and Scores 2 ggplot(WakeCountyScores, aes(x=TaxBase, y=SchoolScore)) + geom_point() 3 ggplot(WakeCountyScores, aes(x=TaxBase, y=SchoolScore)) + geom_point() + stat_smooth(method="lm") 4 ggplot(WakeCountyScores, aes(x=TaxBase, y=SchoolScore)) + geom_point() + geom_smooth() 5

image

image

image

So what is interesting is that there does not seem to be a strong relationship between scores and tax base.  There looks like an equal number of schools both below the score curve than above it.  Note that using a smoothing curve is much better than the linear fit curve in showing the relationship of scores to tax base.  You can see the dip in the lower quartile and the increase at the tail.  It makes sense that the higher tax base shows an increase in scores, but what’s up with that dip?

Finally, the same data is shown using a hax chart

1 library(hexbin) 2 ggplot(WakeCountyScores, aes(x=TaxBase, y=SchoolScore)) + geom_hex(binwidth=c(100000000,5)) + geom_smooth(color="white",se=F) 3

image

So taking a step back, it is clear that there is a weakness in this analysis.  Some schools have thousands of students, some schools have a couple hundred.  (high schools versus elementary students). Using the absolute dollars from the tax valuation is misleading.  What we really need is revenue per student.  Going back to the SchoolInfo dataframe, I added it in and pulled a student count column.

1 WakeCountyScores <- merge(x=WakeCountyScores, y=WakeCountySchoolInfo, by.x="School", by.y="School.Name") 2 names(WakeCountyScores)[names(WakeCountyScores)=="School.Membership.2013.14..ADM..Mo2."] <- "StudentCount" 3 WakeCountyScores$StudentCount <- as.numeric(WakeCountyScores$StudentCount) 4 5 WakeCountyScores["TaxBasePerStudent"] <- WakeCountyScores$TaxBase/WakeCountyScores$StudentCount 6 summary(WakeCountyScores$TaxBasePerStudent) 7

Interestingly, the number of records in the base frame dropped from 166 to 152, which means that perhaps we need a second mapping table.  In any event, you can see that the average tax base is $6.5 million with a max of $114 Million.  Quite a range!

image

Going back to the point and hex graphs

1 ggplot(WakeCountyScores, aes(x=TaxBasePerStudent, y=SchoolScore)) + geom_point() + geom_smooth() 2 ggplot(WakeCountyScores, aes(x=TaxBasePerStudent, y=SchoolScore)) + geom_hex(binwidth=c(25000000,5)) + geom_smooth(color="white",se=F) 3

 

image

image

There is some interesting going on.  First, the initial conclusion that a higher tax base leads to a gradual increase in scores is wrong once you move from total tax to tax per student.

Also, note the significant drop in school scores once you move away from the lowest tax base schools, the recovery, and then the drop again.  From a real estate perspective, these charts suggest that the marginal value of a really expensive or really inexpensive house in Wake County is not worth it (at least in terms of where you send you kids), and there is a sweet spot of value above a certain price point. 

You can find the gist here and the repo is here.

Some lessons I learned in doing this exercise:

  • Some records got dropped between the scores dataframe and the info dataframe -> so there needs to be another mapping table
  • Make the tax base in millions
  • What’s up with that school with 114 million per student?
  • An interesting question is location of dollars to school compared to tax base.  I wonder if that is on WCPSS somewhere.  Hummm…
  • You can’t use the tick(‘) notation, which means you do a lot of overwriting of dataframes.  This can be a costly and expensive feature of the language.  It is much better to assume immutably, even if you clutter up your data window.

As a final note, I was using the Console window b/c that is what the intro books do.  This is a huge mistake in R Studio.  It is much better to create a script and send the results to the console

image

So that you can make changes and run things again.  It is a cheap way of avoid head-scratching bugs…

Creating a crosswalk table between WCPSS School assignment results and school report card school list

As part of my Wake County School Score analysis, I needed to build a cross-walk table between the Wake County School Site parsing that I did here and the school score result set.  The screen scraping put schools in this kind of format:

image

as an added wrinkle, there is some extra data for some of the schools:

 

while the score result set is in this format:

image

So I want to create a cross-walk table with this format:

image

Step one of this process is to get all of the distinct values from the school site data.  If the data was in a sql server database, getting that would as simple as

“Select distinct shoolName from reallyBigDenormalizedTable”

But the data is not in Sql Server, it is in a no-sql database and the json is structured where the names are in an array inside the data structure.  After messing around with the query syntax to traverse the nested array, I gave up and decided to sample the database.

Step one was to get a record out via the index

1 let getSchools (index:int) = 2 try 3 let endpointUrl = "https://chickensoftware.documents.azure.com:443/" 4 let client = new DocumentClient(new Uri(endpointUrl), authKey) 5 let database = client.CreateDatabaseQuery().Where(fun db -> db.Id = "wakecounty" ).ToArray().FirstOrDefault() 6 let collection = client.CreateDocumentCollectionQuery(database.CollectionsLink).Where(fun dc -> dc.Id = "houseassignment").ToArray().FirstOrDefault() 7 let documentLink = collection.SelfLink 8 let queryString = "SELECT * FROM houseassignment WHERE houseassignment.houseIndex = " + index.ToString() 9 let query = client.CreateDocumentQuery(documentLink,queryString) 10 match Seq.length query with 11 | 0 -> None 12 | _ -> 13 let firstValue = query |> Seq.head 14 let assignment = HouseAssignment.Parse(firstValue.ToString()) 15 Some assignment.Schools 16 with 17 | :? HttpRequestException as ex -> 18 None 19

The next step was to create an array of index numbers that have random values in them.  I found this really good extension method to System.Random to populate the array.  The next question was “how big does the sample size have to be to get most/all of the schools?”  I started seeding the array with different values and ran these functions:

1 let random = new System.Random(42) 2 let indexes = random.GetValues(1,350000) |> Seq.take(10000) |> Seq.toArray 3 let allSchools = indexes |> Seq.map(fun i -> getSchools(i)) |> Seq.toArray 4 let getNumberOfSchools (trial:int) = 5 let trialSchools = allSchools.[1..trial] 6 let allSchools' = trialSchools |> Seq.filter(fun s -> s.IsSome) 7 let allSchools'' = allSchools' |> Seq.collect(fun s -> s.Value) 8 let uniqueSchools = allSchools'' |> Seq.distinct 9 uniqueSchools |> Seq.length 10 11 let trialCount = [|1..9999|] 12 13 trialCount |> Seq.map(fun t -> t, getNumberOfSchools(t)) 14 |> Seq.iter(fun (t, c) -> printfn "%A %A" t c) 15

The sample above shows 10,000 records, which is pretty good.  If you graph it, you can see that you get the max values around 2,500.

image

Unfortunately, there were 11 schools on the report card that were not in the 10,000 set.  Confronted with this reality, I did what any reasonable research would do… I dropped them.  My guess is that these schools are not part of a base school pyramid, rather they are “application schools” like STEM or leadership academies.

In any event, with the list of schools down, I copied them into Excel and sorted them alphabetically.  I then put the school score list next to them and started matching.  Within 15 minutes, I had a creditable crosswalk table.

image

You can see the gist here

Predictive Analytics With Microsoft Azure Machine Learning

(On vacation this week)

Over the Christmas holiday, I had some time to look at some of the books that have been sitting on my bookshelf.  One of these was Predictive Analytics With Microsoft Machine Learning by Barga, Fontama, and Tok. 

image

This book is a great introduction to both analytics and Azure ML.  I really appreciated how the authors started off with a couple of basic experiments to get your feet wet, then moved over to some theory about different ML techniques, and then finished out the rest of the book with some hand-on labs.

I worked through all of the labs (except 1) in about 6 hours.  The labs follow a very nice step-by-step pattern with plenty of screen shots.  My only quibble with the book is that the most interesting lab was Building a Chun Model that relied on data from a third party.  When I went to the 3rd party’s website to download the data, the data had broken links and 404s.  I went to the book’s site at APress and its did not have the data either.  That was kinda frustrating and something that the authors should have considered.

In any event, if you have some time, working through Predictive Analytics With Microsoft Azure Machine Learning is well worth the time and is quite fun.

Aggregation of WCPSS Tax Records with School Assignment

So the next part of my WCPSS hit parade, I need a way of combing the screen scrape that I did from the Wake County Tax Records as described here and the screen scrape of the Wake County Public School Assignments as found here.  Getting data from the DocumentDb is straight foreword as long as you don’t ask too much from the query syntax.

I created two functions that pull the tax record and the school assignment via the index number:

1 let getAssignment (id:int) = 2 let collection = client.CreateDocumentCollectionQuery(database.CollectionsLink).Where(fun dc -> dc.Id = "houseassignment").ToArray().FirstOrDefault() 3 let documentLink = collection.SelfLink 4 let queryString = "SELECT * FROM houseassignment WHERE houseassignment.houseIndex = " + id.ToString() 5 let query = client.CreateDocumentQuery(documentLink,queryString) 6 match query |> Seq.length with 7 | 0 -> None 8 | _ -> 9 let assignmentValue = query |> Seq.head 10 let assignment = HouseAssignment.Parse(assignmentValue.ToString()) 11 Some assignment 12 13 let getValuation (id:int) = 14 let collection = client.CreateDocumentCollectionQuery(database.CollectionsLink).Where(fun dc -> dc.Id = "taxinformation").ToArray().FirstOrDefault() 15 let documentLink = collection.SelfLink 16 let queryString = "SELECT * FROM taxinformation WHERE taxinformation.index = 1" 17 let query = client.CreateDocumentQuery(documentLink,queryString) 18 match query |> Seq.length with 19 | 0 -> None 20 | _ -> 21 let valuationValue = query |> Seq.head 22 let valuation = HouseValuation.Parse(valuationValue.ToString()) 23 Some valuation

Note option types are being used because there any many index values where there is not a corresponding record.  Also, there might a situation where the assignment has a record but the valuation does not and vice-versa so I created a function to only put the records together where there both records:

1 let assignSchoolTaxBase (id:int) = 2 let assignment = getAssignment(id) 3 let valuation = getValuation(id) 4 match assignment.IsSome,valuation.IsSome with 5 | true, true -> assignment.Value.Schools 6 |> Seq.map(fun s -> s, valuation.Value.AssessedValue) 7 |> Some 8 | _ -> None

And running this on the first record, we are getting expected. 

image

Also, running it on an index where there there is not a record, we are also getting expected

image

With the matching working, we need a way of bring all of the school arrays together and then aggregating the tax valuation.  I decided to take a step by step approach to this, even though there might be a more terse way to write it. 

1 #time 2 indexes |> Seq.map(fun i -> assignSchoolTaxBase(i)) 3 |> Seq.filter(fun s -> s.IsSome) 4 |> Seq.collect(fun s -> s.Value) 5 |> Seq.groupBy(fun (s,av) -> s) 6 |> Seq.map(fun (s,ss) -> s,ss |> Seq.sumBy(fun (s,av)-> av)) 7 |> Seq.toArray

When I run it on the 1st 10 records, the values come back as expected

image

So the last step is to run it on all 350,000 indexes (let indexes = [|1..350000|]).  The problem is that after a long period of time, things were not returning.  So this is where the power of Azure comes in –> there is no problem so large I can’t thow more cores at it.  I went to management portal and increased the VM to 8 cores

Capture

I then went into the code base and added pseq for the database calls (which I assume was taking the longest time):

1 #time 2 let indexes = [|1..350000|] 3 let assignedValues = indexes |> PSeq.map(fun i -> assignSchoolTaxBase(i)) |> Seq.toArray 4 5 let filePath = @"C:\Git\WakeCountySchoolScores\SchoolValuation.csv" 6 7 assignedValues 8 |> Seq.filter(fun s -> s.IsSome) 9 |> Seq.collect(fun s -> s.Value) 10 |> Seq.groupBy(fun (s,av) -> s) 11 |> Seq.map(fun (s,ss) -> s,ss |> Seq.sumBy(fun (s,av)-> av)) 12 |> Seq.map(fun (s,v) -> s + "," + v.ToString() + Environment.NewLine) 13 |> Seq.iter(fun (s) -> File.AppendAllText(filePath, s))

and after 2 hours:

image

Combining Wake County Real Estate Lookup with Wake County School Assignment

As a follow up to this post and this post, I want to combine looking up Wake County Real Estate valuation with the Wake County School Assignment.  The matching values between the two datasets is the house address.

The first thing I did was to create a new script file in the project.  I then added a reference to the script that does the WCPSS lookup.  I then added a Json provider that will server as the type of the Wake County Real Estate Valuation data that was stored previously in a DocumentDb instance.

1 #r "../packages/FSharp.Data.2.1.1/lib/net40/FSharp.Data.dll" 2 #r "../packages/Microsoft.Azure.Documents.Client.0.9.2-preview/lib/net40/Microsoft.Azure.Documents.Client.dll" 3 #r "../packages/Newtonsoft.Json.4.5.11/lib/net40/Newtonsoft.Json.dll" 4 5 #load "SchoolAssignments.fsx" 6 7 open System 8 open System.IO 9 open FSharp.Data 10 open System.Linq 11 open SchoolAssignments 12 open Microsoft.Azure.Documents 13 open Microsoft.Azure.Documents.Client 14 open Microsoft.Azure.Documents.Linq 15 16 type HouseValuation = JsonProvider<"../data/HouseValuationSample.json">

The house valuation json looks like this:

{

  "index": 1,

  "addressOne": "1506 WAKE FOREST RD ",

  "addressTwo": "RALEIGH NC 27604-1331",

  "addressThree": " ",

  "assessedValue": "$34,848",

  "id": "c0e931de-68b8-452e-8365-66d3a4a93483",

  "_rid": "pmVVALZMZAEBAAAAAAAAAA==",

  "_ts": 1423934277,

  "_self": "dbs/pmVVAA==/colls/pmVVALZMZAE=/docs/pmVVALZMZAEBAAAAAAAAAA==/",

  "_etag": "\"0000c100-0000-0000-0000-54df83450000\"",

  "_attachments": "attachments/"

}

 

The first method pulls the data from the DocumentDb and serializes it into an instance of the type:

1 let getPropertyValue(id: int)= 2 let endpointUrl = "" 3 let authKey = "" 4 let client = new DocumentClient(new Uri(endpointUrl), authKey) 5 let database = client.CreateDatabaseQuery().Where(fun db -> db.Id = "wakecounty" ).ToArray().FirstOrDefault() 6 let collection = client.CreateDocumentCollectionQuery(database.CollectionsLink).Where(fun dc -> dc.Id = "taxinformation").ToArray().FirstOrDefault() 7 let documentLink = collection.SelfLink 8 let queryString = "SELECT * FROM taxinformation WHERE taxinformation.index = " + id.ToString() 9 let query = client.CreateDocumentQuery(documentLink,queryString) 10 let firstValue = query |> Seq.head 11 HouseValuation.Parse(firstValue.ToString()) 12

The next method uses the School Look script to pull the data from the WCPSS site.  The only real gotchas was that the space deliminator (char32) was not the only way to split the address.  The WCPSS site also added in a the hard break (char160).  It took me about a hour to figure out wht “” was not breaking into a array of words via splitting on “ “.  <sigh>

1 let createSchoolAssignmentSearchCriteria(houseValuation: option<HouseValuation.Root>) = 2 match houseValuation.IsSome with 3 | true -> let deliminators = [|(char)32;(char)160|] 4 let addressOneTokens = houseValuation.Value.AddressOne.Split(deliminators) 5 let streetNumber = addressOneTokens.[0] 6 let streetTemplateValue = addressOneTokens.[1] 7 let streetName = addressOneTokens.[1..] |> Array.reduce(fun acc t -> acc + "+" + t) 8 let addressTwoTokens = houseValuation.Value.AddressTwo.Split(deliminators) 9 let city = addressTwoTokens.[0] 10 let streetName' = streetName + city 11 Some {SearchCriteria.streetTemplateValue=streetTemplateValue; 12 streetName=streetName'; 13 streetNumber=streetNumber;} 14 | false -> None 15

In any event, the last piece was to take the value and push it back up to another DocumentDb collection:

1 let writeSchoolAssignmentToDocumentDb(houseAssignment:option<HouseAssignment>) = 2 match houseAssignment.IsSome with 3 | true -> 4 let endpointUrl = "" 5 let authKey = "" 6 let client = new DocumentClient(new Uri(endpointUrl), authKey) 7 let database = client.CreateDatabaseQuery().Where(fun db -> db.Id = "wakecounty" ).ToArray().FirstOrDefault() 8 let collection = client.CreateDocumentCollectionQuery(database.CollectionsLink).Where(fun dc -> dc.Id = "houseassignment").ToArray().FirstOrDefault() 9 let documentLink = collection.SelfLink 10 client.CreateDocumentAsync(documentLink, houseAssignment.Value) |> ignore 11 | false -> () 12 13

With that in place, the final function puts it all together:

1 let createHouseAssignment(id:int)= 2 let houseValuation = getPropertyValue(id) 3 let schools = houseValuation 4 |> createSchoolAssignmentSearchCriteria 5 |> createSearchCriteria' 6 |> createPage2QueryString 7 |> getSchoolData 8 match schools.IsSome with 9 | true -> Some {houseIndex=houseValuation.Value.Index; schools=schools.Value} 10 | false -> None 11

and now we have an end to end way of combing the content on two different sites:

1 //#time 2 //[1..100] |> Seq.iter(fun id -> generateHouseAssignment id)

gives this:

imageimage

You can see the gist here

Follow

Get every new post delivered to your Inbox.