WCPSS Scores and Property Tax Valuations Using R

With all of the data gathered and organized I was ready to do some analytics using R.  The first thing I did was to load the four major datasets into R.

image

  • NCScores is the original dataset that has the school score.  I had already done did an analysis on it here.
  • SchoolValuation is the aggrgrate property values for each school as determined by scraping the Wake County Tax Website and Wake County School Assignment websites.  You can read how it was created here and here.
  • SchoolNameMatch is a crosswalk table between the school name as found in the NCScores dataframe and the School Valuation dataframe.  You can read how it was created here
  • WakeCountySchoolInfo is an export from WCPSS that was tossed around at open data day.

Step one was to reduce the North Carolina Scores data to only Wake County

1 #Create Wake County Scores From NC State Scores 2 WakeCountyScores <- NCScores[NCScores$District == 'Wake County Schools',] 3

The next step was to add in the SchoolNameMatch so that we have the Tax Valuation School Name

1 #Join SchoolNameMatch to Wake County Scores 2 WakeCountyScores <- merge(x=WakeCountyScores, y=SchoolNameMatch, by.x="School", by.y="WCPSS") 3

Interestingly, R is smart enough that the common field not duplicated, just the additional field(s) are added

image

The next step was to add in the Wake County Property Values, remove the Property field as it is no longer needed, and convert the TaxBase field from string to numeric

1 #Join Property Values 2 WakeCountyScores <- merge(x=WakeCountyScores, y=SchoolValuation, by.x="Property", by.y="SchooName") 3 4 #Remove Property column 5 WakeCountyScores$Property = NULL 6 7 #Turn tax base to numeric 8 WakeCountyScores$TaxBase <- as.numeric(WakeCountyScores$TaxBase) 9

Eager to do an analysis, I pumped the data into a correlation

1 #Do a Correlation 2 cor(WakeCountyScores$TaxBase,WakeCountyScores$SchoolScore,use="complete") 3

image

So clearly my expectations that property values track with FreeAndReducedLunch (.85 correlation) were not met.  I decided to use Practical Data Science with R Chapter 3 (Exploring Data)  as a guide to better understand the dataset.

1 #Practical Data Science With R, Chapter3 2 summary(WakeCountyScores) 3 summary(WakeCountyScores$TaxBase) 4

image

image

So there is quite a range in tax base!  The next task was to use some graphs to explore the data.  I added in ggplot2

image

and followed the books example for a histogram.  I started with score and it comes out as expected.  I then tried a historgram on TaxBase and had to tinker with the binwidth to make a meaningful chart:

1 #Historgrams 2 ggplot(WakeCountyScores) + geom_histogram(aes(x=SchoolScore),binwidth=5,fill="gray") 3 ggplot(WakeCountyScores) + geom_histogram(aes(x=TaxBase),binwidth=10000,fill="gray") 4 #Ooops 5 ggplot(WakeCountyScores) + geom_histogram(aes(x=TaxBase),binwidth=5000000,fill="gray") 6

 

image

image

The book then moves to an example studying income, which is directly analogous to the TaxBase so I followed it very closely.  The next graph were some density graphs.  Note the second one is a logarithmic one:

1 #Density 2 library(scales) 3 ggplot(WakeCountyScores) + geom_density(aes(x=TaxBase)) + scale_x_continuous(labels=dollar) 4 ggplot(WakeCountyScores) + geom_density(aes(x=TaxBase)) + scale_x_log10(labels=dollar) + annotation_logticks(sides="bt") 5

 

image

 

image

So kinda interesting that most schools cluster in terms of their tax base, but because there is such a wide range with a majority clustered to the low end, the logarithmic curve is much more revealing.

The book then moved into showing the relationship between two variables.  In this case, SchoolScore as the Y variable and TaxBase as the X variable:

1 #Relationship between TaxBase and Scores 2 ggplot(WakeCountyScores, aes(x=TaxBase, y=SchoolScore)) + geom_point() 3 ggplot(WakeCountyScores, aes(x=TaxBase, y=SchoolScore)) + geom_point() + stat_smooth(method="lm") 4 ggplot(WakeCountyScores, aes(x=TaxBase, y=SchoolScore)) + geom_point() + geom_smooth() 5

image

image

image

So what is interesting is that there does not seem to be a strong relationship between scores and tax base.  There looks like an equal number of schools both below the score curve than above it.  Note that using a smoothing curve is much better than the linear fit curve in showing the relationship of scores to tax base.  You can see the dip in the lower quartile and the increase at the tail.  It makes sense that the higher tax base shows an increase in scores, but what’s up with that dip?

Finally, the same data is shown using a hax chart

1 library(hexbin) 2 ggplot(WakeCountyScores, aes(x=TaxBase, y=SchoolScore)) + geom_hex(binwidth=c(100000000,5)) + geom_smooth(color="white",se=F) 3

image

So taking a step back, it is clear that there is a weakness in this analysis.  Some schools have thousands of students, some schools have a couple hundred.  (high schools versus elementary students). Using the absolute dollars from the tax valuation is misleading.  What we really need is revenue per student.  Going back to the SchoolInfo dataframe, I added it in and pulled a student count column.

1 WakeCountyScores <- merge(x=WakeCountyScores, y=WakeCountySchoolInfo, by.x="School", by.y="School.Name") 2 names(WakeCountyScores)[names(WakeCountyScores)=="School.Membership.2013.14..ADM..Mo2."] <- "StudentCount" 3 WakeCountyScores$StudentCount <- as.numeric(WakeCountyScores$StudentCount) 4 5 WakeCountyScores["TaxBasePerStudent"] <- WakeCountyScores$TaxBase/WakeCountyScores$StudentCount 6 summary(WakeCountyScores$TaxBasePerStudent) 7

Interestingly, the number of records in the base frame dropped from 166 to 152, which means that perhaps we need a second mapping table.  In any event, you can see that the average tax base is $6.5 million with a max of $114 Million.  Quite a range!

image

Going back to the point and hex graphs

1 ggplot(WakeCountyScores, aes(x=TaxBasePerStudent, y=SchoolScore)) + geom_point() + geom_smooth() 2 ggplot(WakeCountyScores, aes(x=TaxBasePerStudent, y=SchoolScore)) + geom_hex(binwidth=c(25000000,5)) + geom_smooth(color="white",se=F) 3

 

image

image

There is some interesting going on.  First, the initial conclusion that a higher tax base leads to a gradual increase in scores is wrong once you move from total tax to tax per student.

Also, note the significant drop in school scores once you move away from the lowest tax base schools, the recovery, and then the drop again.  From a real estate perspective, these charts suggest that the marginal value of a really expensive or really inexpensive house in Wake County is not worth it (at least in terms of where you send you kids), and there is a sweet spot of value above a certain price point. 

You can find the gist here and the repo is here.

Some lessons I learned in doing this exercise:

  • Some records got dropped between the scores dataframe and the info dataframe -> so there needs to be another mapping table
  • Make the tax base in millions
  • What’s up with that school with 114 million per student?
  • An interesting question is location of dollars to school compared to tax base.  I wonder if that is on WCPSS somewhere.  Hummm…
  • You can’t use the tick(‘) notation, which means you do a lot of overwriting of dataframes.  This can be a costly and expensive feature of the language.  It is much better to assume immutably, even if you clutter up your data window.

As a final note, I was using the Console window b/c that is what the intro books do.  This is a huge mistake in R Studio.  It is much better to create a script and send the results to the console

image

So that you can make changes and run things again.  It is a cheap way of avoid head-scratching bugs…

Predictive Analytics With Microsoft Azure Machine Learning

(On vacation this week)

Over the Christmas holiday, I had some time to look at some of the books that have been sitting on my bookshelf.  One of these was Predictive Analytics With Microsoft Machine Learning by Barga, Fontama, and Tok. 

image

This book is a great introduction to both analytics and Azure ML.  I really appreciated how the authors started off with a couple of basic experiments to get your feet wet, then moved over to some theory about different ML techniques, and then finished out the rest of the book with some hand-on labs.

I worked through all of the labs (except 1) in about 6 hours.  The labs follow a very nice step-by-step pattern with plenty of screen shots.  My only quibble with the book is that the most interesting lab was Building a Chun Model that relied on data from a third party.  When I went to the 3rd party’s website to download the data, the data had broken links and 404s.  I went to the book’s site at APress and its did not have the data either.  That was kinda frustrating and something that the authors should have considered.

In any event, if you have some time, working through Predictive Analytics With Microsoft Azure Machine Learning is well worth the time and is quite fun.

Introduction to (part of) IBM Watson

Recently, I joined the IBM Watson beta program (you can join too here) to see what it had to offer.  It looks like IBM is using the “Watson” word to cover a broad array of analytical and machine learning capabilities.  One area that Watson is used is to do statistical analysis without knowing any programming and/or statistics.  For example, I went into their portal and uploaded a new dataset that I just got from the Town Of Cary regarding traffic stops:

image image

image

I then hit the “New Exploration” button just to see what would happen and voila, I have graphs!

image 

image

image

 

So this is kind interesting, they seem to use both modeling sweeping and parameter sweeping and then use natural language questions to explore the dataset.  This is quite impressive as it allows someone to who nothing about statistics to ask questions and get answers.  I am not sure if there is a way to drill down into the models to tweet the questions nor does there look to be a way to consume the results.  Instead, it looks like a management dashboard.  So it is a bit like when you view the results of a dataset, they have taken it to the n degree.

I then went back and hit the “Create a Prediction” button

image

I picked a random y variable (“disposition) with the default values and voila, graphs:

image

Interestingly, it does some sweeping and it picked up that the PrimaryKey is correlated with date – which would make sense since the date is part of the PK value 🙂

image

In any event, I think this is a cool entry into the machine learning space from IBM.  They really have done a good job in making data science accessible.  Now, if they could put their weight into “Open Data” so there are lots of really cool datasets to analyze available, they would really position themselves well in an emerging market.  I can’t wait to dig in even more with  Watson…

Using IBM’s Watson With F#

 

I think everyone is aware of IBM’s Watson from its appearance on Jeopardy.  Apparently, IBM has made the Watson Api available for developers if you sign up here.  Well, there goes my Sunday morning!  I signed up and after one email confirm later, I was in. 
IBM has tied Watson to something called “Blue Mix”, which looks to be a full-service suite of applications from deployment to hosting .  When I looked at the api documentation here, I decided to use the language translation service as a good “hello world” project.  Looking at the api help page, I was hoping just to make a request and get a response with a auth token in the header, like every other api in the world.  However, the documentation really leads you down a path of installing the Watson Explorer on your local machine, and a create a blue mix project, etc.. 
Fortunately, the documentation has some pointers to other projects where people have made their own app.  I used thisthis one as a model and set up Fiddler like so

image

The authorization token is the username and password separated by a colon encoded to base 64.
Sure enough, a 200

image

Setting it up in #FSharp was a snap
1 #r @"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.5\System.Net.Http.dll" 2 #r @"..\packages\Microsoft.AspNet.WebApi.Client.5.2.2\lib\net45\System.Net.Http.Formatting.dll" 3 4 open System 5 open System.Net.Http 6 open System.Net.Http.Headers 7 open System.Net.Http.Formatting 8 open System.Collections.Generic 9 10 11 let serviceName = "machine_translation" 12 let baseUrl = "http://wex-mt.mybluemix.net/resources/translate" 13 let userName = "youNameHere@aol.com" 14 let password = "yourPasswordHere" 15 let authKey = userName + ":" + password 16 17 let client = new HttpClient() 18 client.DefaultRequestHeaders.Authorization <- new AuthenticationHeaderValue("Basic",authKey) 19 20 let input = new Dictionary<string,string>() 21 input.Add("text","This is a test") 22 input.Add("sid","mt-enus-eses") 23 let content = new FormUrlEncodedContent(input) 24 25 let result = client.PostAsync(baseUrl,content).Result 26 let resultContent = result.Content.ReadAsStringAsync().Result

And sure enough

image

 

You can see the gist here
So with that simple call/request under my belt, I decided to look at the api that everyone is talking about, the question/answer api.  I fired up Fiddler again and took a look at the docs.  After some tweaking of the Uri, I got a successful request/response:

image

image

With the answers to an empty question kind interesting. if not head-scratching:

image

So passing in a question:

image

image

So we are cooking with gas.  Back into FSI

1 #r @"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.5\System.Net.Http.dll" 2 #r @"..\packages\Microsoft.AspNet.WebApi.Client.5.2.2\lib\net45\System.Net.Http.Formatting.dll" 3 4 open System 5 open System.Net.Http 6 open System.Net.Http.Headers 7 open System.Net.Http.Formatting 8 open System.Collections.Generic 9 10 11 let baseUrl = "http://wex-qa.mybluemix.net/resources/question" 12 let userName = "yourName@aol.com" 13 let password = "yourCreds" 14 let authKey = userName + ":" + password 15 16 let client = new HttpClient() 17 client.DefaultRequestHeaders.Authorization <- new AuthenticationHeaderValue("Basic",authKey) 18 19 let input = new Dictionary<string,string>() 20 input.Add("question","what time is it") 21 let content = new FormUrlEncodedContent(input) 22 23 let result = client.PostAsync(baseUrl,content).Result 24 let resultContent = result.Content.ReadAsStringAsync().Result

With the result like so

image

And since it is Json coming back, why not use the type provider?

1 let client = new HttpClient() 2 client.DefaultRequestHeaders.Authorization <- new AuthenticationHeaderValue("Basic",authKey) 3 4 let input = new Dictionary<string,string>() 5 input.Add("question","How can I quit smoking") 6 let content = new FormUrlEncodedContent(input) 7 8 let result = client.PostAsync(baseUrl,content).Result 9 let resultContent = result.Content.ReadAsStringAsync().Result 10 11 type qaResponse = JsonProvider<".\QAResponseJson.json"> 12 let qaAnswer = qaResponse.Parse(resultContent) 13 14 qaAnswer.Question.Answers 15 |> Seq.ofArray 16 |> Seq.iter(fun a -> printfn "(%s)" a.Text)

Here is Watson’s response:

image

You can see the gist here

Smart Nerd Dinner

I think there is general agreement that the age of the ASP.NET wire-framing post-back web dev is over.  If you are going to writing web applications in 2015 in the .NET stack, you have to be able to use java script and associated javascript frameworks like Angular.  Similarly, the full-stack developer needs to have a much deeper understanding of the data that is passing in and and out of their application.  With the rise of analytics in an application, the developer needs different tools and approaches to their application.  Just as you need to know javascript if you are going to be in the browser, you need to know F# if you are going to be building industrial-grade  domain and  data layers.

I decided to refactor an existing ASP.NET postback website to see how hard it would be to introduce F# to the project and apply some basic statistics to make the site smarter.  It was pretty easy and the payoffs were quite large.

If you are not familiar, nerd Dinner is the cannonal example of a MVC application that was created to show Microsoft web devs how to create a website using the .NET stack.  The original project was put into a book with the Mount Rushmore of MSFT uber-devs

image

The project was so successful that it actually was launched into a real website

image

and you can find the code on Codeplex here

image

When you download the source code from the repository, you will notice a couple of things:

1) It is not a very big project – with only 1100 lines of code

image

2) There are 191 FxCop violations

image

3) It does compile coming out of source, but some of the unit tests fail

image

4) There is pretty low code coverage (21%)

image

Focusing on the code coverage issue, it makes sense that there is not much code coverage because there is not much code that can be covered.  There is maybe 15 lines of “business logic” if the term business logic is expanded to include input validation.  This is an example

image

Also, there is maybe ten lines of code that do some basic filtering

image

So step one in the quest to refactor nerd dinner to be a bit smarter was to rename the projects.  Since MVC is a UI framework, it made sense to call it that.  I then changed the namespaces to reflect the new structure

image

The next  step was to take the domain classes out of the UI and put them into the application.  First, I created another project

image

I then took all of the interfaces that was in the UI and placed them into the application

1 namespace NerdDinner.Models 2 3 open System 4 open System.Linq 5 open System.Linq.Expressions 6 7 type IRepository<'T> = 8 abstract All : IQueryable<'T> 9 abstract AllIncluding 10 : [<ParamArray>] includeProperties:Expression<Func<'T, obj>>[] -> IQueryable<'T> 11 abstract member Find: int -> 'T 12 abstract member InsertOrUpdate: 'T -> unit 13 abstract member Delete: int -> unit 14 abstract member SubmitChanges: unit -> unit 15 16 type IDinnerRepository = 17 inherit IRepository<Dinner> 18 abstract member FindByLocation: float*float -> IQueryable<Dinner> 19 abstract FindUpcomingDinners : unit -> IQueryable<Dinner> 20 abstract FindDinnersByText : string -> IQueryable<Dinner> 21 abstract member DeleteRsvp: 'T -> unit

I then tooks all of the data structures/models and placed them in the application.

1 namespace NerdDinner.Models 2 3 open System 4 open System.Web.Mvc 5 open System.Collections.Generic 6 open System.ComponentModel.DataAnnotations 7 open System.ComponentModel.DataAnnotations.Schema 8 9 type public LocationDetail (latitude,longitude,title,address) = 10 let mutable latitude = latitude 11 let mutable longitude = longitude 12 let mutable title = title 13 let mutable address = address 14 15 member public this.Latitude 16 with get() = latitude 17 and set(value) = latitude <- value 18 19 member public this.Longitude 20 with get() = longitude 21 and set(value) = longitude <- value 22 23 member public this.Title 24 with get() = title 25 and set(value) = title <- value 26 27 member public this.Address 28 with get() = address 29 and set(value) = address <- value 30 31 type public RSVP () = 32 let mutable rsvpID = 0 33 let mutable dinnerID = 0 34 let mutable attendeeName = "" 35 let mutable attendeeNameId = "" 36 let mutable dinner = null 37 38 member public self.RsvpID 39 with get() = rsvpID 40 and set(value) = rsvpID <- value 41 42 member public self.DinnerID 43 with get() = dinnerID 44 and set(value) = dinnerID <- value 45 46 member public self.AttendeeName 47 with get() = attendeeName 48 and set(value) = attendeeName <- value 49 50 member public self.AttendeeNameId 51 with get() = attendeeNameId 52 and set(value) = attendeeNameId <- value 53 54 member public self.Dinner 55 with get() = dinner 56 and set(value) = dinner <- value 57 58 59 and public Dinner () = 60 let mutable dinnerID = 0 61 let mutable title = "" 62 let mutable eventDate = DateTime.MinValue 63 let mutable description = "" 64 let mutable hostedBy = "" 65 let mutable contactPhone = "" 66 let mutable address = "" 67 let mutable country = "" 68 let mutable latitude = 0. 69 let mutable longitude = 0. 70 let mutable hostedById = "" 71 let mutable rsvps = List<RSVP>() :> ICollection<RSVP> 72 73 [<HiddenInput(DisplayValue=false)>] 74 member public self.DinnerID 75 with get() = dinnerID 76 and set(value) = dinnerID <- value 77 78 [<Required(ErrorMessage="Title Is Required")>] 79 [<StringLength(50,ErrorMessage="Title may not be longer than 50 characters")>] 80 member public self.Title 81 with get() = title 82 and set(value) = title <- value 83 84 [<Required(ErrorMessage="EventDate Is Required")>] 85 [<Display(Name="Event Date")>] 86 member public self.EventDate 87 with get() = eventDate 88 and set(value) = eventDate <- value 89 90 [<Required(ErrorMessage="Description Is Required")>] 91 [<StringLength(256,ErrorMessage="Description may not be longer than 256 characters")>] 92 [<DataType(DataType.MultilineText)>] 93 member public self.Description 94 with get() = description 95 and set(value) = description <- value 96 97 [<StringLength(256,ErrorMessage="Hosted By may not be longer than 256 characters")>] 98 [<Display(Name="Hosted By")>] 99 member public self.HostedBy 100 with get() = hostedBy 101 and set(value) = hostedBy <- value 102 103 [<Required(ErrorMessage="Contact Phone Is Required")>] 104 [<StringLength(20,ErrorMessage="Contact Phone may not be longer than 20 characters")>] 105 [<Display(Name="Contact Phone")>] 106 member public self.ContactPhone 107 with get() = contactPhone 108 and set(value) = contactPhone <- value 109 110 [<Required(ErrorMessage="Address Is Required")>] 111 [<StringLength(20,ErrorMessage="Address may not be longer than 50 characters")>] 112 [<Display(Name="Address")>] 113 member public self.Address 114 with get() = address 115 and set(value) = address <- value 116 117 [<UIHint("CountryDropDown")>] 118 member public this.Country 119 with get() = country 120 and set(value) = country <- value 121 122 [<HiddenInput(DisplayValue=false)>] 123 member public self.Latitude 124 with get() = latitude 125 and set(value) = latitude <- value 126 127 [<HiddenInput(DisplayValue=false)>] 128 member public v.Longitude 129 with get() = longitude 130 and set(value) = longitude <- value 131 132 [<HiddenInput(DisplayValue=false)>] 133 member public self.HostedById 134 with get() = hostedById 135 and set(value) = hostedById <- value 136 137 member public self.RSVPs 138 with get() = rsvps 139 and set(value) = rsvps <- value 140 141 member public self.IsHostedBy (userName:string) = 142 System.String.Equals(hostedBy,userName,System.StringComparison.Ordinal) 143 144 member public self.IsUserRegistered(userName:string) = 145 rsvps |> Seq.exists(fun r -> r.AttendeeName = userName) 146 147 148 [<UIHint("Location Detail")>] 149 [<NotMapped()>] 150 member public self.Location 151 with get() = new LocationDetail(self.Latitude,self.Longitude,self.Title,self.Address) 152 and set(value:LocationDetail) = 153 let latitude = value.Latitude 154 let longitude = value.Longitude 155 let title = value.Title 156 let address = value.Address 157 ()

Unlike C# where there is a class per file, all of the related elements are placed into a the same location.  Also, notice that the absence of semi-colons, curly braces, and other distracting characters, and finally you can see that because were are in the .NET framework, all of the data annotations are the same.  Sure enough, pointing the MVC UI to the application and hitting run, the application just works.

image

With the separation complete, it was time time to make our app much smarter.  The first thing that I thought of was when the person creates an account, they enter their first and last name

 

This seems like an excellent opportunity to add some user manipulation personalization to our site.  Going back to this analysis of names gives to newborns in the United States, if I know your first name, I have a pretty good chance of guessing your age/gender/and state of birth.  For example ‘Jose’ is probably a male born in his twenties in either Texas or California.  ‘James’ is probably a male in his 40s or 50s.

I added 6 pictures to the site for young,middleAged, and old males and females.

image

 

I then modified the logonStatus partial view like so

1 @using NerdDinner.UI; 2 3 4 @if(Request.IsAuthenticated) { 5 <text>Welcome <b>@(((NerdIdentity)HttpContext.Current.User.Identity).FriendlyName)</b>! 6 [ @Html.ActionLink("Log Off", "LogOff", "Account") ]</text> 7 } 8 else { 9 @:[ @Html.ActionLink("Log On", "LogOn", new { controller = "Account", returnUrl = HttpContext.Current.Request.RawUrl }) ] 10 } 11 12 @if (Session["adUri"] != null) 13 { 14 <img alt="product placement" title="product placement" src="@Session["adUri"]" height="40" /> 15 }

Then, I created a session variable called adUri that the picture will reference in the Logon controller

1 public ActionResult LogOn(LogOnModel model, string returnUrl) 2 { 3 if (ModelState.IsValid) 4 { 5 if (ValidateLogOn(model.UserName, model.Password)) 6 { 7 // Make sure we have the username with the right capitalization 8 // since we do case sensitive checks for OpenID Claimed Identifiers later. 9 string userName = MembershipService.GetCanonicalUsername(model.UserName); 10 11 FormsAuth.SignIn(userName, model.RememberMe); 12 13 AdProvider adProvider = new AdProvider(); 14 String catagory = adProvider.GetCatagory(userName); 15 Session["adUri"] = "/Content/images/" + catagory + ".png"; 16

And finally, I added an implementation of the adProvider back in the application:

1 type AdProvider () = 2 member this.GetCatagory personName: string = 3 "middleAgedMale"

So running the app, we have a product placement for a Middle Aged Male

image

So the last thing to do is to turn names into those categories.  I thought of a couple of different implementations: loading the entire census data set and searching it on demand,  I then thought about using Azure ML and making a API request each time, I then decided into just creating a lookup table that can be searched.  In any event, since I am using an interface, swapping out implementations is easy and since I am using F#, creating implementations is easy.

I went back to my script file that analyzed the baby names from the US census and created a new script.  I loaded the names into memory like before

1 #r "C:/Git/NerdChickenChicken/04_mvc3_Working/packages/FSharp.Data.2.0.14/lib/net40/FSharp.Data.dll" 2 3 open FSharp.Data 4 5 type censusDataContext = CsvProvider<"https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/AK.TXT"> 6 type stateCodeContext = CsvProvider<"https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/states.csv"> 7 8 let stateCodes = stateCodeContext.Load("https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/states.csv"); 9 10 let fetchStateData (stateCode:string)= 11 let uri = System.String.Format("https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/{0}.TXT",stateCode) 12 censusDataContext.Load(uri) 13 14 let usaData = stateCodes.Rows 15 |> Seq.collect(fun r -> fetchStateData(r.Abbreviation).Rows) 16 |> Seq.toArray 17

I then created a function that tells the probability of male

1 let genderSearch name = 2 let nameFilter = usaData 3 |> Seq.filter(fun r -> r.Mary = name) 4 |> Seq.groupBy(fun r -> r.F) 5 |> Seq.map(fun (n,a) -> n,a |> Seq.sumBy(fun (r) -> r.``14``)) 6 7 let nameSum = nameFilter |> Seq.sumBy(fun (n,c) -> c) 8 nameFilter 9 |> Seq.map(fun (n,c) -> n, c, float c/float nameSum) 10 |> Seq.filter(fun (g,c,p) -> g = "M") 11 |> Seq.map(fun (g,c,p) -> p) 12 |> Seq.head 13 14 genderSearch "James" 15

image

I then created a function that calculated the year the last name was popular (using 1 standard deviation away)

1 let ageSearch name = 2 let nameFilter = usaData 3 |> Seq.filter(fun r -> r.Mary = name) 4 |> Seq.groupBy(fun r -> r.``1910``) 5 |> Seq.map(fun (n,a) -> n,a |> Seq.sumBy(fun (r) -> r.``14``)) 6 |> Seq.toArray 7 let nameSum = nameFilter |> Seq.sumBy(fun (n,c) -> c) 8 nameFilter 9 |> Seq.map(fun (n,c) -> n, c, float c/float nameSum) 10 |> Seq.toArray 11 12 let variance (source:float seq) = 13 let mean = Seq.average source 14 let deltas = Seq.map(fun x -> pown(x-mean) 2) source 15 Seq.average deltas 16 17 let standardDeviation(values:float seq) = 18 sqrt(variance(values)) 19 20 let standardDeviation' name = ageSearch name 21 |> Seq.map(fun (y,c,p) -> float c) 22 |> standardDeviation 23 24 let average name = ageSearch name 25 |> Seq.map(fun (y,c,p) -> float c) 26 |> Seq.average 27 28 let attachmentPoint name = (average name) + (standardDeviation' name) 29 30 let popularYears name = 31 let allYears = ageSearch name 32 let attachmentPoint' = attachmentPoint name 33 let filteredYears = allYears 34 |> Seq.filter(fun (y,c,p) -> float c > attachmentPoint') 35 |> Seq.sortBy(fun (y,c,p) -> y) 36 filteredYears 37 38 let lastPopularYear name = popularYears name |> Seq.last 39 let firstPopularYear name = popularYears name |> Seq.head 40 41 lastPopularYear "James" 42

image

 

And then created a function that takes in the gender probability of being male and the last year the name was poular and assigns the name into a category:

1 let nameAssignment (malePercent, lastYearPopular) = 2 match malePercent > 0.75, malePercent < 0.75, lastYearPopular < 1945, lastYearPopular > 1980 with 3 | true, false, true, false -> "oldMale" 4 | true, false, false, false -> "middleAgedMale" 5 | true, false, false, true -> "youngMale" 6 | false, true, true, false -> "oldFemale" 7 | false, true, false, false -> "middleAgedFemale" 8 | false, true, false, true -> "youngFeMale" 9 | _,_,_,_ -> "unknown"

And then it was a matter of tying the functions together for each of the names in the master list:

1 let nameList = usaData 2 |> Seq.map(fun r -> r.Mary) 3 |> Seq.distinct 4 5 nameList 6 |> Seq.map(fun n -> n, genderSearch n) 7 |> Seq.map(fun (n,mp) -> n,mp, lastPopularYear n) 8 |> Seq.map(fun (n,mp,(y,c,p)) -> n, mp, y) 9 10 let nameList' = nameList 11 |> Seq.map(fun n -> n, genderSearch n) 12 |> Seq.map(fun (n,mp) -> n,mp, lastPopularYear n) 13 |> Seq.map(fun (n,mp,(y,c,p)) -> n, mp, y) 14 |> Seq.map(fun (n,mp,y) -> n,nameAssignment(mp,y)) 15

image

And then write the list out to a file

1 open System.IO 2 let outFile = new StreamWriter(@"c:\data\nameList.csv") 3 4 nameList' |> Seq.iter(fun (n,c) -> outFile.WriteLine(sprintf "%s,%s" n c)) 5 outFile.Flush 6 outFile.Close()

Thanks to this stack overflow post for the file write (I wish the csv type provider had this ability).  With the file created, I can then use the file as a lookup for my name function back in the MVC app using a csv type provider

1 type nameMappingContext = CsvProvider<"C:/data/nameList.csv"> 2 3 type AdProvider () = 4 member this.GetCatagory personName: string = 5 let nameList = nameMappingContext.Load("C:/data/nameList.csv") 6 let foundName = nameList.Rows 7 |> Seq.filter(fun r -> r.Annie = personName) 8 |> Seq.map(fun r -> r.oldFemale) 9 |> Seq.toArray 10 if foundName.Length > 0 then 11 foundName.[0] 12 else 13 "middleAgedMale"

And now I have some (basic) personalization to Nerd Dinner. (Emma is a young female name so they get a picturer of a campground)

image

So this a rather crude.  There is no provision for nicknames, case-sensitivity, etc.  But the site is along the way to becoming smarter…

The code can be found on github here.

Wake County Restaurant Inspection Data with Azure ML and F#

With Azure ML now available, I was thinking about some of the analysis I did last year and how I could do even more things with the same data set.  One such analysis that came to mind was the restaurant inspection data that I analyzed last year.  You can see the prior analysis here.

I uploaded the restaurant data into Azure and thought of a simple question –> can we predict inspection scores based on some easily available data?  This is an interesting dataset because there are some categorical data elements (zip code, restaurant type, etc…) and there are some continuous ones (priority foundation, etc…).

Here is the base dataset:

image

I created a new experiment and I used a boosted regression model and a neural network regression and used a 70/30 train/test split.

image

After running the models and inspecting the model evaluation, I don’t have a very good model

image

I then decided to go back and pull some of the X variables out of the dataset and concentrate on only a couple of variables.  I added a project column module and then selected Restaurant Type and Zip Code as the X variables and left the Inspection Score as the Y variable. 

image

With this done, I added a couple of more models (Bayesian Linear Regression and a Decision Forest Regression) and gave it a whirl

image

image

Interesting, adding these models did not give us any better of a prediction and dropping the variables to two made a less accurate model.  Without doing any more analysis, I picked the model with the lowest MAE )Boosted Decision Tree Regression) and published it at a web service:

image

I published it as a web service and now I can consume if from a client app.   I used the code that I used for voting analysis found here as a template and sure enough:

["27519","Restaurant","0","96.0897827148438"]

["27612","Restaurant","0","95.5728530883789"]

So restaurants in Cary,NC have a higher inspection score than the ones found in Northwest Raleigh.   However, before we start  alerting the the Cary Chamber of Commerce to create a marketing campaign (“Eat in Cary, we are safer”), the difference is within the MAE.

In any event, it would be easy to create a  phone app and you don’t know a restaurant score, you can punch in the establishment type and the zip code and have a good idea about the score of the restaurant. 

This is an academic exercise b/c the establishments have to show you their card and yelp has their score on them, but a fun exercise none the less.  Happy eating.

Consuming Azure ML With F#

(This post is a continuation of this one)

So with a model that works well enough,  I selected only that model and saved it

image

 

image

Created a new experiment and used that model with the base data.  I then marked the project columns as the input and the score as the output (green and blue circle respectively)

image

After running it, I published it as a web service

image

And voila, an endpoint ready to go.  I then took the auto generated script and opened up a new Visual Studio F# project to use it.  The problem was that this is the data structure that the model needs

FeatureVector = new Dictionary<string, string>() { { "Precinct", "0" }, { "VRN", "0" }, { "VRstatus", "0" }, { "VRlastname", "0" }, { "VRfirstname", "0" }, { "VRmiddlename", "0" }, { "VRnamesufx", "0" }, { "VRstreetnum", "0" }, { "VRstreethalfcode", "0" }, { "VRstreetdir", "0" }, { "VRstreetname", "0" }, { "VRstreettype", "0" }, { "VRstreetsuff", "0" }, { "VRstreetunit", "0" }, { "VRrescity", "0" }, { "VRstate", "0" }, { "Zip Code", "0" }, { "VRfullresstreet", "0" }, { "VRrescsz", "0" }, { "VRmail1", "0" }, { "VRmail2", "0" }, { "VRmail3", "0" }, { "VRmail4", "0" }, { "VRmailcsz", "0" }, { "Race", "0" }, { "Party", "0" }, { "Gender", "0" }, { "Age", "0" }, { "VRregdate", "0" }, { "VRmuni", "0" }, { "VRmunidistrict", "0" }, { "VRcongressional", "0" }, { "VRsuperiorct", "0" }, { "VRjudicialdistrict", "0" }, { "VRncsenate", "0" }, { "VRnchouse", "0" }, { "VRcountycomm", "0" }, { "VRschooldistrict", "0" }, { "11/6/2012", "0" }, { "Voted Ind", "0" }, }, GlobalParameters = new Dictionary<string, string>() { } };

And since I am only using 6 of the columns, it made sense to reload the Wake County Voter Data with just the needed columns.  I went back to the original CSV and did that.  Interestingly, I could not set the original dataset as the publish input so I added a project column module that does nothing

image

With that in place, I republished the service and opened Visual Studio.  I decided to start with a script.  I was struggling though the async when Tomas P helped me on Stack Overflow here.  I’ll say it again, the F# community is tops.  In any event, here is the initial script:

#r @"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.5\System.Net.Http.dll" #r @"..\packages\Microsoft.AspNet.WebApi.Client.5.2.2\lib\net45\System.Net.Http.Formatting.dll" open System open System.Net.Http open System.Net.Http.Headers open System.Net.Http.Formatting open System.Collections.Generic type scoreData = {FeatureVector:Dictionary<string,string>;GlobalParameters:Dictionary<string,string>} type scoreRequest = {Id:string; Instance:scoreData} let invokeService () = async { let apiKey = "" let uri = "https://ussouthcentral.services.azureml.net/workspaces/19a2e623b6a944a3a7f07c74b31c3b6d/services/f51945a42efa42a49f563a59561f5014/score" use client = new HttpClient() client.DefaultRequestHeaders.Authorization <- new AuthenticationHeaderValue("Bearer",apiKey) client.BaseAddress <- new Uri(uri) let input = new Dictionary<string,string>() input.Add("Zip Code","27519") input.Add("Race","W") input.Add("Party","UNA") input.Add("Gender","M") input.Add("Age","45") input.Add("Voted Ind","1") let instance = {FeatureVector=input; GlobalParameters=new Dictionary<string,string>()} let scoreRequest = {Id="score00001";Instance=instance} let! response = client.PostAsJsonAsync("",scoreRequest) |> Async.AwaitTask let! result = response.Content.ReadAsStringAsync() |> Async.AwaitTask if response.IsSuccessStatusCode then printfn "%s" result else printfn "FAILED: %s" result response |> ignore } invokeService() |> Async.RunSynchronously

 

Unfortunately, when I run it, it fails.  Below is the Fiddler trace:

image

 

So it looks like the Json Serializer is postpending the “@” symbol.  I changed the records to types and voila:

image

You can see the final script here.

So then throwing in some different numbers. 

  • A millennial: ["27519","W","D","F","25","1","1","0.62500011920929"]
  • A senior citizen: ["27519","W","D","F","75","1","1","0.879632294178009"]

I wonder why social security never gets cut?

In any event, just to check the model:

  • A 15 year old: ["27519","W","D","F","15","1","0","0.00147285079583526"]