Correlation Between Recruit Rankings and Final Standings in Big Ten Football

Following up on my last post about screen scraping college football in F#, I took the next step and analyzed the data that I scraped.  I am a big believer in Domain Specific Language so ‘Rankings’ means the ranking assigned by Rivals about how well a school recruits players.  ‘Standings’ means the final position in the Big Ten after the games have been played.  Ranking is for recruiting and standings is for actually playing the games.

Going back to the code, the 1st thing I did was to separate the Standings call from the search for a given school – so that the XmlDocument is loaded once and then searched several times versus loading it for each search.  This improved performance dramatically:

  1. static member getAnnualConferenceStandings(year:int)=
  2.     let url = "http://espn.go.com/college-football/conferences/standings/_/id/5/year/"+year.ToString()+"/big-ten-conference";         
  3.     let request = WebRequest.Create(Uri(url))
  4.     use response = request.GetResponse()
  5.     use stream = response.GetResponseStream()
  6.     use reader = new IO.StreamReader(stream)
  7.     let htmlString = reader.ReadToEnd()
  8.     let divMarkerStartPosition = htmlString.IndexOf("my-teams-table");
  9.     let tableStartPosition = htmlString.IndexOf("<table",divMarkerStartPosition);
  10.     let tableEndPosition = htmlString.IndexOf("</table",tableStartPosition);
  11.     let data = htmlString.Substring(tableStartPosition, tableEndPosition- tableStartPosition+8)
  12.     let xmlDocument = new XmlDocument();
  13.     xmlDocument.LoadXml(data);
  14.     xmlDocument        
  15.  
  16. static member getSchoolStanding(xmlDocument: XmlDocument,school) =
  17.     let keyNode = xmlDocument.GetElementsByTagName("td")
  18.                         |> Seq.cast<XmlNode>
  19.                         |> Seq.find (fun node -> node.InnerText = school)
  20.     let valueNode = keyNode.NextSibling
  21.     let returnValue = (keyNode.InnerText, valueNode.InnerText)
  22.     returnValue
  23.  
  24. static member getConferenceStandings(year:int) =
  25.     let xmlDocument = RankingProvider.getAnnualConferenceStandings(year)
  26.     Seq.map(fun school -> RankingProvider.getSchoolStanding(xmlDocument,school)) RankingProvider.schools
  27.         |> Seq.sortBy snd
  28.         |> Seq.toList
  29.         |> List.rev
  30.         |> Seq.mapi(fun index (school,ranking) -> school, index+1)
  31.         |> Seq.sortBy fst
  32.         |> Seq.toList

Thanks for Valera Kolupaev for showing me how to use mapi to create a tuple from the list of schools and what rank they were in the list in getConferenceStandings().

I then went to the rankings call and added a way to parse down only the schools I am interested in.  That way I can compare individual schools, groups of schools, or the entire conference:

  1. static member getConferenceRankings(year) =
  2.     RankingProvider.schools
  3.             |> Seq.map(fun schoolName -> RankingProvider.getSchoolInSequence(year, schoolName))
  4.             |> Seq.toList
  5.     
  6.  
  7. static member getSchoolInSequence(year, schoolName) =
  8.     RankingProvider.getRecrutRankings(year)
  9.                     |> Seq.find(fun (school,rank) -> school = schoolName)

After these two refactorings, my unit tests still ran green so I was ready to do the analysis.

image

I went out to my project of a couple of weeks ago for correlation and copied in the module.  The Correlation function takes in two lists of doubles.  The first list would be a school’s ranking and the second would be the standings:

  1. static member getCorrelationBetweenRankingsAndStandings(year, rankings, standings ) =
  2.     let ranks = Seq.map(fun (school,rank) -> rank) rankings
  3.     let stands = Seq.map(fun (school,standing) -> standing) standings
  4.     Calculations.Correlation(ranks,stands)
  5.  
  6. static member getCorrelation(year:int) =
  7.     let rankings = RankingProvider.getConferenceRankings year
  8.                     |> Seq.map(fun (school,rank) -> school,Convert.ToDouble(rank))
  9.     let standings = RankingProvider.getConferenceStandings(year+RankingProvider.yearDifferenceBetwenRankingsAndStandings)
  10.                     |> Seq.map(fun (school, standing) -> school, Convert.ToDouble(standing))
  11.     let correlation = RankingProvider.getCorrelationBetweenRankingsAndStandings(year,rankings, standings)
  12.     (year, correlation)

A couple of things to note:

1) This function assumes that both the rankings and the standings are the same length and are in order by school name.  A production application would check this as part of standard argument validation.

2) I used Convert.ToDouble() to change the Int32 of the ranking to Double of the correlation function.  Having these .NET assemblies available at key points in the application really moved things along.

In any event, all that was left was to list the Big Ten schools to analyze, the number of years to analyze, and the year difference between the recruit rankings and the standings from the games they played in.

As a first step, I did all original big ten schools with 7 years of recruiting and a 1,2,3,4 years difference (2002 ranking compared to 2003, 2004,2005,2006 standings ,etc…):

imageimage

imageimage

The average is .3303/.2650/.5138/.6065

And so yeah – there is a really strong correlation between a recruit ranking and the outcome on the field.  Also, the most impact the class has seems to be senior year – which makes sense.  I don’t have a hypothesis on why it drops sophomore year – perhaps the ‘impact freshmen’ leave after 1 year?

Also of interest, the correlation does not seem to follow a normal distribution.  If you only look at the schools that have an emphasis on academics, the correlation drops significantly – to a negative correlation! 

imageimage

imageimage

The average is .1485/-.1446/-.2817/-.0381

So another great reason to create the new big ten – sometimes there is a really good recruit class does not do well on the field and other times a poorly-ranked recruiting class does well on the field.  This kind of unpredictability is both exciting and probably much more likely to bring in the casual fans.

Based on this analysis, here is what is going to happen in the Big Ten next year:

  • Michigan State and Ohio State will be the leaders
  • Michigan and Penn State are in the best position beat Michigan State and Ohio State

But you didn’t need a statistical analysis to tell you that.  The one key surprise that this analysis tells you is that

  • Nebraska will have a significant improvement in the standings in 2014
  • Indiana will have a significant improvement in the standings in 2015 and 2016

As a final note, I got this after doing a bunch of requests to Yahoo:

image

image

 

So I wonder if I hit the page too many times and my IP was flagged a as a bot?  I waited a day for the server to reset to finish my analysis.  Perhaps this is a case where I should get the data when the getting is good and take their pages and bring them locally?

 

Advertisements

Screen Scraping College Football Statistics

As a follow-up to my post of the correlation of Academic Ranking and Football Rankings in the Big Ten, I thought I would look that the relationship between two different kinds of Football Rankings: the recruiting ranking assigned by Rivals and the actual results on the field.  To that end, I went to collect the data programmically because I am doing a time-series analysis and I didn’t want to do data-entry.

My first stop was to find a free service that exposes this data on the web.  No luck – either the data was a service that cost money or the data was presented as a web page.  Since I have never screen-scraped using F# (and I am cheap), I chose option #2.

My first data point was the recruiting ranking found here.  When I inspected the source of the page, I caught a break – the data is actually stored as Json on the page.

 image

So firing up Visual Studio, I created a solution with 1 F# project and 2 C# projects:

image

I then wrote a unit test to check that something is being returned:

  1. [TestMethod]
  2. public void getRecrutRankings_RetunsExpected()
  3. {
  4.     var rankings = RankingProvider.getRecrutRankings("2012");
  5.     Assert.AreNotEqual(0, rankings.Length);
  6. }

I then went over the F#.  I created the RankingProvider type and then add a function that pulls in the rankings for a given year:

  1. static member getRecrutRankings(year) =
  2.     let url = "http://sports.yahoo.com/footballrecruiting/football/recruiting/teamrank/&quot;+year+"/BIG10/all";
  3.     let request = WebRequest.Create(Uri(url))
  4.     use response = request.GetResponse()
  5.     use stream = response.GetResponseStream()
  6.     use reader = new IO.StreamReader(stream)
  7.     let htmlString = reader.ReadToEnd()
  8.     let startPosition = htmlString.IndexOf("var rankingsTableData =")
  9.     let headerLength = 23
  10.     let endPosition = htmlString.IndexOf(";",startPosition)
  11.     let data = htmlString.Substring(startPosition+headerLength,endPosition-startPosition-headerLength).Trim()
  12.     let results = JsonConvert.DeserializeObject(data)
  13.     let castedResults = results :?> Newtonsoft.Json.Linq.JArray
  14.                                             |> Seq.map(fun x -> (x.Value("name").ToString(), Int32.Parse(x.Value("rank").ToString())))
  15.                                             |> Seq.toList

 

A couple of things to note.

  • Lines 2 through 12 are language-agnostic.  You would write the exact same code in C#/VB.NET with a slightly different syntax.
  • Line 13 is where things get interesting.  I used the :?> operator to cast the Json to a typed structure.  :?> wins as the weirdest symbol I have ever used in computer programming.   I guess I haven’t been programming long enough?
  • Lines 14 and 15 is where you can see why F# is better than C#.  I created a function that takes the Json and pushes it into a tuple.  With no iteration, the code is both easier to read and less likely to have bugs

Hoping to press my luck, I went over the the other page (the one that holds the standings from the actual games) to see if they used Json.  No dice – so back to mid-2000s screen scraping.  I created a function that loads the table into an XML document and then searches for a given school.

  1. static member getConferenceStanding(year, school) =
  2.     let url = "http://espn.go.com/college-football/conferences/standings/_/id/5/year/&quot;+year+"/big-ten-conference";         
  3.     let request = WebRequest.Create(Uri(url))
  4.     use response = request.GetResponse()
  5.     use stream = response.GetResponseStream()
  6.     use reader = new IO.StreamReader(stream)
  7.     let htmlString = reader.ReadToEnd()
  8.     let divMarkerStartPosition = htmlString.IndexOf("my-teams-table");
  9.     let tableStartPosition = htmlString.IndexOf("<table",divMarkerStartPosition);
  10.     let tableEndPosition = htmlString.IndexOf("</table",tableStartPosition);
  11.     let data = htmlString.Substring(tableStartPosition, tableEndPosition- tableStartPosition+8)
  12.     let xmlDocument = new XmlDocument();
  13.     xmlDocument.LoadXml(data);
  14.     let keyNode = xmlDocument.GetElementsByTagName("td")
  15.                     |> Seq.cast<XmlNode>
  16.                     |> Seq.find (fun node -> node.InnerText = school)
  17.     let valueNode = keyNode.NextSibling
  18.     (keyNode.InnerText, valueNode.InnerText)

A couple of things to note:

  • Lines 2-7 are identical to the prior function so they should be combined into a single function that can be independently testable.
  • Lines 8-13 are language-agnostic.  You would write the exact same code in C#/VB.NET with a slightly different syntax.
  • Lines 14-18 is where F# really shines.  Like the prior function, by using functional programming techniques in F#, I saved myself time, avoid bugs, and made the code much more intuitive.
  • I am making a web call for each function call– this should be optimized so the call is made once and the xmlDocument is passed in.  This would also make the function much more testable (even without a mocking framework)

Next up, I needed to call this function for each of the Big Ten Schools:

  1. static member getConferenceStandings(year)=
  2.     let schools =[|"Nebraska";"Michigan";"Northwestern";"Michigan State";"Iowa";
  3.         "Minnesota";"Ohio State";"Penn State";"Wisconsin"; "Purdue"; "Indiana"; "Illinois"|]
  4.     Seq.map(fun school -> RankingProvider.getConferenceStanding(year,school)) schools
  5.         |> Seq.sortBy snd
  6.         |> Seq.toList
  7.         |> List.rev

 

This is purely F# and is a pure joy to write (and look the least amount of time).  Note that the sort is on the second element of the tuple and that the list is reversed because the second element is the wins-losses so F# is sorting ascending on the number of wins.  Since Seq does not have a rev function, I turned it into a List, which does have the rev function

Some might ask “Why didn’t you use type-providers?”  My answer is “I tried, but I couldn’t get them to work.”  For example, here is the code that I used for the type provider when parsing the xmlDocument:

  1. xmlDocument.LoadXml(data);
  2. let document = XmlProvider<xmlDocument>

The problem is that the type provider expects a uri (and I can’t find an overload to pass in the document).  It looks like type providers are more designed for providers that are ready to, well, provide (Web Services, Databases, etc..) versus jerry-rigged data (like screen scraping).

In any event, with these two functions, ready, I went to the UI project and decided to see how the teams did in 2012 on the field compared to how the teams did in recruiting 2 years before:

  1. static void Main(string[] args)
  2. {
  3.     Console.WriteLine("Start");
  4.  
  5.     Console.WriteLine("——-Rankings");
  6.     var rankings = RankingProvider.getRecrutRankings("2010");
  7.     foreach (var school in rankings)
  8.     {
  9.         Console.WriteLine(school.Item1 + ":" + school.Item2);
  10.     }
  11.  
  12.     Console.WriteLine("——-Standings");
  13.     var standings = RankingProvider.getConferenceStandings("2012");
  14.     foreach (var school in standings)
  15.     {
  16.         Console.WriteLine(school.Item1 + ":" + school.Item2);
  17.     }
  18.  
  19.     Console.WriteLine("End");
  20.     Console.ReadKey();
  21. }

And the results:

image

I have no idea if a 2-year lag between recruiting and rankings is the right number – perhaps an analysis of the correct lag will be done.  After all, between red-shirt freshmen, transfer rules, and attrition, there are plenty of variables the determine when a recruiting class has the biggest impact.  Also, the standings are a blend of recruiting classes and since I am not evaluating individual players, I can’t go to that level of detail.  2 years out seems reasonable, but as Bluto famiously once said

  1. static member getBlutoQuote() =
  2.     "Seven years of college down the drain.";

:

the average might be different.  In any event, I now have the data I want so the next step is to analyze it to see if there is any correlation.  At first glance, there might be something – the top 4 schools for recruiting all finished in the top 4 in the standings – but the bottom 4 is more muddled with only Illinois doing poorly in both recruiting and the standings.

More to come…

F# and Monopoly Simulation Redux

Now that I am 4 months into my F# adventure, I thought I would revisit the monopoly simulation that I wrote in August.  There are some pretty big differences

1)  I am not using the ‘if…then’ construct at all –> rather I am using pattern matching.  For example, consider the original communityChest function:

  1. let communityChest x y =
  2.     if y = 1 then
  3.         0
  4.     else if y = 2 then
  5.         10
  6.      else
  7.         x

and the most recent one:

  1. let communityChest (tile, randomNumber) =
  2.     match randomNumber with
  3.         | 1 -> 0
  4.         | 2 -> 10
  5.         | _ -> tile

“Big deal”, you are saying to yourself (or at least I did).  But the power of pattern matching is put on display with the revised chance.  The code is much more readable and understandable.

Original:

  1. let chance x y =
  2.     if y = 1 then
  3.         0
  4.     else if y = 2 then
  5.         10
  6.     else if y = 3 then
  7.         11
  8.     else if y = 4 then
  9.         39
  10.     else if y = 5 then
  11.         x – 3
  12.     else if y = 6 then
  13.         5
  14.     else if y = 7 then
  15.         24
  16.     else if y = 8 then
  17.         if x < 5 then
  18.             5
  19.         else if x < 15 then
  20.             15
  21.         else if x < 25 then
  22.             25
  23.         else if x < 35 then
  24.             35
  25.         else
  26.             5
  27.     else if y = 9 then
  28.         if x < 12 then
  29.             12
  30.         else if x < 28 then
  31.             28
  32.         else
  33.             12
  34.     else
  35.         x

 

Revised:

  1. let goToNearestRailroad tile =
  2.     match tile with
  3.         | 36|2 -> 5
  4.         | 7 -> 15
  5.         | 17|22 -> 25
  6.         | 33 -> 35
  7.         | _ -> failwith "not on chance"
  8.  
  9. let goToNearestUtility tile =
  10.     match tile with
  11.         | 36|2|7 -> 12
  12.         | 12|22|33-> 28
  13.         | _ -> failwith "not on chance"
  14.  
  15. let chance (tile, randomNumber) =
  16.     match randomNumber with
  17.         | 1 -> 0
  18.         | 2 -> 10
  19.         | 3 -> 11
  20.         | 4 -> 39
  21.         | 5 -> tile – 3
  22.         | 6 -> 5
  23.         | 7 -> 24
  24.         | 8 -> goToNearestRailroad tile
  25.         | 9 -> goToNearestUtility tile
  26.         | _ -> tile

 

As a side note, I ditched the x and y values because they are unreadable.  When I went back to the code after 3 months, I spent way too long trying to figure out what the heck ‘x’ was.  I know that scientific code uses cryptic values, but clean code does not.  I changed them and the code became much better.

I then took a look at the move() function.  The original:

  1. let move x y z =
  2.     if x + y > 39 then
  3.         x + y – 40
  4.     else if x + y = 30 then
  5.         10
  6.     else if x + y = 2 then
  7.         communityChest 2 z
  8.     else if x + y = 7 then
  9.         chance 7 z
  10.     else if x + y = 17 then
  11.         communityChest 17 z
  12.     else if x + y = 22 then
  13.         chance 22 z
  14.     else if x + y = 33 then
  15.         communityChest 33 z
  16.     else if x + y = 36 then
  17.         chance 36 z
  18.     else
  19.         x + y  

 

and the revised:

  1. let getBoardMove (currentTile, dieTotal) =
  2.     let initialTile = currentTile + dieTotal
  3.       matchinitialTile with
  4.           | 2 ->communityChest (2, random.Next())
  5.           | 7 ->chance (7, random.Next())
  6.           | 17 ->communityChest (17, random.Next())
  7.           | 22 ->chance (22, random.Next())
  8.         | 30 -> 10
  9.           | 33 ->communityChest (2, random.Next())
  10.           | 36 ->chance (7, random.Next())
  11.         | 40|41|42|43|44|45|46|47|48|49|50|51 -> initialTile – 40
  12.         | _ -> initialTile   

 

I am not happy with line 11 above – but apparently there is not a way in F# to do this ‘>40’ or even ‘[40 .. 51]’ in the left hand side of the pattern match.

So far, the biggest changes were to make the values more understandable and to get rid of the if…then statements and replace them with pattern matching.  Both these techniques make the code more readable and understandable.  The next big change came with the actual game play itself.  The original version:

  1. let simulation =
  2.     let mutable startingTile = 0
  3.     let mutable endingTile = 0
  4.     let mutable doublesCount = 0
  5.     let mutable inJail = false
  6.     let mutable jailRolls = 0
  7.     for diceRoll in 1 .. 10000 do
  8.         let dieOneValue = random.Next(1,7)
  9.         let dieTwoValue = random.Next(1,7)
  10.         let cardDraw = random.Next(1,17)
  11.         let numberOfMoves = dieOneValue + dieTwoValue
  12.         
  13.         if dieOneValue = dieTwoValue then
  14.             doublesCount <- doublesCount + 1
  15.         else
  16.             doublesCount <- 0
  17.         if inJail = true then
  18.             if doublesCount > 1 then
  19.                 inJail <- false
  20.                 jailRolls <- 0
  21.                 endingTile <- move 10 numberOfMoves cardDraw
  22.             else
  23.                 if jailRolls = 3 then
  24.                     inJail <- false
  25.                     jailRolls <- 0
  26.                     endingTile <- move 10 numberOfMoves cardDraw
  27.                 else
  28.                     inJail <- true
  29.                     jailRolls <- jailRolls + 1
  30.         else
  31.             if doublesCount = 3 then
  32.                 inJail <- true
  33.                 endingTile <- 10
  34.             else
  35.                 endingTile <- move startingTile numberOfMoves cardDraw
  36.          
  37.         printfn "die1: %A + die2: %A = %A FROM %A TO %A"
  38.             dieOneValue dieTwoValue numberOfMoves startingTile endingTile
  39.         startingTile <- endingTile
  40.         tiles.[endingTile] <- tiles.[endingTile] + 1

You will notice that the word ‘’mutable” shows up six times.  Using the word mutable in F# is a code smell so I refactored it out like so:

  1. let rec rollDice (currentTile, rollCount, doublesCount, inJail, jailRollCount)=
  2.     let dieOneValue = random.Next(1,7)
  3.     let dieTwoValue = random.Next(1,7)
  4.     let dieTotal = dieOneValue + dieTwoValue
  5.     let newRollCount = rollCount + 1
  6.     
  7.     let newDoublesCount =
  8.         if dieOneValue = dieTwoValue then doublesCount + 1
  9.         else 0
  10.  
  11.     let newTile = getTileMove(currentTile,dieTotal,newDoublesCount,inJail,jailRollCount)
  12.     
  13.     let newInJail =
  14.         if newTile = 10 then true
  15.         else false
  16.  
  17.     let newJailRollCount =
  18.         if newInJail = inJail then jailRollCount + 1
  19.         else 0
  20.  
  21.     let targetTuple = scorecard.[newTile]
  22.     let newTuple = (fst targetTuple, snd targetTuple + 1)
  23.     scorecard.[newTile] <- newTuple
  24.  
  25.             if rollCount < 10000 then
  26.         rollDice (newTile, newRollCount, newDoublesCount, newInJail, newJailRollCount)
  27.     else
  28.         scorecard

No “mutable” (thanks to recursion) and only 1 assignment.  I also wanted to get rid of that one ‘<-‘ and Thomas Petrick was kind enough to demonstrate the correct way to do this on stack overflow.  Finally, I had to throw in a supporting function to make the decision logic account for rolling doubles that may put you in jail or may get you out of jail depending on prior state (were you in jail when you rolled doubles, were you out of jail when you rolled doubles for the 3rd time, etc…).  I spent way too much time monkeying around with a series of nest if…then statements when it hit me that I should be using tuples and pattern matching:

  1. let getTileMove (currentTile, dieTotal, doublesCount, inJail, jailRollCount) =
  2.     match (inJail,jailRollCount, doublesCount) with
  3.         | (true,3,_) -> getBoardMove(10,dieTotal)
  4.         | (true,_,_) -> 10
  5.         | (false,_,3) -> 10
  6.         | (false,_,_) -> getBoardMove(10,dieTotal)

So here if the real power of F# on display.  I can think of hundreds of applications that I have seen in C#/VB.NET that have a high cyclomatic complexity and hidden bugs that have reared their head at the most inopportune time because of complex business logic using a series of case..switch and/or if..then. statements. Even by putting step into its own function only helps partially because the code is still there –> it is just labeled better.

By using tupled pattern matching, all of that complexity goes away and we have a succinct series of statements that actually reflect how the brain thinks about the problem.  By using F#, there are fewer lines of code (and therefore fewer unit tests to maintain) and you can write code that better represents how the wetware is approaching the problem.

The Big Ten and F#

I was talking to fellow TRINUGer David Green about football schools a couple of weeks ago.  He went to Northwestern and I went to Michigan and we were discussing the relative merits of universities doing well in football.  Assuming Goro was counting, on one hand, it is great to have a sport that can bring in tons of money to the school to fund non-football sports and activities, on the second hand it keeps alumni interested in their school, on the 3rd hand it can give locals a source of pride in the school, and on the last hand it can take the focus away from the other parts of the academic institution.

I then was talking to a professor at Ohio State University – she cares absolutely zero about the football team.  I made the comment that the smartest kids in Ohio don’t go to OSU.  They will go and root for their gladiators on Saturday but when it comes down to their academic and subsequent professional success, they look elsewhere.  She agreed.

Putting those two conversations together, it put OSU and MSU’s continued success in the Big Ten in context – as the inevitable bellyaching that those teams get the short stick when compared to the SEC.  For example, OSU and MSU both would be undefeated in the Ivy League in 2013– does that mean they should be considered in the same conversation as Alabama and Auburn for the national championship?  I think the biggest problem that OSU and MSU have is that they are in the Big Ten – which historically has been about geography, academic success, and athletic competition (in that order). 

Looking at the Big Ten Schools, I pulled their most recent academic ranking for US News and World Report and their BCS Ranking.  I then went over to MathIsFun to get the recipe for correlation:

image

I then went over to Visual Studio and created a solution like so:

image

Learning from my last project, I created my unit test first to verify that the calculation is correct:

  1. [TestMethod]
  2. public void FindCorrelationUsingStandardInput_ReturnsExpectedValue()
  3. {
  4.     Double[] tempatures = new Double[12] { 14.2, 16.4, 11.9, 15.2, 18.5, 22.1, 19.4, 25.1, 23.4, 18.1, 22.6, 17.2 };
  5.     Double[] sales = new Double[12] { 215, 325, 185, 332, 406, 522, 412, 614, 544, 421, 445, 408 };
  6.  
  7.     Double expected = .9575;
  8.     Double actual = Calculations.Correlation(tempatures, sales);
  9.     Assert.AreEqual(expected, actual);
  10. }

I then hopped over to my working code and started coding:

  1. type Calculations() =
  2.     static member Correlation(x:IEnumerable<double>, y:IEnumerable<double>) =
  3.         let meanX = Seq.average x
  4.         let meanY = Seq.average y
  5.         
  6.         let a = Seq.map(fun x -> x-meanX) x
  7.         let b = Seq.map(fun y -> y-meanY) y
  8.  
  9.         let ab = Seq.zip a b
  10.         let abProduct = Seq.map(fun (a,b) -> a * b) ab
  11.  
  12.         let aSquare = Seq.map(fun a -> a * a) a
  13.         let bSquare = Seq.map(fun b -> b * b) b
  14.         
  15.         let abSum = Seq.sum abProduct
  16.         let aSquareSum = Seq.sum aSquare
  17.         let bSquareSum = Seq.sum bSquare
  18.  
  19.         let sums = aSquareSum * bSquareSum
  20.         let squareRootOfSums = sqrt(sums)
  21.  
  22.         abSum/squareRootOfSums

What I noticed is that those intermediate variables make the code much more wordy than they need to be  – so a mathematician might think that the code is too verbose– but a developer might appreciate that each step is laid out.  In fact, I would argue that a better component design would be to break out each of the steps into their own function that can be independently testable (and perhaps reused by other functions):

  1. [TestMethod]
  2. public void GetMeanUsingStandardInputReturnsExpectedValue()
  3. {
  4.     Double[] tempatures = new Double[12] { 14.2, 16.4, 11.9, 15.2, 18.5, 22.1, 19.4, 25.1, 23.4, 18.1, 22.6, 17.2 };
  5.     Double expected = 18.675;
  6.     Double actual = Calculations.Mean(tempatures);
  7.     Assert.AreEqual(expected, actual);
  8. }
  9.  
  10. [TestMethod]
  11. public void GetBothMeansProductUsingStandardInputReturnsExpectedValue()
  12. {
  13.     Double[] tempatures = new Double[12] { 14.2, 16.4, 11.9, 15.2, 18.5, 22.1, 19.4, 25.1, 23.4, 18.1, 22.6, 17.2 };
  14.     Double[] sales = new Double[12] { 215, 325, 185, 332, 406, 522, 412, 614, 544, 421, 445, 408 };
  15.  
  16.     Double expected = 5325;
  17.     Double actual = Calculations.MeanProduct(tempatures);
  18.     Assert.AreEqual(expected, actual);
  19. }
  20.  
  21. [TestMethod]
  22. public void GetMeanSquareUsingStandardInputReturnsExpectedValue()
  23. {
  24.     Double[] tempatures = new Double[12] { 14.2, 16.4, 11.9, 15.2, 18.5, 22.1, 19.4, 25.1, 23.4, 18.1, 22.6, 17.2 };
  25.     Double[] sales = new Double[12] { 215, 325, 185, 332, 406, 522, 412, 614, 544, 421, 445, 408 };
  26.  
  27.     Double expected = 177;
  28.     Double actual = Calculations.MeanSquared(tempatures);
  29.     Assert.AreEqual(expected, actual);
  30. }

I’ll leave that implementation for another day as it is already getting late.  In any event, I ran the unit test and I got red (pink, really):

image

The spreadsheet rounded and my calculation does not.  I adjusted the unit test appropriately:

  1. [TestMethod]
  2. public void FindCorrelationUsingStandardInput_ReturnsExpectedValue()
  3. {
  4.     Double[] tempatures = new Double[12] { 14.2, 16.4, 11.9, 15.2, 18.5, 22.1, 19.4, 25.1, 23.4, 18.1, 22.6, 17.2 };
  5.     Double[] sales = new Double[12] { 215, 325, 185, 332, 406, 522, 412, 614, 544, 421, 445, 408 };
  6.  
  7.     Double correlation = Calculations.Correlation(tempatures, sales);
  8.     Double expected = .9575;
  9.  
  10.     Double actual = Math.Round(correlation, 4);
  11.     Assert.AreEqual(expected, actual);
  12. }

And now I am green:

image

So going back to the original question, I took the current Big Ten Schools and put their academic rankings and football rankings side by side:

image

 

I then made a revised Big Ten that had a much higher academic ranking based on schools that play in a power football conference but still maintain high academics.

image

Note that I left Penn State out of both of these lists b/c they have a NaN for their football ranking – but they certainly have a high enough academic score to be part of the revised Big Ten.

And then when I put those values through the correlation function via a Console UI:

  1. static void Main(string[] args)
  2. {
  3.     Console.WriteLine("Start");
  4.  
  5.     Double[] academicRanking = new Double[12] { 12,28,41,41,52,62,68,69,73,73,75,101 };
  6.     Double[] footballRanking = new Double[12] { 65,41,82,19,7,61,105,36,4,34,63,37 };
  7.  
  8.     Double originalCorrelation = Calculations.Correlation(academicRanking, footballRanking);
  9.     Console.WriteLine("Original BigTen Correlation {0}", originalCorrelation);
  10.  
  11.     academicRanking = new Double[10] { 7,12,17,18,23,23,28,30,41,41 };
  12.     footballRanking = new Double[10] { 24, 65, 32, 26, 94, 84, 41, 58, 82, 19 };
  13.     Double revisedCorrelation = Calculations.Correlation(academicRanking, footballRanking);
  14.     Console.WriteLine("Revised BigTen Correlation {0}", revisedCorrelation);
  15.  
  16.     
  17.     Console.WriteLine("End");
  18.     Console.ReadKey();
  19. }

I get:

image

And just looking at the data seems to support this.  There is a negative correlation between academics and football success in the current Big Ten – Higher the academics = lower the football ranking and vice versa.  In the revised Big Ten, there is positive correlation of the same magnitude – higher academics and higher (relative) football rankings.  Put another way, the new Big Ten has a much stronger academic ranking and pretty much the same football ranking.

Looking at a map, this new conference is like a doughnut with Ohio, West Virginia, and Kentucky in the middle.  Perhaps they can have a football championship sponsored by Krispie Kreeme?  In any event, OSU and MSU are much closer academically and football-wise to the Alabamas and Auburns than the Northwesterns and Michigans of the world.  In terms of geographic proximity, Columbus, Ohio is closer to Tuscalosa, AL than Lincoln, NB.  So perhaps the OSU and MSU fans would be better served in a conference that is more aligned with their University’s priorities?  If they went undefeated or even 1 loss, they would still be in the national championship discussion.

 

 

 

F# > C# when doing math

My friend/coworker Rob Seder sent me this code project link and said it might be an interesting exercise to duplicate what he had done in F#.  Interesting indeed!  Challenge accepted!

I first created a solution like so:

image

I then copied the Variance calculation from the post to the C# implementation:

  1. public class Calculations
  2. {
  3.     public static Double Variance(IEnumerable<Double> source)
  4.     {
  5.         int n = 0;
  6.         double mean = 0;
  7.         double M2 = 0;
  8.  
  9.         foreach (double x in source)
  10.         {
  11.             n = n + 1;
  12.             double delta = x – mean;
  13.             mean = mean + delta / n;
  14.             M2 += delta * (x – mean);
  15.         }
  16.         return M2 / (n – 1);
  17.     }
  18. }

I then created a couple of unit tests for the method and made sure that the results ran green:

  1. [TestClass]
  2. public class CSharpCalculationsTests
  3. {
  4.     [TestMethod]
  5.     public void VarianceOfSameNumberReturnsZero()
  6.     {
  7.         Collection<Double> source = new Collection<double>();
  8.         source.Add(2.0);
  9.         source.Add(2.0);
  10.         source.Add(2.0);
  11.  
  12.         double expected = 0;
  13.         double actual = Calculations.Variance(source);
  14.         Assert.AreEqual(expected, actual);
  15.     }
  16.  
  17.     [TestMethod]
  18.     public void VarianceOfOneAwayNumbersReturnsOne()
  19.     {
  20.         Collection<Double> source = new Collection<double>();
  21.         source.Add(1.0);
  22.         source.Add(2.0);
  23.         source.Add(3.0);
  24.  
  25.         double expected = 1;
  26.         double actual = Calculations.Variance(source);
  27.         Assert.AreEqual(expected, actual);
  28.     }    
  29. }

 

image

 

I then spun up the same unit tests to test the F# implementation and then went over to the F# project.  My first attempt started along the lines like this:

  1. namespace Tff.BasicStats.FSharp
  2.  
  3. open System
  4. open System.Collections.Generic
  5.  
  6. type Calculations() =
  7.     static member Variance (source:IEnumerable<double>) =
  8.         let mean = Seq.average(source)
  9.         let deltas = Seq.map(fun x -> x-mean) source
  10.         let deltasSum = Seq.sum deltas
  11.         let deltasLength = Seq.length deltas
  12.         deltasSum/(double)deltasLength

 

I then realized that I was writing procedural code in F# – I was not taking advantage of the power that the expressiveness that the language provides.  I also realized that looking at the C# code to understand how to calculate Variance was useless – I was getting lost in the loop and the poorly-named variables.  I went over to Wikipedia’s definition to see if that could help me understand Variance better but I got lost in all of the formulas.  I then binged Variance on Google and one of the 1st links is MathIsFun with this explanation.  This was more like it!  Cool dog pictures and a stupid simple recipe for calculating Variance.  The steps are:

image

I hopped over to Visual Studio and wrote a one-for-one line of code to match the recipe:

  1. namespace Tff.BasicStats.FSharp
  2.  
  3. open System
  4. open System.Collections.Generic
  5.  
  6. type Calculations() =
  7.     static member Variance (source:IEnumerable<double>) =
  8.         let mean = Seq.average source
  9.         let deltas = Seq.map(fun x -> sqrt(x-mean)) source
  10.         Seq.average deltas

 

I ran the unit tests but they were running red!  I was getting a NaN. 

image

Hearing my cursing, my 7th grade son came over and said – “Dad, that is wrong.  You don’t use the square root on the (x-mean), you square it.  Also, you can’t take the square root of a negative number and any item in that list that is less than the average will return that ”  Let me repeat that – a 7th grader with no coding experience but who knows about Variance from his math class just read the code and found the problem.

I then changed the code to square the value like so:

  1. namespace Tff.BasicStats.FSharp
  2.  
  3. open System
  4. open System.Collections.Generic
  5.  
  6. type Calculations() =
  7.     static member Variance (source:IEnumerable<double>) =
  8.         let mean = Seq.average source
  9.         let deltas = Seq.map(fun x -> pown(x-mean) 2) source
  10.         Seq.average deltas

 

And now my unit test… runs…. Red!

image

Not understanding why, I turned to the REPL (F# Interactive Window).  I first entered my test set:

image

I then entered the calculation from each line against the test set:

image

Staring at the resulting array, it hit me that perhaps the original unit test’s expected value was wrong!  I went over to TutorVista and entered in my array.  Would you believe it?

image

The calculation on the code project site is incorrect!  The correct way to do the unit test is:

  1. [TestMethod]
  2. public void VarianceOfOneAwayNumbersReturnsOne()
  3. {
  4.     Collection<Double> source = new Collection<double>();
  5.     source.Add(1.0);
  6.     source.Add(2.0);
  7.     source.Add(3.0);
  8.  
  9.     //double expected = 6666666667;
  10.     double expected = 2f / 3f;
  11.     double actual = Calculations.Variance(source);
  12.     Assert.AreEqual(expected, actual);
  13. }    

 

(Note that expected was the easiest way I could come up with .6 repeating without getting all crazy on the formatting).  Now both my unit tests run green and one of the C# ones runs red. 

image

I have no interest in trying to figure out how to fix that C# code – I care less about how to solve my problem and more about just solving the problem.  The real power of F# really is on display here.  The coolest parts of this exercise were:

  • One-for-one correspondence between the steps to solve a problem and the code
  • The code is much more readable to non developers
  • By concentrating on how to solve the problem in C#, the original developer lost sight of what he was trying to accomplish.  F# focuses you on the result, not the code.
  • Unit tests can be wrong – if you let your code’s result drive the expected and not a external source. 

F# and SignalR Stock Ticker: Part 2

Following up on my prior post found here about using F# to write the Stock Ticker example found on SignalR’s website, I went to implement the heart of the application – the stock ticker class.

The original C# class suffers from a violation of command/query separation and also does a couple of things.  Breaking out the code functionality, the class creates a list of random stocks in the constructor. 

image

Then there is a timer that loops and periodically updates the current stock price. 

image

Finally, it broadcasts the new stock price to any connected clients.

image

Because the class depends on the clients for its creation and lifetime, it implements the singleton pattern – you access the class via its Instance property.  This is a very common pattern:

  1. //Singleton instance
  2. private readonly static Lazy<StockTicker> _instance =
  3.     new Lazy<StockTicker>(() =>
  4.         new StockTicker(GlobalHost.ConnectionManager.GetHubContext<StockTickerHub>().Clients));

  1. public static StockTicker Instance
  2. {
  3.     get
  4.     {
  5.         return _instance.Value;
  6.     }
  7. }

Attacking the class from a F# point of view, I first addressed the singleton pattern.  I checked out the Singleton pattern in Liu’s F# for C# Developers. The sentience that caught my eye was “An F# value is immutable by default, and this guarantees there is only on instance.” (p149)  Liu then goes and builds an example using a private class and shows how to reference it via a Instance method.  My take-away from the example is that you don’t need a Singleton pattern in F# – because everything is Singleton by default.  Another way to look at it is that a Singleton pattern is a well-accepted workaround the limitations that mutability brings when using C#.

I then jumped over to the updating stock prices – after all, how can you send out a list of new stock prices if you can’t mutate the list or the individual stocks within the list?  Quite easily, in fact.

The first thing I did was to create a StockTicker class that takes in a SignalR HubContext and a list of stocks.

  1. type StockTicker(clients: IHubConnectionContext, stocks: IEnumerable<Stock>) = class

I then added the logic to update the list and stocks.

  1. let rangePercent = 0.002
  2. let updateInterval = TimeSpan.FromMilliseconds(250.)
  3. let updateStockPrice stock:Stock =
  4.     let updateOrNotRandom = new Random()
  5.     let r = updateOrNotRandom.NextDouble();
  6.     match r with
  7.         | r when r <= 1. -> stock
  8.         | _ ->
  9.  
  10.             let random = new Random(int(Math.Floor(stock.Price)))
  11.             let percentChange = random.NextDouble() * rangePercent
  12.             let pos = random.NextDouble() > 0.51
  13.             let change = Math.Round(stock.Price * decimal(percentChange),2)
  14.             let newPrice = stock.Price + change
  15.             new Stock(stock.Symbol, stock.DayOpen, newPrice)
  16. let updatedStocks = stocks
  17.                         |> Seq.map(fun stock -> updateStockPrice(stock))

Looking at the code, the word “update” in the prior sentence is wrong.  I am not updating anything.  I am replacing the list and the stocks with the new price (if determined).  Who needs a singleton?  F# doesn’t.

I then attempted to notify the clients like so:

  1. member x.Clients = clients
  2. member x.Stocks = stocks
  3. member x.BroadcastStockPrice (stock: Stock) =
  4.     x.Clients.All.updateStockPrice(stock)

 

But I got a red squiggly line of approbation (RSLA) on the updateStockPrice method.  The compiler is complaining that

Error    1    The field, constructor or member ‘updateStockPrice’ is not defined   

And reading the SignalR explanation here:

The updateStockPrice method that you are calling in BroadcastStockPrice doesn’t exist yet; you’ll add it later when you write code that runs on the client. You can refer to updateStockPrice here because Clients.All is dynamic, which means the expression will be evaluated at runtime

So how does F# accommodate the dynamic nature of Clients.All?  I don’t know so off to StackOverflow I go….

In any event, I can then wire up a method that broadcasts the new stock prices like so:

  1. member x.BroadcastAllPrices =
  2.     x.Clients.All.updateAllStockPrices(updatedStocks)

And then write a method that calls this broadcast method every quarter second:

  1.  member x.Start =
  2.      async {
  3.             while true do
  4.              do! Async.Sleep updateInterval
  5.              x.BroadcastAllPrices
  6.      } |> Async.StartImmediate

Note that I tried to figure out the Timer class and subsequent Event for the timer, but I couldn’t.  I stumbled upon this post to ditch the timer in favor of the code above and since it works, I am all for it.  Figuring out events in F# is a battle for another day…