Screen Scraping College Football Statistics

As a follow-up to my post of the correlation of Academic Ranking and Football Rankings in the Big Ten, I thought I would look that the relationship between two different kinds of Football Rankings: the recruiting ranking assigned by Rivals and the actual results on the field.  To that end, I went to collect the data programmically because I am doing a time-series analysis and I didn’t want to do data-entry.

My first stop was to find a free service that exposes this data on the web.  No luck – either the data was a service that cost money or the data was presented as a web page.  Since I have never screen-scraped using F# (and I am cheap), I chose option #2.

My first data point was the recruiting ranking found here.  When I inspected the source of the page, I caught a break – the data is actually stored as Json on the page.

 image

So firing up Visual Studio, I created a solution with 1 F# project and 2 C# projects:

image

I then wrote a unit test to check that something is being returned:

  1. [TestMethod]
  2. public void getRecrutRankings_RetunsExpected()
  3. {
  4.     var rankings = RankingProvider.getRecrutRankings("2012");
  5.     Assert.AreNotEqual(0, rankings.Length);
  6. }

I then went over the F#.  I created the RankingProvider type and then add a function that pulls in the rankings for a given year:

  1. static member getRecrutRankings(year) =
  2.     let url = "http://sports.yahoo.com/footballrecruiting/football/recruiting/teamrank/"+year+"/BIG10/all";
  3.     let request = WebRequest.Create(Uri(url))
  4.     use response = request.GetResponse()
  5.     use stream = response.GetResponseStream()
  6.     use reader = new IO.StreamReader(stream)
  7.     let htmlString = reader.ReadToEnd()
  8.     let startPosition = htmlString.IndexOf("var rankingsTableData =")
  9.     let headerLength = 23
  10.     let endPosition = htmlString.IndexOf(";",startPosition)
  11.     let data = htmlString.Substring(startPosition+headerLength,endPosition-startPosition-headerLength).Trim()
  12.     let results = JsonConvert.DeserializeObject(data)
  13.     let castedResults = results :?> Newtonsoft.Json.Linq.JArray
  14.                                             |> Seq.map(fun x -> (x.Value("name").ToString(), Int32.Parse(x.Value("rank").ToString())))
  15.                                             |> Seq.toList

 

A couple of things to note.

  • Lines 2 through 12 are language-agnostic.  You would write the exact same code in C#/VB.NET with a slightly different syntax.
  • Line 13 is where things get interesting.  I used the :?> operator to cast the Json to a typed structure.  :?> wins as the weirdest symbol I have ever used in computer programming.   I guess I haven’t been programming long enough?
  • Lines 14 and 15 is where you can see why F# is better than C#.  I created a function that takes the Json and pushes it into a tuple.  With no iteration, the code is both easier to read and less likely to have bugs

Hoping to press my luck, I went over the the other page (the one that holds the standings from the actual games) to see if they used Json.  No dice – so back to mid-2000s screen scraping.  I created a function that loads the table into an XML document and then searches for a given school.

  1. static member getConferenceStanding(year, school) =
  2.     let url = "http://espn.go.com/college-football/conferences/standings/_/id/5/year/"+year+"/big-ten-conference";         
  3.     let request = WebRequest.Create(Uri(url))
  4.     use response = request.GetResponse()
  5.     use stream = response.GetResponseStream()
  6.     use reader = new IO.StreamReader(stream)
  7.     let htmlString = reader.ReadToEnd()
  8.     let divMarkerStartPosition = htmlString.IndexOf("my-teams-table");
  9.     let tableStartPosition = htmlString.IndexOf("<table",divMarkerStartPosition);
  10.     let tableEndPosition = htmlString.IndexOf("</table",tableStartPosition);
  11.     let data = htmlString.Substring(tableStartPosition, tableEndPosition- tableStartPosition+8)
  12.     let xmlDocument = new XmlDocument();
  13.     xmlDocument.LoadXml(data);
  14.     let keyNode = xmlDocument.GetElementsByTagName("td")
  15.                     |> Seq.cast<XmlNode>
  16.                     |> Seq.find (fun node -> node.InnerText = school)
  17.     let valueNode = keyNode.NextSibling
  18.     (keyNode.InnerText, valueNode.InnerText)

A couple of things to note:

  • Lines 2-7 are identical to the prior function so they should be combined into a single function that can be independently testable.
  • Lines 8-13 are language-agnostic.  You would write the exact same code in C#/VB.NET with a slightly different syntax.
  • Lines 14-18 is where F# really shines.  Like the prior function, by using functional programming techniques in F#, I saved myself time, avoid bugs, and made the code much more intuitive.
  • I am making a web call for each function call– this should be optimized so the call is made once and the xmlDocument is passed in.  This would also make the function much more testable (even without a mocking framework)

Next up, I needed to call this function for each of the Big Ten Schools:

  1. static member getConferenceStandings(year)=
  2.     let schools =[|"Nebraska";"Michigan";"Northwestern";"Michigan State";"Iowa";
  3.         "Minnesota";"Ohio State";"Penn State";"Wisconsin"; "Purdue"; "Indiana"; "Illinois"|]
  4.     Seq.map(fun school -> RankingProvider.getConferenceStanding(year,school)) schools
  5.         |> Seq.sortBy snd
  6.         |> Seq.toList
  7.         |> List.rev

 

This is purely F# and is a pure joy to write (and look the least amount of time).  Note that the sort is on the second element of the tuple and that the list is reversed because the second element is the wins-losses so F# is sorting ascending on the number of wins.  Since Seq does not have a rev function, I turned it into a List, which does have the rev function

Some might ask “Why didn’t you use type-providers?”  My answer is “I tried, but I couldn’t get them to work.”  For example, here is the code that I used for the type provider when parsing the xmlDocument:

  1. xmlDocument.LoadXml(data);
  2. let document = XmlProvider<xmlDocument>

The problem is that the type provider expects a uri (and I can’t find an overload to pass in the document).  It looks like type providers are more designed for providers that are ready to, well, provide (Web Services, Databases, etc..) versus jerry-rigged data (like screen scraping).

In any event, with these two functions, ready, I went to the UI project and decided to see how the teams did in 2012 on the field compared to how the teams did in recruiting 2 years before:

  1. static void Main(string[] args)
  2. {
  3.     Console.WriteLine("Start");
  4.  
  5.     Console.WriteLine("——-Rankings");
  6.     var rankings = RankingProvider.getRecrutRankings("2010");
  7.     foreach (var school in rankings)
  8.     {
  9.         Console.WriteLine(school.Item1 + ":" + school.Item2);
  10.     }
  11.  
  12.     Console.WriteLine("——-Standings");
  13.     var standings = RankingProvider.getConferenceStandings("2012");
  14.     foreach (var school in standings)
  15.     {
  16.         Console.WriteLine(school.Item1 + ":" + school.Item2);
  17.     }
  18.  
  19.     Console.WriteLine("End");
  20.     Console.ReadKey();
  21. }

And the results:

image

I have no idea if a 2-year lag between recruiting and rankings is the right number – perhaps an analysis of the correct lag will be done.  After all, between red-shirt freshmen, transfer rules, and attrition, there are plenty of variables the determine when a recruiting class has the biggest impact.  Also, the standings are a blend of recruiting classes and since I am not evaluating individual players, I can’t go to that level of detail.  2 years out seems reasonable, but as Bluto famiously once said

  1. static member getBlutoQuote() =
  2.     "Seven years of college down the drain.";

:

the average might be different.  In any event, I now have the data I want so the next step is to analyze it to see if there is any correlation.  At first glance, there might be something – the top 4 schools for recruiting all finished in the top 4 in the standings – but the bottom 4 is more muddled with only Illinois doing poorly in both recruiting and the standings.

More to come…

Advertisements

2 Responses to Screen Scraping College Football Statistics

  1. ovatsus says:

    Hi, the XmlProvider accepts a xml document instead of a url, but you have to pass it as a string by using the Parse method instead of the Load method. In any case, it doesn’t work very well with html. But you can easily use the JsonProvider for the other part. See here: https://gist.github.com/ovatsus/8123945

  2. Pingback: F# Weekly #52, 2013 – New Year Edition | Sergey Tihon's Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: