Consuming and Analyzing Census Data Using F#

As part of my Nerd Dinner refactoring, I wanted to add the ability to guess a person’s age and gender based on their name.  I did a quick search on the internet and the only place that I found that has an API is here and it doesn’t have everything I am looking for.  Fortunately, the US Census website has some flat files with the kind of data I am looking for here.

I grabbed the data and  pumped it into Azure Blob Storage here.  You can swap out the state code to get each dataset.  I then loaded in a list of State Codes found here that match to the file names.

I then fired up Visual Studio and created a new FSharp project.  I added FSharp.Data to use a Type Provider to access the data.  I don’t need to install the Azure Storage .dlls b/c the blobs are public and I just have to read the file


Once Nuget was done with its magic, I opened up the script file, pointed to the newly-installed FSharp.Data, and added a reference to the datasets on blob storage:

#r "../packages/FSharp.Data.2.0.9/lib/portable-net40+sl5+wp8+win8/FSharp.Data.dll" open FSharp.Data type censusDataContext = CsvProvider<""> type stateCodeContext = CsvProvider<"">

(Note that I am going add FSharp as a language to my Live Writer code snippet add-in at a later date)

In any event, I then printed out all of the codes to see what it looks like:

let stateCodes = stateCodeContext.Load(""); stateCodes.Rows |> Seq.iter(fun r -> printfn "%A" r)


And by changing the lambda slightly like so,

stateCodes.Rows |> Seq.iter(fun r -> printfn "%A" r.Abbreviation)

I get all of the state codes


I then tested the census data with code and results are expected

let arkansasData = censusDataContext.Load(""); arkansasData.Rows |> Seq.iter(fun r -> printfn "%A" r)


So then I created a method to load all of the state census data and giving me the length of the total:

let stateCodes = stateCodeContext.Load(""); let usaData = stateCodes.Rows |> Seq.collect(fun r -> censusDataContext.Load(System.String.Format("{0}.TXT",r.Abbreviation)).Rows) |> Seq.length


Since this is a I/O bound operation, it made sense to load the data asynchronously, which speeded things up considerably.  You can see my question over on Stack Overflow here and the resulting code takes about 50% of the time on a my dual-processor machine:

stopwatch.Start() let fetchStateDataAsync(stateCode:string)= async{ let uri = System.String.Format("{0}.TXT",stateCode) let! stateData = censusDataContext.AsyncLoad(uri) return stateData.Rows } let usaData' = stateCodes.Rows |> r -> fetchStateDataAsync(r.Abbreviation)) |> Async.Parallel |> Async.RunSynchronously |> Seq.collect id |> Seq.length stopwatch.Stop() printfn "Parallel: %A" stopwatch.Elapsed.Seconds


With the data in hand, it was time to analyze the data to see if there is anything we can do.   Since 23 seconds is a bit too long to wait for a page load (Smile), I will need to put the 5.5 million records into a format that can be easily searched.  Thinking what we want is:

Given a name, what is the gender?

Given a name, what is the age?

Given a name, what is their state of birth?

Also, since we have their current location, we can also input the name and location and answer those questions.  If we make the assumption that their location is the same as their birth state, we can narrow down the list even further.

In any event, I first added a GroupBy to the name:

let nameSum = usaData' |> Seq.groupBy(fun r -> r.Mary) |> Seq.toArray


And then I summed up the counts of the names

let nameSum = usaData' |> Seq.groupBy(fun r -> r.Mary) |> (n,a) -> n,a |> Seq.sumBy(fun (r) -> r.``14``)) |> Seq.toArray


And then the total in the set:

let totalNames = nameSum |> Seq.sumBy(fun (n,c) -> c)


And then applied a simple average and sorted it descending

let nameAverage = nameSum |> (n,c) -> n,c,float c/ float totalNames) |> Seq.sortBy(fun (n,c,a) -> -a - 1.) |> Seq.toArray


So I feel really special that my parents gave me the most popular name in the US ever…

And focusing back to the task on hand, I want to determine the probability that a person is male or female based on their name:

let nameSearch = usaData' |> Seq.filter(fun r -> r.Mary = "James") |> Seq.groupBy(fun r -> r.F) |> (n,a) -> n,a |> Seq.sumBy(fun (r) -> r.``14``)) |> Seq.toArray


So 18196 parents thought is would be a good idea to name their daughter ‘James’.  I created a quick function like so:

let nameSearch' name = let nameFilter = usaData' |> Seq.filter(fun r -> r.Mary = name) |> Seq.groupBy(fun r -> r.F) |> (n,a) -> n,a |> Seq.sumBy(fun (r) -> r.``14``)) let nameSum = nameFilter |> Seq.sumBy(fun (n,c) -> c) nameFilter |> (n,c) -> n, c, float c/float nameSum) |> Seq.toArray nameSearch' "James"


So if I see the name “James”, there is a 99% chance it is a male.  This can lead to a whole host of questions like variance of names, names that are closest to gender neutral, etc….  Leaving those questions to another day, I now have something I can put into Nerd Dinner.  Now, if there was only a way to handle nicknames and friendly names….

You can see the full code here.








2 Responses to Consuming and Analyzing Census Data Using F#

  1. Jon says:

    Love these types of posts. It really helps to show how awesome F# is when using it with data.

  2. Pingback: F# Weekly #34, 2014 | Sergey Tihon's Blog

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: