Azure ML and Wake County Election Data

I have been spending the last couple of weeks using Azure ML and I think it is one of the most exciting technologies for business developers and analysts since ODBC and FSharp type providers.   If you remember, when ODBC came out, every relational database in the world became accessible and therefore usable/analyzable.   When type providers came out, programming, exploring, and analyzing data sources became much easier and it expanded from RDBMS to all formats (notably Json).  So getting data was no longer a problem, but analyzing it still was.

Enter Azure ML. 

I downloaded the Wake County Voter History data from here.  I took the Excel spreadsheet and converted it to a .csv locally.  I then logged into Azure ML and imported the data

image

I then created an experiment and added the dataset to the canvas

image

 

And looked at the basic statistics of the data set

image

(Note that I find that using the FSharp REPL  a better way to explore the data as I can just dot each element I am interested in and view the results).

In any event, the first question I want to answer is

“given a person’s ZipCode, Race, Party,Gender, and Age, can I predict if they will vote in November”

To that end, I first narrowed down the columns using a Column Projection and picked only the columns I care about.  I picked “11/6/2012” and the X variable because that was the last  national election and that is what we are going to have in November.  I prob should have done 2010 b/c that is a national without a President, but that can be analyzed at a later date.

image

image

I then ran my experiment so the data would be available in the Project Column step.

image

 

I then renamed the columns to make them a bit readable by using a series Metadata Editors (it does not look like you can do all renames in 1 step.  Equally as annoying is that you have to add each module, run it, then add the next.)

image

(one example)

image

 

I then added a Missing Values scrubber for the voted column.  So instead of a null field, people who didn’t vote get a “N”

image

The problem is that it doesn’t work –> looks like we can’t change the values per column.

image

I asked the question on the forum but in the interest of time, I decided to change the voted column from a categorical column to an indicator. That way I can do binary analysis.  That also failed.  I went back to the original spreadsheet and added a Indicator column and then also renamed the column headers so I am not cluttering up my canvas with those meta data transforms.  Finally, I realized I want only active voters but there does not seems to be a filtering ability (remove rows only works for missing) so I removed those also from the original dataset.  I think the ability to scrub and munge data is an area for improvement, but since this is release 1, I understand.

After re-importing the data, I changed my experiment like so

image

I then split the dataset into Training/Validation/And Testing using a 60/20/20 split

image

So the left point on the second split is 60% of the original dataset, the right point on the second split is 20% of the original dataset (or 75%/25% of the 80% of the first split)

I then added a SVM with a train and score module.  Note that I am training with 60% of the original dataset and I am validating with 20%

 

image

After it runs, there are 2 new columns in the dataset –> Scored labels and probabilities so each row now has a score.

 

image

With the model in place, I can then evaluate it using an evaluation model

image

And we can see an AUC of .666, which immediately made me think of this

image

In any event, I added a Logisitc Regression and a Boosted Decision Tree to the canvas and hooked them up to the training and validation sets

image

And this is what we have

image image

 

SVM: .666 AUC

Regression: .689 AUC

Boosted Decision Tree: .713 AUC

So with Boosted Decision Tree ahead, I added a Sweep Parameter module to see if I can tune it more.  I am using AUC as the performance metric

image

image

So the best AUC I am going to get is .7134 with the highlighted parameters.  I then added 1 more Model that uses those parameters against the entire training dataset (80% of the total) and then evaluates it against the remaining 20%.

image

With the final answer of

image

With that in hand, I can create a new experiment that will be the bases of a real time voting app.

2 Responses to Azure ML and Wake County Election Data

  1. Roope Astala says:

    Hi Jamie, great post! You should be able to do the column renames in one step, by picking the columns you want to rename and giving Metadata Editor a comma-separated list of new column names,

    -Roope (MSFT)

  2. mayashenoi says:

    Quite interesting.
    Check out my new blog post – IBM Watson Analytics – Powerful Analytics for Everyone #IBM #IBMWatson http://ow.ly/BD7RE

Leave a comment