Azure ML and Wake County Election Data
September 16, 2014 2 Comments
I have been spending the last couple of weeks using Azure ML and I think it is one of the most exciting technologies for business developers and analysts since ODBC and FSharp type providers. If you remember, when ODBC came out, every relational database in the world became accessible and therefore usable/analyzable. When type providers came out, programming, exploring, and analyzing data sources became much easier and it expanded from RDBMS to all formats (notably Json). So getting data was no longer a problem, but analyzing it still was.
Enter Azure ML.
I downloaded the Wake County Voter History data from here. I took the Excel spreadsheet and converted it to a .csv locally. I then logged into Azure ML and imported the data
I then created an experiment and added the dataset to the canvas
And looked at the basic statistics of the data set
(Note that I find that using the FSharp REPL a better way to explore the data as I can just dot each element I am interested in and view the results).
In any event, the first question I want to answer is
“given a person’s ZipCode, Race, Party,Gender, and Age, can I predict if they will vote in November”
To that end, I first narrowed down the columns using a Column Projection and picked only the columns I care about. I picked “11/6/2012” and the X variable because that was the last national election and that is what we are going to have in November. I prob should have done 2010 b/c that is a national without a President, but that can be analyzed at a later date.
I then ran my experiment so the data would be available in the Project Column step.
I then renamed the columns to make them a bit readable by using a series Metadata Editors (it does not look like you can do all renames in 1 step. Equally as annoying is that you have to add each module, run it, then add the next.)
I then added a Missing Values scrubber for the voted column. So instead of a null field, people who didn’t vote get a “N”
The problem is that it doesn’t work –> looks like we can’t change the values per column.
I asked the question on the forum but in the interest of time, I decided to change the voted column from a categorical column to an indicator. That way I can do binary analysis. That also failed. I went back to the original spreadsheet and added a Indicator column and then also renamed the column headers so I am not cluttering up my canvas with those meta data transforms. Finally, I realized I want only active voters but there does not seems to be a filtering ability (remove rows only works for missing) so I removed those also from the original dataset. I think the ability to scrub and munge data is an area for improvement, but since this is release 1, I understand.
After re-importing the data, I changed my experiment like so
I then split the dataset into Training/Validation/And Testing using a 60/20/20 split
So the left point on the second split is 60% of the original dataset, the right point on the second split is 20% of the original dataset (or 75%/25% of the 80% of the first split)
I then added a SVM with a train and score module. Note that I am training with 60% of the original dataset and I am validating with 20%
After it runs, there are 2 new columns in the dataset –> Scored labels and probabilities so each row now has a score.
With the model in place, I can then evaluate it using an evaluation model
And we can see an AUC of .666, which immediately made me think of this
In any event, I added a Logisitc Regression and a Boosted Decision Tree to the canvas and hooked them up to the training and validation sets
And this is what we have
SVM: .666 AUC
Regression: .689 AUC
Boosted Decision Tree: .713 AUC
So with Boosted Decision Tree ahead, I added a Sweep Parameter module to see if I can tune it more. I am using AUC as the performance metric
So the best AUC I am going to get is .7134 with the highlighted parameters. I then added 1 more Model that uses those parameters against the entire training dataset (80% of the total) and then evaluates it against the remaining 20%.
With the final answer of
With that in hand, I can create a new experiment that will be the bases of a real time voting app.