Kaggle and R

Following up on last week’s post on doing a Kaggle competition, I then decided to see if I could explore the data more in R on my local desktop.  The competition is about analyzing a large group of house claims to give them a risk score.

I started the R studio to take a look at the initial data:

1 train <- read.csv("../Data/train.csv") 2 head(train) 3 summary(train) 4 5 plot(train$Hazard)


A couple of things popped out.  All of the X variables look to be categorical.  Even the result “Hazard” is an integer with most of the values falling between 1 and 9.

With that in mind, I decided to split the dataset into two sections: the majority and the minority.

1 train.low <- subset(train, Hazard < 9) 2 train.high <- subset(train, Hazard >= 9) 3 4 plot(train.low$Hazard) 5 plot(train.high$Hazard)

With the under as:


And the over 9 is like this


But I want to look at the Hazard score from a distribution point of view:

1 hazard.frame <- as.data.frame(table(train$Hazard)) 2 colnames(hazard.frame) <- c("hazard","freq") 3 hist(hazard.frame$freq) 4 plot(x=hazard.frame$hazard, y=hazard.frame$freq) 5 plot(x=hazard.frame$hazard, log(y=hazard.frame$freq)) 6

The hist shows the left skew




and the log plot really shows the distribution


So there is clearly a diminishing return going on.   As of this writing, the leader is at 40%, which is about 20,400 of the 51,000 entries.   So if you could identify all of the ones correctly, you should get 37% of the way there.  To test it out, I submitted to Kaggle only ones:


LOL, so they must take away for incorrect answers as it is same as “all 0” benchmark.  So going back, I know that if I can predict the ones correctly and make a reasonable guess at the rest, I might be OK.   I went back and tuned my model some to get me out of the bottom 25% and then let it be.  I assume that there is something obvious/industry standard that I am missing because there are so many people between my position and the top 25%.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: