# R for the .NET Developer

I spent some time over the last week putting my ideas down for a new speaking topic: “R for the .NET Developer.” With Microsoft acquiring Revolution Analytics and making a concerted push into analytics tooling and platforms, it makes sense that .NET developers have some exposure to the most common language in the data science space – R.

I started the presentation using Prezi (thanks David Green) and set up the major points I wanted to cover:

• · R Overview
• · R Language Features
• · R In Action
• · R Lessons Learned

You can see the Prezi here.

I worked through and then borrowed from several different books:

this great you tube clip and this Pluralsight course I then jumped into R Studio to work though some of the code ideas that the Prezi illustrates. The entire set of code is found here on Github here but I wanted to show a couple of the cooler things that I did.

First, I implemented the Automotive In R from Data Mining and Business Analytics Book. This is pretty much a straight port of his exercise, with the exception is that I convert some vectors to factors to demonstrate who/when to do it:

``` 1 setwd("C:\\Git\\R4DotNet")
2
3 #y = x1 + x2 + x3 + E
4 #y is what you are trying explain
5 #x1, x2, x3 are the variables that cause/influence y
6 #E is things that we are not measuring/ using for calculations
7
9 summary(fuel.efficiency)
10
11 #MPG = Miles per gallon
12 #GPM = Gallons per 100 miles
13 #WT = Weight of car in 1000 lbs
14 #DIS = Displacment in cubic inches
15 #NC = number of cylinders
16 #HP = Horsepower
17 #ACC = Acceleration in seconds from 0-60
18 #ET = Engine Type 0 = V, 1 = Straight
19
20 plot(GPM~WT,data=fuel.efficiency)
21 plot(GPM~DIS,data=fuel.efficiency)
22
23 fuel.efficiency\$NC <- factor(fuel.efficiency\$NC)
24 fuel.efficiency\$ET <- factor(fuel.efficiency\$ET)
25 summary(fuel.efficiency)
26
27 plot(GPM~NC,data=fuel.efficiency)
28
29 model <- lm(GPM~.,data=fuel.efficiency)
30 summary(model)
31
32 # Multiple R-squared:  0.9804
33 # means that we can explain 98% of the GPM with the variables we have E = 2%
34 # That is pretty friggen good
35
36 # turning back to numeric so we can do cor accross data frame
37 fuel.efficiency\$NC <- as.integer(fuel.efficiency\$NC)
38 fuel.efficiency\$ET <- as.integer(fuel.efficiency\$ET)
39 cor(fuel.efficiency)
40
41 #DIS -> WT = 0.9507647
42
43 library(leaps)
44 x=fuel.efficiency[,3:7]
45 y=fuel.efficiency[,2]
46 out = summary(regsubsets(x,y,nbest=2,nvmax=ncol(x)))
48 tab
49
50 #trade off between model size and model fit
51 #just weight is
52
53 model2 = lm(GPM~WT,data=fuel.efficiency)
54 summary(model2)```

Here are the plots (as continuous and as a factor):

Then, I implemented this K-Means from Azure ML to show the difference between the two implementations. The AzureML experiment is found here.  And my code looks like this. Note that I did not do a regression

``` 1 flowers <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")
2 summary(flowers)
3
4 colnames(flowers) <- c("F1", "F2", "F3", "F4", "Label")
5 summary(flowers)
6
7
8 indexes = sample(1:nrow(flowers), size=0.6*nrow(flowers))
9 flowers.train <- flowers[-indexes,]
10 flowers.test <- flowers[indexes,]
11
12 fit <- kmeans(flowers.train[,1:4],5)
13 fit
14
15 plot(flowers.train[c("F1", "F2")], col=fit\$cluster)
16 points(fit\$centers[,c("F1", "F2")], col=1:3, pch=8, cex=2)```

With a plot example like this:

So I think I am ready for the presentation.  It is really true, the best way to learn about something is to teach it…