The Wright Brothers and Scrum

I recently read two books that, on the cover have nothing to do with each other, but actually have very much similar lessons.  The first is David MacCullugh’s The Wright Brothers and Jeff Sutherland’s Scrum: The art of doing twice the work in half the time.

imageimage

Although the human-interest side of the story was kind of interesting to me, what really stood out was how the Wright brothers got their machine in the air.  If you are not familiar with the details of how they constructed the Wright Flyer, there are some pretty interesting points:

1) The Wright brothers were a small agile team that comprised of no fewer than two (the brothers themselves) and no more than seven.  I did not realize how important Charlie Taylor and William Tate was to different pieces of the project.

2) The Wright brothers spent the 1st part of their journey doing research, combing all of the scientific literature, and engaging with the current though leaders of the day in a very much open-source style where they would freely share knowledge but retain the final product for themselves.  This was in direct contrast to other teams that operated in silos and secrecy.

3) The Wright Brothers believed in doing one thing at a time well.  They realized that there were two major problems with heavier than air flight –> thrust  and balance.  They separated these two concerns and tackled the balance problem first.  Once they figured out how to make a glider stable in flight, they then tacked how to add a motor to it.

4) The Wright Brothers made hundreds of small incremental changes with each change able to stand for itself.  For example, they went out to Kitty Hawk in the summer of 1900, 1901, and 1902 with gliders before going out the forth time in 1903 with their airplane.  Each time, the designs got bigger and closer to the final goal.

5) The Wright brothers were willing to challenge conventional and commonly accepted “facts” when their evidence did not support it.  The Wright brothers relied heavily on the calculations of Lilienthal and Chanute to measure lift and drag.  After several failed experiments, the Wright brothers ditched those and went with their own, painstakingly researched, measurement tables.

3) The Wright Brothers were in direct competition with Samuel Langley’s airplane.  In contrast to the Wright Brother’s agile approach, Langley had a large team that operated in absolutely secrecy while taking massive (at the time) amounts of public funds.  When Langley finally rolled out his “final product” it failed miserably every time.

So what does the Wright Brother’s methodology have to do with Scrum?  Everything. If you look art the core tenants of Sutherland’s book,  most of them can be found in how the Wrights conquered the air.  I went though the end of each chapter of Scrum and pulled out some of the take-away points that direct match to how Orville and Wilber did things:

image

The Wright brothers were doing scrum a full hundred years before it became a thing. As amazing what they created, how they did it is really remarkable.  Interestingly for me, the “It’s the journey, not the destination” just rang home.  As I write this blog, it is Friday night and I am on my front porch.  A neighbor stopped by to say “hello”.  When  told her I was working on a blog post related to my profession, she said “Oh, I am sorry you are not doing anything fun tonight.”  And I said “But this is fun.”  and internally I was thinking “I wonder why so many people think work is not fun?  Why are some many people socialized that way?  I hope my kids don’t wind up like that.”

The Wright Brothers and Scrum

I recently read two books that, on the cover have nothing to do with each other, but actually have very much similar lessons.  The first is David MacCullugh’s The Wright Brothers and Jeff Sutherland’s Scrum: The art of doing twice the work in half the time.

imageimage

Although the human-interest side of the story was kind of interesting to me, what really stood out was how the Wright brothers got their machine in the air.  If you are not familiar with the details of how they constructed the Write Flyer, there are some pretty interesting points:

1) The Wright brothers were a small agile team that comprised of no fewer than two (the brothers themselves) and no more than seven.  I did not realize how important Charlie Taylor and William Tate was to different pieces of the project.

2) The Write brothers spent the 1st part of their journey doing research, combing all of the scientific literature, and engaging with the current though leaders of the day in a very much open-source style where they would freely share knowledge but retain the final product for themselves.  This was in direct contrast to other teams that operated in silos and secrecy.

3) The Wright Brothers believed in doing one thing at a time well.  They realized that there were two major problems with heavier than air flight –> thrust  and balance.  They separated these two concerns and tackled the balance problem first.  Once they figured out how to make a glider stable in flight, they then tacked how to add a motor to it.

4) The Wright Brothers made hundreds of small incremental changes with each change able to stand for itself.  For example, they went out to Kitty Hawk in the summer of 1900, 1901, and 1902 with gliders before going out the forth time in 1903 with their airplane.  Each time, the designs got bigger and closer to the final goal.

5) The Wright brothers were willing to challenge conventional and commonly accepted “facts” when their evidence did not support it.  The Wright brothers relied heavily on the calculations of Lilienthal and Chanute to measure lift and drag.  After several failed experiments, the Wright brothers ditched those and went with their own, painstakingly researched, measurement tables.

3) The Wright Brothers were in direct competition with Samuel Langley’s airplane.  In contrast to the Write Brother’s agile approach, Langley had a large team that operated in absolutely secrecy while taking massive (at the time) amounts of public funds.  When Langley finally rolled out his “final product” it failed miserably every time.

So what does the Wright Brother’s methodology have to do with Scrum?  Everything. If you look art the core tenants of Sutherland’s book,  most of them can be found in how the Wrights conquered the air.  I went though the end of each chapter of Scrum and pulled out some of the take-away points that direct match to how Orville and Wilber did things:

image

The Wright brothers were doing scrum a full hundred years before it became a thing. As amazing what they created, how they did it is really remarkable.  Interestingly for me, the “It’s the journey, not the destination” just rang home.  As I write this blog, it is Friday night and I am on my front porch.  A neighbor stopped by to say “hello”.  When  told her I was working on a blog post related to my profession, she said “Oh, I am sorry you are not doing anything fun tonight.”  And I said “But this is fun.”  and internally I was thinking “I wonder why so many people think work is not fun?  Why are some many people socialized that way?  I hope my kids don’t wind up like that.”

The Counted Part 3: Law Enforcement Officers Killed In Line Of Duty

As a follow up to this post, I decided to look on the other side of the gun –> police officers killed in the line of duty.  Fortunately, the FBI collects this data here.  It looks like the FBI is a bit behind on their summary reports:

Capture

So taking the 2013 data as the closest data point to The Counted 2015 data, it took a couple of minutes to download the excel spreadsheet and format it as a useable .csv:

image  to image

After importing in the data in R studio, I did a quick summary on the data frame.  The most striking thing out of the gate is how few Officers are killed.  There were 27 in 2013, compared to over 500 people killed by police officers in the 1st half of 2015:

1 officers.killed <- read.csv("./Data/table_1_leos_fk_region_geographic_division_and_state_2013.csv") 2 sum(officers.killed$OfficersKilled) 3

I then added in the state population to do a similar ratio and map:

1 officers.killed.2 <- merge(x=officers.killed, 2 y=state.population.3, 3 by.x="StateName", 4 by.y="NAME") 5 6 officers.killed.2$AdjustedPopulation <- officers.killed.2$POPESTIMATE2014/10000 7 officers.killed.2$KilledRatio <- officers.killed.2$OfficersKilled/officers.killed.2$AdjustedPopulation 8 officers.killed.2$AdjKilledRatio <- officers.killed.2$KilledRatio * 10 9 officers.killed.2$StateName <- tolower(officers.killed.2$StateName) 10 11 choropleth.3 <- merge(x=all.states, 12 y=officers.killed.2, 13 sort = FALSE, 14 by.x = "region", 15 by.y = "StateName", 16 all.x=TRUE) 17 choropleth.3 <- choropleth.3[order(choropleth.3$order), ] 18 summary(choropleth.3) 19 20 qplot(long, lat, data = choropleth.3, group = group, fill = AdjKilledRatio, 21 geom = "polygon") 22

image

So Louisiana and West Virginia seem to have the highest number of officers killed per capita.  I am not surprised, being that I had no expectations about states that would have higher and lower numbers.  It seems likely a case of “gee-wiz” data.

Since there is so few instances, I decided to forgo any more analysis on police killed and instead combined this data with the people who were killed by police:

1 the.counted.state.5 <- merge(x=the.counted.state.4, 2 y=officers.killed.2, 3 by.x="StateName", 4 by.y="StateName") 5 6 names(the.counted.state.5)[names(the.counted.state.5)=="AdjKilledRatio.x"] <- "NonPoliceKillRatio" 7 names(the.counted.state.5)[names(the.counted.state.5)=="AdjKilledRatio.y"] <- "PoliceKillRatio" 8 9 the.counted.state.6 <- data.frame(the.counted.state.5$NonPoliceKillRatio, 10 the.counted.state.5$PoliceKillRatio, 11 log(the.counted.state.5$NonPoliceKillRatio), 12 log(the.counted.state.5$PoliceKillRatio)) 13 14 colnames(the.counted.state.6) <- c("NonPoliceKilledRatio","PoliceKilledRatio","LoggedNonPoliceKilledRatio","LoggedPoliceKilledRatio") 15 16 plot(the.counted.state.6) 17

and certainly the log helps out and there seems to be a relationship between states that have police killed and people being killed by police (my hand-drawn red lines added):

Capture3

With that in mind, I created a couple of  linear models

1 non.police <- the.counted.state.6$LoggedNonPoliceKilledRatio 2 police <- the.counted.state.6$LoggedPoliceKilledRatio 3 police[police==-Inf] <- NA 4 5 model <- lm( non.police ~ police ) 6 summary(model) 7 8 model.2 <- lm( police ~ non.police) 9 summary(model.2) 10

Since there are only 2 variables, the adjusted R square is the same for x~y and y~x.

image

image

The interesting thing is the model has to account that many states had 0 police fatalities but had at least 1 person killed by the police.  The next interesting thing is the value of the coefficient: in starts where there was at least 1 police fatality and 1 person killed by the police, every police fatality increases the number of people killed by police .96 –> and this .96 is the log of the ratio of population.  So it shows that the police are better at killing then getting killed, which makes sense.

The full gist is found here.

The Counted Part 2: Analysis Using R

Following up on this post last week on analyzing The Counted using F# and R, I decided to look a bit closer at the data.  In last week’s post, I had a data frame of all of the people killed by law enforcement collected by The Guardian for the 1st half of 2015.  Although interesting to looks at, I am not sure that the map tells us anything.  The data frame looks like this:

image

My first thought is to look at killing by population by US State.  Step #1 was to sum up the number of rows by state code:

1 the.counted.state <- data.frame(table(the.counted$state)) 2 colnames(the.counted.state ) <- c("StateCode","NumberKilled") 3 summary(the.counted.state) 4

image

I then brought in the latest population figures by state from the US Census:

1 state.population <- read.csv("http://www.census.gov/popest/data/state/asrh/2014/files/SCPRC-EST2014-18+POP-RES.csv") 2 state.population 3

image

And finally I bought in a cross walk table of US State Codes (what The.Counted data is in and the US State Names, which is what US Census data is in)

1 state.crosswalk <- read.csv("http://www.fonz.net/blog/wp-content/uploads/2008/04/states.csv") 2 state.crosswalk 3

image

 

I then merged all three data frames together using the state name and state code as the common key:

1 state.population.2 <- state.population[c(5,6)] 2 state.population.3 <- merge(x=state.population.2, 3 y=state.crosswalk, 4 by.x="NAME", 5 by.y="State") 6 #The Counted With Population 7 the.counted.state <- merge(x=the.counted.state, 8 y=state.population.3, 9 by.x="StateCode", 10 by.y="Abbreviation") 11

I then tried to add in a column that took the total number of killed individuals by the number of people in the state.

1 the.counted.state.2 <- the.counted.state 2 the.counted.state.2$KilledRatio <- the.counted.state.2$NumberKilled/the.counted.state.2$POPESTIMATE2014 3

image

The problem became quickly obvious: there were not enough people in the numerator to make a meaningful straight division.  To compensate for this issue, I divided the number of people in each state by 10,000.  I also increased the kill ratio by a factor of 10 so that we have a scale between of 0 and 1 of .1 which is easily digestible.  Finally, I renamed the variable “Name” to “StateName” because my OCD couldn’t let such an affront to the naming gods go unpunished.

1 the.counted.state.3 <- the.counted.state 2 the.counted.state.3$AdjustedPopulation <- the.counted.state.2$POPESTIMATE2014/10000 3 the.counted.state.3$KilledRatio <- the.counted.state.3$NumberKilled/the.counted.state.3$AdjustedPopulation 4 the.counted.state.3$AdjKilledRatio <- the.counted.state.3$KilledRatio * 10 5 6 names(the.counted.state.3)[names(the.counted.state.3)=="NAME"] <- "StateName" 7 the.counted.state.3$StateName <- tolower(the.counted.state.3$StateName)

 

image

With the data prepped, I created a choropleth to show the kill ratio by state using a gradient scale:

1 choropleth <- merge(x=all.states, 2 y=the.counted.state.3, 3 sort = FALSE, 4 by.x = "region", 5 by.y = "StateName", 6 all.x=TRUE) 7 choropleth <- choropleth[order(choropleth$order), ] 8 summary(choropleth) 9 10 qplot(long, lat, data = choropleth, group = group, fill = AdjKilledRatio, 11 geom = "polygon")

image

Note that I had to use the all.x=TRUE to account for the fact that South Dakota and Vermont did not have any killings so far in 2015.  This is equiv to a left outer join for you Sql folks.  On a side note, what’s up with Oklahoma? 

I then decided to bin the data into high,medium, and low categories.  Looking at the detail of the adjustedKillRatio, there seems to be some natural breaks around 10% and 20%:

1 the.counted.state.4$AdjKilledRatio 2 summary(the.counted.state.4$AdjKilledRatio) 3

image

So I binned like that:

1 the.counted.state.4$KilledBin <- cut(the.counted.state.4$AdjKilledRatio, 2 breaks=seq(0,1,.1)) 3 summary(the.counted.state.4$KilledBin) 4

image

The problem with my code is that this gives me 10 bins and I only really need 3.  Fortunately, this stack overflow post helped me re-write the bin into 3 factors.  Note the Inf on the high side and the labels.

1 the.counted.state.4$KilledBin <- cut(the.counted.state.4$AdjKilledRatio, 2 breaks=c(seq(0,.2,.1),Inf), 3 labels=c("low","med","high")) 4

And this gives me a pretty good distribution of bins:

image

With things binned up, I added another chiropleth and map:

1 choropleth.2 <- merge(x=all.states, 2 y=the.counted.state.4, 3 sort = FALSE, 4 by.x = "region", 5 by.y = "StateName", 6 all.x=TRUE) 7 choropleth.2 <- choropleth.2[order(choropleth.2$order), ] 8 summary(choropleth.2) 9 10 qplot(long, 11 lat, 12 data = choropleth.2, 13 group = group, 14 fill = KilledBin, 15 geom = "polygon")

image

If your squint, it almost looks like a map of the civil war, no?

The Counted: Initial Analysis Using FSharp and R

(Note: this is post one of three.  Next week is a deeper dive into the data and the following week is an analysis of law enforcement officers killed in the line of duty)

Andrew Oliver hit me up on Twitter with a new dataset that he stumbled across.  The dataset is called “The Counted” and it is an attempt to count all of the deaths at the hand of police in America in 2015.  Apparently, this data is not collected systematically by the US government, which is kind of puzzling.  You can read about and download the data here.  A sample looks like:

image

John asked what we could do with the dataset –> esp when comparing to other variables like socio-economic status.  Step #1 in my mind was to geo-locate the data.  Since this is a .csv, the first-first thing was to remove extra commas and replace them with semi-colons or blank spaces (for example, US Marshals Service, Pennsylvania State Police, Allegheny County Sheriff’s Office became US Marshals Service; Pennsylvania State Police; Allegheny County Sheriff’s Office and Corrections Department, 1400 E 4th Ave became Corrections Department 1400 E 4th Ave)

Adding Geolocations

Drawing on my code that I wrote using Texas A&M’s Geoservice found here, I converted the json type provider script into a function that takes address info and returns a geolocation:

1 let getGeoCoordinates(streetAddress:string, city:string, state:string) = 2 let apiKey = "xxxxx" 3 let stringBuilder = new StringBuilder() 4 stringBuilder.Append("https://geoservices.tamu.edu/Services/Geocode/WebService/GeocoderWebServiceHttpNonParsed_V04_01.aspx") |> ignore 5 stringBuilder.Append("?streetAddress=") |> ignore 6 stringBuilder.Append(streetAddress) |> ignore 7 stringBuilder.Append("&city=") |> ignore 8 stringBuilder.Append(city) |> ignore 9 stringBuilder.Append("&state=") |> ignore 10 stringBuilder.Append(state) |> ignore 11 stringBuilder.Append("&apiKey=") |> ignore 12 stringBuilder.Append(apiKey) |> ignore 13 stringBuilder.Append("&version=4.01") |> ignore 14 stringBuilder.Append("&format=json") |> ignore 15 16 let searchUri = stringBuilder.ToString() 17 let searchResult = GeoLocationServiceContext.Load(searchUri) 18 19 let firstResult = searchResult.OutputGeocodes |> Seq.head 20 firstResult.OutputGeocode.Latitude, firstResult.OutputGeocode.Longitude, firstResult.OutputGeocode.MatchScore

I then loaded in the dataset via the .csv type provider:

1 [<Literal>] 2 let theCountedSample = "..\Data\TheCounted.csv" 3 type TheCountedContext = CsvProvider<theCountedSample> 4 let theCountedData = TheCountedContext.Load(theCountedSample) 5

I then mapped the geofunction to the imported dataset:

1 let theCountedGeoLocated = theCountedData.Rows 2 |> Seq.map(fun r -> r, getGeoCoordinates(r.Streetaddress, r.City, r.State)) 3 |> Seq.toList 4 |> Seq.map(fun (r,(lat,lon,ms)) -> String.Format("{0},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15}", 5 r.Name,r.Age,r.Gender,r.Raceethnicity,r.Month,r.Day,r.Year, r.Streetaddress, r.City,r.State,r.Cause,r.Lawenforcementagency,r.Armed,lat,lon,ms)) 6

And then finally exported the data

1 let baseDirectory = __SOURCE_DIRECTORY__ 2 let baseDirectory' = Directory.GetParent(baseDirectory) 3 let filePath = "Data\TheCountedWithGeo.csv" 4 let fullPath = Path.Combine(baseDirectory'.FullName, filePath) 5 File.WriteAllLines(fullPath,theCountedGeoLocated)

image

The gist is here.  Using the csv and json type providers made the analysis a snap –> a majority code is just building up the string for the service call.  +1 for simplicity.

Analyzing The Results

After adding geolocations to the dataset, I opened R studio and imported the dataset.

1 theCounted <- read.csv("./Data/TheCountedWithGeo.csv") 2 summary(theCounted) 3

 

image

So this is good news that we have good confidence on all of the observations so we don’t have to drop any records (making the counted, un-counted, as it were).

I then googled how to create a US map and put some data points on them and ran across this post.  I copied and pasted the code, changed the variable names, said “there is no way it is this easy” out loud, and hit CTRL+ENTER.

1 library(ggplot2) 2 library(maps) 3 4 all.states <- map_data("state") 5 plot <- ggplot() 6 plot <- plot + geom_polygon(data=all.states, aes(x=long, y=lat, group = group), 7 colour="grey", fill="white" ) 8 plot <- plot + geom_point(data=theCounted, aes(x=lon, y=lat), 9 colour="#FF0040") 10 plot <- plot + guides(size=guide_legend(title="Homicides")) 11 plot

 

image

The gist is here.

Sandcastle Help File Builder and FSharp

If you are going to write and release a professional-grade  .NET assembly, there are some things that need to be considered: logging, exception handling, and documentation.  For .NET components, Sandcastle Help File Builder is the go-to tool to generate documentation as either the old-school .chm file or as a web deploy. 

Consider an assembly that contains a Customer record type, an interface for a Customer Repository, and two implementations (In-Memory and ADO.NET)

1 type Customer = {id:int; firstName:string; lastName:string} 2 3 type ICusomerRepository = 4 abstract member GetCustomer : int -> Customer 5 abstract member InsertCustomer: Customer -> int 6 abstract member DeleteCustomer: int -> unit 7 8 type InMemoryCustomerRepository ()= 9 let customers = [ 10 {id=1; firstName = "First"; lastName = "Customer"} 11 {id=2; firstName = "Second"; lastName = "Customer"} 12 {id=3; firstName = "Third"; lastName = "Customer"}] 13 let customers' = new List<Customer>(customers) 14 15 interface ICusomerRepository with 16 member this.GetCustomer(id:int) = 17 customers' |> Seq.find(fun c -> c.id = id) 18 member this.InsertCustomer(customer: Customer) = 19 let nextId = customers'.Count 20 let customer' = {customer with id=nextId} 21 customers'.Add(customer') 22 nextId 23 member this.DeleteCustomer(id: int) = 24 let customer = customers |> Seq.find(fun c -> c.id = id) 25 customers'.Remove(customer) |> ignore 26 27 type SqlServerCustomerRepository (connectionString:string) = 28 interface ICusomerRepository with 29 member this.GetCustomer(id:int) = 30 use connection = new SqlConnection(connectionString) 31 let commandText = "Select * from customers where id = " + id.ToString() 32 use command = new SqlCommand(commandText, connection) 33 connection.Open() 34 use reader = command.ExecuteReader() 35 reader.Read() |> ignore 36 {id=reader.[0] :?> int; 37 firstName=reader.[1] :?> string; 38 lastName =reader.[2] :?> string} 39 40 member this.InsertCustomer(customer: Customer) = 41 use connection = new SqlConnection(connectionString) 42 let commandText = new StringBuilder() 43 commandText.Append("Insert customers values") |> ignore 44 commandText.Append(customer.firstName) |> ignore 45 commandText.Append(",") |> ignore 46 commandText.Append(customer.lastName) |> ignore 47 use command = new SqlCommand(commandText.ToString(), connection) 48 connection.Open() 49 command.ExecuteNonQuery() 50 51 member this.DeleteCustomer(id: int) = 52 use connection = new SqlConnection(connectionString) 53 let commandText = "Delete customers where id = " + id.ToString() 54 use command = new SqlCommand(commandText, connection) 55 connection.Open() 56 command.ExecuteNonQuery() |> ignore 57

To auto-generate XML code comments, you need to mark “XML documentation file” on the Build page of project properties:

image

With the .XML file created during the build, you can then fire up Sandcastle to point to the .XML file

image

With that, you can get some nice component documents based on your XML Code Comments.  Since I have not put any into my project yet, there is nothing in the docs.

image

So therein lies the rub.  I started entering XML comments (bare minimum) like so:

1 /// <summary> 2 /// Interface for Customer Repository implementations. 3 /// </summary> 4 type ICusomerRepository = 5 /// <summary> 6 /// Get a single validated customer. 7 /// </summary> 8 ///<param name="param0">The customer Id</param> 9 ///<returns>A validated Customer.</returns> 10 abstract member GetCustomer : int -> Customer 11 /// <summary> 12 /// Insert a single validated customer. 13 /// </summary> 14 ///<param name="param0">A validated customer.</param> 15 ///<returns>The Id of the customer, generated by the respository.</returns> 16 abstract member InsertCustomer: Customer -> int 17 /// <summary> 18 /// Deletes a single customer from the respository. 19 /// </summary> 20 ///<param name="param0">The customer Id</param> 21 abstract member DeleteCustomer: int -> unit

image

And you can see what happens.  The code base goes from 5 lines of readable code to 21 lines of clutter to make the help file.

One of the tenants of good code is that it is clean –> so we use SOLID principles, run FxCop, and the like.  Another tenant of good code is that it is uncluttered –> so we use FSharp, use ROP instead of structured exception handling, and avoid boilerplates and templating.  The problem is that we still can’t get away from clutter if we want to have good documentation.  Option A is to just drop documentation, a laudable but unrealistic goal, especially in a corporate environment.  Option B I am not sure on.  I am wondering if I create a separate file in the project just for the code comments.  That way the actual code is uncluttered and you can work with it undistracted and the XML still gets generated…

 

R for the .NET Developer

I spent some time over the last week putting my ideas down for a new speaking topic: “R for the .NET Developer.” With Microsoft acquiring Revolution Analytics and making a concerted push into analytics tooling and platforms, it makes sense that .NET developers have some exposure to the most common language in the data science space – R.

I started the presentation using Prezi (thanks David Green) and set up the major points I wanted to cover:

  • · R Overview
  • · R Language Features
  • · R In Action
  • · R Lessons Learned

You can see the Prezi here.

I worked through and then borrowed from several different books:

clip_image002clip_image004clip_image006clip_image008

clip_image010image image image image

this great you tube clip

clip_image012

and this Pluralsight course

clip_image014

I then jumped into R Studio to work though some of the code ideas that the Prezi illustrates. The entire set of code is found here on Github here but I wanted to show a couple of the cooler things that I did.

First, I implemented the Automotive In R from Data Mining and Business Analytics Book. This is pretty much a straight port of his exercise, with the exception is that I convert some vectors to factors to demonstrate who/when to do it:

1 setwd("C:\\Git\\R4DotNet") 2 3 #y = x1 + x2 + x3 + E 4 #y is what you are trying explain 5 #x1, x2, x3 are the variables that cause/influence y 6 #E is things that we are not measuring/ using for calculations 7 8 fuel.efficiency <- read.csv("C:/Git/R4DotNet/Data/FuelEfficiency.csv") 9 summary(fuel.efficiency) 10 11 #MPG = Miles per gallon 12 #GPM = Gallons per 100 miles 13 #WT = Weight of car in 1000 lbs 14 #DIS = Displacment in cubic inches 15 #NC = number of cylinders 16 #HP = Horsepower 17 #ACC = Acceleration in seconds from 0-60 18 #ET = Engine Type 0 = V, 1 = Straight 19 20 plot(GPM~WT,data=fuel.efficiency) 21 plot(GPM~DIS,data=fuel.efficiency) 22 23 fuel.efficiency$NC <- factor(fuel.efficiency$NC) 24 fuel.efficiency$ET <- factor(fuel.efficiency$ET) 25 summary(fuel.efficiency) 26 27 plot(GPM~NC,data=fuel.efficiency) 28 29 model <- lm(GPM~.,data=fuel.efficiency) 30 summary(model) 31 32 # Multiple R-squared: 0.9804 33 # means that we can explain 98% of the GPM with the variables we have E = 2% 34 # That is pretty friggen good 35 36 # turning back to numeric so we can do cor accross data frame 37 fuel.efficiency$NC <- as.integer(fuel.efficiency$NC) 38 fuel.efficiency$ET <- as.integer(fuel.efficiency$ET) 39 cor(fuel.efficiency) 40 41 #DIS -> WT = 0.9507647 42 43 library(leaps) 44 x=fuel.efficiency[,3:7] 45 y=fuel.efficiency[,2] 46 out = summary(regsubsets(x,y,nbest=2,nvmax=ncol(x))) 47 tab=cbind(out$which,out$req,out$adjr2,out$cp) 48 tab 49 50 #trade off between model size and model fit 51 #just weight is 52 53 model2 = lm(GPM~WT,data=fuel.efficiency) 54 summary(model2)

Here are the plots (as continuous and as a factor):

image image

Then, I implemented this K-Means from Azure ML to show the difference between the two implementations. The AzureML experiment is found here.  And my code looks like this. Note that I did not do a regression

1 flowers <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data") 2 summary(flowers) 3 4 colnames(flowers) <- c("F1", "F2", "F3", "F4", "Label") 5 summary(flowers) 6 7 8 indexes = sample(1:nrow(flowers), size=0.6*nrow(flowers)) 9 flowers.train <- flowers[-indexes,] 10 flowers.test <- flowers[indexes,] 11 12 fit <- kmeans(flowers.train[,1:4],5) 13 fit 14 15 plot(flowers.train[c("F1", "F2")], col=fit$cluster) 16 points(fit$centers[,c("F1", "F2")], col=1:3, pch=8, cex=2)

With a plot example like this:

image

So I think I am ready for the presentation.  It is really true, the best way to learn about something is to teach it…

Follow

Get every new post delivered to your Inbox.