With all of the data gathered and organized I was ready to do some analytics using R. The first thing I did was to load the four major datasets into R.
- NCScores is the original dataset that has the school score. I had already done did an analysis on it here.
- SchoolValuation is the aggrgrate property values for each school as determined by scraping the Wake County Tax Website and Wake County School Assignment websites. You can read how it was created here and here.
- SchoolNameMatch is a crosswalk table between the school name as found in the NCScores dataframe and the School Valuation dataframe. You can read how it was created here
- WakeCountySchoolInfo is an export from WCPSS that was tossed around at open data day.
Step one was to reduce the North Carolina Scores data to only Wake County
1 #Create Wake County Scores From NC State Scores
2 WakeCountyScores <- NCScores[NCScores$District == 'Wake County Schools',]
The next step was to add in the SchoolNameMatch so that we have the Tax Valuation School Name
1 #Join SchoolNameMatch to Wake County Scores
2 WakeCountyScores <- merge(x=WakeCountyScores, y=SchoolNameMatch, by.x="School", by.y="WCPSS")
Interestingly, R is smart enough that the common field not duplicated, just the additional field(s) are added
The next step was to add in the Wake County Property Values, remove the Property field as it is no longer needed, and convert the TaxBase field from string to numeric
1 #Join Property Values
2 WakeCountyScores <- merge(x=WakeCountyScores, y=SchoolValuation, by.x="Property", by.y="SchooName")
4 #Remove Property column
5 WakeCountyScores$Property = NULL
7 #Turn tax base to numeric
8 WakeCountyScores$TaxBase <- as.numeric(WakeCountyScores$TaxBase)
Eager to do an analysis, I pumped the data into a correlation
1 #Do a Correlation
So clearly my expectations that property values track with FreeAndReducedLunch (.85 correlation) were not met. I decided to use Practical Data Science with R Chapter 3 (Exploring Data) as a guide to better understand the dataset.
1 #Practical Data Science With R, Chapter3
So there is quite a range in tax base! The next task was to use some graphs to explore the data. I added in ggplot2
and followed the books example for a histogram. I started with score and it comes out as expected. I then tried a historgram on TaxBase and had to tinker with the binwidth to make a meaningful chart:
2 ggplot(WakeCountyScores) + geom_histogram(aes(x=SchoolScore),binwidth=5,fill="gray")
3 ggplot(WakeCountyScores) + geom_histogram(aes(x=TaxBase),binwidth=10000,fill="gray")
5 ggplot(WakeCountyScores) + geom_histogram(aes(x=TaxBase),binwidth=5000000,fill="gray")
The book then moves to an example studying income, which is directly analogous to the TaxBase so I followed it very closely. The next graph were some density graphs. Note the second one is a logarithmic one:
3 ggplot(WakeCountyScores) + geom_density(aes(x=TaxBase)) + scale_x_continuous(labels=dollar)
4 ggplot(WakeCountyScores) + geom_density(aes(x=TaxBase)) + scale_x_log10(labels=dollar) + annotation_logticks(sides="bt")
So kinda interesting that most schools cluster in terms of their tax base, but because there is such a wide range with a majority clustered to the low end, the logarithmic curve is much more revealing.
The book then moved into showing the relationship between two variables. In this case, SchoolScore as the Y variable and TaxBase as the X variable:
1 #Relationship between TaxBase and Scores
2 ggplot(WakeCountyScores, aes(x=TaxBase, y=SchoolScore)) + geom_point()
3 ggplot(WakeCountyScores, aes(x=TaxBase, y=SchoolScore)) + geom_point() + stat_smooth(method="lm")
4 ggplot(WakeCountyScores, aes(x=TaxBase, y=SchoolScore)) + geom_point() + geom_smooth()
So what is interesting is that there does not seem to be a strong relationship between scores and tax base. There looks like an equal number of schools both below the score curve than above it. Note that using a smoothing curve is much better than the linear fit curve in showing the relationship of scores to tax base. You can see the dip in the lower quartile and the increase at the tail. It makes sense that the higher tax base shows an increase in scores, but what’s up with that dip?
Finally, the same data is shown using a hax chart
2 ggplot(WakeCountyScores, aes(x=TaxBase, y=SchoolScore)) + geom_hex(binwidth=c(100000000,5)) + geom_smooth(color="white",se=F)
So taking a step back, it is clear that there is a weakness in this analysis. Some schools have thousands of students, some schools have a couple hundred. (high schools versus elementary students). Using the absolute dollars from the tax valuation is misleading. What we really need is revenue per student. Going back to the SchoolInfo dataframe, I added it in and pulled a student count column.
1 WakeCountyScores <- merge(x=WakeCountyScores, y=WakeCountySchoolInfo, by.x="School", by.y="School.Name")
2 names(WakeCountyScores)[names(WakeCountyScores)=="School.Membership.2013.14..ADM..Mo2."] <- "StudentCount"
3 WakeCountyScores$StudentCount <- as.numeric(WakeCountyScores$StudentCount)
5 WakeCountyScores["TaxBasePerStudent"] <- WakeCountyScores$TaxBase/WakeCountyScores$StudentCount
Interestingly, the number of records in the base frame dropped from 166 to 152, which means that perhaps we need a second mapping table. In any event, you can see that the average tax base is $6.5 million with a max of $114 Million. Quite a range!
Going back to the point and hex graphs
1 ggplot(WakeCountyScores, aes(x=TaxBasePerStudent, y=SchoolScore)) + geom_point() + geom_smooth()
2 ggplot(WakeCountyScores, aes(x=TaxBasePerStudent, y=SchoolScore)) + geom_hex(binwidth=c(25000000,5)) + geom_smooth(color="white",se=F)
There is some interesting going on. First, the initial conclusion that a higher tax base leads to a gradual increase in scores is wrong once you move from total tax to tax per student.
Also, note the significant drop in school scores once you move away from the lowest tax base schools, the recovery, and then the drop again. From a real estate perspective, these charts suggest that the marginal value of a really expensive or really inexpensive house in Wake County is not worth it (at least in terms of where you send you kids), and there is a sweet spot of value above a certain price point.
You can find the gist here and the repo is here.
Some lessons I learned in doing this exercise:
- Some records got dropped between the scores dataframe and the info dataframe -> so there needs to be another mapping table
- Make the tax base in millions
- What’s up with that school with 114 million per student?
- An interesting question is location of dollars to school compared to tax base. I wonder if that is on WCPSS somewhere. Hummm…
- You can’t use the tick(‘) notation, which means you do a lot of overwriting of dataframes. This can be a costly and expensive feature of the language. It is much better to assume immutably, even if you clutter up your data window.
As a final note, I was using the Console window b/c that is what the intro books do. This is a huge mistake in R Studio. It is much better to create a script and send the results to the console
So that you can make changes and run things again. It is a cheap way of avoid head-scratching bugs…