Age and Sex Analysis Of Microsoft USA MVPs
December 25, 2016 29 Comments
A couple of weeks ago, this came across my Twitter
I participated in this hackathon (well, helped run the F# one). My response was:
I was surprised that I got into this exchange with a Microsoft PM:
That last comment by me was inspired by Mark Twain: “never wrestle with a pig. You just get dirty and the pig likes it.” But it did get me to thinking about the composition of the US MVPs. I did an analysis a couple of years ago of the photos of the Microsoft MVPs (found here and here) so it made sense to follow up on that code and see if I was wrong about my “middle age white guy” hypothesis. I could get the photos from the MVP site and pass them into the Microsoft Cognitive Services API for facial analysis for age/sex data. Using F# made the analysis a snap.
A nice thing about the Microsoft MVP website is that it is public and has photos of the MVPs. Here is one of the pages:
and when you look at the source of the page, each of those photos has a distinct uri:
I opened up Visual Studio and created a new F# project. I went into the script file and brought in the libraries to do some http requests. I then created a couple of functions to pull down the HTML of each of the 19 pages and put it into 1 big string:
1 let getPageContents(pageNumber:int) = 2 let uri = new Uri("http://mvp.microsoft.com/en-us/search-mvp.aspx?lo=United+States&sl=0&browse=False&sc=s&ps=36&pn=" + pageNumber.ToString()) 3 let request = WebRequest.Create(uri) 4 request.Method <- "GET" 5 let response = request.GetResponse() 6 use stream = response.GetResponseStream() 7 use reader = new StreamReader(stream) 8 reader.ReadToEnd() 9 10 let contents = 11 [|1..19|] 12 |> Array.map(fun i -> getPageContents i) 13 |> Seq.reduce(fun x y -> x + y)
(OT: Since I did a map..reduce on lines 12 and 13, does that mean I am working with “Big Data”?)
I then created a quick parser to find only the uris of the photos in all of the HTML.
1 let getUrisFromPageContents(pageContents:string) = 2 let pattern = "/PublicProfile/Photo/\d+" 3 let matchCollection = Regex.Matches(pageContents, pattern) 4 matchCollection 5 |> Seq.cast 6 |> Seq.map(fun (m:Match) -> m.Value) 7 |> Seq.map(fun v -> "https://mvp.microsoft.com/en-us" + v + "?language=en-us") 8 |> Seq.toArray 9 10 let uris = getUrisFromPageContents contents
Sure enough, I got 684 uris for MVP photos. I then wrote another Web Request to pull down each of the photos and save them to disk:
1 let saveImage uri = 2 use client = new WebClient() 3 let id = Guid.NewGuid() 4 let path = @"F:\Git\ChickenSoftware.ParseMvpPages.Solution\ChickenSoftware.ParseMvpPages\photos\" + id.ToString() + ".jpg" 5 client.DownloadFile(Uri(uri),path) 6 7 uris 8 |> Seq.iter saveImage 9
And I now have all 684 photos on disk.
I did not bring down the names of the MVPs – instead using a GUID to randomize the photos, but a name analysis would also be interesting. With the photos now local, I could then upload them to Microsoft Cognitive Services API to do facial analysis. You can read about the details of the API here. I created a third web request to pass the photo up and get the results from the API:
1 let getOxfordResults path = 2 let queryString = HttpUtility.ParseQueryString(String.Empty) 3 queryString.Add("returnFaceId","true") 4 queryString.Add("returnFaceLandmarks","false") 5 queryString.Add("returnFaceAttributes","age,gender") 6 let uri = "https://api.projectoxford.ai/face/v1.0/detect?" + queryString.ToString() 7 let bytes = File.ReadAllBytes(path) 8 let client = new HttpClient() 9 client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key","xxxxxxxxxxx") 10 let response = new HttpResponseMessage() 11 let content = new ByteArrayContent(bytes) 12 content.Headers.ContentType <- MediaTypeHeaderValue("application/octet-stream") 13 let result = client.PostAsync(uri,content).Result 14 Thread.Sleep(TimeSpan.FromSeconds(5.0)) 15 match result.StatusCode with 16 | HttpStatusCode.OK -> Some (result.Content.ReadAsStringAsync().Result) 17 | _ -> None
Notice that I put a 5 second sleep into the call. This is because Microsoft throttles the requests to 20 per minute. Also, since some of the photos do not have a face, I used the F# option type. The results come back from the Microsoft Cognitive Services API as Json. To parse the results, I used the FSharp Json Type Provider:
1 type FaceInfo = JsonProvider<Sample="[{\"faceId\":\"83045097-daa1-4f1c-8669-ed012e9b5975\",\"faceRectangle\":{\"top\":187,\"left\":209,\"width\":214,\"height\":214},\"faceAttributes\":{\"gender\":\"male\",\"age\":42.8}}]"> 2 3 let parseOxfordResuls results = 4 match results with 5 | Some r -> 6 let face = FaceInfo.Parse(r) 7 match Seq.length face with 8 | 0 -> None 9 | _ -> let header = face |> Seq.head 10 Some(header.FaceAttributes.Age,header.FaceAttributes.Gender) 11 | None -> None
So now I can get estimated age and gender from Microsoft Cognitive Services API. I was disappointed that the API does not estimate race. I assume they have the technology but from a social-acceptance point of view, they don’t make it publically available. In any event, a look though their photos show that a majority are white people. In any event, I went ahead and ran this and went out to work on my sons stock car while the requests were spinning.
1 #time 2 let results = 3 let path = @"F:\Git\ChickenSoftware.ParseMvpPages.Solution\ChickenSoftware.ParseMvpPages\photos" 4 Directory.GetFiles(path) 5 |> Array.map(fun f -> getOxfordResults f) 6 |> Array.map(fun r -> parseOxfordResuls r)
When I came back, I had a nice sequence of a tuple that contained ages and genders.
To analyze the data, I pulled in Math .NET. First, I took a look age:
1 Seq.length results //684 2 3 let ages = 4 results 5 |> Seq.filter(fun r -> r.IsSome) 6 |> Seq.map(fun o -> fst o.Value) 7 |> Seq.map(fun a -> float a) 8 9 let stats = new DescriptiveStatistics(ages) 10 let count = stats.Count 11 let largest = stats.Maximum 12 let smallest = stats.Minimum 13 let mean = stats.Mean 14 let median = Statistics.Median(ages) 15 let variance = stats.Variance 16 let standardDeviation = stats.StandardDeviation 17 let kurtosis = stats.Kurtosis 18 let skewness = stats.Skewness 19 let lowerQuartile = Statistics.LowerQuartile(ages) 20 let uppserQuartile = Statistics.UpperQuartile(ages) 21
Here are the results.
I got 620 valid photos of the 684 MVPs – so a 91% hit rate and I have enough observations to make the analysis statistically valid. It looks like Cognitive Services made at least 1 mistake with an age of 4.9 years –> perhaps someone was using a meme for their photo? In any event, the mean is estimated at 41.95 and the median is 40.95, so a slight skew left. (Note I mislabeled it on the screen shot above)
I then wanted to see the distribution of the ages so I brought in FSharp charting and ran a basic histogram:
1 open FSharp.Charting 2 3 let chart = Chart.Histogram(ages,Intervals=10.0) 4 Chart.Show(chart)
So the ages look very Gaussian.
I then decided to look at gender:
1 let gender = 2 results 3 |> Seq.filter(fun r -> r.IsSome) 4 |> Seq.map(fun o -> snd o.Value) 5 6 gender 7 |> Seq.countBy(fun v -> v) 8 |> Seq.map(fun (g,c) -> g, c, float c/float count)
With the results being:
So there are 12% females and 88% males. With an average age 42 years old and 88% male, “middle age white guy” seems like an appropriate label and I stand by my original tweet – we certainly have work to do in 2017.
You can find the gist here
Pingback: F# Advent Calendar in English 2016 – Sergey Tihon's Blog
Pingback: Dew Drop - December 26, 2016 (#2391) - Morning Dew
Good luck changing women’s neurology so that the world will conform to your idealization of it..
What in particular would need to be changed in women’s neurology?
Them wanting to work in tech.
Hi there. Woman in tech here. It’s absolutely horrific that you assume that women’s neurology is to blame for this disparity here, and not ANY OTHER THING that might contribute.
Fantastic post on several levels. Thanks!
This is incendiary stuff entering into 2017 and the beginning of the age of Trump. There are a lot of folk who feel they have the green light to get good and angry about all this “diversity stuff”. Great post, wrong era.
The point being?
My question is, Are these percentages represenative of the IT industry as a whole. Example: if the IT workforce is 88% white old dudes, then the MVP community represents that. Shouldn’t the question be getting more women into IT and then maybe they would be more represented in the MVP group.
Good point. The MVP program is supposed to be a leader in the industry so you would expect it to have a more diverse demographic than the rest of the industry – to their credit they are trying, I think? In terms of the general industry, I guess it is also of just acting locally and trying to influence what you can around you – in my case being active in finding under-represented groups in IT and giving them the opportunity to be leaders here in Raleigh. Fortunately, there are plenty of talented people of all backgrounds here.
I’m all for diversity but I think John’s point it that the MVP program should have roughly the same diversity as the IT industry in general. If you measure MVP candidates on a weighted scale to promote diversity, you devalue the distinction of being an MVP. The solution isn’t to lower the bar for some MVP candidates. Instead, we need to make the IT industry more diverse from the ground up and we shouldn’t be doing it for the sake of being diverse but because a diverse workforce is more effective since we work in a diverse world with a diverse customer base. Once our industry is diverse, the MVP program will be diverse because quality people come equally in all genders, colors, ages, etc…
Chris, I’ll also offer that beyond leading to more commercial effectiveness, greater diversity in an industry as vital and powerful as IT also protects us from exploitative activity. We humans know from history that homogeneity in leadership often leads to exploitation of groups not well-represented as well as (deepened) systemic biases and destructive negative cycles (e.g., lack of role-models evolves into lack of hope and more distrust of the leaders).
We could fire men indiscriminate in order to make way for more women, for example. Sounds good?
Who is “we”?
The real question is why sexually repressed Eastern European men get to have an opinion at all…..
Stupid me hand counted and identified gender. Some men had long hair, some women short hair. Some had two in the picture. Some used avatars. I did my best “Cognitive Service” and after removing all anonymous pics/names I came up with 73.1% Men (203) and 26.9% Women (75).
I CANT MATH! Men 89.5% (643) and Women 10.5% (75). Azure Cognitive Service are pretty dead on.
they would be happy to hear that. 😎
Great post, thank you! Just a nitpicky comment – the age distribution is more Gamma (or log-normal) rather than Gaussian, mainly because it’s left-bounded and skewed.
All this crap is going down in the affirmative action injustice. Who and how is to decide the percentages?
Precisely. What percentage is sufficient? What is the proposition for changing that percentage?
Interesting… I wonder if the distribution accurately reflects the entire (3000+) population of MVPs. Your premise is based on the assumption that those that choose to make their profiles publically visable are a true representative sub-grouping. That would be another great question to research. What is the demographics of those that choose to keep their profiles private?
I think every good analysis leads to more questions.
Pingback: F# Weekly #1, 2017 – New Year Edition – Sergey Tihon's Blog
Great Post, thanks for sharing it. I like the way you used f# in order to get data. It does not prove much it, just serve as fuel to more questioning..
Pingback: Parse web pages with Power BI | Ambiguity vs Information
Pingback: Analyzing the Office Servers and Services MVP community 2017 | Thoughtsofanidlemind's Blog
As I am a lazy women I did my homework from a csv received by my mvp leader (anonymous) Of course after this incommensurable effort I had to take a month to rest and repair my brain and neurology…https://app.powerbi.com/view?r=eyJrIjoiYWUxODZkNGYtNzZiYy00NTM5LWFlMjgtMTkyYzBhYzhiODM0IiwidCI6Ijg5MmIzNDQ2LWQ1ZjAtNDg5ZS1hNjhkLTYwNWMxNjEzYWVhZCIsImMiOjh9