Age and Sex Analysis Of Microsoft USA MVPs

A couple of weeks ago, this came across my Twitter

image

I participated in this hackathon (well, helped run the F# one).  My response was:

image

I was surprised that I got into this exchange with a Microsoft PM:

image

That last comment by me was inspired by Mark Twain: “never wrestle with a pig.  You just get dirty and the pig likes it.”  But it did get me to thinking about the composition of the US MVPs.  I did an analysis a couple of years ago of the photos of the Microsoft MVPs (found here and here) so it made sense to follow up on that code and see if I was wrong about my “middle age white guy” hypothesis.  I could get the photos from the MVP site and pass them into the Microsoft Cognitive Services API for facial analysis for age/sex data.  Using F# made the analysis a snap.

A nice thing about the Microsoft MVP website is that it is public and has photos of the MVPs.  Here is one of the pages:

image

and when you look at the source of the page, each of those photos has a distinct uri:

image

I opened up Visual Studio and created a new F# project.  I went into the script file and brought in the libraries to do some http requests.  I then created a couple of functions to pull down the HTML of each of the 19 pages and put it into 1 big string:

1 let getPageContents(pageNumber:int) = 2 let uri = new Uri("http://mvp.microsoft.com/en-us/search-mvp.aspx?lo=United+States&sl=0&browse=False&sc=s&ps=36&pn=" + pageNumber.ToString()) 3 let request = WebRequest.Create(uri) 4 request.Method <- "GET" 5 let response = request.GetResponse() 6 use stream = response.GetResponseStream() 7 use reader = new StreamReader(stream) 8 reader.ReadToEnd() 9 10 let contents = 11 [|1..19|] 12 |> Array.map(fun i -> getPageContents i) 13 |> Seq.reduce(fun x y -> x + y)

(OT: Since I did a map..reduce on lines 12 and 13, does that mean I am working with “Big Data”?)

I then created a quick parser to find only the uris of the photos in all of the HTML.

1 let getUrisFromPageContents(pageContents:string) = 2 let pattern = "/PublicProfile/Photo/\d+" 3 let matchCollection = Regex.Matches(pageContents, pattern) 4 matchCollection 5 |> Seq.cast 6 |> Seq.map(fun (m:Match) -> m.Value) 7 |> Seq.map(fun v -> "https://mvp.microsoft.com/en-us" + v + "?language=en-us") 8 |> Seq.toArray 9 10 let uris = getUrisFromPageContents contents

Sure enough, I got 684 uris for MVP photos.  I then wrote another Web Request to pull down each of the photos and save them to disk:

1 let saveImage uri = 2 use client = new WebClient() 3 let id = Guid.NewGuid() 4 let path = @"F:\Git\ChickenSoftware.ParseMvpPages.Solution\ChickenSoftware.ParseMvpPages\photos\" + id.ToString() + ".jpg" 5 client.DownloadFile(Uri(uri),path) 6 7 uris 8 |> Seq.iter saveImage 9

And I now have all 684 photos on disk.

image

I did not bring down the names of the MVPs – instead using a GUID to randomize the photos, but a name analysis would also be interesting.  With the photos now local, I could then upload them to Microsoft Cognitive Services API to do facial analysis.  You can read about the details of the API here.  I created a third web request to pass the photo up and get the results from the API:

1 let getOxfordResults path = 2 let queryString = HttpUtility.ParseQueryString(String.Empty) 3 queryString.Add("returnFaceId","true") 4 queryString.Add("returnFaceLandmarks","false") 5 queryString.Add("returnFaceAttributes","age,gender") 6 let uri = "https://api.projectoxford.ai/face/v1.0/detect?" + queryString.ToString() 7 let bytes = File.ReadAllBytes(path) 8 let client = new HttpClient() 9 client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key","xxxxxxxxxxx") 10 let response = new HttpResponseMessage() 11 let content = new ByteArrayContent(bytes) 12 content.Headers.ContentType <- MediaTypeHeaderValue("application/octet-stream") 13 let result = client.PostAsync(uri,content).Result 14 Thread.Sleep(TimeSpan.FromSeconds(5.0)) 15 match result.StatusCode with 16 | HttpStatusCode.OK -> Some (result.Content.ReadAsStringAsync().Result) 17 | _ -> None

Notice that I put a 5 second sleep into the call.  This is because Microsoft throttles the requests to 20 per minute. Also, since some of the photos do not have a face, I used the F# option type. The results come back from the Microsoft Cognitive Services API  as Json. To parse the results, I used the FSharp Json Type Provider:

1 type FaceInfo = JsonProvider<Sample="[{\"faceId\":\"83045097-daa1-4f1c-8669-ed012e9b5975\",\"faceRectangle\":{\"top\":187,\"left\":209,\"width\":214,\"height\":214},\"faceAttributes\":{\"gender\":\"male\",\"age\":42.8}}]"> 2 3 let parseOxfordResuls results = 4 match results with 5 | Some r -> 6 let face = FaceInfo.Parse(r) 7 match Seq.length face with 8 | 0 -> None 9 | _ -> let header = face |> Seq.head 10 Some(header.FaceAttributes.Age,header.FaceAttributes.Gender) 11 | None -> None

So now I can get estimated age and gender from Microsoft Cognitive Services API.  I was disappointed that the API does not estimate race.  I assume they have the technology but from a social-acceptance point of view, they don’t make it publically available.  In any event, a look though their photos show that a majority are white people.  In any event, I went ahead and ran this and went out to work on my sons stock car while the requests were spinning.

1 #time 2 let results = 3 let path = @"F:\Git\ChickenSoftware.ParseMvpPages.Solution\ChickenSoftware.ParseMvpPages\photos" 4 Directory.GetFiles(path) 5 |> Array.map(fun f -> getOxfordResults f) 6 |> Array.map(fun r -> parseOxfordResuls r)

When I came back, I had a nice sequence of a tuple that contained ages and genders.

image

To analyze the data, I pulled in Math .NET.  First, I took a look age:

1 Seq.length results //684 2 3 let ages = 4 results 5 |> Seq.filter(fun r -> r.IsSome) 6 |> Seq.map(fun o -> fst o.Value) 7 |> Seq.map(fun a -> float a) 8 9 let stats = new DescriptiveStatistics(ages) 10 let count = stats.Count 11 let largest = stats.Maximum 12 let smallest = stats.Minimum 13 let mean = stats.Mean 14 let median = Statistics.Median(ages) 15 let variance = stats.Variance 16 let standardDeviation = stats.StandardDeviation 17 let kurtosis = stats.Kurtosis 18 let skewness = stats.Skewness 19 let lowerQuartile = Statistics.LowerQuartile(ages) 20 let uppserQuartile = Statistics.UpperQuartile(ages) 21

Here are the results. 

image

I got 620 valid photos of the 684 MVPs – so a 91% hit rate and I have enough observations to make the analysis statistically valid.  It looks like Cognitive Services made at least 1 mistake with an age of 4.9 years –> perhaps someone was using a meme for their photo?  In any event, the mean is estimated at 41.95 and the median is 40.95, so a slight skew left. (Note I mislabeled it on the screen shot above)

I then wanted to see the distribution of the ages so I brought in FSharp charting and ran a basic histogram:

1 open FSharp.Charting 2 3 let chart = Chart.Histogram(ages,Intervals=10.0) 4 Chart.Show(chart)

image

So the ages look very Gaussian.

I then decided to look at gender:

1 let gender = 2 results 3 |> Seq.filter(fun r -> r.IsSome) 4 |> Seq.map(fun o -> snd o.Value) 5 6 gender 7 |> Seq.countBy(fun v -> v) 8 |> Seq.map(fun (g,c) -> g, c, float c/float count)

With the results being:

image

So there are 12% females and 88% males.  With an average age 42 years old and 88% male, “middle age white guy” seems like an appropriate label and I stand by my original tweet – we certainly have work to do in 2017.

You can find the gist here

29 Responses to Age and Sex Analysis Of Microsoft USA MVPs

  1. Pingback: F# Advent Calendar in English 2016 – Sergey Tihon's Blog

  2. Pingback: Dew Drop - December 26, 2016 (#2391) - Morning Dew

  3. Frank de Groot says:

    Good luck changing women’s neurology so that the world will conform to your idealization of it..

  4. Bob Salita says:

    Fantastic post on several levels. Thanks!

  5. List Walker says:

    This is incendiary stuff entering into 2017 and the beginning of the age of Trump. There are a lot of folk who feel they have the green light to get good and angry about all this “diversity stuff”. Great post, wrong era.

  6. My question is, Are these percentages represenative of the IT industry as a whole. Example: if the IT workforce is 88% white old dudes, then the MVP community represents that. Shouldn’t the question be getting more women into IT and then maybe they would be more represented in the MVP group.

    • jamie dixon says:

      Good point. The MVP program is supposed to be a leader in the industry so you would expect it to have a more diverse demographic than the rest of the industry – to their credit they are trying, I think? In terms of the general industry, I guess it is also of just acting locally and trying to influence what you can around you – in my case being active in finding under-represented groups in IT and giving them the opportunity to be leaders here in Raleigh. Fortunately, there are plenty of talented people of all backgrounds here.

      • Chris says:

        I’m all for diversity but I think John’s point it that the MVP program should have roughly the same diversity as the IT industry in general. If you measure MVP candidates on a weighted scale to promote diversity, you devalue the distinction of being an MVP. The solution isn’t to lower the bar for some MVP candidates. Instead, we need to make the IT industry more diverse from the ground up and we shouldn’t be doing it for the sake of being diverse but because a diverse workforce is more effective since we work in a diverse world with a diverse customer base. Once our industry is diverse, the MVP program will be diverse because quality people come equally in all genders, colors, ages, etc…

      • Rick_Pack2 says:

        Chris, I’ll also offer that beyond leading to more commercial effectiveness, greater diversity in an industry as vital and powerful as IT also protects us from exploitative activity. We humans know from history that homogeneity in leadership often leads to exploitation of groups not well-represented as well as (deepened) systemic biases and destructive negative cycles (e.g., lack of role-models evolves into lack of hope and more distrust of the leaders).

    • We could fire men indiscriminate in order to make way for more women, for example. Sounds good?

  7. Jared says:

    Stupid me hand counted and identified gender. Some men had long hair, some women short hair. Some had two in the picture. Some used avatars. I did my best “Cognitive Service” and after removing all anonymous pics/names I came up with 73.1% Men (203) and 26.9% Women (75).

  8. Great post, thank you! Just a nitpicky comment – the age distribution is more Gamma (or log-normal) rather than Gaussian, mainly because it’s left-bounded and skewed.

  9. All this crap is going down in the affirmative action injustice. Who and how is to decide the percentages?

  10. Interesting… I wonder if the distribution accurately reflects the entire (3000+) population of MVPs. Your premise is based on the assumption that those that choose to make their profiles publically visable are a true representative sub-grouping. That would be another great question to research. What is the demographics of those that choose to keep their profiles private?

  11. Pingback: F# Weekly #1, 2017 – New Year Edition – Sergey Tihon's Blog

  12. Yoann says:

    Great Post, thanks for sharing it. I like the way you used f# in order to get data. It does not prove much it, just serve as fuel to more questioning..

  13. Pingback: Parse web pages with Power BI | Ambiguity vs Information

  14. Pingback: Analyzing the Office Servers and Services MVP community 2017 | Thoughtsofanidlemind's Blog

  15. As I am a lazy women I did my homework from a csv received by my mvp leader (anonymous) Of course after this incommensurable effort I had to take a month to rest and repair my brain and neurology…https://app.powerbi.com/view?r=eyJrIjoiYWUxODZkNGYtNzZiYy00NTM5LWFlMjgtMTkyYzBhYzhiODM0IiwidCI6Ijg5MmIzNDQ2LWQ1ZjAtNDg5ZS1hNjhkLTYwNWMxNjEzYWVhZCIsImMiOjh9

Leave a comment