Friday, August 7, 2015

Fortune Favours the Old

I had my birthday recently and, much like last year, got a joke present from a friend. This year though, it came with an explicit challenge to do something statistical with it. So for this blog post, my subject matter will be based on this box of fortune cookies:

My willing victim

So what stats can be pulled out of a box of fortune cookies? First of all, I suppose the box says there are approximately 25 cookies, but in reality it came with 38 fortunes. Ridiculous quality control, let me tell you!

Of course, most fortunes just floated around without cookies
Fortunately, each fortune has a set of numbers on the back. Numbers are good, so let's do stats with those and leave the yummy cookie bits for later.

Fortunes not necessarily to scale.
Each fortune has a series of 6 ascending, non-repeating integers on the back. Presumably these are lucky numbers for your next lottery, but given just this set of numbers we can't necessarily tell which lottery they might be meant for. But can we make an educated guess?

Quick history lesson: in World War II, the Allies were at least somewhat concerned with estimating how many tanks Germany was building in any given month. One way they had was conventional espionage, which suggested that the Germans were building approximately 1,400 tanks every month between June 1940 and September 1942 (a lot of tanks). Of course, spies sometimes lie (it's their job, after all), so the second way the Allies had to estimate tank production was using statistics on captured tanks.

Every tank had a whole bunch of parts, and every part had a serial number stamped into it during production. These serial numbers were unique for every tank, and in the case of the gearboxes in particular, fell in unbroken sequences. Based on the distribution of serial numbers, a relatively simple formula could give an estimate of the total number of tanks produced. For instance, if the Allies saw that the tanks they destroyed in a given month were tanks produced #25, #94, #141, and #198 of that month (and were confident they were destroying them randomly), they'd be much less worried than if they destroyed tanks #52, #306, #519, and #1058.

It actually turned out way more accurate than anyone hoped - statistical estimates for tank production between June 1940 and September 1942 were 246 tanks per month, and in reality the Germans produced 245. Yay stats!

So like the famous German tank problem, looking at a fortune cookie's string of numbers can give us an estimate of the total number of 'lucky numbers' that the fortune cookies might offer. In the above example, there are 6 numbers decently evenly spaced between 2 and 47. A frequentist statistical approach, therefore, suggests that the total number of possible numbers that could be on the backs of these fortune cookies is 53.83, with a 95% confidence interval of 47-77. Not terribly precise when looking at a single fortune. Another fortune might have a series of numbers with a likely maximum of 48, for instance, and if we look at the average of all 38 fortunes in the box, the average 'expected number' ends up being 49.4. And in fact, of all 38 fortunes, all numbers on the backs were between 1 and 49.

So we have six numbers, chosen between 1-49. Sounds like we're playing Lotto 6/49!

Here's what the distribution of all lucky numbers ended up being:

It kinda looks like number 37 comes up way more often than the rest, and numbers 9 and 13 are super under-represented. Is this a conspiracy, or random chance?

With 49 numbers to choose from, 6 different numbers on each fortune, and 38 fortunes to choose from, we'd expect an average of 4.65 of each number to show up. With an expected 4.65 of each number, we can create a Poisson distribution to see how often we'd expect any given number to turn up, and see if ours is indeed random. That'd give us something like this:

This suggests that the distribution of lucky numbers isn't actually all that lucky, and may be pretty much what you'd expect (R2 value of 0.82, which ain't shabby). It matches particularly closely at the tails, so having a few numbers occur 10 times each isn't all that surprising really.

One last analysis for the fortune cookies. Fortunes tended to come in one of three categories: advice ("Counting time is not as important as making time count"), analysis ("You are deeply attached to your family and home"), and most popularly predictions ("You will soon find something lost long ago"). Is there any relation between the type of fortune on the front, and the sum of the numbers on the back?

Nope, nothing statistically significant anyway. The Analysis fortunes seem to generally have higher numbers on the back, but there are too few of them and they are too varied to be conclusive.

So there you go! Fortune cookies tend to have Lotto 6/49 numbers on the back that are fairly well randomly distributed. Not sure if that left any of you particularly surprised, but it's fun to know nonetheless!

Wednesday, July 15, 2015

Canadian Inequality 2: Revenge of the CCPA

Like last year, the Canadian Centre for Policy Alternatives has come out with a ranking of the "The Best and Worst Places to be a Woman in Canada." Since I got so frustrated with the last one, I figured I'd do a quick follow-up on this one:

The Good

  • Cities are given an inequality score on each category, then the scores are averaged for each to come up with a final score, which is then ranked. This is a vast improvement on last year, where they were ranked within each category, then their ranks were averaged, as the variations within categories change quite drastically. This was one of my biggest issues with last year's report.
  • Within each category, sub-indicators are weighted according to their variance. This was used in other similar studies before, and is helpful at revealing actual differences between cities.
  • No weird ranking arithmetic errors in the appendices!

The Bad

  • Some news sources seem to be suggesting that Victoria has 'risen' in the rankings this year. Nope - this is essentially a completely different yet topically similar study, with a different sample size, new measures, and different methodologies. Consider this a re-do of the previous study, especially since almost all of the data ranges from 2007 to 2013.
  • Inequality ratios aren't capped. This is definitely a judgement call, and isn't wrong necessarily, but when the World Economic Forum did a similar study they capped ratios at 1.0 (perfect equality) so that doing better than equal in one area (for example, education) can't be used to offset poor scores in other areas. This actually would result in a change in the rankings of 20/25 cities.
  • Inequality vs "Worst Place to be a Woman." The report explicitly doesn't compare quality of life standards between cities (apart from one - see next point). Again, the authors say, "The report focuses primarily on the gap between men and women, rather than their overall levels of well-being," yet their title doesn't mention inequality at all, and instead sounds like a judgement explicitly of well-being. I personally think this is clouding some of the discussion surrounding the report, since it seems to prime people to be angry before even reading it. It's ridiculous to say Edmonton and Calgary are the worst places for women in Canada, but also say that they have the lowest levels of women in poverty and highest levels of highly-educated women.
  • One measure that's new this year is the percentage of women per city who've had a pap smear test in the last three years. Interesting measure to include, definitely has potential to be a decent indicator of health in women, but has almost nothing to do with comparative inequality between men and women. The health category has ratios of perceived stress and happiness between the sexes, life expectancy ratios, and a percentage of pap test-takers, and lumps them all together into a number that's supposed to represent the equality ratio. It doesn't belong in this sort of an analysis.
The Ugly

  • This is a list of cities with Clipart in the background. It is not an infographic. Stop saying you made a fancy infographic.

I guess I ended up with more negative bullet points than positive, but in reality they're mostly me being nit-picky. This study is a much better version of what was published last year, and (happily enough) tends to agree quite closely with what I did a few months ago. If the alarmist language in the title is what it takes to get gender inequality discussed openly, then that's fair enough, but I would personally be happier if this was approached a little bit more academically with a little less sensationalism. Either way, Edmonton has plenty to work on, and in the future I hope to see our score go up. Just, like, wait a couple of years, otherwise you'll be using the same data for next year's "updated" report again.

Thursday, June 25, 2015

City Council Analysis

Back after the 2013 Edmonton municipal election, I did a quick analysis to see if I could predict who some of the new mayor Don Iveson's friends on council would be. My thought was that councillors with similar platforms to the mayor's, and who are potentially more likely to agree with him on votes, were also likely to get voter support in the same neighborhoods due to their similarity. It seemed plausible enough so I did a correlation analysis.

It's now been a decent enough time since the last election that I've decided to check if my guess was accurate or not. Let's take a look!

Edmonton's Open Data has a log of the voting record for the 2013-2017 council, and it is fairly long. All told, there were 2,757 different motions that had been voted on so far. One issue with taking a look at all of these combined is that plenty of the votes are for procedural matters in council that get passed quickly and unanimously, and they're kinda boring from an analysis point of view. Of the 2,757 motions that have been voted on, 2,611 were unanimous one way or another. So let's ignore those, and focus on the remaining 146 contentious votes.

If we compare how each councillor and the mayor voted for every contentious vote they were present for, we can see how often any given pair end up voting the same way. The end results look like this:

(Click to zoom and enhance)
One of the first things to notice is that mayor Iveson does seem to have a fair bit of support on council - 7 councillors tend to vote the same way he does more than 80% of the time, and that's enough for a majority on most votes. These same 7 councillors (Knack, Esslinger, Henderson, Walters, Sohi, McKeen, and Loken) all tend to agree very consistently with each other too (with the possible exception of councillors McKeen and Loken). I wouldn't go so far as to say that they act as a voting block, but there certainly is evidence that they get along very well professionally, to say the least.

Other interesting observations include that councillors Loken, Caterina, and Gibbons all vote together quite a bit, with councillors Caterina and Gibbons agreeing more with each other than with the mayor. Councillors Anderson and Nickel quite clearly do not see eye-to-eye with most of the rest of their council colleagues.

One way we can check out my previous analysis is to compare the frequency councillors agree with the mayor with the correlation values I had previously obtained. If we do that, we can generate a graph like this:

Back when I did the original analysis, plenty of people (including myself) were surprised at the fact that councillor Walters ranked so low on the list. It turns out they were surprised with good reason, as he is one of the most notable outliers on the graph. It looks as though the analysis was alright, but nothing to be proud of. It is perhaps better than a random guess, but not necessarily something that provides critical or accurate insight immediately following an election.

One last graph for you. Each member of council had a fellow councillor who they tended to agree with the most. If we pretend that this coincides with who influences who, we can draw a graph like this:

This shows that for seven councillors, the person they agree with the most is mayor Iveson. The remaining councillors tend to split off in a group where they agree with the Caterina/Gibbons group that I mentioned above, though the frequency with which they agree with either of those two councillors is significantly lower than how often the rest tend to agree with the mayor.

The results from this analysis could exist due to a large number of different reasons. It's possible, for example, that this is an example of mayor Iveson's abilities to gain support from his councillors, and it's equally possible that it shows his ability to listen and accommodate the views of his councillors. Either way, it is his job to be the leader of city council, and so far the data seems to suggest he's doing just that.

Monday, June 8, 2015

NHL Odds in a Best-of-7 Series

Last year, Andrew and I worked together to look at which NHL playoff game was the most critical to victory in an NHL series. He built such a lovely database of playoff series that I just couldn't pass up the opportunity to take another look at the problem.

Before looking at real-world results, though, let's take a look at what the most important game ought to be in a perfectly even scenario. It's relatively simple to take a look at how a best-of-seven playoff series will turn out, and a Markov chain for a series will look like this:

Here you can see how each team's odds of winning the series go up or down based on how each previous game has gone. For example, if a team is leading 2 games to 1, their odds of winning the series are 69%. One important thing to note is that in this case it doesn't matter how they got there - there are three ways for a given team to get to a series score of 2-1 (count the lines if you'd like!), and they all lead to the same probability of winning the series. Also, note the symmetry in the diagram, since this model assumes both teams are perfectly even.

At this point, asking which game is most important to win becomes a rather nuanced question. We may as well ignore any sudden death games as they're obviously critical, but which of the remaining games are the most important?

Turns out, perhaps unsurprisingly, that the answer Game 5, but only if the series is tied. This game takes a team from a 50% chance of winning to 75%, cutting their opponents chances in half. This game has the single biggest change in odds one way or another.

But lets face it, teams in the playoffs aren't likely to be even, and there's a well documented home-town advantage in hockey sitting at around 54.5% over the last few seasons. If we assume only a home-town advantage (but otherwise teams are even), how does that effect the playoff model?

Surprisingly, it theoretically doesn't really change the teams' chances at the outset. In fact, the effect is rather diluted by frequently changing who plays where. This is probably good news, as it suggests that seeding order in the playoffs (which depends on teams' previous performance and is somewhat under their control) matters more in playoff series than winning home advantage.

Some differences show up between this model and the previous one, though. If Team A wins the first two games in a row at home, they have a slightly lower chance of winning overall (because they had an advantage then anyway). If team B wins or ties the first two games, they have a slightly higher chance of winning overall, because it's relatively smoother sailing for them from then on. If Team A has tied the series up after game 4, they regain a slight advantage, because they have two home games against one. All in all these differences are rather minor.

But what's far more interesting than theoretical models are actual results. Let's take a look at all playoff series since 1942 (including 14 series of the 2015 playoffs so far):

Here, Team A is both seeded higher than Team B, and has the home advantage. This results in a remarkably different set of probabilities than the first two models shown.

If the question comes back to which game is the most influential, the answer once again is quite different than the previous models. The most critical non-sudden-death game for Team A is actually Game 4 - if Team A is winning then they increase their odds by 18%, but if they're losing at that point they increase their odds by 21% and regain the statistical lead. For Team B, the most influential game is Game 5, for the same reasons as in the 50/50 model previously discussed.

It's important to note that model isn't necessarily applicable to the current Blackhawks v. Lightning Stanley Cup Final. Over half of all playoff series are from the first round of the playoffs, which until recently consisted of teams that were often extremely mismatched (as the top teams would play the 8th-ranked teams, etc.). It's not unreasonable to expect that the two teams who have made it to the Stanley Cup Final are more evenly matched than the average pairing in the first round, so I wouldn't necessarily recommend following along with the chart during this series.

Don't forget to follow along with my NHL Playoff 2015 model and cheer on the Blackhawks (who I had picked to win right the outset of these playoffs!).

Monday, May 11, 2015

Reuniting the Alberta Right

After last week's Alberta election, several of Alberta's political pundits expressed frustration that the splitting of the vote on the right may have allowed for the NDP success that we saw on election night. Danielle Smith, for instance, said:

She has a bit of a point - despite all the hype of the NDP surge during the campaign, they did still manage to get a strong majority government with less than half of the popular vote, and the combined popular vote of the two 'right-of-centre' parties could easily have beaten them.

Overall, the Wildrose Party ended up with far more seats than the PCs, even though they got 53,000 fewer votes (all this sounds like a set-up for a discussion on proportional voting systems, but I'll save that for later). Though the PC dynasty is ended for now, they certainly aren't lacking in a core voter base, and I wouldn't say they're definitely out of the game just yet.

But to those who are lamenting the splitting of the right side of the political spectrum, what's the most efficient way to reunite these two parties? If the right is to take control again, would it be easier to have the PC supporters move over to the Wildrose, or vice versa?

Let's check. I looked at the results for each riding from last week's election, and checked what the results would have been for each seat if a certain percentage of PC support moved to the Wildrose, or vice versa. First of all, let's see what happens if we increase the amount of PC voters who move over to the Wildrose: 

What this is telling us is that if 23.1% of PC supporters in each riding had instead voted Wildrose, there would have been enough to completely eliminate the PC presence in the legislature. If 35.8% of PC supporters had moved to the Wildrose, it would have been enough to take seats from the NDP and result in a majority of seats. A full reunification of the right would have resulted in 59 total seats, with 26 remaining for the NDP. In both cases, the seats won by the Liberal and Alberta Party MLAs were higher than the combined PC/Wildrose vote, so they're considered immune to this reunification effort.

On the other hand, it would have taken 30.3% of Wildrose supporters flocking back to the PCs in order to result in no Wildrose MLAs elected, and a 31.4% defection rate in order for the right to take control of a majority government.

Which one of these scenarios is most likely is a more nuanced question. Because of how poorly distributed the PC vote was between ridings, it's much easier for the Wildrose to absorb all of the PC seats (23.1% of PC support is only 95,393 voters across the province, for instance) than it is for the PC to absorb the Wildrose seats. If the goal is to reunite the right and regain control of the legislature, though, it may still be easier for the PCs to try to woo Wildrose voters - 31.4% of the Wildrose support is only 113,072 voters, and would have gotten the right back in power.

Overall, this means that a swing one way or another of about 100,000 right-leaning voters could have made all the difference in stopping the NDP from getting elected. Considering that this represents less than 8% of all voters from the last election, the possibility of a resurgence of the Alberta right is certainly not out of the question. The NDP has four years in power now to make good on their promises from the last election and retain their support, otherwise they may be in a bit of trouble during the next election.

Wednesday, May 6, 2015

Math is Difficult

Math can be difficult, so it's a good thing that Elections Alberta posts its unofficial elections results in a nice, easy-to-copy-into-Excel format!

Now that the Alberta election is done, I figured I'd post a short post just showing visually where the party support bases were located. Nothing too flashy or stats-heavy this time. Hopefully more analysis will follow!

First of all, based on unofficial results, the voter turnout last night was 57.01%. Not great, but how does that look visually?

Northern Alberta seems to have suffered the most to bad turnout, with an interesting grouping of solid turnout in the center. Both Edmonton and Calgary had poor turnout in their northeast halves for some reason. Feel free to zoom and click on the map, it's actually a lot of fun (red is low turnout, and green is high).

How about the Liberal support:

The Liberals didn't even run a full slate of candidates, so it's not terribly surprising that most of the map is blank. They did well in the one riding that they actually won, though, and did respectfully in Edmonton-Centre.

The PCs:

PC vote was surprisingly consistent across most rural areas, however that meant it was mostly consistent and low. Edmonton center and north were particularly low for the PCs, but otherwise the variation across the rest of the ridings was fairly minimal.


Not terribly surprisingly, Wildrose support was concentrated in the southern rural parts of the province. As official opposition in the new government, they don't have any seats in urban ridings. This is fairly concerning, and hopefully won't create any further urban/rural divides in Alberta.

Finally, the NDP winners:

The NDP did very well in the cities and northwest rural ridings, but urban ridings south of Edmonton were more of a struggle for them. Interestingly enough, there is a substantial hole in NDP support in Calgary-Elbow, suggesting strategic anti-PC voting took precedence down there. I'm sure Greg Clark is appreciative.

There you go! Once the recount is done in Calgary-Glenmore (where it is currently tied between the NDP and PCs), I'll hopefully come back with more election analysis!

Tuesday, April 28, 2015

2012 Alberta Election Results Poll by Poll

There's only one week left until the provincial election!

I figure this as good a time as any to remind everyone of the full results from 2012. I'm going to do it a little differently than most map sources have. 

As an example, Wikipedia has this map of election results for each of the 87 electoral districts in Alberta:

This map suggests to me that two-thirds of rural Alberta voted strongly PC in 2012, rural Alberta south of Red Deer voted Wildrose, and the cities were a mix of mostly PC, Liberal, and NDP voters.

You know what's more interesting than that though? Poll by poll results. Each of the 87 electoral districts represent dozens of polls, and looking at these results can lead to a more detailed view of how people voted almost down to a neighborhood-by-neighborhood level. For instance, here's the full map of Alberta:

And here's Calgary and Edmonton:

If you're a fan of interactive maps, feel free to play with this!

For each poll, the colour represents the party with the most votes, and dark colours mean that the leading party had over 50% of the vote. Blue is for PC, red is Liberal, orange is NDP, and Wildrose is green.

The results generally follow the pattern of the overall district results, though showing significantly more Wildrose rural support up north than the original map would have us believe. As a fun exercise to the reader, I encourage you to try to find the four polls that the Alberta Party won, and the one lonely poll that the communist party won (hint: Calgary-East).

Remember to go vote next Tuesday!