Friday, July 4, 2014

Optimizing your Coffee Fix

Canadians like their coffee. In fact, the average Canadian drinks 55% more coffee per day than the average American, and Canadians are ranked 9th in the world for overall coffee consumption. Because of this, it's hardly a surprise that coffee shops are literally all over the place in Canadian cities.

Pretend for a moment that you're out and about in Edmonton one day, and absolutely need your coffee fix. So badly, in fact, that you're only willing to travel the absolute shortest possible distance to your nearest Tim Hortons or Starbucks (Canada's two most popular coffee shop chains). If you made a map of the city based on where the nearest Tim Hortons is, it would look something like this:

Similarly, a map based on where the nearest Starbucks is would look like this:

(If your browser doesn't like Google maps, check out images of the maps for Tim Hortons and Starbucks. Please note that the maps are only as accurate as Google's knowledge of the world is.)

These are called Voronoi diagrams, which split up the city based on where the closest relevant coffee shop is. Each region corresponds to a single coffee shop, and everywhere within that region is closer to that coffee shop than any other.

It turns out that Edmonton has about one Tim Hortons for every 10,000 people, and about half as many Starbucks. Not too surprisingly, coffee chains tend to be clustered quite a bit downtown and near the U of A campus, leaving the industrial parks to the east rather desolate and missing out on a good brew. The extremely even distribution of Tim Hortons locations in Sherwood Park seems a bit too good to be true, though.

Apart from helping you out with your coffee purchasing optimization, Voronoi diagrams actually do have plenty of real uses. Diagrams like this were famously used in 1854 to show that residents who lived closest to one particular well were dying of cholera, which lead to the discovery that diseases can be spread by contaminating water.

In the case of coffee shops in Edmonton, they can be helpful in city planning or for businesses choosing where to establish new franchises. And, of course, if your caffeine priorities are straight, they can help you get the fastest fix.

Monday, May 19, 2014

Which Game Should You Win?

The other day I was thinking, “What game in a 7 game playoff series is most closely correlated to winning the series?” Fans obviously cheer hard for their team every game in the playoffs, but if they knew that winning a particular game gave their team the best chance to win the series, perhaps they’d pull out all the stops. So, with this interesting question in hand, I went to see what the data said.

Hockey Reference is a treasure trove of hockey statistics and historical data. In order to get some answers to my questions, I pulled all of the playoffs series going back to 1943 (the first year where each round was decided by a 7 game series). This got me a total data-set of 598 series, and a total of 3363 games (so much hockey!).
Now it was time to crunch the numbers. Since I now had this really cool data-set, I decided to calculate some interesting tidbits before answering my original question. The first of these tidbits was to see what percentage of the series ended in 4, 5, 6 or 7 games:

Another interesting tidbit is the idea of home ice advantage. Teams play 82 grueling hockey games throughout the regular season and once you’ve met the “make the playoffs” bar, the only other advantage to doing well in the regular season is the idea of home ice advantage. You would therefore hope that it actually is an advantage to your hockey, not just an advantage to your team’s owner for getting to host an extra game in his arena. Since 1943, the home team (defined as the team that hosted game 1) has won 64.5% of all playoff series. In addition to that, out of the 108 four-game series sweeps there have been, the home team won 81 (75%) of them. So I think it’s fair to say that home ice is an advantage.

Alright, enough beating around the proverbial bush - time to answer the question that took me down this rabbit hole. As I said, I wanted to know which game in a 7-game series has the highest correlation with winning the series. The obvious answer is game 7, considering 100% of the teams that won game 7 won the series. This isn't very interesting, though, so let’s dive in. The following chart shows the percentage of teams that won the given game who also won the series:

Well, this is awkward. That chart isn’t very interesting at all. Essentially the first 4 games are a toss-up (with Game 2 having a slight edge), and then we see the percentages go up for the elimination games as you would expect. Clearly if your team has a chance to end the series in a game, you should cheer as hard as possible for that to happen. I did a little more analysis however and found something interesting. Later in the series, the team that won Game 5 in a series that went past Game 5 won the series 60.6% of the time. But for Game 6, the team that won game 6 in a series that went to Game 7 only won the series only 47.9% of the time. So much for all that “momentum” talk.

In the end it is the humble opinion of this writer that winning any game in the playoffs is probably your best bet. However, if you’re looking for the most statistically advantageous games to win, it appears that Game 2 and Game 5 are the ones to win. 

More analysis to come so stay tuned...

Thursday, April 24, 2014

Open Letter to the Canadian Centre for Policy Alternatives

To: Ms. Kate McInturff
Senior Researcher, Canadian Centre for Policy Alternatives
Dear Ms. McInturff,
It is with concern that I read your most recent publication from the CCPA, "The Best and Worst Places to be a Woman in Canada". Your analysis looked at five categories for Canada's top 20 cities, normalized and ranked them, then averaged the rank for each to form a list for gender inequality across Canada.
Two weeks ago, I did an analysis where I looked at six factors for Canada's top 20 cities, normalized them, averaged the score for each, and ranked each city for zombie preparedness. My piece was intended as a joke but using real statistics, and I am extremely discouraged that what appears to be a similar level of statistical rigour was applied to your analysis as was applied to mine. Though I'm sure I agree in general with the principles behind your analysis, I have several specific concerns:
First of all, I would like to express some confusion as to how your final ranking was determined. You mention that:

The scores for the indicators in each category (i.e. health, education) are averaged to produce a final score for that category. Each indicator is given equal statistical weight in the calculation of the score for each category. The cities are then ranked according to their score. The overall ranking of the cities is produced by averaging their ranks in each category.
At first glance, this seems pretty straightforward. When I took a look at Appendix B, though, this didn't quite add up. Québec city is still very much in the lead (ranked 6th, 2nd, 3rd, 7th, and 8th; average: 5.2), but I'm confused about why Montréal (ranked 11th, 9th, 11th, 6th, and 7th; average: 8.8) is ranked above Sherbrooke (ranked 7th, 10th, 10th, 11th, and 2nd; average: 8.0). In fact, it looks like most of the cities ranked 3-9 are somewhat shuffled:
This is pretty minor in the long run, but I am curious about how you got this final result.
Secondly, though the concept of using similar indicators as the Gender Inequality Index and the Global Gender Gap Report certainly seems valid, weighting seventeen indicators in five categories equally before ranking the categories and averaging them ignores much of the analytics required to do a study like this justice. (As an aside, stating that something is "well-supported by medical research" without citing any research makes it tough to follow up on...)
An easy example of even weighting potentially causing issues is in the Education category. The female to male completion ratios are calculated from the National Housing Survey for High School, Apprenticeships, College, and University, then all four were averaged together to come up with an inequality score, which is then ranked for the full category.
Of the four indicators, apprenticeship rates are consistently the lowest for females. However, the fraction of all people that go into the trades at all is also very low, and varies more city-to-city than the female enrollment rate (with a coefficient of variation 1.75 times higher). As a result, weighting all four forms of education equally penalizes cities with low number of tradespeople in general by effectively giving the apprenticeship figure twice the influence it should.
If we account for this and weight each indicator based on its overall population size, it turns out that no city has fewer women than men in total education attainment at all. The cities most helped by weighting all forms of education the same were the cities with the highest number of people in the trades, which are Montreal, Quebec, and Sherbrooke. These cities drop to rank 17, 18, and 13 respectively for Education after accounting for this, and likely would have an effect on the conclusions drawn about Quebec as a province. 
Thirdly, the system of averaging rankings, though fun in a zombie analysis sort of way, both addresses all five categories completely equally and undermines the variation between cities within each category. Even without accounting for any changes in the Education category, it is a category with equal or nearly-equal participation between men and women. If all categories are treated equally and the ranking is the variable under consideration, then we are essentially saying that a city like Vancouver, which was ranked 15th in Education with a perfectly equal score of 1.00, is just as bad as a city like Oshawa, which was ranked 15th in Leadership with a score of 0.27.
Finally, I find the name of your paper to be alarming and misleading. Your report specifically (and in italics) reinforces that it examines "the gap between men and women, rather than overall levels of well-being." Saying that Edmonton is the worst place in Canada to be a woman sounds like a comment on well-being, especially after mentioning that it's a great place for median income, just not quite as awesome for women as it is for men. On the other hand, cities with dangerous crime rates might be considered great for women, as long as the crime is equally distributed. I guess "Gender Inequality Index of Canadian Cities" wasn't catchy enough.
It's time that gender equality statistics are taken more seriously than zombie statistics. I hope that future studies reflect this.
Michael Ross

Monday, April 7, 2014

Canadian Cities Most and Least Likely to Survive the Zombie Apocalypse

Last week I found this blog post, which ranks the US states based on how likely they are to survive a zombie apocalypse. As the post mentions, seeing as the zombie apocalypse is clearly unavoidable, it's important to plan ahead and learn where to be when it hits.

Canadian provinces are way too huge and there aren't quite enough of them to do quite the same sort of analysis north of the border. On the other hand, Canada still has plenty of cities, and seeing as two thirds of the country live in one of the 20 biggest cities, that could be a pretty good way of looking at things.

Instead of looking at 11 factors (ranging from number of veterans to number of triathletes), I looked at the following 6 factors:

Distance to Closest Military Base: Let's face it, when zombies come to get ya, you'll be hoping the military is close-by to help take care of things. Fortunately Canada has a ton of army, navy, and air force bases dotted around the country, but the cities closest to the bases are definitely more likely to handle their undead uprising.

Average Temperature: I'm not an expert, but I imagine if you're dead and frozen solid, you're less likely to be a threat than if you're dead and flexible. Fortunately, Canadian cities have fairly low average daily high temperatures!

Population Density: Zombie math is pretty simple: too many people + too small a space = brains. If you're trapped and surrounded by a lot of future-zombies you've got way worse chances than if you've got some space around ya.

Obesity Rate: This one's pretty straightforward - obese people make easy zombie targets. It's related to (though strangely not strongly correlated to):

Physical Activity: Rule #1 in Zombieland is "Cardio" for a good reason. More people who can escape zombies make for fewer zombies, which really is just better for everyone else.

Gun Ownership: Zombies don't like guns for exactly the same reason zombie apocalypse survivors love guns. Gun ownership data is unfortunately only available on a province-by-province basis, but it's hard to argue that the more guns that are around in a province the better equipped people are to handle the undead. [Edit: I had previously presented this number as guns per population - I actually used licenses per population.]

With all that said, here's the ranking of the best and worst Canadian cities to be in during a zombie apocalypse (overall score is out of 1.0):

Moral of the story:

  • Don't live in southern Ontario - it's a zombie playground. Ontarians don't have a lot of guns, southern Ontario is relatively warm, and there's really nothing special going on in terms of physical activity and obesity.
  • Do live in a provincial capital. They tend to have military bases, and more often than not are large with relatively low density (suburbia is way better for zombie defense than downtown, of course).
  • I'm proud of Edmonton. Good job, us.
  • Newfoundland has a lot of guns. This is probably worth following up on.
[Edit #2: Updated Toronto temperature data - I mistakenly used daily mean instead of daily high for Toronto.]

Thursday, February 27, 2014

SU Elections: Presidential Grammar

With one very tiny exception at the end, I'm not going to talk about the platforms of any candidates in this year's SU elections. I'm not a student anymore, and it's probably time that I leave things be.

That being said, I was reading some of the platforms for the presidential candidates, and I found the grammar too much to bear. For instance, this is a page from one candidate's platform I copied and commented on (click to zoom):

And here's one from another candidate:

Come on guys. Apostrophes are taught to children. Capitalization is usually for proper nouns. "High jacked" sounds like an adjective shopping list for bros at an Amsterdam gym.

Grammar aside, I have to take massive exception to this graph in one candidate's platform:

If I looked at that, and not the numbers, I'd think "Wow, international tuition is WAY higher than domestic tuition!"

(Aside: the domestic tuition in the source is actually only $5,269.20. That's sort of irrelevant though.)

What is going on in this graph? A quick math guess tells me that 19 thousand dollars is only about 3-4 times 5 thousand dollars (precise ratio: 3.55). This appears to be the heights of the circles in this graph. In other words, this graph could and ought to be presented like this:

Sure, this still looks bad, but not NEARLY as bad as the previous graph because we're not implicitly pretending that the area of the section is what's being compared. The original graph massively skews axes and subtly suggests that international tuition is about 13 times domestic tuition by using circle areas instead of bars. This is a technique covered on Chapter 6 of "How to Lie with Statistics", which is a wonderful read if you're into that kind of thing. If we were to be truly honest with this graph, it could look something like this:

This is admittedly far less alarming, but also less likely to mislead people.

I've said my bit. Now go have a fun campaign, and I'll hopefully get back to you with my model predictions next week!

Tuesday, February 25, 2014

Winter Olympics Predictions

The winter Olympics are over, which means that my productivity is back on the rise and my sense of nationalism has returned to normal levels.

One of the things I enjoy trying to do from time to time is developing predictions of sporting events, such as the NHL Playoffs. So when I heard that people were trying to predict the medal counts for the 2014 Sochi Olympics, naturally I became intrigued and tracked some of their results.

I found four different published predictions:
  • Infostrada Sports: These guys used results from "Olympics, World Championships, and World Cups (or equivalent)" since the 2010 Vancouver Olympics to develop a likely scenario for who would win in each event. Their model had different weights for the results, time since the event, and nature of the event. They only ranked the top 15 countries on their medal table, and it was last updated three days before the opening ceremonies.
  • Wall Street Journal: The prestigious journal interviewed experts and rated recent performances, and assigned probabilities to certain outcomes. They claim to have been accurate to "within a few medals" in the last two Olympics, but were actually just alright for the 2012 London games, and only good at predicting a few countries in Vancouver in 2010.
  • SportsMyriad: I think this is a blog? Either way it's a fun website if you like sports stats. No real idea where the stats came from (apart from the disclaimer "It'll change from injuries, form, whims, etc.").
  • Andreff & Andreff (2014): A working paper from the International Association of Sports Economists, and also posted to the Freakonomics blog, the authors correlated factors such as population, per-capita income, political regime, average snowfall, and number of ski resorts to try to determine the number of medals. This sort of approach has been used for summer games before (probably not with ski resorts as a major factor...), but apparently not for winter Olympics. These were the only guys to include upper and lower bounds on their predictions.
How did they all turn out? Sort of alright, I guess. Sort of.

The best prediction was by the Wall Street Journal, with a coefficient of determination of 0.77 for total medals, and 0.63 for golds (1 being perfect).

Notable exceptions were the Netherlands (getting double the expected medals - whoops) and South Korea (getting half their prediction), but otherwise things were pretty decent for the Wall Street Journal.

Next best was the SportsMyriad site, which came in only slightly behind at 0.75 for total medals, but less close at 0.58 for golds.

Andreff and Andreff were next up, with a coefficient of determination of 0.68 for total medals (their model didn't break it down into colours). They were the only group to include upper and lower bounds, which proved to be a bit silly since only 35% of countries fell within the bounds given to them. These guys were the most wrong about the Netherlands (they predicted very confidently they'd get 5-7 medals, instead they got 24).

InfoStrada was the furthest off, with a coefficient of determination of 0.22 for total medal count. This is a bit unfair of a direct comparison, though, as they only listed their top 15 countries, and the addition of 10 lower-performing countries would have likely bumped that up. Even on a comparison of their top 15 across all models, though, they still came last.

In general, the Olympics are tough to predict, for loads of reasons. Even the best in a sport don't win every event they compete in, and trying to predict the result of a single mogul run or figure skate performance is an exercise in futility. Team sports predictions are rough since the full teams only rarely play each other with the exact line-ups between Olympics, and occasionally Olympic berths are won by teams or athletes who don't even end up competing. Using socio-economic data is probably ok for getting a general picture of a country's winter abilities, but ignores the fact that sometimes people are just good at something despite their surroundings.

That being said, I admire the effort by these would-be predictors, and look forward to seeing how they do next time around!

Wednesday, January 22, 2014

LRT Station Names are Silly

When City Council added "Fort Edmonton Park" to the South Campus LRT station name, many people were baffled.

The distance between Fort Edmonton Park and South Campus is about three kilometers. This may not seem too extreme, until you consider what else is within 3 km of South Campus station. For instance:

  • Four other LRT stations (Southgate up to University Stations)
  • Snow Valley
  • Half of Hawrelak Park
  • Most of the fun parts of Whyte Ave
  • Arch-rivals Harry Ainlay AND Strathcona High School
  • The Zoo

Everything within 3 km of South Campus

This is obviously ridiculous. The name of an LRT station should be based on what you would reasonably expect is by that LRT station, not something that you could connect to three kilometers away. If we opened the rest of the LRT station names up to standards like this, we'd have this chunk of the city with potential naming rights:

Which is getting right up to 20-25% of the city. Ludicrous.

However, this gave me an idea for some possible LRT station name changes, if we're allowed to be within 3 km of any of them. How would you like to go to the:

  • Century Park/IKEA Station
  • Southgate/Derrick Golf Course Station
  • South Campus/Snow Valley Station
  • McKernan/Belgravia/Hawrelak Park Station
  • Health Sciences/Jubilee/Oliver Village Station
  • University/Valley Zoo Station
  • Grandin/Campus St Jean Station
  • Corona/NAIT Station
  • Bay/Enterprise Square/Lister Residence Station
  • Central/Bonnie Doon Station
  • Churchill/Mill Creek Station
  • Stadium/Grant MacEwan Station
  • Coliseum/Kingsway Mall Station
  • Belvedere/Concordia Station
  • Clareview/Londonderry Mall Station
Any other fun ones I might have missed? Let me know!