Well that was fun!

###
How did I do?

For more than a year now I've been

tracking Alberta election polls with the hope of developing a reasonably accurate prediction model. Overall, I'm happy to report that the party I predicted in the lead won in 80 out of 87 races, and my riding qualifiers broke out as follow:

- "Solid" lead: 65/65 (100%)
- "Likely" lead: 12/15 (80%)
- "Lean" lead: 2/5 (40%)
- "Toss-up" edge: 1/2 (50%)

I think this is a decent proof of concept, small "lean" sample size notwithstanding, and I want to talk a bit about what went right and what went wrong, and how I can improve if I want to keep doing this sort of thing.

First of all, the polls leading up to election day didn't turn out to be too accurate. Take a look at the province and regional splits:

Edmonton was remarkably accurate, Calgary was close, but the rest of the province and the top line results were off significantly. This is possibly a cause for concern, as it could suggest that my model was taking inaccurate data as inputs but then claiming credit for an accurate output, which it wasn't designed to do.

The NDP ended up under-performing relative to their polling numbers, and likely the only reason this didn't mess up too many election prediction models is because they under performed mostly in areas like rural Alberta, where they were predicted to lose anyway. If the polls had been that wrong about the NDP in Edmonton, say, the predictions could have been far worse.

Similarly, my model and others like me likely wouldn't have fared too well if the NDP had

*over*performed their polling rather than

*under*performed. The same amount of polling error as actually occurred, applied the other direction, could have had the NDP win the popular vote across the province.

My takeaway from this is that I need to adjust my topline polling tracker. Right now it runs under the implicit assumption that errors in individual polls will cancel each other out. This seemed reasonable given that polls are produced by different companies with different methods. That led to my full Alberta tracker having a low confidence interval for the NDP in particular, though, as several polls in a row provided the same result. If I instead make the assumption that at least part of the polling error is correlated between polls, perhaps due to something beyond their control, then the final result from election night would have still been a surprise, but far less of one. Certainly something I'll take into account next time.

###
Other Metrics

Overall on a riding-by-riding level, I had an error of 6.4% vote share. That's not superb, but also not far from what my testing beforehand suggested, and was factored into my uncertainty. Comparing my final projection to actual results on election night doesn't look too bad:

If we ignore the Alberta Party and the Liberals, this leads to an overall R-squared value of 0.79, which I consider respectable. It's handy to ignore the low parties because they don't have much of a spread, and will skew the coefficient of determination calculation.

Very fortunately for me, if I input the final actual regional results as though they were a poll result, my model does improve. This is a good hint that my model is behaving decently, especially so since this hasn't been the case with all other forecasters.

With the correct Calgary, Edmonton, and Rural results input as large polls, my model improved to 83/87 seats correctly predicted and an R-squared for party support per seat of 0.91. Very encouraging - too bad the polls weren't more correct!

Finally, I also provided an expected odds of winning each seat for each party. It's one thing to count a prediction as a success if you give it 100% odds of winning and it comes true, but how does one properly score oneself in the case of Calgary-Mountain View, where I gave the Liberals (10.8%), UCP (16.2%) and NDP (73%) different odds of winning, and only one (NDP) did?

In this case I've scored each riding using a

Brier score. A score of 0 means a perfect prediction (100% to the winner and 0% predicted for all losers), a score of 1.0 means a perfectly wrong prediction (100% to one of the losers), and because of the math, a score of 0.19 for a complete four-way coin toss (I only predicted the four parties represented in the debate).

Overall, I scored a

**0.027**, which is considerably better than just guessing. It's hard to get an intuitive sense of what that score really means, but it's mathematically the same as assigning an 83.5% chance of something happening and having it come true. Not a bad prediction, but there's room to be sharpened.

###
How did I stack up?

So like I said, there were a lot of us predicting the election this time around. I've tried to find as many as I can, and I apologize profoundly if I've missed anyone. I've only included forecasts that had either a vote breakdown per seat or anticipated odds of winning each seat for comparison purposes.

I've reported on three main measures (seat accuracy, R-squared per seat, and prediction Brier score), and I'll present as many of those for each forecaster as I was able to determine. Different forecasters win at different categories, so it's not necessarily a clear picture as to which one of us is the "best", so I'll mostly leave room here for interpretation:

I'm not claiming to be the second best, but it's important to note that being best in one measure doesn't necessarily mean best overall. There are also harder-to-evaluate measures in play here - for instance

VisualizedPolitics and

TooClosetoCall allow you to input poll values to see reactions for yourself, and both improved when given more accurate data (VisualizedPolitics also got to 83 seats accurately predicted, though still with a low R-squared value).

338Canada probably rightly

can claim to have been the strongest this time around, but I given the polling errors we were faced with I think it'll take several more elections to determine if anyone is really getting a significant edge consistently. This

isn't the first time we've compared ourselves to each other, and I think it's an important exercise in evaluating our own models and whether there's a need for more.