Thursday, November 1, 2012

Sneaky Statistics

Statistics are cool. Statistics are your friend. If you treat them right, they'll love you forever and never lie to you.

The problem with statistics is that sometimes they're confusing, and people very frequently think they understand them better than they do. Because of this, it's really easy sometimes for people to lie by tricking you with stats. Meanies...

Here's a cool example from Wikipedia of statistics playing tricks with you. Pretend that you're a doctor and you're trying to figure out which treatment is better for curing kidney stones. You use both for a while and this is what you get:

Treatment A Treatment B
Small Stones 81/87 = 93% 234/270 = 87%
Large Stones 192/263 = 73% 55/80 = 69%
Total 273/350 = 78% 289/350 = 83%

At first glance this test may seem pretty fair - both treatments were used 350 times, so we can compare them, right? And it looks like Treatment B was better than Treatment A. Maybe we should use it? Sounds good!

But wait. When we break it down into small stones versus large stones, the story changes. In small stones, Treatment A is 6% better than Treatment B, and in large stones Treatment A is 4% better. That's crazy though - how can A be both better at treating small stones and better at treating large stones, but worse at treating both? Clearly evil forces are at work here.

Around this point it wouldn't be horrible for you to be confused about which treatment is actually better, and it turns out that this study was, in fact, not fair. Large stones had a lower rate of successful curing, and Treatment A was used more than three times more often for these stones. Similarly, the easier smaller stones were more often given Treatment B. This creates such an unbalanced weighting between the treatments and stones that when it's all added up Treatment B looks better.

This highlights two cool concepts in statistics. The first is Simpson's paradox, where the correlation observed in two separate groups is reversed when they are combined together. Obviously this could offer juicy opportunities to people with an agenda - a drug company representing either Treatment A or B could make a case that their drug is better, simply based on how they add the numbers up in the study.

The second is the confounding (or lurking) variable - a variable that wasn't originally accounted for that has an effect on both the dependent and independent variables in the study. A good example is as follows: a statistician could do a week-by-week analysis of human behavior on a beach, keeping track of both drownings and slurpee consumption. They might make the observation that in weeks with high slurpee consumption, more people drown, and someone could then declare that drinking slurpees increases the chance of drowning. 

Boy that would suck. As a researcher, you could probably even justify this a little - perhaps drinking slurpees fills you up or makes you lethargic, increasing your chance of drowning. However, a more likely explanation is to take something new into account: the season. People just plain eat drink more slurpees in the summer than the winter (unless they're me). People also go swimming more at beaches during the summer, increasing the chance of drowning. In this example, the season would be a lurking variable - it correlates with both previously-considered variables, and explains the phenomenon.

Similarly, in our kidney stone example a lurking variable could be the size of the stone. Doctors disproportionately used Treatment A more for large stones, and Treatment B more for small stones - at the same time, small stones were easier to cure than large stones. By not taking into account the effect of the stone size on the treatment distribution, we arrive at the paradox from before.

Funnily enough, Simpson's paradox occurs fairly frequently - in fact, statisticians have estimated that in any similar 2x2 table as in the kidney stone example, we'd expect about 1/60 of them to have some version of the paradox.

On famous example involved a sex discrimination lawsuit at Berkeley in 1973. The admission results from the six largest departments looked something like this:

Department Men Women
A 512/825 = 62% 89/108 = 82%
B 353/560 = 63% 17/25 = 68%
C 120/325 = 37% 202/593 = 34%
D 138/417 = 33% 131/375 = 35%
E 53/191 = 28% 142/393 = 24%
F 16/272 = 6% 24/341 = 7%

 When the total data was added up across all departments, though, the distribution was as follows:

Applicants Admitted
Men 8442 44%
Women 4321 35%

At first glance, it looks like a case of gender discrimination - nearly 10% more men were admitted across the board than women, and some people who felt cheated took it to court.


Looking at those six departments in the first table, though, shows something interesting - Departments A, B, and D where the most popular with the men, and the least popular for the women. In these, the women consistently were more likely to be admitted than men. On the other hand, Departments C and E were the most popular for the women, and they lost to the men. Unfortunately, the Departments most popular with women also had admission averages that were about half of the ones that the men chose.


In fact, a study of these results suggested that there was a "small but statistically significant bias in favor of women" in the admission process when examining all departments in question, and the lawsuit failed. The lurking variable in this case was the character of the departments themselves - men tended to go into studies that were more math-intense (engineering, science, etc.), which happened to have more room to accept students.


It's really important to keep concepts like this in mind when examining statistics. For instance, one has to be extremely careful performing direct comparisons of male versus female earnings to account for factors such as preference in employment - it's much better to compare across identical jobs than comparing aggregate numbers. Aggregate statistics in that case are only really good for highlighting disparity in employment distributions, not earning statistics. Similarly, the Berkeley sex bias case, while not showing a bias against admitting women into studies, highlighted a lack of female participation in programs involving math that was more indicative of early societal pressures than active denial.


One final word of caution regarding Simpson's paradox: due to its relative likelihood, it's not impossible to make it appear as though it is taking place when in fact it isn't. Breaking up applicants by department makes sense because each department's admissions process is hypothetically  independent of each other, but one could easily also break the applicants into groups based on eye colour, height, birth place, beer preference, favorite hockey team, or blog readership. Chances are that in any given group of people, there's a way of breaking the data up nearly arbitrarily that could result in such a paradox. So if you ever see crazy differences between aggregate results and group results, make sure to keep an eye out for any funny business!

No comments: