Sunday, September 25, 2011

Deviant Numbers


Last week we talked about why it was important to measure not only the center of a data set, but its spread as well, since data sets with very similar centers can look completely different in other important ways.  For example, my retirement fund netted me a whopping $300 in 2010.  I'm certain that, somewhere on Wall Street, there exists a fund manager who also cleared $300 last year--in other words, we had the same average returns on our investments--but our monthly statements probably look as though they were generated on separate planets.  While my fund may have gained or lost a few bucks month to month, his probably gained or lost a few million a day.  As a result, I live in a two-bedroom apartment in St. Louis Park, and he lives in a corporate penthouse in Manhattan.  So we need some way of measuring how our data varies from the average, from the mean, to make sense of it all.

First, we looked at all of the differences individually.  We subtracted the mean from each number, and we called these signed differences deviations.  They were nice: simple to calculate, easy to understand, perfectly sensible.  The problem with deviations was that we generated one deviation per data point.  This isn't such a big deal in a data set of ten numbers, but what about ten thousand numbers--or ten million?  Not so nice. What we'd really like is to boil things down to one number, one measurement that gives us a meaningful summary of our data's spread.

Our initial idea was simply to average the deviations, which seemed like it might be the thing we were looking for.  Want to know how spread out some data points are?  Let's just take their average deviation from the mean.  Easy day.  But then something strange happened: every time we tried to add up the deviations (the first step in finding the average, after all), we got a sum of zero.  Not once, not twice, but every single time.  Coincidence?  Do I just choose really terrible examples?  Nope.  And maybe, but not because of this.  Here's what's really going on...

Imagine you're watching a game of tug of war between two perfectly matched teams--let's call them East and West.  The Easters are pulling, well, east, and the Westers...you get it.  These are normal tug of warriors with awful team t-shirts and varying levels of skill.  There are some studs pulling really hard, some weaklings who are barely hanging on (just for show), and some people in between.  But because the teams are exactly even, the flag in the middle of the rope isn't moving in either direction; it's just hanging out.  In other words, the flag is moving the same amount that it would if no one were pulling at all.  On average, every person is moving the flag by exactly zero feet.

The same thing is happening with our deviations.  When we have a set of numbers, each number is pulling on the mean.  Those with big negative deviations are pulling down on it pretty hard (they're the Team West studs); those with tiny positive deviations are pulling up on it a little (Team East Weaklings); and a bunch are pulling with an intensity that's somewhere in between.  At the end of the day, though, the mean's not going anywhere.  In fact, if any of the deviations could pull hard enough to move it, then we'd have to start all over with a different mean entirely!  So, in essence, it's as if each number weren't pulling on the mean at all.  And, since distance equals strength in this metaphor, that means it's as if each number, on average, is a distance of zero from the mean.  The deviations cancel each other out.  Every.  Single.  Time.

Thus thwarted, we thought the next logical step would be just to ignore the signs and then take the average deviation.  This takes care of our zero-sum problem in a jiffy, and in fact it's a really good solution.  (I told you you guys were smart.)  The issue now is a little bit beyond our reach, but I'll give you a quick justification.  Ignoring the signs on the deviations means, mathematically, that we're taking the absolute value (remember that thing?).  There's nothing really terrible about this, but the absolute value function is what's known as a piecewise function, which mean that the outputs follow different rules depending on what inputs we feed into it.  These piecewise functions can be a little slippery when we try to apply the tools of calculus, so it turns out a better way to handle the signed deviations is just to square everything in sight, then take the average, and finally take the square root to knock everything back down to size.  The result of all these gymnastics is a number we call the standard deviation, and we finally have it--a single value that captures some information about how spread out a data set is with respect to its mean.

As you might imagine, this tug of war metaphor isn't really a mathematical proof of why the deviations must sum to zero, but you have all the tools you need to give a real, honest-to-goodness proof if you're so inclined (write out the formula for finding the mean of a set and go from there).  It's my challenge to you.  If you're still curious, but lazy, put your Googling skills to good use and find a proof online.  On average, they're everywhere.