Sunday, October 11, 2009

Useless Metrics and Scatter Plots

I was reading Lifehacker Friday during lunch and saw a story about  DataMasher seems like a good idea: simplify data from various government reports, split it up by state, let users specify a relationship (i.e. alcohol consumption / average temperature), and drop the results onto a choropleth map.  I actually like the site, the choropleth maps are very nicely done and there are some fun data sets that I might normally be too lazy to dig up.  I'm also a big fan of things that are free.

The problem with the site is that it allows you to save these "mash-ups" and you can see just how poorly most people understand statistical measures.  Apparently after miles-per-gallon swept the Statistical Measurement Awards back in 1920, people decided that the best way to find the statistical relationship between any two variables is to divide one by the other.  Always.  The result is a bunch of super informative "mash-ups" like, oh,  # of Deaths Due to Motor Vehicle Accidents per 100k divided by Alcohol Consumption/ Binge Drinkers.

Motor accidents divided by binge drinkers displayed by state.  Hmm.  What is this supposed to tell me?  That drunk drivers in Tennessee are that much more effective killing machines than the ones here in Washington?  That people in Utah drink less?  No, it can't tell you anything, and every single mash up on the site is like this!  People chose two variables that are strongly correlated, divided one by the other, and created a completely useless metric.

It would be like creating a map that shows number of human eyeballs per state divided by number of humans per state.  In a perfect world, states would start with a value of 2 in the very center of the country but approach 1 the closer they got to the ocean due to the increasing number of pirates.  In reality though, every state will have a value of almost exactly 2 and my map will be an uninformative, one-color piece of suck.

There are extremely few useful charts in the world, and almost all of them are a minor variation on bar/column charts, line charts, and scatter plots.  This isn't a rule or anything, but it's a good starting point:  when comparing between different categories use bar/column charts, when comparing over time use line charts, and when comparing two variables use a scatter plot.

Scatterplots are perfect for almost all of the data sets people combined on  The only trouble is that 50 states is a lot of points and you can't label XY points in excel so you're out of luck if you want people to know which states are which.  The solution?

By far my favorite, favorite, favorite add on for Excel is, well, Sparklines, but a close second would be John Walkenbach's Chart Tools or, alternately, the XY Chart Labeler add in.  Labeling points in an XY chart is awesome.  If you haven't tried it, you don't know what you're missing.  You can use it to label points, create a fake scale on an axis... well, I'm sure there are a lot of other great things, but those are the only two I can think of right now.

The great thing about using this for states is that you can reduce the state down to the postal abbreviation, center the data label, and then delete the point marker.  This cleans up the chart quite a bit and is similar to the idea behind a stem-and-leaf plot: using the values as the visual element of the chart.

To add the data labels, first you need to install one of the chart utilities I mentioned above.  This example will use JWalk Chart Tools.  First, I like to make my points invisible, but you can do this in either order.  Double click on the points, go to the Patterns tab, and set both lines and points to 'None'.

After you hit okay, you'll have some invisible points.  If you ever lose these and want to edit them again, just use the arrow keys to move around the different chart elements and you'll eventually find them.  After that, hit Shift+F10 and the right-click menu for that object will pop up.  I use this more often for axes that I've deleted and want to rescale.

After you right click on the points, select JWalk Chart Tools from the bottom of the menu.  Select the Data Labels tab, choose the series you want to label from the list, and then drag the selection over the range of state postal code abbreviations.  Leave the box checked for Create Links to Label Cells.

Now you'll want to double click the data labels and center them so they aren't off to the right of your actual points.  You'll likely end up with a few points majorly overlapping.  The main point of a scatter plot is to show relationships, not exact values, so don't be afraid to jiggle the data to make it legible.  People are probably going to be more interested in the trend line than the individual points in any event.  Scatterplots always need trendlines.  You can add this by right clicking on your points and selecting Add Trendline.

That was a nice looking chart.  I'm still figuring out this whole blogging thing.  The default image hosting in Blogger is giving me fits.

One thing I didn't mention in all of this is that the original data I pulled from only had full state names not the postal abbreviations.  I'd like to think I know all of the postal codes, but I send maybe 5 letters in an entire year and all of those go to the same 3 states.  And even if I did remember them all correctly, it's a pain typing out 50 postal codes.  That's probably over 100 letters!

So, I wrote my own Excel formula that exchanges state names with postal abbreviations and vice versa.  That post will be later this week.


No comments:

Post a Comment

Creative Commons License
David @ Work by David Montgomery is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.