Sunday, May 18, 2008

losing the plot

In the previous post I mentioned the results of a couple of tests on memorisation of articles in a language class. A reader pointed out that the plots given separately could be combined in a bivariate plot. Too true. This was weighing on my mind even as I posted the simple pair of plots in the last post.

The results, I remind you, were:

Test 1: 2 6 6 7 8 8 9 9 13 14 14 15 15 16 16 17 18 19
Test 2: 18 20 20 20 20 18 20 19 19 18 17 20 20 19 20 20 19 19

Since the data are currently in Excel, I run them through the chart wizard to generate a scatter plot and come up with:

which is not at ALL what I want. Why does the y axis start at 16.5? Why is it broken down in increments of .5, when the number of correct answers was always an integer? I want a chart that shows the area that's blank because NOBODY got a second score below 17.

The Excel Chart Wizard does not offer the option of customising the y axis. Since I always expect the worst of Excel, I assume there is nothing to be done. In my hour of shame, I come up with the dodgy solution of, ahem, adding a dummy set of results at 0,0. This produces


which is an improvement, but I am, needless to say, deeply mortified by the false result at the lower lefthand corner. I suddenly think: But what if I double-click on the y axis! Sure enough this brings up a dialogue box which lets me set the y axis to my own specifications, and...

Ha!

There's just one slight problem. I know this tiny database, which means I know that 9 people got 20 on the second test, whereas the line at 20 shows only 7 results. The chart has fallen victim to overplotting; the two people who got 6 on Test 1 and 20 on Test 2 have been collapsed into a single dot, as have the two who got 15 followed by 20. Excel has come through once, but I can't believe there's a way to jitter the plot points. I retreat cravenly to inserting bullet points by hand in the basic grid:

This is clunky, no doubt about it, but since I've been doing it all by hand it's easy to see the two people at 6 who got 20 and the two at 15 who got 20:


I then realise that I can achieve a similar result in the charts feature by tampering, yet again, with the data: if I replace the pairs (6,20; 6,20) with (5.8,20; 6.1,20) and use the same dodge on 15 I come up with

Good. Good. (I mention all this because Excel is what most readers are likely to have in the home; it's easy to assume that feeding data into a chart will generate a chart that displays all the data.)

At this point, needless to say, I do not feel happy about a chart that depends on fudging the data. I now do what I should have done in the first place, which is to take it all into R. How much better it would all look, I think, if I used Hadley Wickham's ggplot2 package!

So I put the data into R. Vanilla R produces a plot which throws up a y axis that starts at 17 and moves by increments of .5 to 20, which means it is necessary to rifle through much PDF documentation (which is, of course, why I did not take this very simple task to R in the first place). ylim produces the right axis but doesn't look very nice, so I load ggplot2 and get this

which is very pretty but has yet another y axis starting at 17 and going up to 20 in increments of .5. There passed a weary time, each tongue was parched and glazed each eye, in other words ylim does not do the business in ggplot2, some other method of tinkering is called for, I spend much time rifling through the documentation of ggplot2 both in PDF and at geom_point
(I knew this would happen) trying to work out what to do. Wickham's work is inspired not only by Tufte but by Lee Wilkinson's Grammar of Design, which means that the documentation discusses the rationale underlying the package, which is, of course, both interesting and admirable but unhelpful if you just want to know how to do in ggplot2 what ylim does in vanilla R. Finger in the page. geom_point does make it easy to jitter, so I try that out and get


which is actually not what I want at all, because I only want to jitter the four points where there is overlapping. I think there is a way to fix this (I think it is possible to select horizontal jitter), but how late it is, how late.

At this point, naturally, I begin to wonder whether it is not somewhat infra dig to put all this low-level milling about on display; how much better just to relegate it all to the drafts folder! Wait till I have worked through ggplot2 properly and at some later date post a series of handsome plots, drawing on a more interesting range of data sets, with an air of effortless ease. Yes.

(I revert to my paltry little Excel chart. Wouldn't it be better to have gridlines that divided the area in four? Would it be better if the numbers on the x axis were closer to the points, i.e. at the top of the plot?

Well, maybe. It's clear that about half the participants got under half the answers right on Test 1, and everyone got better than 75% right on Test 2, so that's quite nice. And it does look somewhat like a Smeg refrigerator into the bargain.)

One problem with writing novels is that you often find that there is some software somewhere that looks as though it might do some specific thing that you need for some particular chapter, which may well never be needed again. So you find yourself simultaneously at the embarrassing amateur stage of, who knows, maybe 10 or 15 different programs. So what you would really love to have is the literary equivalent of a director of photography - a technical advisor whose job it is to answer questions like 'How do I fix the axes in ggplot2?' But this is really at odds with the whole Weltanschauung of the publishing industry. But enough, enough.

I then think, but maybe it would be nice to see the two sets of data in a line plot. I am somewhat demoralised by my adventures with ggplot2, so I run them through Excel and get


which is, of course, hideous.

But also enlightening.

Participants got 3 minutes to learn the genders of 20 words. Pre-technique, half remembered fewer than half. Post-technique, half remembered 100%; all remembered 80% or better. ONE PERSON, who started with a score of 19, failed to raise the score. In a word unknown to the immortal bard, blimey.

I don't know how well they would have performed if they had been tested again after half an hour, or 5 hours, or 5 days; this is, one would have thought, an obvious question, but it was one that was not answered in the class.

Meanwhile, behind the scenes... I draft an e-mail to Hadley Wickham, pleading for help. I then realise that my dear dear friend Rafe Donahue, despite his exasperation with the sort of person who is seduced by pretty plots, is still my dear dear dear dear friend. I send an e-mail to my dear friend...

And meanwhile, what to my wondering eyes should appear, but a newsletter from Linotype celebrating the birthday of Adrian Frutiger. I mooch around the Linotype website, checking out the Akira Says column: the most recent essay is on Frutiger, but there are also essays on dashes (hyphens, en-dashes, em-dashes), small caps...

And I am OUTRAGED

because the thing is, when you see a book into print, an 'expert' will be given a month or so to go through the text to introduce 'correct' dashes, capitalisation and so on, which the author can then spend up to 6 months trying to remove

but the thing is, let's be sane. Fine-tuning the dashes and caps is never going to achieve significant improvement in the reader's grasp and retention of the text. When I say 'significant' I'm not poaching on statistical preserves, I'm talking about the kind of improvement displayed in a pair of tests on memorisation of gender. Text A gets 50% of readers, we'll say, getting things wrong 50% of the time; improved Text B gets 100% of readers getting things right 80% of the time or better. You're not going to see that, because, um, text has an extremely limited capacity to convey information in the first place. Whereas, of course, if you start with two sets of numbers


Test 1: 2 6 6 7 8 8 9 9 13 14 14 15 15 16 16 17 18 19
Test 2: 18 20 20 20 20 18 20 19 19 18 17 20 20 19 20 20 19 19

and convert them to some kind of graphic display (as above), you can dramatically improve your chances of conveying a pattern of change. And if the graphic display has the allure of a Smeg, it will dramatically improve the chances that the sort of reader who has hitherto loathed graphs will suddenly be downloading R, braving PDF documentation, collecting data on self, friends and relations for the sheer entertainment of turning it all into graphs.

The point being, if publishers hired statisticians instead of copy-editors and designers, so that the author spent a few months going over the text with a statistical expert instead of the sort of person who knows his en-dashes, it would still be a lot of work, but it would be worth it.

Meanwhile it's a dark, gloomy day.

11 comments:

Andrew Gelman said...

The trouble is that computer programs think your data are continuous. For our own convenience we wrote a little R function to classify data as continuous, positive, nonnegative, binary, integer-valued, etc. But really something like Excel should do this automatically. You'll also see things like y-axes that go from -10% to 110% . . . very distracting!

See here for my more systematic thoughts on the subject.

Anonymous said...

The axis problems described with Excel have been fixed in Excel 2007. It is simple to set the range on the plot as well as the major tick size.

Hadley Wickham said...

How were you trying to set the axes in ggplot2? qplot(Test1, Test2, ylim=c(0, 20), xlim=c(0, 20)) should do the trick (this is documented in ?qplot). Generally, if you want to adjust the axes, what you really want to do is change the scales - so scale_continuous would have also pointed you in the right direction.

I realise the documentation isn't the best, but it's improving and any comments are always appreciated. I hope to have the ggplot2 book finished by the end of summer, and that should be a big help when learning (I hope)

Chris said...

Reason number 4552 not to use Excel.

Jon Peltier said...

Excel has no special knowledge of your data. It doesn't know you want it to start at zero, it doesn't know these are discrete values, it doesn't even know you don't want duplicate points to overlap. It plots exactly the data you tell it to, without making any assumptions.

A good habit to get into is to keep your data in one range, then if you find you need to modify values (i.e., jitter), you should make a range that inks to the original data, and modify the link formulas so that you can add or subtract minor values to accomplish the jittering, or to apply an offset to subsequent series to avoid overlapping.

Despite Kevin's comments, the "problems" with Excel's axes in 2003 were with your approach, and therefore did not need fixing in 2007. In fact, most experienced users find 2007 more cumbersome than prior versions.

Andrew Gelman said...

Jon,

Given that Word can spell- and grammar-check, it might not be too much to ask Excel, by default, to check to see if variables are integer-valued.

Andrew Gelman said...

P.S. Regarding your comments about copy-editing, I'll refer you to my comments on the pinch-hitter syndrome.

Helen DeWitt said...

Well, as I think was clear from the post, problems that initially looked like problems with Excel turned out to arise only from the user's ignorance. So someone might reasonably say it's a poor workman blames his tools.

I know Excel can do all kinds of extremely useful things. The problem is... if you look through the possible chart types (this is Excel 2004 for Mac) you can have a choice of cones, pyramids, or cylinders, you can have an anodized pie, a 3d anodized pie, a 3d glass pie, 3D plexi columns, 3D plexi area blocks... but you can't do a box plot. In previous incarnations of Excel you had the cones, pyramids and cylinders but you did not have the anodized pies etc. So the thing is, this sort of improvement seemed so frivolous, it left me assuming the worst if some functional aspect of a chart needed fixing. I don't think this is particularly rational, just the mark of a somewhat morose outlook.

HW, thank you SO much, this works perfectly. I know you invite comments on the documentation, but frankly I was not sufficiently convinced that it was the documentation that was at fault.

JP, I know Excel doesn't 'know' that I want to start the axis at 0, for instance; precisely because it doesn't 'know', it seemed odd that it did not offer the option of specifying this when one first set up the chart. Once one knows that one can adjust this after the fact there is, of course, no problem. I feel sure your suggestions are sensible ones.

AH, great post on pinch-hitters.

Anonymous said...

I remember back in my days of doing calculus homework that we would be assigned about 100 problems a night, say, 25 on differentiation of sums, 25 on the product rule, 25 on the quotient rule, and 25 on the chain rule.

I learned that if I could do essentially the first and the last one of each section, doing the ones in between really added no value.

But I would hear people say, "Hmmm, I have trouble with the chain rule one but the product rule is easy, so I'll just do the product rule problems."

So they would feel really happy that they could do 25 product rule problems and not understand a thing about the chain rule when what they should have been working on was chain rule problems: they only solved the problems they knew how to solve. It looked like they were making progress, although nothing was farther than the truth.

The tie-in?

I just opened my copy of Microsoft Word. I have a choice of 241 fonts. Of course, that includes some bold and some italic and some bold-italic versions, so I might be inflating that by a factor of four --- so call it 60. I work for a small company and all I do is write technical reports. Why do I need 60 fonts?

The answer is that I don't need 60 fonts (I probably need about 3) but that adding another font looks like progress, although it really is not, just like doing the math homework.

It is easy to add a font. So the software company adds a font. Of course, it does not do ligatures or hanging punctuation, but I have 60 fonts.

What does all this mean? I think it means that different tasks demand different tools. Excel is fine for tracking playing minutes for my sons' basketball team or tracking the proceeds for the church bake sale; Word is fine for the Christmas letter to Grandma or for the 7th grade term paper.

But serious work demands serious tools. (See the discussion of PowerPoint centimeters: PowerPoint centimeters.) For the easy stuff, it is simpler to use Excel and Word (and for a great majority of the population, this is fine!). But for the serious stuff, I use SAS or R or SPSS or InDesign or TeX or LaTeX, because it can do what I want it to do.

Time to do some more calculus...

Will Grizzly said...

Am I the only one who wants to hear more stories about your horrifying experiences with copy-editors?

For the record, I am a big fan of em-dashes, and I think that their correct usage can prevent a lot of unnecessary ambiguity in the text. On the other hand, I have no problem with people using dumb quotes, or the prime symbol in place of an apostrophe, as these errors almost never introduce ambiguity as to what the author was intending to say.

Helen DeWitt said...

Anon, yes, why do we hear so little about font rage?

WG, Mark Liberman of Language Log very kindly offered to let me write a guest post on the subject, with the result, naturally, that I have now started up a virtual drafts folder for Language Log in addition to the actual drafts folder of paperpools...