In the previous post I mentioned the results of a couple of tests on memorisation of articles in a language class. A reader pointed out that the plots given separately could be combined in a bivariate plot. Too true. This was weighing on my mind even as I posted the simple pair of plots in the last post.
The results, I remind you, were:
Test 1: 2 6 6 7 8 8 9 9 13 14 14 15 15 16 16 17 18 19
Test 2: 18 20 20 20 20 18 20 19 19 18 17 20 20 19 20 20 19 19
Since the data are currently in Excel, I run them through the chart wizard to generate a scatter plot and come up with:

which is not at ALL what I want. Why does the y axis start at 16.5? Why is it broken down in increments of .5, when the number of correct answers was always an integer? I want a chart that shows the area that's blank because NOBODY got a second score below 17.
The Excel Chart Wizard does not offer the option of customising the y axis. Since I always expect the worst of Excel, I assume there is nothing to be done. In my hour of shame, I come up with the dodgy solution of, ahem, adding a dummy set of results at 0,0. This produces

which is an improvement, but I am, needless to say, deeply mortified by the false result at the lower lefthand corner. I suddenly think: But what if I double-click on the y axis! Sure enough this brings up a dialogue box which lets me set the y axis to my own specifications, and...

Ha!
There's just one slight problem. I
know this tiny database, which means I know that 9 people got 20 on the second test, whereas the line at 20 shows only 7 results. The chart has fallen victim to overplotting; the two people who got 6 on Test 1 and 20 on Test 2 have been collapsed into a single dot, as have the two who got 15 followed by 20. Excel has come through once, but I
can't believe there's a way to jitter the plot points. I retreat cravenly to inserting bullet points by hand in the basic grid:

This is clunky, no doubt about it, but since I've been doing it all by hand it's easy to see the two people at 6 who got 20 and the two at 15 who got 20:

I then realise that I can achieve a similar result in the charts feature by tampering, yet again, with the data: if I replace the pairs (6,20; 6,20) with (5.8,20; 6.1,20) and use the same dodge on 15 I come up with

Good. Good. (I mention all this because Excel is what most readers are likely to have in the home; it's easy to assume that feeding data into a chart will generate a chart that displays all the data.)
At this point, needless to say, I do not feel happy about a chart that depends on fudging the data. I now do what I should have done in the first place, which is to take it all into R. How much better it would all look, I think, if I used Hadley Wickham's ggplot2 package!
So I put the data into R. Vanilla R produces a plot which throws up a y axis that starts at 17 and moves by increments of .5 to 20, which means it is necessary to rifle through much PDF documentation (which is, of course, why I did not take this very simple task to R in the first place). ylim produces the right axis but doesn't look very nice, so I load ggplot2 and get this

which is very pretty but has yet another y axis starting at 17 and going up to 20 in increments of .5. There passed a weary time, each tongue was parched and glazed each eye, in other words ylim does not do the business in ggplot2, some other method of tinkering is called for, I spend much time rifling through the documentation of ggplot2 both in PDF and at
geom_point(I
knew this would happen) trying to work out what to do. Wickham's work is inspired not only by Tufte but by Lee Wilkinson's Grammar of Design, which means that the documentation discusses the rationale underlying the package, which is, of course, both interesting and admirable but unhelpful if you just want to know how to do in ggplot2 what ylim does in vanilla R. Finger in the page. geom_point does make it easy to jitter, so I try that out and get

which is actually not what I want at all, because I only want to jitter the four points where there is overlapping. I think there is a way to fix this (I think it is possible to select horizontal jitter), but how late it is, how late.
At this point, naturally, I begin to wonder whether it is not somewhat infra dig to put all this low-level milling about on display; how much better just to relegate it all to the drafts folder! Wait till I have worked through ggplot2 properly and at some later date post a series of handsome plots, drawing on a more interesting range of data sets, with an air of effortless ease. Yes.
(I revert to my paltry little Excel chart. Wouldn't it be better to have gridlines that divided the area in four? Would it be better if the numbers on the x axis were closer to the points, i.e. at the top of the plot?

Well, maybe. It's clear that about half the participants got under half the answers right on Test 1, and everyone got better than 75% right on Test 2, so that's quite nice. And it does look somewhat like a Smeg refrigerator into the bargain.)
One problem with writing novels is that you often find that there is some software somewhere that looks as though it might do some specific thing that you need for some particular chapter, which may well never be needed again. So you find yourself simultaneously at the embarrassing amateur stage of, who knows, maybe 10 or 15 different programs. So what you would really love to have is the literary equivalent of a director of photography - a technical advisor whose
job it is to answer questions like 'How do I fix the axes in ggplot2?' But this is really at odds with the whole Weltanschauung of the publishing industry. But enough, enough.
I then think, but maybe it would be nice to see the two sets of data in a line plot. I am somewhat demoralised by my adventures with ggplot2, so I run them through Excel and get

which is, of course, hideous.
But also enlightening.
Participants got 3 minutes to learn the genders of 20 words. Pre-technique, half remembered fewer than half. Post-technique, half remembered 100%; all remembered 80% or better. ONE PERSON, who started with a score of 19, failed to raise the score. In a word unknown to the immortal bard, blimey.
I don't know how well they would have performed if they had been tested again after half an hour, or 5 hours, or 5 days; this is, one would have thought, an obvious question, but it was one that was not answered in the class.
Meanwhile, behind the scenes... I draft an e-mail to Hadley Wickham, pleading for help. I then realise that my dear dear friend Rafe Donahue, despite his exasperation with the sort of person who is seduced by pretty plots, is still my dear dear dear dear friend. I send an e-mail to my
dear friend...
And
meanwhile, what to my wondering eyes should appear, but a newsletter from Linotype celebrating the birthday of
Adrian Frutiger. I mooch around the Linotype website, checking out the Akira Says column: the most recent essay is on
Frutiger, but there are also essays on
dashes (hyphens, en-dashes, em-dashes),
small caps...
And I am
OUTRAGEDbecause the thing is, when you see a book into print, an 'expert' will be given a month or so to go through the text to introduce 'correct' dashes, capitalisation and so on, which the author can then spend up to 6 months trying to remove
but the thing is, let's be sane. Fine-tuning the dashes and caps is never going to achieve significant improvement in the reader's grasp and retention of the text. When I say 'significant' I'm not poaching on statistical preserves, I'm talking about the kind of improvement displayed in a pair of tests on memorisation of gender. Text A gets 50% of readers, we'll say, getting things wrong 50% of the time; improved Text B gets 100% of readers getting things right 80% of the time or better. You're not going to see that, because, um, text has an extremely limited capacity to convey information in the first place. Whereas, of course, if you start with two sets of numbers
Test 1: 2 6 6 7 8 8 9 9 13 14 14 15 15 16 16 17 18 19
Test 2: 18 20 20 20 20 18 20 19 19 18 17 20 20 19 20 20 19 19
and convert them to some kind of graphic display (as above), you can
dramatically improve your chances of conveying a pattern of change. And if the graphic display has the allure of a Smeg, it will dramatically improve the chances that the sort of reader who has hitherto loathed graphs will suddenly be downloading R, braving PDF documentation, collecting data on self, friends and relations for the sheer entertainment of turning it all into graphs.
The point being, if publishers hired statisticians instead of copy-editors and designers, so that the author spent a few months going over the text with a statistical expert instead of the sort of person who knows his en-dashes, it would still be a lot of work, but it would be worth it.
Meanwhile it's a dark, gloomy day.