Got this e-mail from Rafe Donahue, a biostatistician at the University of Vanderbilt:
Ok, so one day I was looking at heart rate data. You go to the doctor and the technician takes your pulse. You sit still for 15 seconds and they count. Then they multiply by 4.
Or they can count for 30 and mutliply by two.
Or they can just count for 60 and then there is no difficult math involved.
Heck, you could count for 2 minutes and *divide* by two.
When I was a little kid, I convinced myself, before I knew anything about statistics and probability, that you could count for 1 second and multiply by 60 and then get a pulse of 0 or 60 or 120 and then if you did a weighted average of a bunch of single seconds, it would all work out! A child prodigy I was; then I grew up and look what happened.
So I am looking at some heart rate data and I decide to draw the histogram and look: there are little spikes at 48 and 52 and 56 and 60 and 64 ... and smaller spikes at 50 and 54 and 58 and 62 ... and very few readings at odd numbers. So of course different places where the pulses are taken use different counting schema!
In fact, if you draw the histogram for the individual sites, you can see which ones did what! Goodness, who ever thought one would need to standardize _taking a pulse!_
Then someone I know sends me the attached picture. These are diastolic blood pressure readings from a clinical trial. These are the baseline values. There are something like 6000 readings total; it is a big trial. The guy who sent the plot added the smoothed density estimate.
At the end of the trial, the dbp values will be examined, probably by doing some t tests. And the assumptions will be that the data come from a Normal, or Gaussian, distribution. Ha!
So, what will be the impact of that digit preference? I'm not sure, but I know that if the rounding is not symmetric relative to the original distribution, there will be bias. In fact, we will probably be able to show that one can make a treatment difference arbitrarily big or small by choosing a suitable rounding scheme.
Go figure. So much for Normal data.
Here's the graph:
I ask Rafe if I can post this on pp and he says Sure. He comments that it is real world clinical data, but it's better not to name the pharmaceutical company ("although there is no doubt that they all look like this"), adding:
We need to make sure that the point is that the data are funky; no one is _trying_ to use them to be deceitful. But when you actually look at the data, sometimes things look different from what you might expect. And the downstream implications are pretty much unknown.
Oh, and in other news, you can read some news about the lottery in TN. They switched from a physical machine with numbered balls to a computer system. Naturally someone screwed up the programming and no one noticed. Then someone noticed something was seemingly goofy but they didn't know what to really check; they didn't know how to do the probability computations and, to make it worse, they didn't know that they didn't know how to do the probability computations. Here is the link where they asked me about the probabilities: