Sunday, July 29, 2007

Bivariate Baseball Plot

Rafe Donahue, a biostatistician at the University of Vanderbilt, has sent me a link to an interactive website that uses the statistical graphic program R to produce a bivariate baseball plot. Devised in collaboration with Tatsuki Koyama, Jeffrey Horner and Cole Beck (as Rafe as pointed out in the comments), it works like this:

The user selects the team and year in which s/he is interested

then goes on to select from: Day of the Week, Opponent Team

Opponent League, Day/Night, Starting Pitcher

(I know readers have seen drop-down menus before, but they are not usually this much fun), Opponent Starting Pitcher, Home/Away, Pitcher with Decision, Opponent Pitcher with Decision, Month, or First/Second Half .

R then produces a bivariate plot displaying the results:

As you'll have noticed from the menus, you can then print out your graphic as a PDF.

The Baseball Scoreplot blog explains how to read a baseball bivariate score plot, discusses known issues and analyses the graphic Rafe generated for the Astros, with Roger Clemens as starting pitcher

The Astros’ opponents’ marginal distribution (on the left) shows how teams fare against teams that beat them: their average rpg is just over 3.5 rpg compared with nearly 4.5 rpg for the Astros. Where the Astros were held to 1 run 27 times, their opponents were held to 1 or fewer on 42 occasions. Note that Clemens started 2 games that were shutouts and started 11 games where the opponents were held to fewer than 2 runs. He also started a game where the opponents scored 9 runs.

The joint distributions reveals details of Clemens’ abysmal run support. The bottom-left corner of the distribution shows five games which Clemens started in which the Astros lost 1-0, a pitcher’s nightmare. So, of the 11 games that Clemens started and the opponents were held to one run, 5 of those games failed to produce a single Houston run. In fact, Clemens was the only Astros pitcher to start a game in which the team lost 1-0.
(Graphic available on blog.)

We never see this kind of thing in fiction.


rafe donahue said...

An opportunity to make sure that credit gets passed around: the real work-horses on this project are Tatsuki, Jeff, and Cole --- I'm just the idea guy, the one willing to make the trip to SLC and make the presentation sporting a tie.

Tatsuki did 99% of the coding to draw the plots. (There are even more options lurking behind the scenes; we constantly debate which ones we should let rank amateur data displayists control and which we should keep to satisfy our appetite for control and evenutal world domination. [Insert Pinky and The Brain theme here.]) Cole did the majority of the database work, getting the data from infosheet into something that Tatsuki's code could use. And Jeff is responsible for the rapache interface thing (I really have no idea what it does) and the website functionality with the self-adjusting pulldown lists. I just sit back and think about how data should look; they did all the work; they should get more credit. Send them a candy bar.

(Oh, and a subtlety that we Irishmen like to bicker about: my 'Donahue' has an 'a' in the middle instead of an 'o'. I have no idea which way is right, except that that's how it is on my birth certificate...)

Helen DeWitt said...

Rafe. Argh. I do apologise. I have now fixed this omissions and errors in the post (so your comment will not make much sense to latecomers) and in the sidebar.