On the Sixth Day, God Created... Runs?

Dave Paisley

For quite a while now, I've been puzzling over this whole Bill James "Runs Created" stat. If you've been following my recent columns, you'll know it's cropped up each of the last two weeks. Now, I'm a big fan of Bill James in general -- I think he's done a huge amount to advance the quality of analysis of baseball, and forced a lot more critical thinking about the game (except in most GM's offices, of course, and most notably in Baltimore and Seattle). So there's little doubt that James is a good guy. The one thing I've never quite got, though, is this Runs Created (RC) thing.

First, it's horrendously complicated. Oh, not that it involves differential calculus or anything, but it's impossible to do a few mental calculations to synthesize the required numbers. So it's not like you can be sitting at the ballpark, thumbing through a stat book or Baseball Weekly and roughly figure out Runs Created. You need a spreadsheet, and you need a lot of data, even for the simple version.

Second, while it's called Runs Created, it's just a made-up number. It's designed to be roughly comparable in magnitude to Runs Scored, but it tends to overpredict that by quite a wide margin (about 6%). When you get right down to it, RC is all about predicting runs scored due to batting skill. So runs that score on wild pitches, passed balls and fielding errors aren't really part of the model. In fact, a better comparison might be RBI, as an RBI is usually credited due to batting skill alone, and excludes many of the runs that due to the fielding effects I just mentioned. Given that league RBI totals are about 6% lower than runs scored, it means that RC overpredicts the runs it's trying to predict by about 12%.

As many of you regulars will note, I tend to prefer something simpler, but still meaningful -- On Base Percentage plus Slugging Average (OBP+SLG), commonly known as OPS. OPS is pretty simple, yet provides most of the value of RC, or perhaps just as much.

To road test the two methods, I decided to do a comparison of correlations with 1998 team data. For Runs Created, I simply cranked out the numbers for each team and compared them to actual Runs Scored. For OPS, I took the team scoring data for the 1996 and 1997 season team data and bashed out a linear regression to get the coefficients to use to predict 1998 team Runs Scored. I figured this was the fairest way to do it. Presumably, RC has been developed by testing against prior years' scoring, and it does go occasional tweaks to the formula. I figured if you're a baseball fan wanting to figure out the 1998 season numbers, you might estimate team OPS for 1998 and use the previous two seasons to estimate how many runs each team might score.

Here are the results. The horizontal axis is actual Runs Scored, while the vertical axis shows the predicted runs from the two methods.

The blue diamonds are the RC predictions for each team, while the red triangles represent the OPS correlated predictions. Similarly, the blue line and red lines represent the best straight line fit through those data points. The green line is simply the 1:1 line where the prediction would exactly match runs scored. The closer you get to that line, the better.

It's pretty evident that both methods do a decent job of predicting, but the blue RC line is well above actual runs scored, the effect I mentioned above. The OPS-based prediction line is scarily close to right on, at least indicating that there isn't much difference in the characteristics of team run scoring from the 1996 and 1997 seasons to the 1998 season.

One final thing that might have occurred to you is that the RC data seems a little tighter, but the standard deviation on the predictions is 27 runs for the RC prediction, versus 30 for the OPS prediction. That's not enough to say that there's any significant difference between them.

This data all applies to aggregate team stats, which is only one of the situations where it might be useful. Next week I'll be taking a look at how these predictions work for individual players, including a more in-depth look at the relative merits of On Base Percentage and Slugging Average. As I step though this series of articles, I'd love to know what you think, so be sure to send me any interesting observations, and especially any areas you'd like me to take a look at.

RC definition (simple version):

A = H + BB - CS

B = TB + .52 * SB + .26 * BB

C = AB + BB

RC = A * B / C

H = Hits
BB = Walks
CS = Caught Stealing
TB = Total Bases
SB = Stolen Bases
AB = At-Bats

about the author

With all this number crunching, Dave Paisley's computer is in severe danger of running out of 1's and 0's. Send him a few binary digits (or any other kind) at drdjp@strikethree.com.

Google
Web Strikethree.com