Behind the numbers

With football season over, and news that pitchers and catchers have reported for spring training, the internet forums and blogs are a-buzz with how their

SI Staff | Feb 19, 2007

Behind the numbers /

With football season over, and news that pitchers and catchers have reported for spring training, the internet forums and blogs are a-buzz with how their favorite players and teams will do in the upcoming season. In a sense, February is "projection season" for the world of baseball fandom.

For your convenience, FanGraphs has compiled four different projection systems: The 2007 Bill James Handbook projections, Dan Szymborski "ZiPS," Sean Smith's "CHONE," and the Marcel the Monkey Forecasting System. With all four in one place, you can easily see the differences in each system for any particular player. But first, let's explore some of the nuances of each system.

Tom Tango, the keeper of the Marcel the Monkey Forecasting System, has stated that the Marcels are "the minimum level of competence that you should expect from any forecaster." That's why they've been aptly named after a monkey. They simply use three years of weighted data, regression toward the mean and an age adjustment formula. You could, in fact, recalculate them yourself since the method is completely open source and available at tangotiger.net.

It should be noted that the Marcels only project players that have already played in the majors. Any player that hasn't played in the majors yet, is assumed to have a league average projection.

There are really two different projection systems at work here: the batter projections and the pitcher projections. Bill James did the batter projections and wrote in the 2006 version of the Handbook: "What a player has done in the past, we predict he will do in the future, modified slightly by age, playing time and park effect."

As for the pitcher projections, James was not involved in any of them and in fact does not believe they can be done. They were instead done by the Baseball Info Solutions team consisting of John Dewan, Pat Quinn, and Damon Lichtenwalner.

To be brief, the pitching projections were based on the past eight years of a player's stats with heavier emphasis on the past three years. Playing time was based on the pitcher's role in the last two months of the most current season. Some age adjustment was used and DIPS (Defense Independent Pitching Statistics) theory was taken into consideration. Minor-league stats were used with the help of Ron Shandler's Minor League Equivalency system. There are a lot of other minute details in the pitching system you can read about in a rather entertaining and informative FAQ located in the 2006 Bill James Handbook.

Smith is the brainchild behind this system, often writes about his efforts on his site, Anaheim Angels all the way. His batter projections are based on four years of weighted data, regression toward the mean, and custom age curves based on player type. Playing time is then adjusted to make things "reasonable."

In his pitching projections he used batted ball data (fly balls, groundballs, etc...) to predict batting average on balls in play and home runs. He used his own Major League Equivalency system for minor league statistics. Playing time for pitchers is based on the 5 most likely starters and 6 most likely relievers from which innings pitched are then doled out.

Dan Szymborski of BaseballThinkFactory.org puts these out annually. They're based on three or four years of weighted data depending on a player's age and he uses various "growth and decline" curves based on the type of player.

"I don't try to find particularly similar players but instead large groups with similar characteristics, such as K rate for pitchers, Speed Score for batters, [batting average on balls in play] BABIP for batters, handedness, and a lot of other stuff."

Pitching projections do take DIPS theory into account by not only regressing BABIP toward the mean but also by taking into account handedness, knuckleballs, and groundball-to-fly ball ratios.

It's worth noting that ZiPS does not attempt to project playing time and of the four projection systems, it has the most players with 995 batters and 989 pitchers, many of whom have yet to play in the majors. So with at least a decent idea of the workings behind each projection system and their differences, let's take a look at how similar they are by comparing how they project OPS and ERA for individual players. Using the Marcels as a baseline since it's the simplest system of the four, if we look at batters with at least 300 at-bats in projected playing time, and pitchers with at least 100 innings in projected playing time, we'll see that the four systems are quite similar.

Despite their similarities as a whole, there are a number of differences in the four systems if you look at any specific players. Here are the 10 batters the systems are in the least agreement with in terms of OPS:

It's interesting to see Ryan Howard atop the list. Everyone expects him to be very good, but there's certainly some question about how great he'll actually be. At least CHONE and ZiPS seem to think Carlos Delgado is in for a rather steep decline next year, while the other two think he'll keep chugging along at his .900+ OPS.

But the problem with comparing OPS across projection systems, especially for the fantasy baseball players out there, is that it that it takes the systems out of their own context, which is why it might be more helpful to rank the players by OPS and see how their rankings compare. Once again, here are the 10 players the systems are in disagreement about in terms of ranking by OPS.

A few of the same names show up on this list like Dan Uggla, Andre Ethier, Luis Gonzalez, Esteban German, and Chris Duncan, but you'll notice it devoid of Howard. Despite the systems not agreeing on his actual OPS, they all agree that he'll have one of the top OPSes in all of baseball:

Moving on to the pitchers and their projected ERA, you'll remember that overall there was a bit more disagreement between the systems for pitchers and ERA than there was with batters and OPS. Pitchers are quite a bit more unpredictable than batters. So let's see where the systems are in disagreement:

A few names stand out to me here, mainly Rich Hill and Curt Schilling. The Handbook takes the low road on both of them, with the Marcels not being quite so optimistic. With only one year of major-league playing time for Hill, there's going to be some heavy regression for the Marcel projection. The Handbook is also quite smitten with Clay Hensley. I can only image it has something to do with him playing in PETCO park and his extreme groundball tendencies. Let's take a look at where the projections are in agreement:

It's interesting to see such agreement on Bronson Arroyo considering he's had exactly one year with an ERA under 4.00. Obviously, C.C. Sabathia and Carlos Zambrano have had a much better long-term track record and it shows in both their projected ERA. It's also fun to see that everyone agrees Kirk Saarloos and Gil Meche are nothing to write home about. But where are Johan Santana and Roger Clemens? Let's take a look at one last list that shows which player each projection system is in agreement on in terms of ranking by ERA:

Santana is nearly a consensus number one. Jake Peavy and Roy Oswalt also make high appearances in each system, while Runelvys Hernandez, Tony Armas Jr., and Ramon Ortiz are not so highly thought of.

So whatever projection system you're inclined to use, regardless if it's actually better or worse than either of these four, it's probably similar as a whole yet with its own nuances. A second opinion never hurts, and if you head over to FanGraphs, you'll find four other opinions at your disposal.

Published Feb 19, 2007

SI STAFF