I’ve had many baseball researchers ask me the “so-what” question with reference to my work during my Ph.D: how accurate are the new models you developed for SAFE (Spatial Aggregate Fielding Evaluation)? After taking some time away from the project (read: burning my USB dedicated to my thesis in effigy), I finally believe I have a way to answer that question. Best of all, the answer seems to suggest that the most sophisticated model we proposed beats out all the competition.

Backing up a step, my thesis, which is now a paper in JQAS, was focused on incorporating time elements into the existing, hierarchical Bayesian model developed for fielding evaluation by Shane Jensen called SAFE. Much of this JQAS paper is spent detailing the prediction performance of these new models. However, the prediction methods discussed in the paper, which we gauge using something we called predicted deviations, were exclusive to forecasting a single event; that is, estimating the probability of a single ball in play of a certain type being fielded by a player at a specific location and given a (nominal) velocity. All of the current (public) methods used to evaluate fielding (e.g. UZR, DRS, etc.) represent an aggregation over a set of discrete bins, where the continuous nature of the data is lost. This makes sense as their purpose is meant more as a way of gauging, in a general sense, the actual outcome on the field by a fielder, not a mark of their true ability.

That’s what led me to an impasse. While I could have used the MLE of a fielder’s out rate for a given ball in play^{1} to represent UZR in a prediction comparison, it doesn’t seem quite fair due to the nature of UZR. It’s “interest” is simply to estimate the number of runs above/below average that fielder actually saved/cost their team at that position.

My new solution is to compare the SAFE models to UZR/DRS in the UZR/DRS realm: RMSE^{2} of total runs saved/cost through fielding a player’s position. Essentially, for every player that qualified^{3} during the span of data available^{4}, I used the posterior mean of every SAFE model developed in the paper^{5}, the SAFE value calculated using the MLE^{6}, UZR per 150 games and DRS to predict UZR per 150 games for each pair of consecutive seasons. That is, I’m predicting “current season” UZR per 150 games, which I’m using as a proxy for observed outcome, by taking the previous season’s value for all of the comparable metrics, including UZR itself. After that, I calculated the RMSE per each metric, which are listed in the table below. Note that I have sorted the table by model complexity, from simplest to most complex. Also, I refer to the SAFE models using shorthand: Original = 1-level Bayesian hierarchical model, CoT = Constant over Time, MAA = Moving Average Age and Autoreg = Autoregressive Age):

UZR/150 |
DRS |
MLE |
Original |
CoT |
MAA |
Autoreg |
---|---|---|---|---|---|---|

11.24 | 11.66 | 10.54 | 10.75 | 10.40 | 10.51 | 9.71 |

- | +3.7% | -6.2% | -4.4% | -7.5% | -6.5% | -13.6% |

As always, there are some caveats. I’m not correcting the SAFE estimates to reflect “150 games” as UZR is^{7}. Plus, as mentioned above, I’m only working with 2002-2008 data. Regardless, if either of these two things create any biases, my guess would be that it goes against SAFE’s favor, so these numbers likely serve as a “worst case” for the SAFE models.

The takeaways: (1) all of the SAFE models outperform the “traditional” metrics and (2) only the Autoregressive Age model stands out as a “King of the Hill” when it comes to predicting future fielding performance. I’ll be expanding on this analysis in the coming weeks by making the aforementioned adjustments and changing up how I do the prediction (e.g. replacing the outcome variable with something else, looking at predicting non-consecutive years).

**UPDATE**: After some great advice by Tom Tango, I’ve revamped the metrics used for prediction. His point was very apt: no one would simply use previous season’s value as the prediction for the current season. Instead, you would likely regress the previous season’s value on what you’re trying to predict. To put it in technical terms, the predictor you would use would come from applying the estimated coefficients from the following regression:

*UZR(t) = Metric(t-1) + Intercept*

where UZR(t) is the current season’s UZR/150 value and Metric(t-1) is the previous season’s value of any metric. I did exactly this, for all the metrics, and recalculated RMSE. The results make things much more vague:

UZR/150 |
DRS |
MLE |
Original |
CoT |
MAA |
Autoreg |
---|---|---|---|---|---|---|

9.53 | 9.97 | 10.34 | 10.64 | 10.32 | 10.43 | 9.34 |

- | +4.6% | +8.5% | +11.6% | +8.3% | +9.4% | -2.0% |

As you can see, SAFE ~~fairs~~fares much worse post-regression with the Autoregressive Age model only slightly edging out UZR/150, itself, as a better predictor. Just to see how these predictions perform when predicting current season UZR/150 for positions of higher importance, and thus higher sample size, I selected out fielders who were playing CF or SS and updated the RMSE, accordingly.

UZR/150 |
DRS |
MLE |
Original |
CoT |
MAA |
Autoreg |
---|---|---|---|---|---|---|

9.76 | 9.70 | 10.05 | 10.27 | 10.15 | 10.12 | 8.92 |

- | -0.1% | +3.0% | +5.2% | +4.0% | +3.7% | -8.6% |

The non-Autoregressive Age SAFE models still predict with much less accuracy, but the Autoregressive Age model begins to separate from the pack as the leader in predicting these more crucial defensive positions.

My next post, as Tom Tango suggested, will be to see how each model ~~fairs~~fares when trying to predict the aggregate of what I originally predicted in the paper: outs per BIP. To be continued.

- I would assume this is just the average out rate over a season, or seasons, of the bin in which the ball in play would have landed. [↩]
- RMSE, or Root Mean Squared Error, is a fairly common metric that researchers use to evaluate prediction accuracy between a few different methods. [↩]
- I made sure to give UZR the benefit of having ample sample size by limiting the dataset to be considered to fielders who logged enough innings to qualify them for the FanGraphs leaderboard (i.e. 900 innings). SAFE was able to stabilize without the need of so many “innings,” but that requirement came in the form of total number of balls in play, making it difficult to relate the two. For more of an understanding behind this, refer to my paper. [↩]
- I, unfortunately, only have the data from 2002-2008 from Baseball Info Solutions. I would love to run this on a more complete dataset, but financial capabilities got in the way. [↩]
- Posterior mean is the mean value of the (estimated) posterior distribution of SAFE values for a given player. [↩]
- This is the equivalent to calculating SAFE using no shrinkage or hierarchical Bayesian modeling. See Shane Jensen’s original paper on SAFE for more details. [↩]
- The SAFE estimates are originally aligned to the level of play from what we’d expect the 15th most used position player that given position. [↩]