Testing the Prediction Accuracy of Fielding Metrics

I’ve had many baseball researchers ask me the “so-what” question with reference to my work during my Ph.D: how accurate are the new models you developed for SAFE (Spatial Aggregate Fielding Evaluation)?  After taking some time away from the project (read: burning my USB dedicated to my thesis in effigy), I finally believe I have a way to answer that question.  Best of all, the answer seems to suggest that the most sophisticated model we proposed beats out all the competition.

Backing up a step, my thesis, which is now a paper in JQAS, was focused on incorporating time elements into the existing, hierarchical Bayesian model developed for fielding evaluation by Shane Jensen called SAFE.  Much of this JQAS paper is spent detailing the prediction performance of these new models.  However, the prediction methods discussed in the paper, which we gauge using something we called predicted deviations, were exclusive to forecasting a single event; that is, estimating the probability of a single ball in play of a certain type being fielded by a player at a specific location and given a (nominal) velocity.  All of the current (public) methods used to evaluate fielding (e.g. UZR, DRS, etc.) represent an aggregation over a set of discrete bins, where the continuous nature of the data is lost. This makes sense as their purpose is meant more as a way of gauging, in a general sense, the actual outcome on the field by a fielder, not a mark of their true ability.

That’s what led me to an impasse.  While I could have used the MLE of a fielder’s out rate for a given ball in play1 to represent UZR in a prediction comparison, it doesn’t seem quite fair due to the nature of UZR.  It’s “interest” is simply to estimate the number of runs above/below average that fielder actually saved/cost their team at that position.

My new solution is to compare the SAFE models to UZR/DRS in the UZR/DRS realm: RMSE2 of total runs saved/cost through fielding a player’s position.  Essentially, for every player that qualified3 during the span of data available4, I used the posterior mean of every SAFE model developed in the paper5, the SAFE value calculated using the MLE6, UZR per 150 games and DRS to predict UZR per 150 games for each pair of consecutive seasons.  That is, I’m predicting “current season” UZR per 150 games, which I’m using as a proxy for observed outcome, by taking the previous season’s value for all of the comparable metrics, including UZR itself.  After that, I calculated the RMSE per each metric, which are listed in the table below. Note that I have sorted the table by model complexity, from simplest to most complex. Also, I refer to the SAFE models using shorthand: Original = 1-level Bayesian hierarchical model, CoT = Constant over Time, MAA = Moving Average Age and Autoreg = Autoregressive Age):

RMSE of All SAFE Models and UZR/DRS
UZR/150 DRS MLE Original CoT MAA Autoreg
11.24 11.66 10.54 10.75 10.40 10.51 9.71
- +3.7% -6.2% -4.4% -7.5% -6.5% -13.6%

As always, there are some caveats. I’m not correcting the SAFE estimates to reflect “150 games” as UZR is7. Plus, as mentioned above, I’m only working with 2002-2008 data. Regardless, if either of these two things create any biases, my guess would be that it goes against SAFE’s favor, so these numbers likely serve as a “worst case” for the SAFE models.

The takeaways: (1) all of the SAFE models outperform the “traditional” metrics and (2) only the Autoregressive Age model stands out as a “King of the Hill” when it comes to predicting future fielding performance. I’ll be expanding on this analysis in the coming weeks by making the aforementioned adjustments and changing up how I do the prediction (e.g. replacing the outcome variable with something else, looking at predicting non-consecutive years).

UPDATE: After some great advice by Tom Tango, I’ve revamped the metrics used for prediction. His point was very apt: no one would simply use previous season’s value as the prediction for the current season. Instead, you would likely regress the previous season’s value on what you’re trying to predict. To put it in technical terms, the predictor you would use would come from applying the estimated coefficients from the following regression:

UZR(t) = Metric(t-1) + Intercept

where UZR(t) is the current season’s UZR/150 value and Metric(t-1) is the previous season’s value of any metric.  I did exactly this, for all the metrics, and recalculated RMSE. The results make things much more vague:

RMSE of All SAFE Models and UZR/DRS After Regressing on UZR(t)
UZR/150 DRS MLE Original CoT MAA Autoreg
9.53 9.97 10.34 10.64 10.32 10.43 9.34
- +4.6% +8.5% +11.6% +8.3% +9.4% -2.0%

As you can see, SAFE fairsfares much worse post-regression with the Autoregressive Age model only slightly edging out UZR/150, itself, as a better predictor. Just to see how these predictions perform when predicting current season UZR/150 for positions of higher importance, and thus higher sample size, I selected out fielders who were playing CF or SS and updated the RMSE, accordingly.

RMSE of All SAFE Models and UZR/DRS After Regressing on UZR(t) for CF and SS only
UZR/150 DRS MLE Original CoT MAA Autoreg
9.76 9.70 10.05 10.27 10.15 10.12 8.92
- -0.1% +3.0% +5.2% +4.0% +3.7% -8.6%

The non-Autoregressive Age SAFE models still predict with much less accuracy, but the Autoregressive Age model begins to separate from the pack as the leader in predicting these more crucial defensive positions.

My next post, as Tom Tango suggested, will be to see how each model fairsfares when trying to predict the aggregate of what I originally predicted in the paper: outs per BIP. To be continued.

  1. I would assume this is just the average out rate over a season, or seasons, of the bin in which the ball in play would have landed. []
  2. RMSE, or Root Mean Squared Error, is a fairly common metric that researchers use to evaluate prediction accuracy between a few different methods. []
  3. I made sure to give UZR the benefit of having ample sample size by limiting the dataset to be considered to fielders who logged enough innings to qualify them for the FanGraphs leaderboard (i.e. 900 innings).  SAFE was able to stabilize without the need of so many “innings,” but that requirement came in the form of total number of balls in play, making it difficult to relate the two.  For more of an understanding behind this, refer to my paper. []
  4. I, unfortunately, only have the data from 2002-2008 from Baseball Info Solutions.  I would love to run this on a more complete dataset, but financial capabilities got in the way. []
  5. Posterior mean is the mean value of the (estimated) posterior distribution of SAFE values for a given player. []
  6. This is the equivalent to calculating SAFE using no shrinkage or hierarchical Bayesian modeling.  See Shane Jensen’s original paper on SAFE for more details. []
  7. The SAFE estimates are originally aligned to the level of play from what we’d expect the 15th most used position player that given position. []

Comparing Fedor to Bonds

Before believing this to be a nonsense post describing two people who dominated their respective sport, I mean only to point out one concept that these two had in common: a hitch/catch in their swing/punch.

For Fedor, he uses a common Sambo technique known as “casting.”  His loopy, almost sloppy punching method comes from this idea of there being a catch in his swing, where he loads up his punch.  For Bonds, the “hitch” in his swing is where he generates much of his power.  Essentially, that hitch is where the bat has to catch up with his hands; a batter’s hands drop when they coil up to swing, before the bat starts on its path to the baseball.

Using my basic knowledge of physics and help from the Google, let me explain why these two gentlemen implement this technique.  The principle behind having said hitch/catch is to generate more torque with your body, which in turn, produces a faster punch/bat.  By loading a swing or punch far back, you allow a greater distance to accelerate the bat or fist.

In most instruction circles, these loading methods (so to speak) are heavily frowned upon.  The interesting part is that both of them have nearly the same critique:

  • They are rarely accurate.
  • They require very good and precise timing.

Guys like Adam Dunn and Marcus Davis have had some success with these methods, but come up short.  Perhaps, this is the case of outliers; it’s only effective for those very few.

For video evidence of this phenomenon, I’ve sliced together four videos clips:

  1. Fedor’s knockout of Brett Rogers.
  2. Fedor’s flurry that led to a rear naked choke on Tim Sylvia.
  3. A Bonds’ swing from early on.
  4. Bonds’ home run, record-breaking at-bat.

Focus on the hitch with Bonds’ and the catch with Fedor.  I know it’s hard what with all their greatness, but you’ll see it.

Players are using “advanced analysis” to improve? Seems like a stretch

A recent article in ESPN the Magazine titled Saviormetrics described how Brandon McCarthy, current Oakland A’s starting pitcher, utilized advanced baseball metrics to shape his game as a mediocre pitcher to a diamond-in-the-rough.  I do believe it’s important for players to use the data available on them in conjunction with video to make improvements in their as I mentioned in a previous post, but this article seems to suggest McCarthy used some advanced analysis to improve his game.  In fact, one of the more absurd quotes in the article is when they compare McCarthy to Beane:

What Billy Beane was to GMs, Brandon McCarthy is now to players.

That seems like a very far stretch, if we’re to believe the article’s take on his recent development.  From the looks of it, Brandon saw that he was giving up too many homers, a fact he gathered from a high HR/IP, and relying too much on flyball outs, which he took from his low GB/FB ratio.  So, he analyzed that information and came to the conclusion that he needed to add a grounball-inducing pitch to his arsenal, a two-seamed fastball.  After adding it, he began to see a surprising amount of success and, ultimately, his amazing 2011 season.

That’s an acute analysis using some straight forward statistics.  That said, it seems like a stretch to me that a pitching coach couldn’t give him the same advice using only the basics of scouting.  According to the article, it’s all thanks to these two statistics that he was reborn.  I’m not suggesting that he shouldn’t be using those statistics to inform how he can better improve, or even saying that by understanding those metrics he didn’t find a method of development; my point is that he didn’t have to use anything outside of what has already been known in the “clubhouse,” not the advanced analysis suggested here.

What Billy Beane did (with the help of Michael Lewis’ book) is shape how front offices analyze the way they evaluate talent, whether it’s in-house, other team’s or amateur.  Front offices had always used reasoning based on basic statistics and scouting when making their decisions; it wasn’t until that Oakland A’s front office built their team of “misfit toys” with no money that others adopted deep analysis of data.

In the end, I hope this article does encourage more players to look at deeper into their own data, but that doesn’t refer to GB/FB or HR/IP.  I’ll get excited when I see them fully utilizing Pitch F/X data or maybe even some of the catcher ERA stuff…