Testing the Prediction Accuracy of Fielding Metrics

I’ve had many baseball researchers ask me the “so-what” question with reference to my work during my Ph.D: how accurate are the new models you developed for SAFE (Spatial Aggregate Fielding Evaluation)?  After taking some time away from the project (read: burning my USB dedicated to my thesis in effigy), I finally believe I have a way to answer that question.  Best of all, the answer seems to suggest that the most sophisticated model we proposed beats out all the competition.

Backing up a step, my thesis, which is now a paper in JQAS, was focused on incorporating time elements into the existing, hierarchical Bayesian model developed for fielding evaluation by Shane Jensen called SAFE.  Much of this JQAS paper is spent detailing the prediction performance of these new models.  However, the prediction methods discussed in the paper, which we gauge using something we called predicted deviations, were exclusive to forecasting a single event; that is, estimating the probability of a single ball in play of a certain type being fielded by a player at a specific location and given a (nominal) velocity.  All of the current (public) methods used to evaluate fielding (e.g. UZR, DRS, etc.) represent an aggregation over a set of discrete bins, where the continuous nature of the data is lost. This makes sense as their purpose is meant more as a way of gauging, in a general sense, the actual outcome on the field by a fielder, not a mark of their true ability.

That’s what led me to an impasse.  While I could have used the MLE of a fielder’s out rate for a given ball in play1 to represent UZR in a prediction comparison, it doesn’t seem quite fair due to the nature of UZR.  It’s “interest” is simply to estimate the number of runs above/below average that fielder actually saved/cost their team at that position.

My new solution is to compare the SAFE models to UZR/DRS in the UZR/DRS realm: RMSE2 of total runs saved/cost through fielding a player’s position.  Essentially, for every player that qualified3 during the span of data available4, I used the posterior mean of every SAFE model developed in the paper5, the SAFE value calculated using the MLE6, UZR per 150 games and DRS to predict UZR per 150 games for each pair of consecutive seasons.  That is, I’m predicting “current season” UZR per 150 games, which I’m using as a proxy for observed outcome, by taking the previous season’s value for all of the comparable metrics, including UZR itself.  After that, I calculated the RMSE per each metric, which are listed in the table below. Note that I have sorted the table by model complexity, from simplest to most complex. Also, I refer to the SAFE models using shorthand: Original = 1-level Bayesian hierarchical model, CoT = Constant over Time, MAA = Moving Average Age and Autoreg = Autoregressive Age):

RMSE of All SAFE Models and UZR/DRS
UZR/150 DRS MLE Original CoT MAA Autoreg
11.24 11.66 10.54 10.75 10.40 10.51 9.71
- +3.7% -6.2% -4.4% -7.5% -6.5% -13.6%

As always, there are some caveats. I’m not correcting the SAFE estimates to reflect “150 games” as UZR is7. Plus, as mentioned above, I’m only working with 2002-2008 data. Regardless, if either of these two things create any biases, my guess would be that it goes against SAFE’s favor, so these numbers likely serve as a “worst case” for the SAFE models.

The takeaways: (1) all of the SAFE models outperform the “traditional” metrics and (2) only the Autoregressive Age model stands out as a “King of the Hill” when it comes to predicting future fielding performance. I’ll be expanding on this analysis in the coming weeks by making the aforementioned adjustments and changing up how I do the prediction (e.g. replacing the outcome variable with something else, looking at predicting non-consecutive years).

UPDATE: After some great advice by Tom Tango, I’ve revamped the metrics used for prediction. His point was very apt: no one would simply use previous season’s value as the prediction for the current season. Instead, you would likely regress the previous season’s value on what you’re trying to predict. To put it in technical terms, the predictor you would use would come from applying the estimated coefficients from the following regression:

UZR(t) = Metric(t-1) + Intercept

where UZR(t) is the current season’s UZR/150 value and Metric(t-1) is the previous season’s value of any metric.  I did exactly this, for all the metrics, and recalculated RMSE. The results make things much more vague:

RMSE of All SAFE Models and UZR/DRS After Regressing on UZR(t)
UZR/150 DRS MLE Original CoT MAA Autoreg
9.53 9.97 10.34 10.64 10.32 10.43 9.34
- +4.6% +8.5% +11.6% +8.3% +9.4% -2.0%

As you can see, SAFE fairsfares much worse post-regression with the Autoregressive Age model only slightly edging out UZR/150, itself, as a better predictor. Just to see how these predictions perform when predicting current season UZR/150 for positions of higher importance, and thus higher sample size, I selected out fielders who were playing CF or SS and updated the RMSE, accordingly.

RMSE of All SAFE Models and UZR/DRS After Regressing on UZR(t) for CF and SS only
UZR/150 DRS MLE Original CoT MAA Autoreg
9.76 9.70 10.05 10.27 10.15 10.12 8.92
- -0.1% +3.0% +5.2% +4.0% +3.7% -8.6%

The non-Autoregressive Age SAFE models still predict with much less accuracy, but the Autoregressive Age model begins to separate from the pack as the leader in predicting these more crucial defensive positions.

My next post, as Tom Tango suggested, will be to see how each model fairsfares when trying to predict the aggregate of what I originally predicted in the paper: outs per BIP. To be continued.

  1. I would assume this is just the average out rate over a season, or seasons, of the bin in which the ball in play would have landed. []
  2. RMSE, or Root Mean Squared Error, is a fairly common metric that researchers use to evaluate prediction accuracy between a few different methods. []
  3. I made sure to give UZR the benefit of having ample sample size by limiting the dataset to be considered to fielders who logged enough innings to qualify them for the FanGraphs leaderboard (i.e. 900 innings).  SAFE was able to stabilize without the need of so many “innings,” but that requirement came in the form of total number of balls in play, making it difficult to relate the two.  For more of an understanding behind this, refer to my paper. []
  4. I, unfortunately, only have the data from 2002-2008 from Baseball Info Solutions.  I would love to run this on a more complete dataset, but financial capabilities got in the way. []
  5. Posterior mean is the mean value of the (estimated) posterior distribution of SAFE values for a given player. []
  6. This is the equivalent to calculating SAFE using no shrinkage or hierarchical Bayesian modeling.  See Shane Jensen’s original paper on SAFE for more details. []
  7. The SAFE estimates are originally aligned to the level of play from what we’d expect the 15th most used position player that given position. []

Trying to understand how I cost Wharton $400,000

In a recent WSJ article, the Dean of Wharton, Thomas Robertson, had this interesting exchange with the author of the article about doctoral candidates in his school:

WSJ: Few business schools still have Ph.D. programs. How can talent continue to flow through the pipeline into academia?

Mr. Robertson: When we admit students, in the letter of admission, the expectation is that they will go into academia. The notion has even been floated that if they don’t go into academia, they should pay back the cost of the Ph.D. program. It costs about $400,000 to educate one Ph.D. student, because they don’t pay tuition and they get stipends [emphasis mine].

If [departments] are not placing at top schools, or if their students don’t go into academia, we will cut back the number of Ph.D. students that they’re allowed to admit.

Glancing over the notion of repaying Wharton for not going into academia (which is rather obnoxious), Robertson’s response about costs for educating a Ph.D student is curious; did Wharton really invest $400,000 into my doctoral education?  Am I really that important?  I decided to do a little investigating by finding as much info about the price of tuition, fees, stipends, etc. that I could to account for the $400,000 cost.

To give Dean Robertson the benefit of the doubt, we’ll start by considering tuition/fees for the 2011-2012 academic year as the baseline.  Also, let’s assume that all costs increase by roughly 4% every year (a more accurate number could be attained by doing a detailed analysis of this and national inflation levels).  Given that I am only expertly qualified to talk about the Statistics department, our final assumption will be that this student will receive their Ph.D in five years from Wharton in Statistics and come into the program with no previous credits.

With that out of the way, let’s begin with a summary of the known costs.

1. Tuition = $90,872.47

As per the Doctoral Programs’ website, full-time tuition (i.e. taking classes) for the current academic year of 2011-2012 is $26,660.00.  Take that over 3 years of classes with interest, which is possible for students without previous graduate studies, the total is $83,221.86.  For the other 2 years spent working on their dissertation, students are placed on dissertation status and need to only pay “reduced tuition,” which according to the 2011-2012 rate, is $3,334.00.  Skipping ahead to 3 years from now and accruing those costs during this student’s final 2 years, they need to tack on an additional $7650.61.  This brings the total cost of tuition to $90,872.47, a bit under the $100k mark.

2. Fees/Health Insurance = $28,365.63

To delve further into fees and health insurance, here’s a helpful document that breaks all of this down from previous years; some searching on program-specific sites will get you to updated numbers for this current academic year.  Those fees breakdown into a “general fee” that is as high as $2,318.00, if the student is doing at least 3 course units, and a technology fee of $668.00.  If a student goes into dissertation mode, then their general fee drops to $582.00, though their technology fee remains the same.  As for health insurance, an annual plan through Penn, which for all intents and purposes is an excellent deal, is $3,012.00 for a single student.

Add that all up over our student’s 5 years in Wharton, the fees and health insurance end up costing $28365.63.

3. Stipend = $159,402.90

Straight out of my W-2 from 2009 (2010 was different because I taught a class that summer, which did pay me slightly more), I earned $27,209.78.  This is one place for contention by Dean Robertson; I only know from experience what I (and my statistical cohorts) was paid in terms of a stipend and have no source for what other departments inside of Wharton pay.  It could be that there are significant increases over this number, but because I do not know for certain, I will stick to my own.

Sum my stat stipend (after correcting for the 2 year difference to begin with, though from experience, I know that the 4% increase was much more like 2%-3%) over the 5 years, our stat Ph.D student would have gotten paid a total of $159,402.90.

Thus, the grand total of all known costs by our student to Wharton over their 5 year tenure is $278,641.00, or a little less than $56k a year.  And that leaves …

The Remaining “Unknown” Costs = $121,359.00

What is the origin(s) of these unknown costs?  In my mind, there are only two possible costs I missed: the mentoring/advising by professors and the right to work on projects for those same professors.  The latter seems rather absurd given the type of labor that said students would be doing, so I would have to lean to the former as being root cost.  So, as a doctoral candidate in statistics, over your 5 years, Wharton pays your professors nearly $25,000 to assist you through your Ph.D.

My suggestion to current students: milk them for that money as much as you can.

The restoring force in basketball (or the “poor get richer” phenomenon)

Random Walk Picture of Basketball Scoring, a recent paper in JQAS, is well-worth a read; it’s one of those ideas that when I read about it, I thought “why had no one ever tried this before!?”

I wanted to share a quote from it that is incredibly intriguing and, after my brief searching, has never been expressed before:

The [ubiquitous feature augmenting random scoring in a basketball game] is the existence of a weak linear restoring force, in which the leading team scores at a slightly lower rate (conversely, the losing team scores at a slightly higher rate).  This restoring force seems to be a natural human response to an unbalanced game – a team with a large lead may be tempted to coast, while a lagging team likely plays with greater urgency.

The authors revisit this feature later in the article, but only so much as to re-describe the phenomenon and relate their adjustment approach to a common physics model.  I do not intend on talking about the implications or existence of this feature, but rather, I want to focus on its cause.

The authors, who admittedly approach this features origin naively, do not thoroughly explain why the phenomenon occurs, except to link it to a “natural human response” that has been referenced previously in economic competitions.  While I do not doubt that said response exists, I am skeptical that it is the most influencing component of this phenomenon.

Consider the strategy of a coach when his/her team is leading.  If the lead is often great enough, a coach may choose to bench their starters in hopes of keeping them fresh (and, thus, more efficient).  This would almost certainly decrease the talent level of the team in the lead.  In an effort to keep themselves in the game, the opposing coach often takes the corresponding risk by keeping their starters on the court during that time to exploit the temporary drop in talent.  A perfect example of this is this year’s Sweet 16 game between IU and UK, where Zeller stayed on the court with IU trailing when Davis went off, both of whom were in foul trouble.

A less obvious way that coaching strategy could lead to this “poor get richer” phenomenon can be found in the response to a team gaining a lead or going on a run that produces a lead: timeouts.  It’s almost exclusively the case that coaches take timeouts when there is very little time on the game clock or their opponent has been going on a scoring run.  In both of these scenarios, the coach taking the timeout is likely trailing the other team.  It’s also believable that a timeout “interrupts” the scoring ability of an opponent and increases a team’s chances of scoring in the next possession (note: what little research exists on the effect of timeouts has been shown it to be significantly in the favor of the team calling the timeout).  Thus, team’s down, who much more frequently call timeouts, would see themselves increasing their probability of scoring compared to play before the timeout.

There are likely other explanations for this restoring force (e.g. players are more often fatigued after scoring than after failing to score).  Regardless, I believe very strongly that this phenomenon, in general, deserves further attention.  What causes it?  Are we sure it’s a weak force?  Is this exclusive to basketball, a sport with infinite substitutions, or does this carry over into other sports like baseball, where roster depth places an even bigger role via relievers, removals after substitutions, etc.?

Comparing Fedor to Bonds

Before believing this to be a nonsense post describing two people who dominated their respective sport, I mean only to point out one concept that these two had in common: a hitch/catch in their swing/punch.

For Fedor, he uses a common Sambo technique known as “casting.”  His loopy, almost sloppy punching method comes from this idea of there being a catch in his swing, where he loads up his punch.  For Bonds, the “hitch” in his swing is where he generates much of his power.  Essentially, that hitch is where the bat has to catch up with his hands; a batter’s hands drop when they coil up to swing, before the bat starts on its path to the baseball.

Using my basic knowledge of physics and help from the Google, let me explain why these two gentlemen implement this technique.  The principle behind having said hitch/catch is to generate more torque with your body, which in turn, produces a faster punch/bat.  By loading a swing or punch far back, you allow a greater distance to accelerate the bat or fist.

In most instruction circles, these loading methods (so to speak) are heavily frowned upon.  The interesting part is that both of them have nearly the same critique:

  • They are rarely accurate.
  • They require very good and precise timing.

Guys like Adam Dunn and Marcus Davis have had some success with these methods, but come up short.  Perhaps, this is the case of outliers; it’s only effective for those very few.

For video evidence of this phenomenon, I’ve sliced together four videos clips:

  1. Fedor’s knockout of Brett Rogers.
  2. Fedor’s flurry that led to a rear naked choke on Tim Sylvia.
  3. A Bonds’ swing from early on.
  4. Bonds’ home run, record-breaking at-bat.

Focus on the hitch with Bonds’ and the catch with Fedor.  I know it’s hard what with all their greatness, but you’ll see it.

Amelia Bedelia

Here’s a fantastic quote from a recent Drew Magary article about children’s book; the entire article is worth a read, but this point in particular is spot on (at least from what I remember as a kid):

7. Do not buy any Amelia Bedelia books.
She’s awful. I hate her. She takes everything you say literally, so when her boss is like, “Make the bed,” she literally makes a little bed out of craft supplies. What a moron. Kids are too young and too stupid to understand the concept of figures of speech, so all the jokes go right over their heads. They’re the lucky ones. I actually understand these jokes, which only makes it worse. If this woman existed in the real world, she’d be arrested and sent to prison and then she’d die from swallowing bleach by accident and she would DESERVE it. Every time she gets fired in these books, I cheer. And when she gets rehired, I want to vote Republican. STOP BAILING OUT THESE PEOPLE.

Players are using “advanced analysis” to improve? Seems like a stretch

A recent article in ESPN the Magazine titled Saviormetrics described how Brandon McCarthy, current Oakland A’s starting pitcher, utilized advanced baseball metrics to shape his game as a mediocre pitcher to a diamond-in-the-rough.  I do believe it’s important for players to use the data available on them in conjunction with video to make improvements in their as I mentioned in a previous post, but this article seems to suggest McCarthy used some advanced analysis to improve his game.  In fact, one of the more absurd quotes in the article is when they compare McCarthy to Beane:

What Billy Beane was to GMs, Brandon McCarthy is now to players.

That seems like a very far stretch, if we’re to believe the article’s take on his recent development.  From the looks of it, Brandon saw that he was giving up too many homers, a fact he gathered from a high HR/IP, and relying too much on flyball outs, which he took from his low GB/FB ratio.  So, he analyzed that information and came to the conclusion that he needed to add a grounball-inducing pitch to his arsenal, a two-seamed fastball.  After adding it, he began to see a surprising amount of success and, ultimately, his amazing 2011 season.

That’s an acute analysis using some straight forward statistics.  That said, it seems like a stretch to me that a pitching coach couldn’t give him the same advice using only the basics of scouting.  According to the article, it’s all thanks to these two statistics that he was reborn.  I’m not suggesting that he shouldn’t be using those statistics to inform how he can better improve, or even saying that by understanding those metrics he didn’t find a method of development; my point is that he didn’t have to use anything outside of what has already been known in the “clubhouse,” not the advanced analysis suggested here.

What Billy Beane did (with the help of Michael Lewis’ book) is shape how front offices analyze the way they evaluate talent, whether it’s in-house, other team’s or amateur.  Front offices had always used reasoning based on basic statistics and scouting when making their decisions; it wasn’t until that Oakland A’s front office built their team of “misfit toys” with no money that others adopted deep analysis of data.

In the end, I hope this article does encourage more players to look at deeper into their own data, but that doesn’t refer to GB/FB or HR/IP.  I’ll get excited when I see them fully utilizing Pitch F/X data or maybe even some of the catcher ERA stuff…

The biggest problem facing sports research: incompetent researchers (and ESPN)

Kudos to Phil Birnbaum’s recent post rehashing Gladwell’s ridiculous argument for early NFL draft picks being no better than late draft picks.  Spoiler alert: it’s selection bias.

This isn’t anything new, unfortunately.  All the time we see people making audacious claims  backing them up with statistical analyses that are founded on fairly pedestrian statistical biases.  While many of these biases can be nuanced (think heteroscedasticity, not selection bias), that’s precisely why research requires so much thought and care.  A fundamental of research is checking and rechecking your assumptions to ensure that they’re all well-founded.

Why is this such a problem in sports research?  It’s not much unlike the problem faced in climate science (see this paper for more): the output is completely tangible and, in terms of data mining, easy to understand.  Everyone can see someone dunk a basketball, though they do not necessarily see the pick set to make it possible.  Everyone in NYC knows that this has been a mild winter, but that says nothing about the combined temperature across the planet.  Because the average reader can quickly understand these basic data points, they believe that they are able to freely analyze them and their more advanced “relatives.”  Reading into even the most basic statistics can be damaging, which explains the struggles academics have trying to explore a topic with deep thought and a complete understanding of all previous literature.

What’s worse is that the problem isn’t getting better.  For example, the fan’s choice for best research paper (what does that even mean!?) in the upcoming SSAC, Sloan Sports Analytics Conference, will be chosen by SportsNation.  I’ll save my SSAC rant for next week, but thinking along the lines of an idiocracy, ESPN might move to change our peer-review system into a poll of what paper’s impact is the most awesome-est.

People do not play to their skill set

It never ceases to amaze me how people who can provide fair and accurate judgement on the game of others have no ability to reflect that same skill upon themselves.  It seems most obvious when playing pickup basketball.

I by no means have any basketball skills; my body is basically a scaled version of Kermit the frog.  That said, I know that same body gives me some advantages down-low.  So, my game plan has always been simple: stay, shoot and help defend down-low in the paint.  If I have the ball outside of the paint with no clear lane, pass it immediately.

What’s so shocking is how often I see, say, the other team’s 5’5″ guard trying to do the same thing, or a big man on my team taking completely unnecessary shots from way outside the arc.  You see this stuff happen in college all the time (e.g. Quinn Cook) and even in the NBA (e.g. Antonio Walker), which is incredibly frustrating to watch.  It’s why I find it very refreshing to see players who can take a particular skill set and squeeze everything out of it.  My favorite example is the Big Z, Zydrunas Ilgauskas.

Take a look at his stats per shooting locations in the 2008-2009 season:

You’ll notice that he basically only takes two types of shots a game: shots at the rim or mid-range jumpers.  By delving further into his mid-range shots, they’re almost exclusively catch-and-shoots (notice the ridiculously high %As, or percentage of makes assisted).  What’s even more telling is that he’s not only taking them from the same distance, but nearly always the same place on the court: off of the left wing of the basket, 2 feet in front of the 3-point line.  Check out his shot chart from his best game that season (vs. the Bucks):

Zydrunas Ilgauskas shot chart in his best game of the 2008-09 season.

The clustering to the left of the rim is his sweet spot.  He’s clearly comfortable there, whether it be from practicing or some innate skill.  What you don’t see are shots taken randomly all over the court; he knows where he can best shoot because he understands his skill set.

Why does he have such an understanding?  Tape is often the key, but even a basic sense can be gathered by listening to others.  Or, in the case of pickup basketball, observing the reactions of your teammates whenever you do something stupid.

NBA in the red due to capital expenditure

In a recent article on TrueHoop, Adam Silver, COO of the NBA, is quoted as saying

“The league will not make money this year,” Silver says. And next year? “Maybe.”

I think there is a little truth and a little fibbing in this statement.  Yes, it’s easy to forget that even with the league’s popularity, and subsequent ticket sales, rising fast due to Jeremy Lin, the league will still take a huge hit to profits because of the lockout.  However, that cost could easily be construed as a capital expenditure.

Ultimately, the league (or owners, depending on your perspective) chose to take the loss in present revenue in hopes for a larger portion of future revenue; that is, by stalemating, the league invested in the chance that the players would accept a deal that allotted them a smaller chunk of BRI.  Even if that money “lost” went towards variable costs (e.g. wages, venue maintenance), it was still spent on a calculated risk for future benefits, which almost certainly qualifies it as a capital expenditure.

So, next time you hear Adam Silver (at a bar?) say that the league is losing money this year, just respond with “technically …”

Picturing Savannah

With my first blog post, I figured I would address the significance of the header of my site (at least in its current state).  That’s a picture from the neighborhood dock less than a mile away from where I grew up.  I usually don’t talk much about that dock, not because it means so little to me, but because an image like this does a much better job.  In so many ways, it’s a telling portrait of life in Savannah: so little movement in an entrancing and beautiful place.