Our Blog


The Way We Grade NFL Fantasy Projections Is Wrong

As we come to the close of the NFL season, it is a good time to think about big-picture lessons we can learn for daily fantasy sports. During the offseason, we spend our time working to improve our processes, tools, and approach to creating projections for our favorite chaotic and unpredictable pastime.

Ask anyone what the most important measure of fantasy projections is and they will answer: Accuracy.

But the way the industry grades projection accuracy is wrong

It’s not wrong wrong, but it’s not what most people want. People throw around statistics like RMSE, R^2, and correlation that, while appropriate and valid in an academic sense, are hard for normal people to wrap their heads around. I have an engineering degree and I still have no idea how to compare these statistics!

These typical measures fall down when we use them to inform our weekly roster decisions.

For example, R^2 numbers can be inflated by the long tail — it’s easy to project the large number of backups and scrubs that will score under five fantasy points per week. But who cares? We weren’t ever going to play Kyle Juszczyk so it doesn’t matter if we correctly predicted he would score exactly 2.31 points.

We should think about weekly lineup decisions as a binary decision.

For weekly leagues, did I make the right sit/start call? For DFS, did this player outperform expectations?

Precision and Recall

Thinking about projections as a tool to inform these binary decisions, we can look at an alternating grading method called Precision and Recall.

This method produces two percentages:

  • Precision (positive predictive power)
  • Recall (sensitivity)

In our context, Precision can answer an important question: When we predict a player will exceed expectations, how often did that player actually beat them?

High Precision shows that the model might not highlight every good play, but the ones it does identify are solid (great for cash game plays).

Recall can tell us: Of all players who beat expectations this week, how many did we correctly identify?

If a projection model has high Recall, it is really good at unearthing sleeper picks (the model identified all of the possible good plays, but also some duds).

Precision and Recall paints a more relevant, understandable, and actionable way for fantasy players to evaluate the accuracy of projection models and determine how we use them to make lineup decisions.

Methodology

To show why this method is more helpful, let’s grade how well we projected players by looking at the FantasyLabs Plus/Minus metric for a sample week.

To calculate Precision and Recall, we need to tally the following:

  • True positives: Players we projected to do well and who did well
  • False positives: Players we projected to do well but who did poorly
  • False negatives: Players we projected to do poorly but who did well
  • True negatives: Players we projected to do poorly and who did poorly

For example:

Player Salary-based Expectation Projection Actual Result
Josh Gordon 11.7 fpts 14.9 fpts (+3.2) 15.9 fpts (+4.2) True positive
Sterling Shepard 11.3 fpts 14.6 fpts (+3.3) 2.7 fpts (-8.6) False positive
Keelan Cole 7.1 fpts 6.9 fpts (-0.2) 18.9 fpts (+11.8) False negative
Seth Roberts 8.3 fpts 5.8 fpts (-2.5) 5.4 fpts (-2.9) True negative

(Repeat for all player projections)

Next, we use the Results to compute the two percentages.

Precision = # of True Positives / (# of True Positives + # of False Positives)
Recall = # of True Positives / (# of True Positives + # of False Negatives)

Sample Results

I used this method to grade the Week 14 NFL DraftKings projections (by position):

QB RB WR TE
Precision 48% 65% 48% 36%
Recall 67% 61% 59% 42%
R^2 (Projected fpts vs Actual fpts) 0.02 0.38 0.27 0.07

With Precision and Recall, we can start to make some interesting observations about projection accuracy that would be impossible with only raw R^2 numbers.

In this particular week, the projections were good for picking cash game running backs. Of the 26 running backs projected to have a positive Plus/Minus, 17 of them did (Precision: 65%).

One weakness was at tight end: Of the 12 tight ends that had a positive Plus/Minus on game day, only five of our projected tight ends were among them (Recall: 42%). If you were looking to find a punt tight end play to win a GPP, you probably missed the mark.

If we looked strictly at R^2, we would assume that our wide receiver projections were way better than our quarterback projections this week (note: higher R^2 is better). But when making the binary “play or not” decision, both positions were similar (48% Precision).

Cool, But How Does This Help Me?

One way to think about these metrics is that Precision is measuring the quality of the predictions, while Recall is measuring the quantity. For building DFS lineups, we need to balance both metrics to fit our game selection (aside: There are several statistics that aim to assign a single accuracy score that combine both Precision and Recall: F-Score or Youden’s index).

For constructing a single cash game lineup, we need only a handful of plays at each position, but we need those plays to be high quality — even a single dud can be the difference between cashing or not. Given this goal, it becomes easier to compare different projection Models.

Cash Model A Cash Model B
Precision 80% 50%
Recall 20% 75%

In this hypothetical, Model A is going to be much stronger for cash game lineups compared to Model B. Using Model A means selecting from a player pool that is smaller (low Recall) but where each player has a higher likelihood of exceeding expectations (high Precision).

For constructing multiple GPP lineups, the importance of the metrics flips. We need to open up the player pool to include players who have high upside but a wider range of outcomes. We should be willing to accept a lower Precision score if we can increase the Recall.

You might ask: Why don’t we aim for the extremes? Wouldn’t the best model for cash games be one with 100% Precision? Or a tournament model with 100% Recall?

Unfortunately, we have other constraints that we need to balance for a model to be practical. Imagine that you built a model that projected that Antonio Brown would be the only player to exceed expectations this week. If he goes off, great! Your model scored a perfect 100% Precision.

But it’s not actually useful. Even for a single lineup, we need a handful of options at each position both to fill out the minimum roster requirements and to balance salary costs.

In a similar vein, you could score 100% Recall if you simply projected every player to score 50 fantasy points every week. You will definitely have every good play in your player pool (because you included every player), but the Precision will be so low that you won’t be able to construct any meaningful lineups.

The benefit of Precision and Recall compared to other explanatory statistics is that as we adjust model parameters and tweak projections, we can look at how these changes impact Precision and Recall and if the results are more in line with our lineup construction needs.

Photo Credit: Chuck Cook – USA TODAY Sports

As we come to the close of the NFL season, it is a good time to think about big-picture lessons we can learn for daily fantasy sports. During the offseason, we spend our time working to improve our processes, tools, and approach to creating projections for our favorite chaotic and unpredictable pastime.

Ask anyone what the most important measure of fantasy projections is and they will answer: Accuracy.

But the way the industry grades projection accuracy is wrong

It’s not wrong wrong, but it’s not what most people want. People throw around statistics like RMSE, R^2, and correlation that, while appropriate and valid in an academic sense, are hard for normal people to wrap their heads around. I have an engineering degree and I still have no idea how to compare these statistics!

These typical measures fall down when we use them to inform our weekly roster decisions.

For example, R^2 numbers can be inflated by the long tail — it’s easy to project the large number of backups and scrubs that will score under five fantasy points per week. But who cares? We weren’t ever going to play Kyle Juszczyk so it doesn’t matter if we correctly predicted he would score exactly 2.31 points.

We should think about weekly lineup decisions as a binary decision.

For weekly leagues, did I make the right sit/start call? For DFS, did this player outperform expectations?

Precision and Recall

Thinking about projections as a tool to inform these binary decisions, we can look at an alternating grading method called Precision and Recall.

This method produces two percentages:

  • Precision (positive predictive power)
  • Recall (sensitivity)

In our context, Precision can answer an important question: When we predict a player will exceed expectations, how often did that player actually beat them?

High Precision shows that the model might not highlight every good play, but the ones it does identify are solid (great for cash game plays).

Recall can tell us: Of all players who beat expectations this week, how many did we correctly identify?

If a projection model has high Recall, it is really good at unearthing sleeper picks (the model identified all of the possible good plays, but also some duds).

Precision and Recall paints a more relevant, understandable, and actionable way for fantasy players to evaluate the accuracy of projection models and determine how we use them to make lineup decisions.

Methodology

To show why this method is more helpful, let’s grade how well we projected players by looking at the FantasyLabs Plus/Minus metric for a sample week.

To calculate Precision and Recall, we need to tally the following:

  • True positives: Players we projected to do well and who did well
  • False positives: Players we projected to do well but who did poorly
  • False negatives: Players we projected to do poorly but who did well
  • True negatives: Players we projected to do poorly and who did poorly

For example:

Player Salary-based Expectation Projection Actual Result
Josh Gordon 11.7 fpts 14.9 fpts (+3.2) 15.9 fpts (+4.2) True positive
Sterling Shepard 11.3 fpts 14.6 fpts (+3.3) 2.7 fpts (-8.6) False positive
Keelan Cole 7.1 fpts 6.9 fpts (-0.2) 18.9 fpts (+11.8) False negative
Seth Roberts 8.3 fpts 5.8 fpts (-2.5) 5.4 fpts (-2.9) True negative

(Repeat for all player projections)

Next, we use the Results to compute the two percentages.

Precision = # of True Positives / (# of True Positives + # of False Positives)
Recall = # of True Positives / (# of True Positives + # of False Negatives)

Sample Results

I used this method to grade the Week 14 NFL DraftKings projections (by position):

QB RB WR TE
Precision 48% 65% 48% 36%
Recall 67% 61% 59% 42%
R^2 (Projected fpts vs Actual fpts) 0.02 0.38 0.27 0.07

With Precision and Recall, we can start to make some interesting observations about projection accuracy that would be impossible with only raw R^2 numbers.

In this particular week, the projections were good for picking cash game running backs. Of the 26 running backs projected to have a positive Plus/Minus, 17 of them did (Precision: 65%).

One weakness was at tight end: Of the 12 tight ends that had a positive Plus/Minus on game day, only five of our projected tight ends were among them (Recall: 42%). If you were looking to find a punt tight end play to win a GPP, you probably missed the mark.

If we looked strictly at R^2, we would assume that our wide receiver projections were way better than our quarterback projections this week (note: higher R^2 is better). But when making the binary “play or not” decision, both positions were similar (48% Precision).

Cool, But How Does This Help Me?

One way to think about these metrics is that Precision is measuring the quality of the predictions, while Recall is measuring the quantity. For building DFS lineups, we need to balance both metrics to fit our game selection (aside: There are several statistics that aim to assign a single accuracy score that combine both Precision and Recall: F-Score or Youden’s index).

For constructing a single cash game lineup, we need only a handful of plays at each position, but we need those plays to be high quality — even a single dud can be the difference between cashing or not. Given this goal, it becomes easier to compare different projection Models.

Cash Model A Cash Model B
Precision 80% 50%
Recall 20% 75%

In this hypothetical, Model A is going to be much stronger for cash game lineups compared to Model B. Using Model A means selecting from a player pool that is smaller (low Recall) but where each player has a higher likelihood of exceeding expectations (high Precision).

For constructing multiple GPP lineups, the importance of the metrics flips. We need to open up the player pool to include players who have high upside but a wider range of outcomes. We should be willing to accept a lower Precision score if we can increase the Recall.

You might ask: Why don’t we aim for the extremes? Wouldn’t the best model for cash games be one with 100% Precision? Or a tournament model with 100% Recall?

Unfortunately, we have other constraints that we need to balance for a model to be practical. Imagine that you built a model that projected that Antonio Brown would be the only player to exceed expectations this week. If he goes off, great! Your model scored a perfect 100% Precision.

But it’s not actually useful. Even for a single lineup, we need a handful of options at each position both to fill out the minimum roster requirements and to balance salary costs.

In a similar vein, you could score 100% Recall if you simply projected every player to score 50 fantasy points every week. You will definitely have every good play in your player pool (because you included every player), but the Precision will be so low that you won’t be able to construct any meaningful lineups.

The benefit of Precision and Recall compared to other explanatory statistics is that as we adjust model parameters and tweak projections, we can look at how these changes impact Precision and Recall and if the results are more in line with our lineup construction needs.

Photo Credit: Chuck Cook – USA TODAY Sports