PGA Is Really Noisy — Strokes Gained Doesn’t Fix That

Last week, I explored the relationship between strokes gained and conventional data and showed how they are far more similar than people realize. The strong similarities alone should give pause to everyone claiming that SG has much better predictive power than conventional data. This week, we’ll quantify exactly how much difference in predictive power there is between properly adjusted conventional data and SG data.

As a crude example, we’ll take data from three tournaments in 2015 (Quicken Loans National, Bridgestone Invitational, and the PGA Championship) to predict some numbers for the next tournament in the schedule, the Wyndham Championship.

I took the simple average of both SG data and adjusted conventional data from the prior tournaments and used each of them to run two linear regression models (one with conventional data and one with SG data) to predict birdies, bogies, and pars for the Wyndham. In reality, it’s more practical to use a much larger historical period for averages, but using three tournaments allows us to follow the example much more intuitively. Here are the predicted vs. actual plots for each of the four metrics using SG data and conventional data (click to enlarge):

sg_test_results

None of these plots look particularly good — and that’s precisely the point. Have you seen the things we’re trying to predict? There’s an enormous amount of noise in PGA data. The cut alone introduces a massive dose of variance, as you can see by the stratifications in the data points. It’s a distribution you just don’t see in any other sport. Furthermore, there’s a huge amount of noise within each stratification, as shown by how high they stretch in the vertical direction.

As a general rule, when the data you’re trying to predict is this noisy, you should be skeptical that there will be a magical unicorn metric that makes sense of it all and gives you absolute clarity. In PGA, some people think that SG is that metric. Remember that properly adjusted conventional data is still capturing a large amount of SG’s information (80% SG: T2G, 67% SG:P). If you truly believe SG will dominate conventional data in terms of predictive power, you’re saying that the missing 25% difference will yield much better predictions. In this particular example, conventional data ends up doing better than SG data, but not by much. The reason is still pretty straightforward: there’s not much difference between the two data sets anyway, and whatever difference there is ends up being drowned in an ocean of noise.

The astute among you are probably asking what those R-squared values look like when the historical sample size gets increased. The express version: Yes, the R-squared values for both go up, but the difference between the data sources is still negligible. Increased sample size doesn’t change the underlying facts highlighted above. And those instincts are correct: The best way to combat noisy data is to get as much of it as possible to help mitigate the noise. Sometimes quantity actually is more important than quality, and that’s definitely true for golf. You don’t have the luxury of not incorporating results whenever you can get them.

Unfortunately for SG, that happens to be its active downside, since it doesn’t exist for non-PGA tournaments. In the last part of this series, we’ll quantify that missing data and show how much active harm it can do to your DFS decision-making process.

Last week, I explored the relationship between strokes gained and conventional data and showed how they are far more similar than people realize. The strong similarities alone should give pause to everyone claiming that SG has much better predictive power than conventional data. This week, we’ll quantify exactly how much difference in predictive power there is between properly adjusted conventional data and SG data.

As a crude example, we’ll take data from three tournaments in 2015 (Quicken Loans National, Bridgestone Invitational, and the PGA Championship) to predict some numbers for the next tournament in the schedule, the Wyndham Championship.

I took the simple average of both SG data and adjusted conventional data from the prior tournaments and used each of them to run two linear regression models (one with conventional data and one with SG data) to predict birdies, bogies, and pars for the Wyndham. In reality, it’s more practical to use a much larger historical period for averages, but using three tournaments allows us to follow the example much more intuitively. Here are the predicted vs. actual plots for each of the four metrics using SG data and conventional data (click to enlarge):

sg_test_results

None of these plots look particularly good — and that’s precisely the point. Have you seen the things we’re trying to predict? There’s an enormous amount of noise in PGA data. The cut alone introduces a massive dose of variance, as you can see by the stratifications in the data points. It’s a distribution you just don’t see in any other sport. Furthermore, there’s a huge amount of noise within each stratification, as shown by how high they stretch in the vertical direction.

As a general rule, when the data you’re trying to predict is this noisy, you should be skeptical that there will be a magical unicorn metric that makes sense of it all and gives you absolute clarity. In PGA, some people think that SG is that metric. Remember that properly adjusted conventional data is still capturing a large amount of SG’s information (80% SG: T2G, 67% SG:P). If you truly believe SG will dominate conventional data in terms of predictive power, you’re saying that the missing 25% difference will yield much better predictions. In this particular example, conventional data ends up doing better than SG data, but not by much. The reason is still pretty straightforward: there’s not much difference between the two data sets anyway, and whatever difference there is ends up being drowned in an ocean of noise.

The astute among you are probably asking what those R-squared values look like when the historical sample size gets increased. The express version: Yes, the R-squared values for both go up, but the difference between the data sources is still negligible. Increased sample size doesn’t change the underlying facts highlighted above. And those instincts are correct: The best way to combat noisy data is to get as much of it as possible to help mitigate the noise. Sometimes quantity actually is more important than quality, and that’s definitely true for golf. You don’t have the luxury of not incorporating results whenever you can get them.

Unfortunately for SG, that happens to be its active downside, since it doesn’t exist for non-PGA tournaments. In the last part of this series, we’ll quantify that missing data and show how much active harm it can do to your DFS decision-making process.