Our Blog


Chaos Theory, Statcast Data, and Daily Fantasy Baseball

“As a boy I was always interested in doing things with numbers, and was also fascinated by changes in the weather.”
— Edward Lorenz, the father of chaos theory

Last year, I wrote a piece about chaos theory and daily fantasy golf. Very few people read it, but in the simulation of this universe that’s the variable that ultimately determined the 2017 Presidential Election.

That was my first stab at a “butterfly effect” joke. It sucked.

Anyway the piece on chaos theory and PGA was . . . chaotic and about golf. This piece will hopefully be more organized and about MLB.

Chaos Theory

Chaos theory is a mathematical field that studies dynamic systems. Because these systems are active and reactive, their initial conditions are extremely important, a fact that Lorenz himself fortuitously discovered when a technician in his lab made a small and seemingly inconsequential change in the computer code of a weather simulation system.

Because of that small event — a technician rounding up at the third decimal place — Lorenz eventually became the guy who realized that a butterfly flapping its wings in one part of the world can cause a hurricane in another part of the world a few weeks later.

What does this have to do with baseball?

MLB Spray Charts

You probably know what an MLB spray chart is. This is a spray chart:

Because MLB has great ball-tracking technology and data, we can do a number of comparisons based on the distance of batted balls. For instance, if we wanted to see how a player did over two separate periods of time, we could create a spray chart for that. If we wanted to do a cross-player comparison, we easily could. Or if we wanted to compare one batter’s performance at two ballparks — one actual and one theoretical — we could.

A cross-park comparison is what we see in the graphic above: On the left are the batted balls that Kendrys Morales in 2016 hit at Kauffman Stadium as a member of the Kansas City Royals, and on the right are the same batted balls with an underlay of Rogers Centre, where Morales will play his home games as a member of the Toronto Blue Jays in 2017.

Pulled from our Blue Jays DFS scouting report, this image hit me with two simultaneous thoughts the first time I saw it:

  1. First thought: That’s cool. At a glance, we can generally see that if those balls had been hit at Rogers instead of Kauffman then they likely would’ve had different outcomes. As we say in the scouting report, “There were numerous doubles and flyouts at Kauffman that have a good chance of turning into home runs in Toronto.” If I’m looking at this graphic like a Rorschach test, the impression I get is that Morales is moving to a stadium that theoretically will play better to his power. That’s the impression we wanted to convey with the image.
  2. Second thought: The premise of this graphic is false. Those balls would’ve never been hit at Kauffman, because those pitches never would’ve been thrown. In a different place, with a different team, in different circumstances, facing different pitchers, Morales wouldn’t have seen the exact pitches that turned into those batted balls. The initial conditions would be different: #ButterflyEffect.

Clearly, the purpose of the spray chart is to suggest what’s possible, not to present what would’ve actually happened. That’s fair and fine. I’m not saying that spray charts (or the use of them) is bad. What’s I’m saying is that the data presented in the spray charts — batted ball distance in two different stadiums (“stadia,” if you will) — is missing the context of initial conditions (especially in the Rogers chart).

Statcast Data & Adjusted Production

In comparison to major team sports like football, basketball, and hockey, MLB is further down the analytical rabbit hole. Although the other sports have advanced, MLB got a decade head start and is now firmly in the era of sabermetrics.

Having said that, I can see one clear deficiency in the way that many people think about and handle some baseball data — specifically MLB’s Statcast data.

Many people act as if batted ball distance, exit velocity, launch angle, etc., all take place within a vacuum — as if they are immune to the vagaries of chaos. I’m sometimes guilty of this. I’ll look at batted ball data in our Player Models, I’ll see that Player X is hitting the ball farther than Player Y, and I’ll assume that Player X is the better hitter. Of course, if I took the time to use our Trend tool to research Players X and Y in a more holistic fashion — on the basis of more than just their batted ball data — then I might find that, given the full set of initial conditions, Player Y is likelier to producer a higher Plus/Minus.

The problem with a lot of statcast data is that we view it through a decontextualized lens. The original context matters. What makes our PGA data so great (I think) is that everything is adjusted: Long-Term Adjusted Round Score, for instance. All of this adjusted data is viewable in our Models and researchable in the Labs Tools. Based on the original circumstances under which all the data points are produced, they can be adjusted. Right now, the problem with statcast data is that it’s not adjusted.

Not all balls hit 265 feet into right field are the same. These other factors also matter.

  • Handedness of batter
  • Handedness of pitcher
  • Exit velocity
  • Launch angle
  • Pitch velocity
  • Pitch location
  • Batting order spot
  • Relative strength of surrounding hitters
  • Game situation (score, inning, outs, men on base, count)
  • Home plate umpire and size/location of strike zone (?)

I’m probably missing some factors. Over the course of a season, a lot of these factors might be smoothed away. For example, over 162 games it’s possible that pitch velocity has almost no effect on batted ball distance — I have no idea — but when all of these factors are taken into account they almost certainly have significance, and by looking at unadjusted statcast data we’re missing the impact of the contextual factors that reveal to us the initial conditions of production.

MLB Chaos Theory

Here are some closing thoughts:

  1. It’s possible that once we do the research we’ll find that a lot of the factors I listed above are irrelevant and that the raw statcast data is representative and maybe even predictive as is.
  2. We won’t know if that’s the case unless we do the research.
  3. When we do the research, I expect those other factors will be found to matter. Context matters. The original conditions always matter.
  4. Given the rapidity with which the DFS market is becoming increasingly efficient, it’s likely that adjusted statcast data will provide an edge to its early adopters.

I don’t know when the future’s coming, but it’s coming. For now, the unadjusted data will have to do. Fortunately for FantasyLabs subscribers, life is better with raw statcast data than without it.

The Labyrinthian: 2017.33, 128

This is the 128th installment of The Labyrinthian, a series dedicated to exploring random fields of knowledge in order to give you unordinary theoretical, philosophical, strategic, and/or often rambling guidance on daily fantasy sports. Consult the introductory piece to the series for further explanation. Previous installments of The Labyrinthian can be accessed via my author page.

“As a boy I was always interested in doing things with numbers, and was also fascinated by changes in the weather.”
— Edward Lorenz, the father of chaos theory

Last year, I wrote a piece about chaos theory and daily fantasy golf. Very few people read it, but in the simulation of this universe that’s the variable that ultimately determined the 2017 Presidential Election.

That was my first stab at a “butterfly effect” joke. It sucked.

Anyway the piece on chaos theory and PGA was . . . chaotic and about golf. This piece will hopefully be more organized and about MLB.

Chaos Theory

Chaos theory is a mathematical field that studies dynamic systems. Because these systems are active and reactive, their initial conditions are extremely important, a fact that Lorenz himself fortuitously discovered when a technician in his lab made a small and seemingly inconsequential change in the computer code of a weather simulation system.

Because of that small event — a technician rounding up at the third decimal place — Lorenz eventually became the guy who realized that a butterfly flapping its wings in one part of the world can cause a hurricane in another part of the world a few weeks later.

What does this have to do with baseball?

MLB Spray Charts

You probably know what an MLB spray chart is. This is a spray chart:

Because MLB has great ball-tracking technology and data, we can do a number of comparisons based on the distance of batted balls. For instance, if we wanted to see how a player did over two separate periods of time, we could create a spray chart for that. If we wanted to do a cross-player comparison, we easily could. Or if we wanted to compare one batter’s performance at two ballparks — one actual and one theoretical — we could.

A cross-park comparison is what we see in the graphic above: On the left are the batted balls that Kendrys Morales in 2016 hit at Kauffman Stadium as a member of the Kansas City Royals, and on the right are the same batted balls with an underlay of Rogers Centre, where Morales will play his home games as a member of the Toronto Blue Jays in 2017.

Pulled from our Blue Jays DFS scouting report, this image hit me with two simultaneous thoughts the first time I saw it:

  1. First thought: That’s cool. At a glance, we can generally see that if those balls had been hit at Rogers instead of Kauffman then they likely would’ve had different outcomes. As we say in the scouting report, “There were numerous doubles and flyouts at Kauffman that have a good chance of turning into home runs in Toronto.” If I’m looking at this graphic like a Rorschach test, the impression I get is that Morales is moving to a stadium that theoretically will play better to his power. That’s the impression we wanted to convey with the image.
  2. Second thought: The premise of this graphic is false. Those balls would’ve never been hit at Kauffman, because those pitches never would’ve been thrown. In a different place, with a different team, in different circumstances, facing different pitchers, Morales wouldn’t have seen the exact pitches that turned into those batted balls. The initial conditions would be different: #ButterflyEffect.

Clearly, the purpose of the spray chart is to suggest what’s possible, not to present what would’ve actually happened. That’s fair and fine. I’m not saying that spray charts (or the use of them) is bad. What’s I’m saying is that the data presented in the spray charts — batted ball distance in two different stadiums (“stadia,” if you will) — is missing the context of initial conditions (especially in the Rogers chart).

Statcast Data & Adjusted Production

In comparison to major team sports like football, basketball, and hockey, MLB is further down the analytical rabbit hole. Although the other sports have advanced, MLB got a decade head start and is now firmly in the era of sabermetrics.

Having said that, I can see one clear deficiency in the way that many people think about and handle some baseball data — specifically MLB’s Statcast data.

Many people act as if batted ball distance, exit velocity, launch angle, etc., all take place within a vacuum — as if they are immune to the vagaries of chaos. I’m sometimes guilty of this. I’ll look at batted ball data in our Player Models, I’ll see that Player X is hitting the ball farther than Player Y, and I’ll assume that Player X is the better hitter. Of course, if I took the time to use our Trend tool to research Players X and Y in a more holistic fashion — on the basis of more than just their batted ball data — then I might find that, given the full set of initial conditions, Player Y is likelier to producer a higher Plus/Minus.

The problem with a lot of statcast data is that we view it through a decontextualized lens. The original context matters. What makes our PGA data so great (I think) is that everything is adjusted: Long-Term Adjusted Round Score, for instance. All of this adjusted data is viewable in our Models and researchable in the Labs Tools. Based on the original circumstances under which all the data points are produced, they can be adjusted. Right now, the problem with statcast data is that it’s not adjusted.

Not all balls hit 265 feet into right field are the same. These other factors also matter.

  • Handedness of batter
  • Handedness of pitcher
  • Exit velocity
  • Launch angle
  • Pitch velocity
  • Pitch location
  • Batting order spot
  • Relative strength of surrounding hitters
  • Game situation (score, inning, outs, men on base, count)
  • Home plate umpire and size/location of strike zone (?)

I’m probably missing some factors. Over the course of a season, a lot of these factors might be smoothed away. For example, over 162 games it’s possible that pitch velocity has almost no effect on batted ball distance — I have no idea — but when all of these factors are taken into account they almost certainly have significance, and by looking at unadjusted statcast data we’re missing the impact of the contextual factors that reveal to us the initial conditions of production.

MLB Chaos Theory

Here are some closing thoughts:

  1. It’s possible that once we do the research we’ll find that a lot of the factors I listed above are irrelevant and that the raw statcast data is representative and maybe even predictive as is.
  2. We won’t know if that’s the case unless we do the research.
  3. When we do the research, I expect those other factors will be found to matter. Context matters. The original conditions always matter.
  4. Given the rapidity with which the DFS market is becoming increasingly efficient, it’s likely that adjusted statcast data will provide an edge to its early adopters.

I don’t know when the future’s coming, but it’s coming. For now, the unadjusted data will have to do. Fortunately for FantasyLabs subscribers, life is better with raw statcast data than without it.

The Labyrinthian: 2017.33, 128

This is the 128th installment of The Labyrinthian, a series dedicated to exploring random fields of knowledge in order to give you unordinary theoretical, philosophical, strategic, and/or often rambling guidance on daily fantasy sports. Consult the introductory piece to the series for further explanation. Previous installments of The Labyrinthian can be accessed via my author page.