A team of researchers at the Penn State College of Information Sciences and Technology has developed a machine learning model that can better measure baseball players’ and teams’ short- and long-term performance. The new method was measured against existing statistical analysis methods called sabermetrics.
The research was presented in a paper titled “Using Machine Learning to Describe How Players Impact the Game in the MLB.”
Building on NLP and Computer Vision
The team’s approach relied on recent advances in natural language processing and computer vision, and it could have big implications for the way in which the player’s impact on the game is measured.
Connor Heaton is a doctoral candidate in the College of IST.
Heaton says that the existing family of methods rely on the number of times a player or team achieves a discrete event, such as hitting a home run. These methods fail to consider the context of each action.
“Think about a scenario in which a player recorded a single in his last plate appearance,” said Heaton. “He could have hit a dribble down the third base line, advancing a runner from first to second and beat the throw to first, or hit a ball to deep left field and reached first base comfortably but didn’t have the speed to push for a double. Describing both situations as resulting in ‘a single’ is accurate but does not tell the whole story.”
The New Model
Heaton’s model relies on learning the meaning of in-game events, which is based on the impact they have on the game and their context. The model then views the game as a sequence of events to output numerical representations of how players impact the game.
“We often talk about baseball in terms of ‘this player had two singles and a double yesterday.’ or ‘he went one for four,” said Heaton. “A lot of the ways in which we talk about the game just summarize the events with one summary statistic. “Our work is trying to take a more holistic picture of the game and to get a more nuanced, computational description of how players impact the game.”
The new method leverages sequential modeling techniques in NLP to enable computers to learn the meaning of different words. Heaton used this to teach his model the meaning of events in the baseball game, such as a batter hitting a single. The game was then modeled as a sequence of events.
“The impact of this work is the framework that is proposed for what I like to call ‘interrogating the game,’” Heaton said. “We’re viewing it as a sequence in this whole computational scaffolding to model a game.”
The model is able to describe a player’s influence on the game over the short term, and when combined with traditional methods, it can predict the winner of a game with over 59% accuracy.
Training the Model
The researchers trained their model by using data previously collected from systems installed at major league baseball stadiums. These systems track detailed information for each pitch, including player positioning, base occupancy, and pitch velocity. Two types of data were used. The first was pitch-by-pitch data, which helped analyze information like pitch type. The second was season-by-season data, used to investigate position-specific information.
Each pitch within the collected dataset had three major features, which were the specific game, the at-bat number within the game, and the pitch number within the at-bat. This data enabled the researchers to reconstruct the sequence of events that make up a MLB game.
To describe the events that happened, how they happened, and who was involved with each play, the team identified 325 possible game changes that could occur when a pitch is thrown. This was then combined with existing data, and player records were imputed.
Prasenjit Mitra is professor of information sciences and technology, as well as co-author of the paper.
“This work has the potential to significantly advance the state of the art in sabermetrics,” said Prof. Mitrae. “To the best of our knowledge, ours is the first to capture and represent a nuanced state of the game and utilize this information as the context to evaluate the individual events that are counted by traditional statistics — for example, by automatically building a model that understands key moments and clutch events.”