Categories
Uncategorized

Data Engineering In the NBA and Basketball World

Reading Guide:

Estimated Read Time: 15 minutes

  1. How I Got Started // How I Got Introduced To Data Engineering in Basketball
  2. Data Engineering in the NBA // The NBA Is Also A Tech Company
  3. What Is Data Like In the NCAA // My Struggles with Data in the NCAA
  4. Conclusion // My Concluding Thoughts

How I Got Started

In the second semester of Freshman year I was watching less basketball than I wanted to be watching. Since I knew the NBA stored a video of every play, I thought a good project in the upcoming summer was to write a program that can download NBA pbp videos, this way I can get a replay of the last 5 minutes of any game while skipping needless time stoppages. Heading into this project I had ZERO experience with web scraping, data processing, or really any real Python programming beyond the very basics of loops and lists. The only thing I knew heading into the project was that Python could be used to automate something a human can do and I knew it was possible to manually create the process of downloading the videos and clipping the videos together so I just needed to put my steps into code by using Google and then I would have a working program. Oh I also did not know that this project was restricted by the NBA (more about that later)*.

Soon after I coded my first couple of lines, I realized that I had a problem, how would I find the game I want to pull without inserting a link to the game, because inserting a link killed the point of the automation- the only thing I wanted to manually do was chose the game I want to watch. This was when I had my soft introduction to data engineering where I started interacting with front end database organization (front end refers to what the end product user sees and back end refers to what the software engineer sees in the back end). My next issue with this project was server interactions. When I started this project I thought I could just work with the code seen in the inspect element section of a website, I did not know that backend connected servers were a thing. At that point I was locked out of the NBA servers and I had no access to the videos on the play by play log. After 2 weeks of Googling, I discovered a Github package by the title of nba_api and I did not think much of it. I just read through its documentation, unfortunately the package did not cover my use case but I was able to look at the code and I kind of found a place to start from. From this step going forward extracting the videos was smooth sailing as I was able to gain access to the data stored on the website except for one slight hiccup. The NBA website does not permit web scrapers and it has a filter that does not allow a bot to make more than 10 requests per second as a method of enforcing this rule; however, I was in too deep at this project and I wanted to finish my project so I loopholed around the filter by just adding a delay between my requests (so if someone from the NBA is by some chance reading this then here is a bug that can be fixed). Once I pulled the videos, I faced a problem related to data engineering (but I did not know the depth of this problem at that point), how was I going to store and reliably deal with every play from every night… At that time I decided my best bet was to get an old laptop with a lot of storage and 8GB of RAM then put Ubuntu on it so I can have 7GB of available RAM (I was basically inventing my own Great Value version of a server, but I did not know what a server was back then). The goal of this configuration was to have enough RAM and storage to process each game individually in a program that attaches the videos together then saved the game to a folder. This set up worked fine but after a couple of days of using this “application” I grew sick of the computer I was using since it was so slow and had the loudest fan of all time (even louder than a PS4 fan!!!). Therefore, I decided that I wanted to be able to access this video from any device so I decided to have my script just upload the videos to Google Drive and so that was where the videos were hosted and I was enjoying the videos from the convenience of my phone. When I reached that stage my next thought was “Wow, I can literally make money off this if I put it the videos on a website that has a paywall such as Patreon or something” so my next step in this program was to have it upload the videos on Google Drive to Google Sites and that was successfully done. However, in the end I got a little paranoid and thought I could land in serious legal trouble if I made my project public since the NBA clearly says “This content may not be copied or distributed without written consent of the NBA”, so I sent an email to a representative at the NBA asking for permission and of course they said no.

When I created that project, it was done without consideration for resume’s, computer science principles or any of that, it was simply a project with a clear goal- let me watch NBA game replays in a quick way. However, now with my current experience I can look at that project and see how I engaged with almost every stage involved in data engineering, I was pulling data then transforming it then storing it while having to make design decisions on how the data was going to be stored and then I moved the data from storage to live use. I inadvertently applied concepts of data lake housing when I stored every video in a specific folder and I also applied the concept of data warehousing by storing data ready to be used by the end user on Google Drive. Of course I did not use the proper tooling used by data engineers but I kind of created my own tools as necessary.

Two things that illuminated my perspective about the complexity of data in the NBA happened in this project. The first thing was that while I was googling solutions for problems that came up while working on the project I came across a project made by someone where they tracked their body movement while playing pick up basketball; the second thing that happened was my data lake housing (I did not know that I was doing that at the time), since I saved all those videos of every play I had what one would call a VERY MASSIVE sample size. Both events occurred by chance but it was clear what my next project was going to be. I wanted to track the movement of NBA players to gain tracking data, what was I going to do with it?, I didn’t know I just wanted to track players in videos. As a result, I did what carried me through my old project and just Googled my way towards learning OpenCV on Python and created my first tracker. This tracker had a ton of issues, for one thing it got confused when teammates who had the same skin color crossed each other, it also could not track the ball, and finally the biggest issue of all was the fact that its coordinates were slightly wrong due to the shifts and cuts in broadcasting camera angles. I looked for many different ways to deal with my issues and after a lot of Googling I learned that this is actually an ongoing issue and whoever can solve it can make a lot of money to the tune of $500,000 per NBA team. However, my Googling led me to learn more about the usage of motion tracking to record and fully automate the NBA data collection process.

Data Engineering In the NBA

If one is to go on the NBA stats website, you can see data about literally anything. Every single thing that happens is now on the stat sheet. If a player dove onto the ground to get the ball, that is on the stat sheet and if a player boxed out but never got rebounds, everyone is now able to see who boxed out who and how many times and if their team got the rebound. However, let us back track and see how does this data even get picked off… Is there someone who records every single observation for every player? Nope, there are actually 8 cameras that record every game and they work together by applying machine learning to identify every player and record their position on the court. I am not sure how each player is recognized but I assume that is done through a combination of facial recognition algorithms as well as the application of some bayesian based probability to infer the common sense outcome of who the player is. I would think that an algorithm made to detect the player on the court would have inputs of players in the game, as well as their faces and heights and it would combine those features to determine who the player is. Once the camera identifies who the player is, it has to associate their actions with them at a rate of 25 frames per second. So that means for every one second, 25 data points are added regarding what the player is doing in terms of their location on the court as well as what’s going on in the court. This data is initially tracked by a 3rd party company and then distributed to the NBA. Once the NBA has this data, they have a team of data engineers who have to do magic.

The magic done by the NBA engineers is based around intense feature engineering and data transformations. First the data is cleaned from errors, the cameras might have lapses where cameras miss-track a player and can not track him for a couple of frames, or the ball data gets lost due to actions like a crossover (at least those were the errors I faced in my programs). After cleaning the data from errors, the engineers have to engage in the painstaking process of creating features of the events. This is done by creating labels of different events. For example, if the ball’s proximity goes from a player on Team A to Team B for more than a certain time limit then possession is switched and if the ball leaves a players hand and increases in height then that is a shot, and if the ball goes through the rim then it is a make, otherwise it is a miss. In the process described above one detail is missing, how can the data analyst have a standardized way of creating his code so that it works no matter what and he does not have to create a new process every game? Well this process is called pipelining since the goal is to have data flow from the camera straight to the individual teams and consumers without manual interference-the only manual interference would ideally be done only for data validation and maintenance.

Standardizing the process where the camera data is transformed into useful insights involves pulling data about the players, for example their team. Once basic labels of data can be identified such as a pass- maybe defined as when the ball changes proximity from player A to player B where player A and player B are on the same team. After the typical labels are identified an analyst needs to start identifying more advanced patterns of data and that can be done through classifier based neural networks that are trained to recognize patterns in data according to a labeled sample. The trained neural networks can then be applied to the raw camera data to start labelling actions such as pick and rolls and isolation plays. Once that data is processed and sent to a data lake (just a place where all this more processed data is stored), this data is organized to be queried in order to be organized into different sections. For example, when you go to the NBA Stats website you can see data presented by team, month, player, and so many other categories this is because of the front end design that is well done to hide all the complexity going on in the back end.

However, motion tracking data is not the only piece of data processed after a game, the NBA also standardizes play by play data. In its raw form, play by play data simply records events in a game such as a turn over, FGM, substitution, and all the basic other things you would expect. However, even that data is pipelined and transformed to present additional metrics like offensive rating and on-off stats.

The entire process above only explains some of the data processing that goes on in the NBA level and does not mention the data processing that goes on a team by team basis. Each team has an additional team of data engineers and data scientists who have to analyze data for their own team as well as other teams. For example, the Utah Jazz had one member who created the most advanced publicly availale RPM metric called the EPM found on Dunks and Threes. I do not know nor do I understand the math that went into creating that metric, however, its basics are built off of the play by play data supplied by the NBA. Other projects made by individual teams include the San Antonio Spurs who used the raw tracking data to identify what play would lead to the largest increase in points based off player positioning, and another project done by the Raptors where they built a model that had the goal of identifying where defenders need to be playing to minimize the points given up on defense.

The section above, does not even go into enough detail and does not give justice to the work done in the NBA as it only describes the tip of the iceberg. The section above does not address the data utilized when evaluating rookies or when pricing players and it does not cover many other nuances, that I do not even understand. While the NBA is not revolutionizing data processing and storage the same way companies like Amazon, Microsoft, Google and Airbnb are, the NBA is a top consumer of top of the line products and has one of the most modernized and seamless data storage mechanisms which a lot of companies can benefit from (I’M LOOKING AT YOU HEALTH INSURANCE COMPANIES).

What Is Data Like In the NCAA?

After last summer where I learned a lot about data in basketball because of my projects and extensive reading of papers presented to the MIT SLOAN Analytics Conference, as well as books made on this topic I decided to try my luck shooting an e-mail to the Texas A&M Men’s Basketball Coaching Staff. In my e-mail I wrote about how I found basketball analytics interesting and how I wanted to volunteer by helping the team in any way I could, and I also attached a portfolio of my previous work. I received a reply and I was asked to present a short project showing what I can offer. At the time I still had a shallow understanding of Machine Learning and so for my project I decided to use basketball reference data to create a project clustering the players. My shallow understanding of ML led to a project that was not too insightful and I do not think it showed too much value to the staff but it was enough where they didn’t just tell me to not reach out to them anymore.

After that interview, I was in on and off contact with the staff where I developed my shot tracker program which some of the readers here are familiar with what I am talking about but that project was rejected and I just have the code laying around and will probably improve it this summer- anyways that is not related to the score of this article. My first real assignment from the coaching staff was to create a model that evaluates transfer players, since our team relies heavily on transfer players- 4/5 of our starters are transfer players. I was assigned to this project in October 2021 and today I am in February 2022 and I am still not done with it, however, my experience has allowed me to learn a lot about the state of data in college basketball.

For my project, the staff chose to share 0 data with me (they pay a lot for this data and I am not a salaried employee so I can not argue with their decision) and so this was when I had to take data pipelining seriously and where I became aware of the state of data in the NCAA. As basketball fans to predict the usefulness of transfer players, we can all think of the very common variables that need to be considered such as how the player fits and the transition in the intensity of play. Figuring out how the player fits requires data that shows as many aspects of the player as possible. This means you need data on the player’s tendencies, their play style, their growth over time, and the situation at their old team, as well as the situation at the new team and for that you need a lot of data. This data is unfortunately not available in the same scale as the NBA and for someone creating a Machine Learning Model (that was what I wanted to do), I had no data to pull other than the basketball reference data- but that data is bad and bad data produces bad models.

As a result, this project I was assigned to turned from a data science project to a data engineering project. The reason I say so was because I had to work so hard on automating my data collection and storage process. A scope of the things I had to essentially build from scratch includes a data base of play by play logs that have text translated into useful data bits (by no means is my data base professional or production quality). The primary insights from the play by play data base come in the form of aggregation where you can see how players and teams perform in certain game situations (ie are they good scorers in half court play or do they get their points in fast break play?). The challenges I run into with automating this data are honestly so tedious and stupid but someone (me) needs to collect useful data- a lot of the testing I have to do for my data resolves around accounting for the many ways a player’s name can be spelled. I also had to create my own python package that scraped and transformed data from basketball reference into useful metrics. The data I was able to pull and then make useful from basketball reference was then used to train a low complexity model (just so I can see how useful the metrics I gathered were) and I was able to have a model that performed at a listed 80% accuracy (in reality the model would run at like 60% in my estimate since the confusion matrix looked quite ugly).

One main task of data engineering is to make sure that your pipeline (basically the process of pulling data then making it useful then storing it) is running smoothly with no error but one day, while running a check on my process, I noticed that basketball reference restructured their website. This led to major errors in my data where I could not pull around half of my data without error. That day erased 2 months of my progress because I could not figure out how to pull the data since the data was shifted in an unpredictable way. As a result, I decided to be a glass half full type of guy and decided to redesign that pipeline from scratch after learning from many of my mistakes over the past 2 months. Just as an example, the way I designed my initial script it took it around 25 minutes to run since it was just different tasks piled on top of each other. However, my current design now utilizes intermediate data stoppage points. I first separate the types of data, where team based data, player based data, and pbp based data are all handled separately. I now pull the data then slightly clean it and save it to a data lake (a data lake is just a place on a server where the data is stored) I then pull that saved data and do feature engineering (feature engineering is basically just looking at available stats (in basketball terms) and seeing how to make them more useful). I do that process for every type of data and hopefully when I have all my data pipelines finalized then I can start working on the machine learning aspect of this project.

The entire description above goes to describe my frustration with the extreme lack of publicly available data on college basketball. For a sport with a lot of gambling going on, there sure isn’t enough public data out there, maybe I am over-estimating the demand for college basketball data. Nonetheless, this is not to say there aren’t useful data providers out there such as the legendary KenPom and Evan Miyakwa (check out his website, it is amazing)- there also is Shot Quality who run a service in partnership with college teams where they analyze pbp data as well as shooter tracking to estimate shot quality but also have a section of their website that is paywalled but designed for fans. Another company that works in collaboration with college teams is HDI and they have a top of the line service compared to everyone else where from what I have seen they seem to have very well made data backends and their clients seem to have benefitted after partnering with them (and no I do not work there nor do I have any association with them- I am just stating my observation as a neutral bystander).

My overall goal of this section was to describe how data is used in the NCAA and I hope that was useful.

Conclusion

Charles Barkley may read this and say “ok so what, why do we even need all this data if someone knows basketball then they do not need all those models to make decisions, they can just use their intuition since they know basketball”. People who echo Charles’s viewpoint are not wrong and the analytics community tends to snobbily look down at them. They have a valid view point, what is the value of creating an algorithm that creates defensive schemes if a coach can also do so? My opinion is that the data processing and algorithm processing drives value by increasing productivity. For example, the NBA’s play by play log that I scraped to create my game replayed can also be used to create a queryable search engine for all types of film that should be evaluated. Meanwhile the algorithm designed to read and understand defensive schemes can be applied in a coaching environment to analyze all games related to an upcoming opposing team- since the scouting department can only watch so much film, but a program made by data scientists can be used to evaluate how the opposing team played in every game they were in and how their tendencies changed given certain conditions. For example, if I am a scout on the OKC Thunder and I have a game coming up against the Warriors and I want to scheme out a defense to contain the Warriors, I would want to watch and see what the Warriors are up to but I would also want to know how do the Warriors schemes change when they go against lineups that have a solid perimeter defender and a 1-5 switchable line up. I would not be able to watch enough tape to understand any of that but a program would be able to generate a report that answers this question and this increases the output of work done by the scouting department.

In terms of data engineering and data science, I think one thing we will see in the near future is a sky rocket in the demand for data engineers. For example, NASA still have not digitized over 1000 documents from the Apollo Missions and in NASA there are ambitions to put a man on the Moon as soon as 2028 while also building a sustainable presence and documents from the initial Apollo Missions would certainly be useful therefore someone can deliver a lot of value by having the skill set to transform this data. This does not just apply to NASA, many companies and organizations that were not built on technology as a service do not have their desired data infrastructure to move into the realm of using data to drive decisions and increase productivity. The NBA was very clever and strategic in their movement to accept data into their organization. As a result they have reaped the rewards of their forward thinking; NBA teams get to use the best and most recent breakthroughs in ML to improve the quality of their league and product.

Appreciate you sticking around to read this and have a good day!

Leave a comment