Maximizing Baseball Run Creation

For our second project at the Metis Data Science Bootcamp, we were tasked with creating a Linear Regression model to make a prediction based on existing data.

The Topic

With free reign in our topic selection, I chose something I’m already passionate and knowledgeable about: baseball. My initial idea was to find how college stats would translate to major league performance, however, when scraping data, I discovered that there wasn’t enough publicly available college data to build a model, so I decided instead to shift gears and leverage the much more robust dataset of MLB players’ batted ball profiles to compare them to their wRC+ and build a predictive model on that.

Glossary

Before I talk about my process, a quick glossary of some of the more technical baseball terms that I will be talking about:

wRC+: (Weighted Runs Created Plus). Measures the number of runs a player creates, compared to all other players (weighted), and normalized at 100 (meaning the average player will have a wRC+ score of 100. This is what the plus in the stat stands for).

Exit Velocity: The speed at which the ball leaves the bat.

Launch Angle: The angle at which the ball leaves the bat.

Barrels: The number of times a player hit the ball with the optimal exit velocity and launch angle. We’ll primarily be looking at this as a percentage of barrels per batted ball event (which is just any ball put into play, regardless of whether or not it resulted in a hit.)

The Process

Since I was using a Linear Regression, I looked for continuous data to use for my model. I used data from FanGraphs and StatCast data from MLB’s own Baseball-Savant.

First, I looked at contact, which was split three ways in each of three categories:

DirectionStrengthType
PullSoftGroundball
CenterMediumLine Drive
OppositeHardFlyball

I also looked at non-contact, which consisted of Walks and Strikeouts. I found that the variables with the highest correlation to wRC+ were Hard-Hit Balls, Home Runs Per Flyball, and Walk Rate. Since Hard-Hit Percentage is highly influenced by Exit Velocity, Launch Angle, and Barrels, I tested those factors in my model as well. I also included Line Drive Rate, as they are usually considered the most desirable type of hit in baseball. Additionally, with some feature engineering, I found that strikeouts had a big impact on my model.

With these variables in my model, I found that my model was fairly good at predicting a player’s wRC+, with an R² score of .703 and average error of 9.89 wRC+, which is within 10%.

The Outcome

Unsurprisingly, my research showed me that the most important factors in determining a hitter’s ability to create runs are:

  • Exit Velocity
  • Launch Angle
  • Home Runs Per Flyball
  • Line Drive Percentage
  • Walk Percentage
  • Strikeout Percentage

In fact, the only thing that comes as a mild surprise is the fact that Strikeout Percentage turned out to be so important in my model, as the modern wisdom says that strikeouts aren’t very important for hitters, as long as they hit the ball hard and walk a lot.

Further Study

With more time and resources at my disposal, I’d look at the data on a more granular level, possibly looking at individual pitches, as well as more advanced regression techniques.

My Github Repo for this project

Street Team Placement for WomenTechWomenYes

As our first project at the Metis Data Science Bootcamp, we were assigned a fictional client, WomenTechWomenYes, and asked to optimize the placement of their street teams in order to maximize attendees to their gala.

The Client:

WomenTechWomenYes (WTWY) has an annual gala at the beginning of the summer each year. As we are new and inclusive organization, we try to do double duty with the gala both to fill our event space with individuals passionate about increasing the participation of women in technology, and to concurrently build awareness and reach.

To this end we place street teams at entrances to subway stations. The street teams collect email addresses and those who sign up are sent free tickets to our gala.

Where we’d like to solicit your engagement is […] to help us optimize the placement of our street teams, such that we can gather the most signatures, ideally from those who will attend the gala and contribute to our cause.

WomenTechWomenYes Client Email

The Process

As suggested, we looked at data from the MTA in New York to find which subway stations had the highest number of entries to determine which stations would have the most traffic and therefore the best chance of collection signatures.

We also looked at US Census Data to determine which boroughs of New York had populations who would be most likely to contribute to the cause.

MTA Data

In analyzing the MTA data, we decided to focus on the number of entries to each station. Potentially given more time, we would have widened the parameters to include both entries and exits, but with only four days to complete the work, we limited ourselves to entries only.

The first thing we did was standardize our column names and remove any duplicate entries that existed for each turnstile

Standardizing column names
Removing duplicate data

As the data we worked from listed lifetime entries for each turnstile, we had to use that data to find the number of entries for each block of time. After that, we found that some of the data seemed to fall outside of the correct scope, including some stations with a negative number of entries per day (impossible) and some with over a million per day (not technically impossible, but incredibly unlikely). We removed all of the negative entries as well as any that were more than 3 standard deviations from the mean so that our data wouldn’t be tainted with outliers.

When plotting this, we ended up with this mess:

Out of which we pulled the top ten stations to get:

We also used the top ten station data to find the times of highest activity:

We could have simply stopped here and recommended the top ten stations at 6pm. But this is where we decided to add…

Census Data

Once we had our MTA data sorted out and easily identifiable, we pulled in data from the US Census to help determine where to place the street teams by borough. Since the census data was pretty messy, a lot of time went into cleaning this up, including transposing it in order to have our borough values represented by rows instead of columns.

Once we had our data in a readable format, we looked at which boroughs had a higher concentration of the following things:

  1. Women per Square Mile
  2. Female-Owned Firms per Square Mile
    • We decided to focus on these two areas as locations with more female traffic as well as more businesses with female owners would be more likely to be interested in the mission of WTWY to increase the participation of women in technology.
  3. Median Annual Income
    • Areas with a higher median annual income were targeted as higher-income people are more likely to donate to the cause.
  4. Homes with Broadband
    • Areas with more broadband-equipped homes were targeted as broadband usage is often associated with a higher-tech base of people, which is what we were targeting.

Unsurprisingly, we found that the highest concentration of all of our criteria were in Manhattan, so we decided that WTWY should focus its primary efforts on Manhattan. However, in the spirit of inclusiveness mentioned in the initial client email, we also decided to deploy teams to Queens, which was third in income and second in broadband usage, as well as Brookly, which was second in both women and female-owned firms per square mile.

The Recommendation

We advised WTWY to place their street teams in the following stations:

  • Manhattan
    • 34th St.–Penn Station
    • 23rd St
    • Grand Central Station
    • 34th St.–Herald Square
    • Union Square
  • Queens
    • Flushing–Main Street
    • Jackson Heights–Roosevelt Avenue
  • Brooklyn
    • Atlantic Avenue–Barclays Center

To optimize the traffic, we recommended deploying the teams between 5 and 9 p.m. Tuesday through Friday.

Further Study

With more time, the data could be further optimized to include exits in addition to entries to better target certain stations. We could also analyze each individual stations to craft more specific day and time recommendations.