Analysis of three datasets related to the effects that playing on synthetic turf versus natural turf can have on player movements and the factors that may contribute to lower extremity injuries.
The data provided for analysis are 250 complete player in-game histories from two subsequent NFL regular seasons. Three different files in .csv format are provided, documenting injuries, player-plays, and player movement during plays:
Injury Record: The injury record file in .csv format contains information on 105 lower-limb injuries that occurred during regular season games over the two seasons. Injuries can be linked to specific records in a player’s history using the PlayerKey, GameID, and PlayKey fields.
Play List: – The play list file contains the details for the 267,005 player-plays that make up the dataset. Each player is indexed by PlayerKey, GameID, and PlayKey fields. Details about the game and play include the player’s assigned roster position, stadium type, field type, weather, play type, position for the play, and position group.
Player Track Data: player level data that describes the location, orientation, speed, and direction of each player during a play recorded at 10 Hz (i.e. 10 observations recorded per second).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
injury_df = pd.read_csv('InjuryRecord.csv')
play_list_df = pd.read_csv('Playlist.csv')
player_track_df = pd.read_csv('PlayerTrackData.csv')
PlayerKey | GameID | PlayKey | BodyPart | Surface | DM_M1 | DM_M7 | DM_M28 | DM_M42 | |
PlayerKey | GameID | PlayKey | RosterPosition | PlayerDay | PlayerGame | StadiumType | FieldType | Temperature | Weather | PlayType | PlayerGamePlay | Position | PositionGroup | |
PlayKey | time | event | x | y | dir | dis | o | s | |
In this stage the data is checked for accuracy and completeness prior to beginning the analysis.
<class 'pandas.core.frame.DataFrame'>
There are a significant number of rows in play_list_df. We are concerned primarily with the plays associated with injuries. Therefore, we will identify and delete the irrelevant data.
There are only 76 PlayKeys in the player_track_df that are associated with plays in injury records. Therefore, these are the rows containing data on the plays where the injury occured.
#Dropping rows where PlayKey is not in Injury Record
play_list_df.drop(play_list_df[play_list_df.PlayKey.isin(injury_df.PlayKey) == False].index,inplace=True)
<class 'pandas.core.frame.DataFrame'>
There are a significant number of rows in player_track_df. We are concerned primarily with the plays associated with injuries. Therefore, we will identify and delete the irrelevant data.
There are only 21905 rows in the player_track_df that are associated with plays in the injury record. Therefore, these are the rows containing data on the plays where the injury occured.
#Dropping rows where PlayKey is not in Injury Record
player_track_df.drop(player_track_df[player_track_df.PlayKey.isin(injury_df.PlayKey) == False].index,inplace=True)
<class 'pandas.core.frame.DataFrame'>
There are missing values in the “PlayKey” Column.
PlayKey uniquely identifies plays made by a specific player during a certain game. Missing values mean that there are injuries recorded that are not associated with a specific play during that player’s game. Additionally we will not have information on the play from the Play List and Player Track Data.
Another thing to note is that there are only 76 unique PlayKeys in the player_track_df and play_list_df that are associated with plays in injury records. Therefore, there is no associated data for the associated rows with missing PLayKeys in injury_df.
Therefore, it would be best to drop these rows that do not contain associated values in the other two Dataframe.
#Dropping rows with missing values
#Checking if any missing values remain
#Checking for any missing values
<class 'pandas.core.frame.DataFrame'>
The “StadiumType” and “Weather” have missing values.
play_list_df['StadiumType'] = play_list_df['StadiumType'].fillna("Unknown")
play_list_df['Weather'] = play_list_df['Weather'].fillna("Not Applicable")
#Checking if any missing values remain
#Checking for any missing values
#Identifying which columns contain missing values
We see that the “event” column has missing values.
The event refers to the play details as a function of time during the play. For each play, each subsequent value in this column is a moment in time during the play. Therefore, it is possible the name of the event only appears on the first instant of the play (time = 0.0) and remains blank in the subsequent rows associated with the same event. Nevertheless, we will be dropping the column as it mainly contains categorical data and we will still have all the numeric data available for analysis.
#Dropping the "event" column
player_track_df.drop('event', inplace=True, axis=1)
#Checking if any missing values remain
Outdoor 38
Outdoors 8
Indoor 7
Unknown 4
Indoors 3
Retractable Roof 3
Dome 2
Outddors 2
Closed Dome 1
Retr. Roof-Closed 1
Open 1
Indoor, Roof Closed 1
Retr. Roof - Closed 1
Retr. Roof - Open 1
Indoor, Open Roof 1
Oudoor 1
Domed, closed 1
Name: StadiumType, dtype: int64
Generally , we are only concerned if the Stadium is indoor or outdoor to understand its exposure to the elements.
The inconsistencies in the stadium types will all be attributed as follows:
The following are in and Unknown State so we will classify them all as Unknown
#Replacing the values
play_list_df['StadiumType'].replace('Outdoors','Outdoor', inplace=True)
play_list_df['StadiumType'].replace('Oudoor','Outdoor', inplace=True)
play_list_df['StadiumType'].replace('Outddors','Outdoor', inplace=True)
play_list_df['StadiumType'].replace('Retr. Roof - Open','Outdoor', inplace=True)
play_list_df['StadiumType'].replace('Open','Outdoor', inplace=True)
play_list_df['StadiumType'].replace('Indoors','Indoor', inplace=True)
play_list_df['StadiumType'].replace('Retr. Roof-Closed','Indoor', inplace=True)
play_list_df['StadiumType'].replace('Retr. Roof - Closed','Indoor', inplace=True)
play_list_df['StadiumType'].replace('Roof Closed','Indoor', inplace=True)
play_list_df['StadiumType'].replace('Indoor, Roof Closed','Indoor', inplace=True)
play_list_df['StadiumType'].replace('Dome','Indoor', inplace=True)
play_list_df['StadiumType'].replace('Domed, closed','Indoor', inplace=True)
play_list_df['StadiumType'].replace('Closed Dome','Indoor', inplace=True)
play_list_df['StadiumType'].replace('Retractable Roof','Unknown', inplace=True)
play_list_df['StadiumType'].replace('Indoor, Open Roof','Unknown', inplace=True)
Generally, we are only concerned about bad weather conditions affecting players such as rain. However, the weather classification appears to be subjective. It may be possible to classify Weather as Hot or Cold based on temperature value but it is not possible to accurately classify if its rainy, cloud etc. given temperature data alone.
Best course of action will be to avoid analyzing data based on ‘Weather’ in the subsequent analysis. As seen above there are clearly some issues with how the temperature data has been classified under weather. It may be possible to classify Weather as Hot or Cold based on temperature value but it is not possible to accurately classify if it’s rainy, cloud etc. given this data.
#Dropping the weather column
The temperature values of -999 are clearly wrong. As we see there are some erroneous values so we will replace them with the average of the remaining values.
temp_df = play_list_df['Temperature']
temp_df = temp_df.to_frame()
temp_df.drop(temp_df[temp_df['Temperature']==-999].index, inplace=True)
avg_temp = round(temp_df['Temperature'].mean())
play_list_df['Temperature'].replace(-999,avg_temp ,inplace=True)
In this stage, we will examine the data to identify any patterns, trends and relationships between the variables. It will help us analyze the data and extract insights that can be used to make decisions.
Data Visualization will give us a clear idea of what the data means by giving it visual context.
surface_count = injury_df['Surface']
plt.title("Number of Injury Occurances by Surface", fontsize=15)
Text(0.5, 1.0, 'Number of Injury Occurances by Surface')
synthetic = injury_df['Surface'].value_counts()[0]
natural = injury_df['Surface'].value_counts()[1]
Synthetic surfaces result in 13.89% more injuries than Natural surfaces.
bodypart_count = injury_df['BodyPart']
plt.title("Number of Injury Occurances by Body Part", fontsize=15)
Text(0.5, 1.0, 'Number of Injury Occurances by Body Part')
Knee and Ankle injuries have the highest occurrances.
A typical NFL season is 18-weeks or 126 days. Each team plays 17 games with one week off. Therefore, A player would play in one game a week. The number of days missed can be considered the duration of the injury as it affects the players ability to participate in games during the season.
fig_dims = (8, 4)
fig, ax = plt.subplots(figsize=fig_dims)
#Counting Number of Days missed due to injury
day_1 = injury_df['DM_M1'].value_counts()[1]
days_7 = injury_df['DM_M7'].value_counts()[1]
days_28 = injury_df['DM_M28'].value_counts()[1]
days_42 = injury_df['DM_M42'].value_counts()[1]
#Creating a list with the values
days_missed = [day_1,days_7,days_28,days_42]
days_missed_label = ['1 Day or More','7 Days or More','28 Days or More','42 Days or More'],days_missed)
plt.title("Number of Days Missed Due to Injury", fontsize=20)
Text(0.5, 1.0, 'Number of Days Missed Due to Injury')
The majority of injuries result in players missing either 1 or more days or 7 days or more.
#Grouping by Surface
days_missed_surface = injury_df.groupby('Surface').agg(day_1=('DM_M1',np.count_nonzero),days_7=('DM_M7',np.count_nonzero),days_28=('DM_M28',np.count_nonzero),days_42=('DM_M42',np.count_nonzero))
day_1 | days_7 | days_28 | days_42 | |
Surface | ||||
Natural | 36 | 27 | 11 | 9 |
Synthetic | 41 | 33 | 20 | 15 |
The data above has overlap in the values counted as the counts are not mutually exclusive.
It would be useful to know what percent of players injured on each surface correspond to each injury duration i.e. the days missed. We can modify the days_missed DataFrame to help us visualize this relationship.
#Creating new columns with mutually exclusive values
days_missed_surface ['day_1_new'] = days_missed_surface ['day_1'] - days_missed_surface ['days_7']
days_missed_surface ['days_7_new'] = days_missed_surface ['days_7'] - days_missed_surface ['days_28']
days_missed_surface ['days_28_new'] = days_missed_surface ['days_28'] - days_missed_surface ['days_42']
days_missed_surface ['days_42_new'] = days_missed_surface ['days_42']
#Deleting the original columns
days_missed_surface .drop(['day_1','days_7','days_28','days_42'], inplace=True, axis=1)
#Renaming new columns
days_missed_surface.rename(columns={'day_1_new': '1 Day or More', 'days_7_new': '7 Days or More', 'days_28_new': '28 Days or More', 'days_42_new': '42 Days or More'}, inplace=True)
#Plotting the pie chart for each row
days_missed_surface.T.plot.pie(subplots=True, figsize=(10, 5), autopct="%.1f", wedgeprops={'linewidth': 3.0, 'edgecolor': 'white'}, labels=None,legend=False)
plt.legend(labels=days_missed_surface.columns, bbox_to_anchor=(1.05,0.5), loc="center right", fontsize=10,bbox_transform=plt.gcf().transFigure)
#days_missed.T allows us to transpose the DataFrame and plot each row as a pie chart
# autopct="%.1f" shows the percentage to 1 decimal place
#wedgeprops={'linewidth': 3.0, 'edgecolor': 'white'} creates a space between pie segments
#bbox_to_anchor allows us to manually place the legend
#bbox_transform=plt.gcf().transFigure ensures that the legend does not overlap with the pie charts
plt.title("Days Missed Due to Injury vs Surface", fontsize=20)
Text(0.5, 1.0, 'Days Missed Due to Injury vs Surface')
Based on this visualization we can say that,
Overall, injuries on synthetic surfaces result in players missing more days than those on natural surfaces.
fig_dims = (14, 5)
fig, ax = plt.subplots(figsize=fig_dims)
roster_position_count = play_list_df['RosterPosition']
plt.title("Number of Injury Occurances by Roster Postion", fontsize=15)
Text(0.5, 1.0, 'Number of Injury Occurances by Roster Postion')
The most injuries are sustained by Players in the Linebacker position followed by those in the Wide Receiver and Safety positions.
fig_dims = (8, 4)
fig, ax = plt.subplots(figsize=fig_dims)
stadium_type_count = play_list_df['StadiumType']
plt.title("Number of Injury Occurances by Stadium Type", fontsize=15)
Text(0.5, 1.0, 'Number of Injury Occurances by Stadium Type')
The most injuries occur in Outdoor Stadiums, which makes sense as these stadiums have the most exposure to potentially hazardous weather conditions.
fig_dims = (8, 4)
fig, ax = plt.subplots(figsize=fig_dims)
temp = play_list_df['Temperature']
plt.title("Number of Injury Occurances by Temperature", fontsize=15)
Text(0.5, 1.0, 'Number of Injury Occurances by Temperature')
The greatest number of injuries take place in games with weather of 67-68 degrees followed by 88-89 degrees.
fig_dims = (14, 5)
fig, ax = plt.subplots(figsize=fig_dims)
position_count = play_list_df['Position']
plt.title("Number of Injury Occurances by Postion", fontsize=15)
Text(0.5, 1.0, 'Number of Injury Occurances by Postion')
The most injuries are sustained by players in the Wide Receiver (WR) position during a play followed by those in the Outside Linebacker (OLB) position.
fig_dims = (14, 5)
fig, ax = plt.subplots(figsize=fig_dims)
position_group_count = play_list_df['PositionGroup']
plt.title("Number of Injury Occurances by Postion Group", fontsize=15)
Text(0.5, 1.0, 'Number of Injury Occurances by Postion Group')
The most injuries are sustained by players in the Linebacker (LB) position group during a play followed by those in the Wide Receiver (WR) and Defensive Back (DB) position group.
fig_dims = (8, 4)
fig, ax = plt.subplots(figsize=fig_dims)
field_type_count = play_list_df['FieldType']
plt.title("Number of Injury Occurances by Field Type", fontsize=15)
Text(0.5, 1.0, 'Number of Injury Occurances by Field Type')
There are more injuries on Synthetic fields.
fig_dims = (14, 5)
fig, ax = plt.subplots(figsize=fig_dims)
roster_position_count = play_list_df['RosterPosition']
sns.countplot(x=roster_position_count, data=play_list_df,palette='viridis',hue='FieldType')
plt.title("Number of Injury Occurances by Roster Postion", fontsize=15)
Text(0.5, 1.0, 'Number of Injury Occurances by Roster Postion')
Natural surfaces result in more injuries for players in the Linebacker, Safety, Defensive Lineman and Offensive Lineman roster positions.
Synthetic surfaces in more injuries for players in the Wide Receiver, Cornerback and Running Back roster positions
Note: The Roster Position refers to the official position assigned to the player but it may vary in the actual play itself. The Position refers to the player’s position during the actual play so this information must be verified with the Position data.
fig_dims = (14, 5)
fig, ax = plt.subplots(figsize=fig_dims)
position_count = play_list_df['Position']
plt.title("Number of Injury Occurances by Postion", fontsize=15)
Text(0.5, 1.0, 'Number of Injury Occurances by Postion')
Players in the Wide Receivers (WR) position, who sustain the most injuries during a play, sustain more injuries on Synthetic surfaces than Natural ones.
fig_dims = (14, 5)
fig, ax = plt.subplots(figsize=fig_dims)
position_group_count = play_list_df['PositionGroup']
sns.countplot(x=position_group_count, data=play_list_df,palette='viridis',hue='FieldType')
plt.title("Number of Injury Occurances by Postion Group", fontsize=15)
Text(0.5, 1.0, 'Number of Injury Occurances by Postion Group')
Players in the Linebacker (LB) position group, who sustain the most injuries during a play, sustain more injuries on Natural surfaces than Synthetic ones.
fig_dims = (8, 4)
fig, ax = plt.subplots(figsize=fig_dims)
stadium_type_count = play_list_df['StadiumType']
sns.countplot(x=stadium_type_count, data=play_list_df,palette='viridis',hue='FieldType')
plt.title("Number of Injury Occurances by Stadium Type", fontsize=15)
Text(0.5, 1.0, 'Number of Injury Occurances by Stadium Type')
graph = sns.FacetGrid(play_list_df, col="FieldType", height=6)
#Setting the title for the FacetGrid
graph.fig.suptitle('Number of Injury Occurances by Temperature', fontsize=20)
Text(0.5, 0.98, 'Number of Injury Occurances by Temperature')
The player_track_df DataFrame gives us the most information about how the player’s body is moving during a play.
“Injuries occur during football games and practice due to the combination of high speeds and full contact. While overuse injuries can occur, traumatic injuries such as concussions are most common. The force applied to either bringing an opponent to the ground or resisting being brought to the ground makes football players prone to injury anywhere on their bodies, regardless of protective equipment.” - Football Injuries, University of Rochester Medical Center
Source: “Football Injuries.” UR Medicine, University of Rochester Medical Center, University of Rochester Medical Center, 2010,
Given the context above, it might be possible to identify the moment of injury by tracking the instant when the maximum change occurs in all the variables associated with movement.
We will create a new DataFrame with variables that reflect the player’s motion during the play.
# Creating a new DataFrame to the change in variables
delta_player_track_df = player_track_df['PlayKey'].to_frame()
delta_player_track_df['time'] = player_track_df['time']
#Adding new columns to reflect the absolute change in variables using DataFrame.diff()
#np.where ensures that in case the time is 0.0 the difference is not calculated as this indicates the row is associated with the next PlayKey. In that case the value will be set to 0
#np.where(condition,value if condition is true, value if condition is false)
delta_player_track_df['delta_dir'] = np.where(player_track_df['time'] == 0, 0, abs(player_track_df['dir'].diff()))
delta_player_track_df['delta_o'] = np.where(player_track_df['time'] == 0, 0, abs(player_track_df['o'].diff()))
delta_player_track_df['delta_s'] = np.where(player_track_df['time'] == 0, 0, abs(player_track_df['s'].diff()))
#Checking number of rows
Now that we have a DataFrame containing all the necessary variables, we can determine which values are outside the norm for each one of these variables.
We are assuming that the injury is sustained as a result of impact, which would be reflected as a sudden change in players normal motion. We will assume any instance where the player’s movements are above average across any of these variables indicates a high risk instance where the injury could have occured.
If we assume the data has a gaussian distribution (as the majority of the player’s motion does not result in injury), we can calculate the z score for each variable to determine which values are outliers.
#Calculating z-scores for dataset as a whole
from scipy import stats
from scipy.stats import zscore
z_score_dir = stats.zscore(delta_player_track_df['delta_dir'])
z_score_o = stats.zscore(delta_player_track_df['delta_o'])
z_score_s = stats.zscore(delta_player_track_df['delta_s'])
fig_dims = (12, 4)
fig, ax = plt.subplots(figsize=fig_dims)
highlight = delta_player_track_df[z_score_dir>3]
ax.scatter(y='delta_dir',x='time', data=delta_player_track_df)
ax.scatter(y='delta_dir',x='time', data=highlight,facecolor="red")
ax.set_xlabel("Time (s)")
ax.set_ylabel("Change in Angle (Degrees)")
ax.set_title("Change in Angle of Player Motion vs. Time",fontsize=20)
Text(0.5, 1.0, 'Change in Angle of Player Motion vs. Time')
Highlighted points have a z-score > 3
The majority of plays seem to require some type of instantaneous change in the player’s direction of motion in the first 40 seconds of the game.
fig_dims = (12, 4)
fig, ax = plt.subplots(figsize=fig_dims)
highlight = delta_player_track_df[z_score_o>3]
ax.scatter(y='delta_o',x='time', data=delta_player_track_df)
ax.scatter(y='delta_o',x='time', data=highlight,facecolor="red")
ax.set_xlabel("Time (s)")
ax.set_ylabel("Change in Orientation (Degrees)")
ax.set_title("Change in Player Orientation vs. Time",fontsize=20)
Text(0.5, 1.0, 'Change in Player Orientation vs. Time')
Highlighted points have a z-score > 3
Correlating with the Change in Angle, the majority of plays seem to require some type of instantaneous change in the direction the player is facing in the first 40 seconds of the game.
fig_dims = (12, 4)
fig, ax = plt.subplots(figsize=fig_dims)
highlight = delta_player_track_df[z_score_s>3]
ax.scatter(y='delta_s',x='time', data=delta_player_track_df)
ax.scatter(y='delta_s',x='time', data=highlight,facecolor="red")
ax.set_xlabel("Time (s)")
ax.set_ylabel("Change in Speed (Yards per Second)")
ax.set_title("Change in Player Speed vs. Time",fontsize=20)
Text(0.5, 1.0, 'Change in Player Speed vs. Time')
Highlighted points have a z-score > 3
Generally the change in speed values are clustered closer together, indicating that the variation in speed is somewhat lower for most of these plays.
None of the variables follow a gaussian distribution, however, the variables do seem to cluster below certain values. which can be helpful in visually identifying outliers.
Identifying points with a z-score > 3 does a somewhat decent job of identifying outliers so we will proceed with this method. However, we will calculate the z-scores for the variables for each PlayKey instead of the data as whole to improve the accuracy a bit.
#Creating a DataFrame to store all the high risk instances
high_risk_playkey = pd.DataFrame(columns = ['PlayKey','time','delta_dir','delta_o','delta_s'])
#Creating a groupby object so we can retrive the keys which in this case are the PlayKeys
groups = delta_player_track_df.groupby('PlayKey')
# extract keys from groups
keys = groups.groups.keys()
#Creating a for loop that will calculate the z-scores for each individual PlayKey and check if any of the values are more than three standard deviations away
for i in keys:
#Creating a Dataframe with the desired rows for each Playkey
i_df = delta_player_track_df[delta_player_track_df['PlayKey']==i]
#Calculate z-scores for this PlayKey
z_score_dir = stats.zscore(i_df['delta_dir'])
z_score_o = stats.zscore(i_df['delta_o'])
z_score_s = stats.zscore(i_df['delta_s'])
#Filter by z-scores exceeding 3
i_df = i_df[(z_score_dir>3) | (z_score_o>3) | (z_score_s>3)]
#Attaching the DataFrame to the high_risk_playkey DataFrame
high_risk_playkey = pd.concat([high_risk_playkey, i_df], axis=0)
#Checking if the DataFrame contains all the PlayKey values
We now have a DataFrame with all the high risk instances for all PlayKey values. Although there are still a significant number of instances for each PlayKey we have narrowed down the values from our original data.
This allows us to answer the following questions:
#Creating a new DataFrame with the PlayKeys and associated surface/field type
surfaces = play_list_df[['PlayKey','FieldType']]
#Merging the two DataFrames to add the surface/field type values
high_risk_playkey = pd.merge(left=high_risk_playkey, right=surfaces, left_on='PlayKey', right_on='PlayKey')
surface_risk = high_risk_playkey.groupby('FieldType').agg(number_of_instances=('FieldType',pd.Series.value_counts),min_time=('time',np.min),max_time=('time',np.max),avg_time=('time',np.mean))
number_of_instances | min_time | max_time | avg_time | |
FieldType | ||||
Natural | 545 | 0.1 | 37.1 | 16.0 |
Synthetic | 647 | 0.1 | 97.6 | 17.1 |
Injuries on synthetic surfaces result in players missing more days than those on natural surfaces.
Players in the Linebacker (LB) position group, who sustain the most injuries during a play, sustain more injuries on Natural surfaces than Synthetic ones.
We are assuming that the injury is sustained as a result of impact, which would be reflected as a sudden change in players normal motion. We will assume any instance where the player’s movements are out of the ordinary across the variables in the Player Track Data indicates a high risk instance where the injury could have occured.