import pandas as pd
import numpy as np
import datetime as dt
Shotlog=pd.read_csv("../../Data/Week 6/Shotlog1.csv")
Player_Stats=pd.read_csv("../../Data/Week 6/Player_Stats1.csv")
Player_Shots=pd.read_csv("../../Data/Week 6/Player_Shots1.csv")
Player_Game=pd.read_csv("../../Data/Week 6/Player_Game1.csv")
Shotlog.head()
We can first calculate the conditional probability of making a shot in the current period conditional on making the previous shot. $$Conditional \ Probability=\frac{Probability \ of \ Making \ Consecutive \ Shots}{Probability \ of \ Making \ Previous \ Shot}$$
We will need to create a variable that indicates a player made consecutive shots.
Shotlog['conse_shot_hit'] = np.where((Shotlog['current_shot_hit']==1)&(Shotlog['lag_shot_hit']==1), 1, 0)
Shotlog.head()
We can create a player level dataframe. The average of the variable "conse_shot_hit" would be the joint probability of making current and previous shots. We will also calculate the average of "lag_shot_hit" to indicate the probability of making the previous shot.
Player_Prob=Shotlog.groupby(['shoot_player'])['conse_shot_hit','lag_shot_hit'].mean()
Player_Prob=Player_Prob.reset_index()
Player_Prob.rename(columns={'lag_shot_hit':'average_lag_hit'}, inplace=True)
Player_Prob.head()
We can calculate the conditional probability by dividing the joint probability by the probability of making the previous shot.
Player_Prob['conditional_prob']=Player_Prob['conse_shot_hit']/Player_Prob['average_lag_hit']
Player_Prob.head()
We can merge the "Player_Prob" data frame with the "Player_Stats" data frame we created earlier to compare the conditional probability and the unconditional probability. If the two probabilities are the same, or almost the same, then we fail to find evidence that the making the current shot depends on making the previous shot.
Player_Stats=pd.merge(Player_Prob, Player_Stats, on=['shoot_player'])
Player_Stats.head(10)
Let's first take a quick look at our "Player_Stats" data frame.
Player_Stats.info()
Note that when we created the "conditional_prob" variable, some observations may have missing value since the "average_lag_shot" variable may contain zero value. We will delete these observations with missing values in conditional probability.
Player_Stats=Player_Stats[pd.notnull(Player_Stats["conditional_prob"])]
We can first check which players have the highest conditional probability, i.e., more likely to have hot hand.
Let's sort the data by conditional probability.
Player_Stats.sort_values(by=['conditional_prob'], ascending=[False]).head(10)
Comparing the "conditional_prob" variable and the "average_hit" variable, some players have a slightly higher conditional probability but some also have a lower conditional probability.
We can sort the data by the value of difference between conditional and unconditional probabilities.
Player_Stats['diff_prob']=Player_Stats['conditional_prob']-Player_Stats['average_hit']
Player_Stats=pd.merge(Player_Stats, Player_Shots, on=['shoot_player'])
Player_Stats.sort_values(by=['diff_prob'], ascending=[False]).head(10)
Comparing the "conditional_prob" variable and the "average_hit" variable, some players have a slightly higher conditional probability but some also have a lower conditional probability. We can sort the data by the value of the difference between conditional and unconditional probabilities. We can see that Lamar Patterson has the highest difference between the two probabilities, at 30%. But we could also see that the sample size for Patterson is pretty small. For Joe Young and Damjan Rudez, we have about 80 observations and the difference in the probabilities is about 20%.
More rigorously, we can use a t-test to test if the players’ probability of hitting the goal is statistically significantly different than their conditional probability.
We need to choose a significance level before we perform the test. If the test produces a p-value less than the chosen significance level, then we say that there is a statistically significant difference between the two probabilities; otherwise, we fail to find evidence to support that the two probabilities are statistically significantly different from each other.
The most commonly used significance level is 0.05.
import scipy.stats as sp
sp.stats.ttest_ind(Player_Stats['conditional_prob'], Player_Stats['average_hit'])
The first number is the t-statistics and the second number is the p-value.
Note that the p-value for the t test is about 0.10, which is higher than the conventional significance level 0.05. Thus the conditional probability is not statistically significantly different than the average success rate. In other words, in the analysis of conditional probability, we fail to find evidence to support the "hot hand".
We can calculate the autocorrelation coefficient by calculating the correlation coefficient between the “current_shot_hit” variable and the “lag_shot_hit” variable.
Note: in python, you could use “autocorr(lag=1)” to calculate first order autocorrelation coefficient. This command is not very useful in our case since we want to look at the autocorrelation coefficient within each game. Using the built-in autocorrelation coefficient function in python, we will be treating the last shot from the previous game and the first shot of the subsequent game as a pair.
Shotlog['current_shot_hit'].corr(Shotlog['lag_shot_hit'])
As we can see, the autocorrelation coefficient is negative and the magnitude is very small and close to zero.
Since some players may have “hot hand”, and hence strong correlation between outcomes of adjacent shots, while some may not. We can also calculate autocorrelation coefficient for each player.
Shotlog.groupby('shoot_player')[['current_shot_hit','lag_shot_hit']].corr().head(10)
Autocorr_Hit=Shotlog.groupby('shoot_player')[['current_shot_hit','lag_shot_hit']].corr().unstack()
Autocorr_Hit.head()
Note that now each row represents a single player. But we still have duplicate information in the columns.
Lastly, we will also reset the index so that the player names would become a variable.
Autocorr_Hit=Shotlog.groupby('shoot_player')[['current_shot_hit','lag_shot_hit']].corr().unstack().iloc[:,1].reset_index()
Autocorr_Hit.head()
Notice that we still have two levels of variable names.
Autocorr_Hit.columns=Autocorr_Hit.columns.get_level_values(0)
Autocorr_Hit.head()
Let's rename the variable capturing autocorrelation coefficient.
Autocorr_Hit.rename(columns={'current_shot_hit':'autocorr'}, inplace=True)
Autocorr_Hit.head()
Player_Game_Shot=Player_Game.groupby(["shoot_player"])['shot_per_game'].mean().reset_index(name='avg_shot_game')
Player_Game_Shot.head()
Autocorr_Hit=pd.merge(Autocorr_Hit, Player_Game_Shot, on=['shoot_player'])
Autocorr_Hit.sort_values(by=['autocorr'], ascending=[False]).head(10)
We will merge the Player_Game_Shot dataframe to the Player_Shots dataframe since both dataframes are measured in player level and both contain information on the number of shots.
Player_Shots=pd.merge(Player_Shots, Player_Game_Shot, on=['shoot_player'])
Player_Shots.head()
Shotlog.to_csv("../../Data/Week 6/Shotlog2.csv", index=False)
Player_Stats.to_csv("../../Data/Week 6/Player_Stats2.csv", index=False)
Player_Shots.to_csv("../../Data/Week 6/Player_Shots2.csv", index=False)