In week 1, we took a brief look at the cricket match of statistics of the Indian Premier league in 2018 (IPL2018teams dataset). In this week, we will look at the player level statistics. In particular, we are interested in whether the player performance impact their salaries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as sm
In our data repository, there is a data set “IPL18Player.csv” which contains performance statistics as well as salary information of cricket players in the Indian Premier League in 2018.
IPLPlayer=pd.read_csv("../../Data/Week 4/IPL18Player.csv")
IPLPlayer.head()
IPLPlayer.shape
IPLPlayer.info()
There are missing values in the salary variable. We will drop observations with missing values.
IPLPlayer=IPLPlayer.dropna()
IPLPlayer.shape
IPLPlayer['batsman']=np.where(IPLPlayer['innings']> 0, 1, 0)
IPLPlayer['batsman'].describe()
IPLPlayer['bowler']=np.where(IPLPlayer['matches_bowled']> 0, 1, 0)
IPLPlayer['bowler'].describe()
The last type of player that is not captured by either batsman or bowler is wicket keeper. In the dataset, the variable "matches_keeper" indicates the number of matches that a player is a wicket keeper.
Notice that if a batsman has scored runs but not been dismissed, his batting average is technically infinite. Similarly, if a player did not face any ball, his batting strike would be infinite and if a player did not lose any wicket, his bowling average or bowling strike would be infinite.
We will not be able to run a regression when our variables have some infinite values.
There are two alternatives we will consider to deal with this issue.
IPLPlayer['outs']=np.where(IPLPlayer['batsman']==1, IPLPlayer['innings']-IPLPlayer['not_outs'], 0)
IPLPlayer['outs'].describe()
Create batting average, batting strke rate, bowling average, and bowling strike rate variables. Add 1 to the number of outs, balls faced, andn wickets taken in calculating these variables.
IPLPlayer['batting_average']=IPLPlayer['runs']/(IPLPlayer['outs']+1)
IPLPlayer['batting_strike']=IPLPlayer['runs']/((IPLPlayer['balls_faced']+1))*100
IPLPlayer['bowling_average']=IPLPlayer['runs_conceded']/(IPLPlayer['wickets']+1)
IPLPlayer['bowling_strike']=IPLPlayer['balls_bowled']/(IPLPlayer['wickets']+1)
IPLPlayer['batting_average'].describe()
IPLPlayer['batting_strike'].describe()
IPLPlayer['bowling_average'].describe()
IPLPlayer['bowling_strike'].describe()
reg_IPL1=sm.ols(formula = 'Salary ~ batsman+ bowler+ batsman*bowler', data= IPLPlayer, missing="drop").fit()
print(reg_IPL1.summary())
We will first simply use the total number of runs, number of not outs, and number of balls faced to measure players’ performance.
reg_IPL2=sm.ols(formula = 'Salary ~ runs', data= IPLPlayer).fit()
print(reg_IPL2.summary())
reg_IPL3=sm.ols(formula = 'Salary ~ runs+not_outs', data= IPLPlayer).fit()
print(reg_IPL3.summary())
reg_IPL4=sm.ols(formula = 'Salary ~ runs+not_outs+balls_faced', data= IPLPlayer).fit()
print(reg_IPL4.summary())
In the next regressions, we will use the modified batting average and batting strike variables to measure player performance.
reg_IPL5=sm.ols(formula = 'Salary ~ batting_average', data= IPLPlayer).fit()
print(reg_IPL5.summary())
reg_IPL6=sm.ols(formula = 'Salary ~ batting_average+batting_strike', data= IPLPlayer).fit()
print(reg_IPL6.summary())
Again, we will first use number of runs conceded, number of balls bowled, and number of wickets taken to measure bowlers' performance.
reg_IPL7=sm.ols(formula = 'Salary ~ runs_conceded', data= IPLPlayer).fit()
print(reg_IPL7.summary())
reg_IPL8=sm.ols(formula = 'Salary ~ runs_conceded+balls_bowled', data= IPLPlayer).fit()
print(reg_IPL8.summary())
reg_IPL9=sm.ols(formula = 'Salary ~ runs_conceded+balls_bowled+wickets', data= IPLPlayer).fit()
print(reg_IPL9.summary())
In the next regression, we will use the modified bowling average and bowling strike variables to measure player performance.
reg_IPL10=sm.ols(formula = 'Salary ~ bowling_average+bowling_strike', data= IPLPlayer).fit()
print(reg_IPL10.summary())
We will first use the original variables, total number of runs, number of not outs, number of balls faced, number of runs conceded, number of balls bowled, and number of wickets in the regression.
reg_IPL11=sm.ols(formula = 'Salary ~ runs+not_outs+balls_faced+runs_conceded+balls_bowled+wickets', data= IPLPlayer).fit()
print(reg_IPL11.summary())
We will also use the modified batting average, batting strike, bowling average, and bowling strike variables to measure the player performance.
reg_IPL12=sm.ols(formula = 'Salary ~ batting_average+batting_strike+bowling_average+bowling_strike', data= IPLPlayer).fit()
print(reg_IPL12.summary())