Regression Analysis
Contents
Regression Analysis¶
Let’s apply regression analysis on this data.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
#present data
data = {'Education': [11,12,13,15,8,10,11,12,17,11],
'Income': [25,27,30,41,18,23,26,24,48,26]}
data
{'Education': [11, 12, 13, 15, 8, 10, 11, 12, 17, 11],
'Income': [25, 27, 30, 41, 18, 23, 26, 24, 48, 26]}
import statsmodels.api as sm
# convert into a data frame
df = pd.DataFrame(data,columns=['Education','Income'])
# get predictor and response
X = df['Education']
Y = df['Income']
# Add the intercept to the design matrix
X = sm.add_constant(X)
model = sm.OLS(Y, X).fit()
predictions = model.predict(X)
print_model = model.summary()
print(print_model)
OLS Regression Results
==============================================================================
Dep. Variable: Income R-squared: 0.932
Model: OLS Adj. R-squared: 0.923
Method: Least Squares F-statistic: 108.9
Date: Sat, 28 May 2022 Prob (F-statistic): 6.18e-06
Time: 16:28:35 Log-Likelihood: -22.203
No. Observations: 10 AIC: 48.41
Df Residuals: 8 BIC: 49.01
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -12.1655 4.004 -3.038 0.016 -21.400 -2.931
Education 3.4138 0.327 10.434 0.000 2.659 4.168
==============================================================================
Omnibus: 2.110 Durbin-Watson: 2.085
Prob(Omnibus): 0.348 Jarque-Bera (JB): 1.025
Skew: -0.771 Prob(JB): 0.599
Kurtosis: 2.710 Cond. No. 62.6
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Interpretation of the output¶
The fitted regression line is:
\(\hat{y}_i=-12.166+3.414x_i.\)
Interpretation of coefficients:
\(Education = \hat{\beta}_1=3.414\) tells us that on average for each additional year of education, an individual’s income rises by \(\$ 3.414\) thousand. The estimated stadandard error of \(\hat{\beta}_1\), \(se(\hat{\beta}_1)=0.327\).
\(cnst:\hat{\beta}_0=-12.166\). For this data, it does have any meaning since X=0 does not make sense for this data. The estimated stadandard error of \(\hat{\beta}_0\), \(se(\hat{\beta}_0)=4.004\).
Hypothesis testing for slope parameter:
\(H_{0}:\beta_1=0\) vs \(H_{1}:\beta_1 \neq 0\)
Under \(H_{0}\), the observed value of test statistic is \(t_{0}=\frac{3.4138}{0.327}=10.434\).
The corresponding P-value is 0.000. Since P-value is less than \(\alpha=0.05\), we reject \(H_{0}\) at \(\alpha=0.05\) level. We conclude that years spend in education is linearly associated with the indivual’s income.
Hypothesis testing for intercept parameter:
\(H_{0}:\beta_0=0\) vs \(H_{1}:\beta_0 \neq 0\)
Under \(H_{0}\), the observed value of test statistic is \(t_{0}=\frac{-12.1655}{4.004}=-3.038\).
The corresponding P-value is 2*0.016. Since P-value is less than \(\alpha=0.05\), we reject \(H_{0}\) at \(\alpha=0.05\) level. We conclude that the intercept should be in the model.
Confidence interval:
A \(95\%\) confidence interval for \(\beta_1\) is \([2.659,4.168]\). We are \(95\%\) confident that true value of \(\beta_1\) is between 2.659 and 4.168.
A \(95\%\) confidence interval for \(\beta_0\) is \([-21.400,-2.931]\). We are \(95\%\) confident that true value of \(\beta_1\) is between -21.400 and -2.931.
R-squared = 0.932 tells us that \(93\%\) of variation in income is accounted for by a linear relationship with the variable years spent in education.
Session Info¶
import session_info
session_info.show()
Click to view session information
----- matplotlib 3.5.2 numpy 1.22.4 pandas 1.4.2 scipy 1.8.1 session_info 1.0.0 statsmodels 0.13.2 -----
Click to view modules imported as dependencies
PIL 9.1.1 asttokens NA backcall 0.2.0 beta_ufunc NA binom_ufunc NA cffi 1.15.0 colorama 0.4.4 cycler 0.10.0 cython_runtime NA dateutil 2.8.2 debugpy 1.6.0 decorator 5.1.1 defusedxml 0.7.1 entrypoints 0.4 executing 0.8.3 hypergeom_ufunc NA ipykernel 6.13.0 ipython_genutils 0.2.0 jedi 0.18.1 joblib 1.1.0 kiwisolver 1.4.2 mpl_toolkits NA nbinom_ufunc NA packaging 21.3 parso 0.8.3 patsy 0.5.2 pexpect 4.8.0 pickleshare 0.7.5 pkg_resources NA prompt_toolkit 3.0.29 psutil 5.9.1 ptyprocess 0.7.0 pure_eval 0.2.2 pydev_ipython NA pydevconsole NA pydevd 2.8.0 pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pygments 2.12.0 pyparsing 3.0.9 pytz 2022.1 six 1.16.0 sphinxcontrib NA stack_data 0.2.0 tornado 6.1 traitlets 5.2.1.post0 wcwidth 0.2.5 zmq 23.0.0
----- IPython 8.4.0 jupyter_client 7.3.1 jupyter_core 4.10.0 notebook 6.4.11 ----- Python 3.8.12 (default, May 4 2022, 08:13:04) [GCC 9.4.0] Linux-5.13.0-1023-azure-x86_64-with-glibc2.2.5 ----- Session information updated at 2022-05-28 16:28