Regression Analysis¶

Let’s apply regression analysis on this data.

import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats

#present data
data = {'Education': [11,12,13,15,8,10,11,12,17,11],
        'Income': [25,27,30,41,18,23,26,24,48,26]} 
data
{'Education': [11, 12, 13, 15, 8, 10, 11, 12, 17, 11],
 'Income': [25, 27, 30, 41, 18, 23, 26, 24, 48, 26]}
import statsmodels.api as sm

# convert into a data frame
df = pd.DataFrame(data,columns=['Education','Income']) 

# get predictor and response
X = df['Education'] 
Y = df['Income']

# Add the intercept to the design matrix
X = sm.add_constant(X) 

model = sm.OLS(Y, X).fit()
predictions = model.predict(X) 

print_model = model.summary()
print(print_model)
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 Income   R-squared:                       0.932
Model:                            OLS   Adj. R-squared:                  0.923
Method:                 Least Squares   F-statistic:                     108.9
Date:                Sat, 28 May 2022   Prob (F-statistic):           6.18e-06
Time:                        16:28:35   Log-Likelihood:                -22.203
No. Observations:                  10   AIC:                             48.41
Df Residuals:                       8   BIC:                             49.01
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -12.1655      4.004     -3.038      0.016     -21.400      -2.931
Education      3.4138      0.327     10.434      0.000       2.659       4.168
==============================================================================
Omnibus:                        2.110   Durbin-Watson:                   2.085
Prob(Omnibus):                  0.348   Jarque-Bera (JB):                1.025
Skew:                          -0.771   Prob(JB):                        0.599
Kurtosis:                       2.710   Cond. No.                         62.6
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Interpretation of the output¶

  • The fitted regression line is:

    \(\hat{y}_i=-12.166+3.414x_i.\)

  • Interpretation of coefficients:

    • \(Education = \hat{\beta}_1=3.414\) tells us that on average for each additional year of education, an individual’s income rises by \(\$ 3.414\) thousand. The estimated stadandard error of \(\hat{\beta}_1\), \(se(\hat{\beta}_1)=0.327\).

    • \(cnst:\hat{\beta}_0=-12.166\). For this data, it does have any meaning since X=0 does not make sense for this data. The estimated stadandard error of \(\hat{\beta}_0\), \(se(\hat{\beta}_0)=4.004\).

  • Hypothesis testing for slope parameter:

    \(H_{0}:\beta_1=0\) vs \(H_{1}:\beta_1 \neq 0\)

  • Under \(H_{0}\), the observed value of test statistic is \(t_{0}=\frac{3.4138}{0.327}=10.434\).

  • The corresponding P-value is 0.000. Since P-value is less than \(\alpha=0.05\), we reject \(H_{0}\) at \(\alpha=0.05\) level. We conclude that years spend in education is linearly associated with the indivual’s income.

  • Hypothesis testing for intercept parameter:

    \(H_{0}:\beta_0=0\) vs \(H_{1}:\beta_0 \neq 0\)

  • Under \(H_{0}\), the observed value of test statistic is \(t_{0}=\frac{-12.1655}{4.004}=-3.038\).

  • The corresponding P-value is 2*0.016. Since P-value is less than \(\alpha=0.05\), we reject \(H_{0}\) at \(\alpha=0.05\) level. We conclude that the intercept should be in the model.

  • Confidence interval:

    • A \(95\%\) confidence interval for \(\beta_1\) is \([2.659,4.168]\). We are \(95\%\) confident that true value of \(\beta_1\) is between 2.659 and 4.168.

    • A \(95\%\) confidence interval for \(\beta_0\) is \([-21.400,-2.931]\). We are \(95\%\) confident that true value of \(\beta_1\) is between -21.400 and -2.931.

  • R-squared = 0.932 tells us that \(93\%\) of variation in income is accounted for by a linear relationship with the variable years spent in education.

Session Info¶

import session_info
session_info.show()
Click to view session information
-----
matplotlib          3.5.2
numpy               1.22.4
pandas              1.4.2
scipy               1.8.1
session_info        1.0.0
statsmodels         0.13.2
-----
Click to view modules imported as dependencies
PIL                 9.1.1
asttokens           NA
backcall            0.2.0
beta_ufunc          NA
binom_ufunc         NA
cffi                1.15.0
colorama            0.4.4
cycler              0.10.0
cython_runtime      NA
dateutil            2.8.2
debugpy             1.6.0
decorator           5.1.1
defusedxml          0.7.1
entrypoints         0.4
executing           0.8.3
hypergeom_ufunc     NA
ipykernel           6.13.0
ipython_genutils    0.2.0
jedi                0.18.1
joblib              1.1.0
kiwisolver          1.4.2
mpl_toolkits        NA
nbinom_ufunc        NA
packaging           21.3
parso               0.8.3
patsy               0.5.2
pexpect             4.8.0
pickleshare         0.7.5
pkg_resources       NA
prompt_toolkit      3.0.29
psutil              5.9.1
ptyprocess          0.7.0
pure_eval           0.2.2
pydev_ipython       NA
pydevconsole        NA
pydevd              2.8.0
pydevd_file_utils   NA
pydevd_plugins      NA
pydevd_tracing      NA
pygments            2.12.0
pyparsing           3.0.9
pytz                2022.1
six                 1.16.0
sphinxcontrib       NA
stack_data          0.2.0
tornado             6.1
traitlets           5.2.1.post0
wcwidth             0.2.5
zmq                 23.0.0
-----
IPython             8.4.0
jupyter_client      7.3.1
jupyter_core        4.10.0
notebook            6.4.11
-----
Python 3.8.12 (default, May  4 2022, 08:13:04) [GCC 9.4.0]
Linux-5.13.0-1023-azure-x86_64-with-glibc2.2.5
-----
Session information updated at 2022-05-28 16:28