Regression Analysis

Let’s apply regression analysis on this data.

import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats

#present data
data = {'Education': [11,12,13,15,8,10,11,12,17,11],
        'Income': [25,27,30,41,18,23,26,24,48,26]} 
data
{'Education': [11, 12, 13, 15, 8, 10, 11, 12, 17, 11],
 'Income': [25, 27, 30, 41, 18, 23, 26, 24, 48, 26]}
import statsmodels.api as sm

# convert into a data frame
df = pd.DataFrame(data,columns=['Education','Income']) 

# get predictor and response
X = df['Education'] 
Y = df['Income']

# Add the intercept to the design matrix
X = sm.add_constant(X) 

model = sm.OLS(Y, X).fit()
predictions = model.predict(X) 

print_model = model.summary()
print(print_model)
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 Income   R-squared:                       0.932
Model:                            OLS   Adj. R-squared:                  0.923
Method:                 Least Squares   F-statistic:                     108.9
Date:                Sat, 28 May 2022   Prob (F-statistic):           6.18e-06
Time:                        16:28:35   Log-Likelihood:                -22.203
No. Observations:                  10   AIC:                             48.41
Df Residuals:                       8   BIC:                             49.01
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -12.1655      4.004     -3.038      0.016     -21.400      -2.931
Education      3.4138      0.327     10.434      0.000       2.659       4.168
==============================================================================
Omnibus:                        2.110   Durbin-Watson:                   2.085
Prob(Omnibus):                  0.348   Jarque-Bera (JB):                1.025
Skew:                          -0.771   Prob(JB):                        0.599
Kurtosis:                       2.710   Cond. No.                         62.6
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Interpretation of the output

  • The fitted regression line is:

    y^i=12.166+3.414xi.

  • Interpretation of coefficients:

    • Education=β^1=3.414 tells us that on average for each additional year of education, an individual’s income rises by $3.414 thousand. The estimated stadandard error of β^1, se(β^1)=0.327.

    • cnst:β^0=12.166. For this data, it does have any meaning since X=0 does not make sense for this data. The estimated stadandard error of β^0, se(β^0)=4.004.

  • Hypothesis testing for slope parameter:

    H0:β1=0 vs H1:β10

  • Under H0, the observed value of test statistic is t0=3.41380.327=10.434.

  • The corresponding P-value is 0.000. Since P-value is less than α=0.05, we reject H0 at α=0.05 level. We conclude that years spend in education is linearly associated with the indivual’s income.

  • Hypothesis testing for intercept parameter:

    H0:β0=0 vs H1:β00

  • Under H0, the observed value of test statistic is t0=12.16554.004=3.038.

  • The corresponding P-value is 2*0.016. Since P-value is less than α=0.05, we reject H0 at α=0.05 level. We conclude that the intercept should be in the model.

  • Confidence interval:

    • A 95% confidence interval for β1 is [2.659,4.168]. We are 95% confident that true value of β1 is between 2.659 and 4.168.

    • A 95% confidence interval for β0 is [21.400,2.931]. We are 95% confident that true value of β1 is between -21.400 and -2.931.

  • R-squared = 0.932 tells us that 93% of variation in income is accounted for by a linear relationship with the variable years spent in education.

Session Info

import session_info
session_info.show()
Click to view session information
-----
matplotlib          3.5.2
numpy               1.22.4
pandas              1.4.2
scipy               1.8.1
session_info        1.0.0
statsmodels         0.13.2
-----
Click to view modules imported as dependencies
PIL                 9.1.1
asttokens           NA
backcall            0.2.0
beta_ufunc          NA
binom_ufunc         NA
cffi                1.15.0
colorama            0.4.4
cycler              0.10.0
cython_runtime      NA
dateutil            2.8.2
debugpy             1.6.0
decorator           5.1.1
defusedxml          0.7.1
entrypoints         0.4
executing           0.8.3
hypergeom_ufunc     NA
ipykernel           6.13.0
ipython_genutils    0.2.0
jedi                0.18.1
joblib              1.1.0
kiwisolver          1.4.2
mpl_toolkits        NA
nbinom_ufunc        NA
packaging           21.3
parso               0.8.3
patsy               0.5.2
pexpect             4.8.0
pickleshare         0.7.5
pkg_resources       NA
prompt_toolkit      3.0.29
psutil              5.9.1
ptyprocess          0.7.0
pure_eval           0.2.2
pydev_ipython       NA
pydevconsole        NA
pydevd              2.8.0
pydevd_file_utils   NA
pydevd_plugins      NA
pydevd_tracing      NA
pygments            2.12.0
pyparsing           3.0.9
pytz                2022.1
six                 1.16.0
sphinxcontrib       NA
stack_data          0.2.0
tornado             6.1
traitlets           5.2.1.post0
wcwidth             0.2.5
zmq                 23.0.0
-----
IPython             8.4.0
jupyter_client      7.3.1
jupyter_core        4.10.0
notebook            6.4.11
-----
Python 3.8.12 (default, May  4 2022, 08:13:04) [GCC 9.4.0]
Linux-5.13.0-1023-azure-x86_64-with-glibc2.2.5
-----
Session information updated at 2022-05-28 16:28