# Regression Analysis¶

Let’s apply regression analysis on this data.

import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats

#present data
data = {'Education': [11,12,13,15,8,10,11,12,17,11],
'Income': [25,27,30,41,18,23,26,24,48,26]}
data

{'Education': [11, 12, 13, 15, 8, 10, 11, 12, 17, 11],
'Income': [25, 27, 30, 41, 18, 23, 26, 24, 48, 26]}

import statsmodels.api as sm

# convert into a data frame
df = pd.DataFrame(data,columns=['Education','Income'])

# get predictor and response
X = df['Education']
Y = df['Income']

# Add the intercept to the design matrix

model = sm.OLS(Y, X).fit()
predictions = model.predict(X)

print_model = model.summary()
print(print_model)

                            OLS Regression Results
==============================================================================
Dep. Variable:                 Income   R-squared:                       0.932
Method:                 Least Squares   F-statistic:                     108.9
Date:                Sat, 28 May 2022   Prob (F-statistic):           6.18e-06
Time:                        16:28:35   Log-Likelihood:                -22.203
No. Observations:                  10   AIC:                             48.41
Df Residuals:                       8   BIC:                             49.01
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -12.1655      4.004     -3.038      0.016     -21.400      -2.931
Education      3.4138      0.327     10.434      0.000       2.659       4.168
==============================================================================
Omnibus:                        2.110   Durbin-Watson:                   2.085
Prob(Omnibus):                  0.348   Jarque-Bera (JB):                1.025
Skew:                          -0.771   Prob(JB):                        0.599
Kurtosis:                       2.710   Cond. No.                         62.6
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


# Interpretation of the output¶

• The fitted regression line is:

$$\hat{y}_i=-12.166+3.414x_i.$$

• Interpretation of coefficients:

• $$Education = \hat{\beta}_1=3.414$$ tells us that on average for each additional year of education, an individual’s income rises by $$\ 3.414$$ thousand. The estimated stadandard error of $$\hat{\beta}_1$$, $$se(\hat{\beta}_1)=0.327$$.

• $$cnst:\hat{\beta}_0=-12.166$$. For this data, it does have any meaning since X=0 does not make sense for this data. The estimated stadandard error of $$\hat{\beta}_0$$, $$se(\hat{\beta}_0)=4.004$$.

• Hypothesis testing for slope parameter:

$$H_{0}:\beta_1=0$$ vs $$H_{1}:\beta_1 \neq 0$$

• Under $$H_{0}$$, the observed value of test statistic is $$t_{0}=\frac{3.4138}{0.327}=10.434$$.

• The corresponding P-value is 0.000. Since P-value is less than $$\alpha=0.05$$, we reject $$H_{0}$$ at $$\alpha=0.05$$ level. We conclude that years spend in education is linearly associated with the indivual’s income.

• Hypothesis testing for intercept parameter:

$$H_{0}:\beta_0=0$$ vs $$H_{1}:\beta_0 \neq 0$$

• Under $$H_{0}$$, the observed value of test statistic is $$t_{0}=\frac{-12.1655}{4.004}=-3.038$$.

• The corresponding P-value is 2*0.016. Since P-value is less than $$\alpha=0.05$$, we reject $$H_{0}$$ at $$\alpha=0.05$$ level. We conclude that the intercept should be in the model.

• Confidence interval:

• A $$95\%$$ confidence interval for $$\beta_1$$ is $$[2.659,4.168]$$. We are $$95\%$$ confident that true value of $$\beta_1$$ is between 2.659 and 4.168.

• A $$95\%$$ confidence interval for $$\beta_0$$ is $$[-21.400,-2.931]$$. We are $$95\%$$ confident that true value of $$\beta_1$$ is between -21.400 and -2.931.

• R-squared = 0.932 tells us that $$93\%$$ of variation in income is accounted for by a linear relationship with the variable years spent in education.

## Session Info¶

import session_info
session_info.show()

Click to view session information
-----
matplotlib          3.5.2
numpy               1.22.4
pandas              1.4.2
scipy               1.8.1
session_info        1.0.0
statsmodels         0.13.2
-----

Click to view modules imported as dependencies
PIL                 9.1.1
asttokens           NA
backcall            0.2.0
beta_ufunc          NA
binom_ufunc         NA
cffi                1.15.0
colorama            0.4.4
cycler              0.10.0
cython_runtime      NA
dateutil            2.8.2
debugpy             1.6.0
decorator           5.1.1
defusedxml          0.7.1
entrypoints         0.4
executing           0.8.3
hypergeom_ufunc     NA
ipykernel           6.13.0
ipython_genutils    0.2.0
jedi                0.18.1
joblib              1.1.0
kiwisolver          1.4.2
mpl_toolkits        NA
nbinom_ufunc        NA
packaging           21.3
parso               0.8.3
patsy               0.5.2
pexpect             4.8.0
pickleshare         0.7.5
pkg_resources       NA
prompt_toolkit      3.0.29
psutil              5.9.1
ptyprocess          0.7.0
pure_eval           0.2.2
pydev_ipython       NA
pydevconsole        NA
pydevd              2.8.0
pydevd_file_utils   NA
pydevd_plugins      NA
pydevd_tracing      NA
pygments            2.12.0
pyparsing           3.0.9
pytz                2022.1
six                 1.16.0
sphinxcontrib       NA
stack_data          0.2.0
traitlets           5.2.1.post0
wcwidth             0.2.5
zmq                 23.0.0

-----
IPython             8.4.0
jupyter_client      7.3.1
jupyter_core        4.10.0
notebook            6.4.11
-----
Python 3.8.12 (default, May  4 2022, 08:13:04) [GCC 9.4.0]
Linux-5.13.0-1023-azure-x86_64-with-glibc2.2.5
-----
Session information updated at 2022-05-28 16:28