# Regression Analysis

Let's apply regression analysis on this data.

In [1]:
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats

#present data
data = {'Education': [11,12,13,15,8,10,11,12,17,11],
        'Income': [25,27,30,41,18,23,26,24,48,26]} 
data

{'Education': [11, 12, 13, 15, 8, 10, 11, 12, 17, 11],
 'Income': [25, 27, 30, 41, 18, 23, 26, 24, 48, 26]}

In [3]:
import statsmodels.api as sm

# convert into a data frame
df = pd.DataFrame(data,columns=['Education','Income']) 

# get predictor and response
X = df['Education'] 
Y = df['Income']

# Add the intercept to the design matrix
X = sm.add_constant(X) 

model = sm.OLS(Y, X).fit()
predictions = model.predict(X) 

print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:                 Income   R-squared:                       0.932
Model:                            OLS   Adj. R-squared:                  0.923
Method:                 Least Squares   F-statistic:                     108.9
Date:                Fri, 27 May 2022   Prob (F-statistic):           6.18e-06
Time:                        10:44:22   Log-Likelihood:                -22.203
No. Observations:                  10   AIC:                             48.41
Df Residuals:                       8   BIC:                             49.01
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -12.1655      4.004     -3.038      0.0

# Interpretation of the output

- The fitted regression line is:

  $\hat{y}_i=-12.166+3.414x_i.$
  
- Interpretation of coefficients:
  - $Education = \hat{\beta}_1=3.414$ tells us that on average for each additional year of education, an individual's income rises by $\$ 3.414$ thousand. The estimated stadandard error of $\hat{\beta}_1$, $se(\hat{\beta}_1)=0.327$. 
  - $cnst:\hat{\beta}_0=-12.166$. For this data, it does have any meaning since X=0 does not make sense for this data. The estimated stadandard error of $\hat{\beta}_0$, $se(\hat{\beta}_0)=4.004$.

- Hypothesis testing for slope parameter:

  $H_{0}:\beta_1=0$ vs
  $H_{1}:\beta_1 \neq 0$
  
- Under $H_{0}$, the observed value of **test statistic** is $t_{0}=\frac{3.4138}{0.327}=10.434$.
- The corresponding P-value is 0.000. Since P-value is less than $\alpha=0.05$, we reject $H_{0}$ 
at $\alpha=0.05$ level. We conclude that years spend in education is linearly associated with the indivual's
income.

- Hypothesis testing for intercept parameter:

  $H_{0}:\beta_0=0$ vs
  $H_{1}:\beta_0 \neq 0$
  
- Under $H_{0}$, the observed value of **test statistic** is $t_{0}=\frac{-12.1655}{4.004}=-3.038$.
- The corresponding P-value is 2*0.016. Since P-value is less than $\alpha=0.05$, we reject $H_{0}$ 
at $\alpha=0.05$ level. We conclude that the intercept should be in the model.
     
- Confidence interval:

  - A $95\%$ confidence interval for $\beta_1$ is $[2.659,4.168]$. We are $95\%$ confident that true value
  of $\beta_1$ is between 2.659 and 4.168.
  - A $95\%$ confidence interval for $\beta_0$ is $[-21.400,-2.931]$. We are $95\%$ confident that true value
  of $\beta_1$ is between -21.400 and -2.931.

- R-squared = 0.932 tells us that $93\%$ of variation in income is accounted  for by a linear relationship with the variable years spent in education.

#### Session Info

In [None]:
import session_info
session_info.show()