Analysis of variance (ANOVA)

SST (Sum of Squares Total)

Also known as TSS (Total sum of squares.)

Measures the total variablity of the dataset

i=1n(yiyˉ)2\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}

SSR (Sum of Squares Regression)

Also known as ESS (Explained Sum of squares)

It is the sum of the differences between the predicted value and the mean of the dependent variable.

Measures the explained variablity by your line.

i=1n(y^iyˉ)2\sum_{i=1}^{n}\left(\widehat{y}_{i}-\bar{y}\right)^{2}

SSE (Sum of Squares Error)

Also known as RSS (Residual Sum of Squares) -> Remaining/Unexplained

Measures the unexplained variabilty by the regression

Lower error => better explanatory power

i=1nei2\sum_{i=1}^{n} e_{i}^{2}

Connection

Sum of Squares Total = Sum of Squares Regression + Sum of Squares Error\text{Sum of Squares Total = Sum of Squares Regression + Sum of Squares Error}

SST=SSR+SSESST = SSR + SSE

R-Squared

R-squared measures how well your model fits your data.

The R-squared shows how much of the total variability of the dataset is explained by your regression model. This may be expressed as: how well your model fits your data.

It is incorrect to say your regression line fits the data, as the line is the geometrical representation of the regression equation. It also incorrect to say the data fits the model or the regression line, as you are trying to explain the data with a model, not vice versa.

R2=SSRSSTR^2 = \frac{SSR}{SST}

  • It is a relative measure and takes values ranging from 0 to 1.
  • R2=0R^2 = 0 means your model explains none of the variability of the data.
  • R2=1R^2 = 1 means your model explains the entire variability of the data.
  • you will usually observe R2R^2 ranging from 0.2 to 0.9 .
  • Good R2R^2 value depends on the complexity of the topic and how many variables are believed to be in play.

Adjusted R-Squared

Multiple regressions are always better than simple ones, as with each additional variable that you add the explanatory power may only increase or stay the same.

Adjusted R-Squared is used to measure how weel your model fits the data but penalizes excessive use of variables.

  • Adjusted R-Squared (R2ˉ\bar{R^2}) is almost always smaller than R-Squared (R2R^2).
    • The statement is not true only in the extreme occasions of small sample sizes and a high number of independent variables.
  • Adjusted R-Squared penalizes excessive use of variables.

If adding a variable increases the R-squared but decreases the Adjusted R-squared, It means that variable can be omitted since it holds no predictive power.

F-Statistic

The F statistic follows an F distribution.

F-Test is used for testing the overall significance of the model.

the lower the F statistic, the closer to a non-significant model.

The F-Test:

H0:β1=β2==βk=0H_{0}: \beta_{1}=\beta_{2}=\ldots=\beta_{k}=0

H1:at least one βi0H_1 : \text{at least one } \beta_i \neq 0

H0H_0 : If all betas are 0, then none of the independent variables matter, our model has no merit.

Linear Regression Assumption

We should consider them before performing linear regression analysis.

  • Linearity
  • No endogeneity of regressor
  • Normality and homoscedasticity
  • No autocorrelation
  • No multicollinearity

We should not violate the assumptions.

If a regression assumption is violated, performing regression anakysis will yeild an incorrect result.

Linearity

γ=β0+β1x1+β2x2++βkxk+ε\gamma=\beta_{0}+\beta_{1} x_{1}+\beta_{2} x_{2}+\cdots+\beta_{k} x_{k}+\varepsilon

If the data points form a pattern that looks like a straight line then a linear regression model is suitable.

No endogeneity of regressor

σXε=0:x,ε\sigma_{X \varepsilon}=0: \forall x, \varepsilon

Omitted Variables Bias happens when you forget to include a relevant variable. This is reflected in the error term as the factor you forgot about is included in the error. In this way, the error is not random but includes a systematic part (the omitted variable).

Normality and homoscedasticity

εN(0,σ2)\varepsilon \sim N\left(0, \sigma^{2}\right)

Normality

We assume the error term is normally distributed.

T-tests and F-tests because we have assumed normality.

Zero mean

If the mean is not expected to be 0 then the line is not the best fitting one.

Homoscedasticity

It means to have equal variance, So the error term should have equal variance one with the other.

To prevent High variability:

  • Look for Omitted Variables Bias
  • Look for outliers
  • Apply Transformation
    • Log Transformation

To take Log Transformation:

  • Take the natural log of the variable
  • then create a regression between the log of Y and the independent Xs
    • semi-log model: logy^=b0+b1x1log \hat{y} = b_0 + b_1x_1
  • Or conversely, create a regression between Y and the log of independent Xs
    • semi-log model: y^=b0+b1(logx1)\hat{y} = b_0 + b_1(logx_1)
  • Or sometimes we need to change both scales to log.
    • log-log model: logy^=b0+b1(logx1)log\hat{y} = b_0 + b_1(logx_1)

No autocorrelation

Also known as No serial correlation.

σεiεj=0:ij\sigma_{\varepsilon_{i} \varepsilon_{j}}=0: \forall i \neq j

errors are assumed to be uncorrelated.

Autocorrelation is not likely to be observed in cross-sectional data. You usually spot it at time series data, which is a subset of panel data. Sample data is not relevant for this question.

To Detect autocorrelation:

  • plot all the residuals on a graph and look for patterns
    • If there are no patterns to be seen => no autocorrelation
  • Derbin-Watson test (value falls between 0 to 4)
    • 2 => no autocorrelation
    • <1 or >3 causes an alarm

To Fix autocorrelation, Use alternative models:

  • Autoregressive model
  • Moving average model
  • Autoregressive moving average model
  • Autoregressive integrated moving average model

No multicollinearity

ρxixj≉1:i,j;ij\rho_{x_{i} x_{j}} \not\approx 1: \forall i, j ; \mathrm{i} \neq \mathrm{j}

We observe multicollinearity when two or more variables have a high correlation.

Prevention:

  • Find the correlation between each two pairs of independent variables

To fix multicollinearity:

  • Drop one of the two variable
  • Transform them into one

Dummy Variable

Dummy Variable is an imitation or a copy that stands as a substitute in regression analysis.

A dummy is a variable that is used to include categorical data into a regression model.

Usually we imitate the categories with number. (e.g. Yes = 1, No = 0)

Dummy Variable In Python

Using panda library, We can use backup the dataset and map the values.

1
2
3
data = raw_data.copy()

data['xxx'] = data['xxx'].map({'Yes':1,'No':0})

Regression Analysis

Regression analysis is one of the most common methods of prediction it is use whenever we have a causal relationship between variables.

Fundamentals of regression analysis are used in supervised machine learning.

Linear Regression Model

A linear regression is a linear approximation of a causal relationship between two or more variables.

One of the most common ways to make inferences and predictions

Process:

  1. You get sample data
  2. design a model that explains the data
  3. then make predictions for the whole population based on the model you’ve developed

Simple Linear Regression Equation

The simple Linear Regression Equation go like this :

y^=b0+b1x1\hat{y} = b_0 + b_1x_1

Variable Explained:

y^\hat{y} - Estimated / Predicted value (dependent variable)

b0b_0 - a constant (intercept)

b1b_1 - slope, quantifies the effect the independent (x)(x) on the dependent (y)(y) .

x1x_1 - the sample data for independent variable. (Observed value)

Relation: xx affact yy

Correlation vs Regression

Correlation does not imply causation

Correlation Regression
in a Relationship (affect each other) \leftrightarrow one variable affects the other \rightarrow
Movement together cause and effect
p(x,y)=p(y,x)p(x,y) = p(y,x) One way only
Single point only Line

Linear Regression In Python (with StatsModels)

We mainly use pandas to create data frame.

Import the relevant libraries

1
2
3
4
5
6
7
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# We can override the default matplotlib styles with those of Seaborn
import seaborn as sns # seaborne is a package built on top of matplotlib.
sns.set() # activate seaborn to override all the matplotlib graphics

Load the data from .csv file

1
2
3
4
5
6
7
8
# Load the data from a .csv in the same folder
data = pd.read_csv('xxx_yyy.csv')
# To check what's inside this data frame
data
# or this
data.head()
# check more detail
data.describe()

We can use .describe(include='all') for the categorical variables.

Declare the dependent and the independent variables

1
2
3
4
# Following the regression equation, our dependent variable (y)
y = data ['yyy']
# Similarly, our independent variable (x)
x1 = data ['xxx']

Explore the data first.

1
2
3
4
5
6
7
# Plot a scatter plot (first we put the horizontal axis, then the vertical axis)
plt.scatter(x1,y)
# Name the axes
plt.xlabel('xxx', fontsize = 20)
plt.ylabel('yyy', fontsize = 20)
# Show the plot
plt.show()

Do a Regression

now we find the regression line to do the prediction.

1
2
3
4
5
6
# Add a constant. Essentially, we are adding a new column (equal in lenght to x), which consists only of 1s
x = sm.add_constant(x1)
# Fit the model, according to the OLS (ordinary least squares) method with a dependent variable y and an idependent x
results = sm.OLS(y,x).fit()
# Print a nice summary of the regression. That's one of the strong points of statsmodels -> the summaries
results.summary()

^ results.summary contains a few table. Model summary, coefficient table and some additional tests.

  • coefficient table will show the coefficient of x1 and coefficient of constant (intercept).
    • which will be used in below code stored as coef_of_x1 and coef_of_const as example.
  • standard error(std err) shows the accuracy of prediction.
    • The lower the standard error the better the estimate.
  • p-value determines whether the variable is significant.
    • p-value < 0.05 means the variable is significant, the coefficient is most probably different from zero.
1
2
3
4
5
6
plt.scatter(x1,y)
yhat = x1*coef_of_x1 + coef_of_const
fig = plt.plot(x1,yhat, lw=4, c='orange', label ='regression line')
plt.xlabel('xxx', fontsize = 20)
plt.ylabel('yyy', fontsize = 20)
plt.show()

Do a Prediction

Let’s say we have some new records called new_data.

We can use the predict method to know the predictions.

1
predictions = results.predict(new_data)

then show them again as dataframe.

1
2
3
predictionsdf = pd.DataFrame({'Predictions':predictions})
joined = new_data.join(predictionsdf)
joined.rename(index={0:'index0',1:'index1'})

Linear Regression In Python (with Sklearn)

Scikit-learn is built on numpy, Scipy and Matplotlib.

We need to translate our data into ndarray using numpy then feed to the algorithm.

Import the relevant libraries

1
2
3
4
5
6
7
8
9
# For these lessons we will need NumPy, pandas, matplotlib and seaborn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# and of course the actual regression (machine learning) module
from sklearn.linear_model import LinearRegression

Load the data from .csv file

1
2
3
4
5
6
7
8
# Load the data from a .csv in the same folder
data = pd.read_csv('xxx_yyy.csv')
# To check what's inside this data frame
data
# or this
data.head()
# check more detail
data.describe()

Declare the dependent and the independent variables

  • Our dependent variable (y) is called target or output
  • Our independent variable (x) is called feature or input.

^ This is supervised learning.

1
2
3
4
5
6
7
# Our dependent variable (y) is called target or output
y = data ['yyy']
# Our independent variable (x) is called feature or input
x = data ['xxx']

# Often it is useful to check the shapes of the features
x.shape

In order to feed x to sklearn, it should be a 2D array (a matrix).

Therefore, we must reshape it since we only get 1 feature in this case…

1
2
3
4
5
# Note that this will not be needed when we've got more than 1 feature (as the inputs will be a 2D array by default)
x_matrix = x.values.reshape(-1,1)

# Check the shape just in case
x_matrix.shape

Note:

What does numpy reshape(-1,1) mean?

Regression itself

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

1
2
3
4
5
6
reg = LinearRegression()

# The whole learning process boils down to fitting the regression
# Note that the first argument is the independent variable,
# while the second - the dependent (unlike with StatsModels)
reg.fit(x_matrix,y)

^ Above code will give an output like this:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

  • normalize : subtract the mean and divide by the L2-norm of the inputs
  • copy_X : copies the inputs before fitting them. Usually set as True.
  • fit_intercept : add a constant.
  • n_jobs : decide how much CPU you want to use.

Note:

https://en.wikipedia.org/wiki/Feature_scaling

Standardization is the process of subtracting the mean and dividing by the standard deviation

xμσ\frac{x-\mu}{\sigma}

which is a type of normalization, but here we have different explaination.

In this case, Normalization means subtract the mean and divide by the L2-norm of the inputs.

fit_intercept = True is actually same as x = sm.add_constant(x1) in statmodel example but in sklearn we can do it automatically.

Find R-squared

1
reg.score(x_matrix,y)

Find Coefficients

1
reg.coef_

Find Intercept

1
reg.intercept_

Making Predictions

Lets say we have a new data set of input.

Create the fake datasets using Dataframe

1
2
new_data = pd.DataFrame(data=[1111,1101], columns=['xxx'])
new_data

Now Making Predictions

1
reg.predict(new_data)

To show in a table:

1
2
new_data['Predicted_yyy'] = reg.predict(new_data)
new_data

Plot the Regression

Same as previous example.

1
2
3
4
5
6
plt.scatter(x,y)
yhat = reg.coef_*x_matrix + reg.intercept_
fig = plt.plot(x,yhat, lw=4, c='orange', label = 'regression line')
plt.xlabel('xxx', fontsize = 20)
plt.ylabel('yyy', fontsize = 20)
plt.show()

Multiple Linear Regression Model

We prefer using a multiple linear regression model to a simple linear regression mode because it is more realistic. Things often depend on many factors (more than 2).

Multiple Linear Regression Equation

The Multiple Linear Regression Equation go like this :

y^=b0+b1x1+b2x2++bkxk\hat{y} = b_0 + b_1x_1 + b_2x_2 + \cdots + b_kx_k

Variable Explained:

y^\hat{y} - Estimated / Predicted value (dependent variable)

b0b_0 - a constant (intercept)

b1b_1 to bkb_k - coefficient/slope, quantifies the effect the independent (x)(x) on the dependent (y)(y) .

x1x_1 to xkx_k - independent variables. (Observed value)

Relation: xx affact yy

Actually it stops being two dimensional (2D) and when we have over three dimensions (3D) there is no visual way to represent the data.

It’s about the best fitting model. (Least SSE for OLS)

Multiple Linear Regression In Python (with StatsModels)

Import the relevant libraries

1
2
3
4
5
6
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn
seaborn.set()

Load the data from .csv file

1
2
3
4
5
6
7
8
# Load the data from a .csv in the same folder
data = pd.read_csv('xxx_yyy.csv')
# To check what's inside this data frame
data
# or this
data.head()
# check more detail
data.describe()

Declare the dependent and the independent variables

1
2
3
4
5
# Following the regression equation, our dependent variable (y)
y = data ['yyy']
# Similarly, our independent variable (x)
# for example, we have 2 independent variables this time:
x1 = data [['xxx1', 'xxx2']]

To have better understanding, run this code to know the x1.

1
2
3
x1
#or
x1.head()

Do a Regression

now we find the regression line to do the prediction. Just like the simple one.

1
2
3
4
5
6
# Add a constant. Essentially, we are adding a new column (equal in lenght to x), which consists only of 1s
x = sm.add_constant(x1)
# Fit the model, according to the OLS (ordinary least squares) method with a dependent variable y and an idependent x
results = sm.OLS(y,x).fit()
# Print a nice summary of the regression. That's one of the strong points of statsmodels -> the summaries
results.summary()

^ results.summary contains a few table. Model summary, coefficient table and some additional tests.

  • coefficient table will show the coefficient of x1 and coefficient of constant (intercept).
    • which will be used in below code stored as coef_of_x1 and coef_of_const as example.
  • standard error(std err) shows the accuracy of prediction.
    • The lower the standard error the better the estimate.
  • p-value determines whether the variable is significant.
    • p-value < 0.05 means the variable is significant, the coefficient is most probably different from zero.
    • With the p-value, we can know which variable to remove in the equation.

Multiple Regression In Python (with Sklearn)

Import the relevant libraries

1
2
3
4
5
6
7
8
9
# For these lessons we will need NumPy, pandas, matplotlib and seaborn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# and of course the actual regression (machine learning) module
from sklearn.linear_model import LinearRegression

Load the data from .csv file

1
2
3
4
5
6
7
8
# Load the data from a .csv in the same folder
data = pd.read_csv('xxx_yyy.csv')
# To check what's inside this data frame
data
# or this
data.head()
# check more detail
data.describe()

Declare the dependent and the independent variables

  • Our dependent variable (y) is called target or output
  • Our independent variable (x) is called feature or input.

^ This is supervised learning.

1
2
3
4
5
6
7
8
# Our dependent variable (y) is called target or output
y = data ['yyy']
# Our independent variable (x) is called feature or input
# for example, we have 2 independent variables this time:
x = data [['xxx1','xxx2']]

# Often it is useful to check the shapes of the features
x.shape

This time x is a 2D array (a matrix).

Therefore, we need not reshape it since we only get 2 features in this case.

Standardization (Feature Scaling)

Standardization is the process of subtracting the mean and dividing by the standard deviation

xμσ\frac{x-\mu}{\sigma}

1
from sklearn.preprocessing import StandardScaler
1
2
3
4
5
6
# declare standard scaler object
scaler = StandardScaler()

# Preparing the mechanism...
# calculates and stores the mean and standard deviation of each feature
scaler.fit(x)

output will be something like : StandardScaler(copy=True, with_mean=True, with_std=True)

Then we can get the scaled data using scaler.transform(x).

1
2
3
# scaled data
x_scaled = scaler.transform(x)
x_scaled

Note : This will affact the Coefficients.

When we perform feature scaling, we don’t care if a useless variable is there or not.

Regression itself

Same as the simple one.

1
2
3
4
5
6
reg = LinearRegression()

# The whole learning process boils down to fitting the regression
# Note that the first argument is the independent variable,
# while the second - the dependent (unlike with StatsModels)
reg.fit(x,y)

Finding Coefficients and Intercept

1
2
reg.coef_
reg.intercept_

Find R-squared

Note The R-squared is a universal measure to evaluate how well linear regressions fare and compare

1
reg.score(x,y)

Find Adjusted R-squared

Adjusted R-Squared is used to measure how weel your model fits the data but penalizes excessive use of variables.

Radj.2=1(1R2)n1np1R^2_{adj.} = 1 - (1-R^2)*\frac{n-1}{n-p-1}

where nn is the number of observations (samples) and pp is the number of predictors (features).

You can know nn and pp by checking x.shape.

1
2
3
4
5
6
7
8
r2 = reg.score(x,y)

n = x.shape[0]

p = x.shape[1]

adjusted_r2 = 1 - (1-r2) * (n-1)/(n-p-1)
adjusted_r2

You can also make it as a function:

1
2
3
4
5
6
7
8
def find_adj_r2(x,y):
r2 = reg.score(x,y)
n = x.shape[0]
p = x.shape[1]
adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
return adjusted_r2

find_adj_r2(x,y)

Feature Selection with F-regression

  • Used to detect the variables which are unneeded in the model.
  • it simplifies models which makes them much easier to interpret by data scientists.
  • Through this process we gain improved speed and often prevent a series of other unwanted issues arising from having too many features.
  • In StatsModel we use p-value of x to determine whether the independent variables were relevant for the model, but In Sklearn we use the F-regression to find the p-value of x using F-statistic.
1
2
3
4
5
6
7
from sklearn.feature_selection import f_regression

f_regression(x,y) # return F-statistics and p-values

# We only need the p_values, therefore :
p_values = f_regression(x,y)[1]
p_values.round(3) # We only need the 3 digits.

^ If p value > 0.005, the independent variable x is redundent.

We use P-values to determine if a variable is redundent.

Note:

the F-regression does not take into account the interrelation of the features.

Creating a summary table

Note:

  • Intercept is also called bias
  • Coefficients is also called Weights
1
2
3
# Let's create a new data frame with the names of the features
reg_summary = pd.DataFrame(data = x.columns.values, columns=['Features'])
reg_summary
1
2
3
4
5
6
# Then we create and fill a second column, 
# called 'Coefficients' with the coefficients of the regression
reg_summary ['Coefficients'] = reg.coef_
# Finally, we add the p-values we just calculated
reg_summary ['p-values'] = p_values.round(3)
reg_summary

Now you have a good summary table to see which variable is redundent.

Template: “It seems that ‘xxx’ is not event significant, therefore we should remove it from the model.”

Practical Linear Regression Concepts

Spliting Train Data and Test Data in Python

Import the relevant libraries

1
2
import numpy as np
from sklearn.model_selection import train_test_split

Generate some data

1
2
a = np.arange(1,101) # generate an array from number 1 to 100
b = np.arange(501,601) # generate an array from number 501 to 600

Split the data

We use the method train_test_split(x) to split arrays or matrices into random train and test subsets.

The first array is training and the second array is test.

1
a_train, a_test = train_test_split(a, test_size=0.2, shuffle=True)

Note if it is shuffled, each time we split the data we will get different training and testing datasets.

It will affect the result a bit.

To avoid each time shuffling different result, we use random_state to make the datasets consistant.

1
2
a_train, a_test = train_test_split(a, test_size=0.2, random_state=42) 
# random_state=42 is commonly used

^ Now We can get exactly shuffle split everytime.

Note We can also split the data for more than 1 datasets.

1
2
3
# The 2 arrays will be split into 4
# The order is train1, test1, train2, test2.
a_train, a_test, b_train, b_test = train_test_split(a, b, test_size=0.2, random_state=42)

Preprocessing Data

Check the descriptive statistics of all variables

1
2
# To include the categorical ones, you should specify this with an argument
raw_data.describe(include='all')

Determining the variables of interest

Sometimes we will drop some of the variables from the data because they are not useful.

1
2
3
# DataFrame.drop(columns, axis) returns new object with the indicated columns dropped
data = raw_data.drop('xxx', axis=1)
# axis 1 stands for columns, axis 0 stands for rows

Dealing with missing values

1
2
3
4
5
6
7
# check for miss values
data.isnull() # shows a dataframe with the information whether a data point is null
# Since True = the data point is missing,
# while False = the data point is not missing, we can sum them

# This will give us the total number of missing values feature-wise
data.isnull().sum()

A rule of thumb :

if you are removing < 5% of the observations you are free to just remove all observations that have missing values.

1
2
3
4
# dropping records with missing values
data_no_mv = data.dropna(axis=0)

data_no_mv.describe(include='all')

Actually a common way to label missing values is by assigning 99.99. Be aware of that.

(It is a bad idea to label values in such ways as it is very hard for other users of the data to distinguish

them from the true values.) But some people still do it.

Showing Probability density function (PDF)

1
2
3
4
5
# A great step in the data exploration is to display the probability distribution function (PDF) of a variable
# The PDF will show us how that variable is distributed
# This makes it very easy to spot anomalies, such as outliers
# The PDF is often the basis on which we decide whether we want to transform a feature
sns.distplot(data_no_mv['xxx'])

Dealing with outliers

Outliers = observations that lie on abnormal distance from other observations in the data.

they will affect the regression dramatically and cause coefficients to be inflated as the regression will try to place the line closer to those values.

  • One way to deal with outliers seemlessly is to remove top 1% of observations.
    • DataFrame.quantile(0.99) method
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Declare a variable that will be equal to 
# the 99th percentile of the independent variable
q = data_no_mv['xxx'].quantile(0.99)

# Then we can create a new dataframe,
# with the condition that all xxx must be below the 99 percentile of 'xxx'
data_1 = data_no_mv[data_no_mv['xxx']<q]
# ^ you can replace q with a number for other purpose:
# e.g. all xxx must below 6.5
# data_1 = data_no_mv[data_no_mv['xxx']<6.5]

# In this way we have essentially removed the top 1% of the data about 'xxx'
data_1.describe(include='all')

# now the PDF is with less outliers
sns.distplot(data_1['xxx'])

Then we repeat above step to deal all the outliers of independent variables.

We need to reset the index at the end.

1
2
3
4
5
6
7
8
9
10
# When we remove observations, the original indexes are preserved
# If we remove observations with indexes 2 and 3, the indexes will go as: 0,1,4,5,6
# That's very problematic as we tend to forget about it (later you will see an example of such a problem)

# Finally, once we reset the index, a new column will be created containing the old index (just in case)
# We won't be needing it, thus 'drop=True' to completely forget about it
data_cleaned = data_4.reset_index(drop=True)

# Let's see what's left
data_cleaned.describe(include='all')

Checking OLS assumptions

The categorical variables will be included as dummies so we don’t need to check the assumptions for them.

1
2
3
4
5
6
7
8
9
10
f, (ax1, ax2, ax3) = plt.subplots(1, 3, sharey=True, figsize =(15,3)) 
#sharey -> share 'y' as y
ax1.scatter(data_cleaned['x1'],data_cleaned['y'])
ax1.set_title('y and x1')
ax2.scatter(data_cleaned['x2'],data_cleaned['y'])
ax2.set_title('y and x2')
ax3.scatter(data_cleaned['x3'],data_cleaned['y'])
ax3.set_title('y and x3')

plt.show()

Note the graph usually will not be linear.

Just check the dependent variable again:

1
2
# From the subplots and the PDF of y, we can easily determine that how 'y' is distributed
sns.distplot(data_cleaned['y'])

Lets say this time ‘y’ is exponentially distributed, we need to apply a log transformation.

Relaxing the assumption

1
2
3
4
5
6
# Let's transform 'y' with a log transformation
log_y = np.log(data_cleaned['y'])

# Then we add it to our data frame
data_cleaned['log_y'] = log_y
data_cleaned
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Let's check the three scatters once again
f, (ax1, ax2, ax3) = plt.subplots(1, 3, sharey=True, figsize =(15,3))
ax1.scatter(data_cleaned['x1'],data_cleaned['log_y'])
ax1.set_title('Log y and x1')
ax2.scatter(data_cleaned['x2'],data_cleaned['log_y'])
ax2.set_title('Log y and x2')
ax3.scatter(data_cleaned['x3'],data_cleaned['log_y'])
ax3.set_title('Log y and x3')


plt.show()

# Now The relationships show a clear linear relationship
# This is some good linear regression material

# Alternatively we could have transformed each of the independent variables


# Since we will be using the log price variable, we can drop the old 'Price' one
data_cleaned = data_cleaned.drop(['y'],axis=1)

Check for Multicollinearity

We need to assume No multicollinearity.

1
data_cleaned.columns.values
  • One of the best ways to check for multicollinearity is through VIF (Variance Inflation Factor).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# sklearn does not have a built-in way to check for multicollinearity
# one of the main reasons is that this is an issue well covered in statistical frameworks and not in ML ones
# surely it is an issue nonetheless, thus we will try to deal with it

# Here's the relevant module
# full documentation: http://www.statsmodels.org/dev/_modules/statsmodels/stats/outliers_influence.html#variance_inflation_factor
from statsmodels.stats.outliers_influence import variance_inflation_factor

# To make this as easy as possible to use, we declare a variable where we put
# all features where we want to check for multicollinearity
# since our categorical data is not yet preprocessed, we will only take the numerical ones
variables = data_cleaned[['x1','x2','x3']]

# we create a new data frame which will include all the VIFs
# note that each variable has its own variance inflation factor as this measure is variable specific (not model specific)
vif = pd.DataFrame()

# here we make use of the variance_inflation_factor, which will basically output the respective VIFs
vif["VIF"] = [variance_inflation_factor(variables.values, i) for i in range(variables.shape[1])]
# Finally, I like to include names so it is easier to explore the result
vif["Features"] = variables.columns

# Let's explore the result
vif

# Drop the variable which has the highest VIF,
# Otherwise will drive the VIF of other variables down!!!
data_no_multicollinearity = data_cleaned.drop(['x1'],axis=1)
  • VIF = 1 means no multicollinearity
  • 1 < VIF < 5 is considered perfectly okay
  • 10 < VIF means that variable too correlated with the other variables, needs to be removed

Create Dummy Variables Effectively

  • If we have N categories for a feature, we have to create N-1 dummies.
    • since N dummies will introduce multicollinearity and causing VIF to infinity

pd.get_dummies(df [. drop_first]) spots all categorical variables and creates dummies automatically

1
2
3
4
5
# To include the categorical data in the regression, let's create dummies
# There is a very convenient method called: 'get_dummies' which does that seemlessly
# It is extremely important that we drop one of the dummies, alternatively we will introduce multicollinearity
data_with_dummies = pd.get_dummies(data_no_multicollinearity, drop_first=True)
# ^ drop_first means only create N-1 dummies

To make our data frame more organized, we prefer to place the dependent variable in the beginning of the dataframe. Since each problem is different, that must be done manually. We can display all possible features using data_with_dummies.columns.values, and then choose the desired order and store with variable cols.

1
cols = [<variable list you want>]
1
2
data_preprocessed = data_with_dummies[cols]
data_preprocessed.head()

Practical Linear Regession Model

Declare the inputs and targets

1
2
3
4
5
# The target(s) (dependent variable) is 'log y'
targets = data_preprocessed['log_y']

# The inputs are everything without the dependent variable, so we can simply drop it
inputs = data_preprocessed.drop(['log_y'],axis=1)

Scale the Data (Standardization) and Data Split

1
2
3
4
5
6
7
# Import the scaling module
from sklearn.preprocessing import StandardScaler

# Create a scaler object
scaler = StandardScaler()
# Fit the inputs (calculate the mean and standard deviation feature-wise)
scaler.fit(inputs)
1
2
# Scale the features and store them in a new variable (the actual scaling procedure)
inputs_scaled = scaler.transform(inputs)

Note that it is not usually recommended to standardize dummy variables.

Then do a Train Test Split.

1
2
3
4
5
# Import the module for the split
from sklearn.model_selection import train_test_split

# Split the variables with an 80-20 split and some random state
x_train, x_test, y_train, y_test = train_test_split(inputs_scaled, targets, test_size=0.2, random_state=42)

Create Regression

1
2
3
4
5
6
7
8
9
# Create a linear regression object
reg = LinearRegression()
# Fit the regression with the scaled TRAIN inputs and targets
reg.fit(x_train,y_train)

# Let's check the outputs of the regression
# I'll store them in y_hat as this is the 'theoretical' name of the predictions
y_hat = reg.predict(x_train)

Scatter Plot Check

1
2
3
4
5
6
7
8
9
10
11
12
# The simplest way to compare the targets (y_train) and the predictions (y_hat) is to plot them on a scatter plot
# The closer the points to the 45-degree line, the better the prediction
plt.scatter(y_train, y_hat)
# Let's also name the axes
plt.xlabel('Targets (y_train)',size=18)
plt.ylabel('Predictions (y_hat)',size=18)
# Sometimes the plot will have different scales of the x-axis and the y-axis
# This is an issue as we won't be able to interpret the '45-degree line'
# We want the x-axis and the y-axis to be the same
plt.xlim(6,13)
plt.ylim(6,13)
plt.show()

Residual Plot Check

Residual = Differences between the target and predictions

1
2
3
4
5
6
# Another useful check of our model is a residual plot
# We can plot the PDF of the residuals and check for anomalies
sns.distplot(y_train - y_hat)

# Include a title
plt.title("Residuals PDF", size=18)

In the best case scenario this plot should be normally distributed.

If there are many negative residuals (far away from the mean),

Given the definition of the residuals (y_train - y_hat),

negative values imply that y_hat (predictions) are much higher than y_train (the targets) (overestimate).

Find the R-squared of the model

The score represents how much our model is explaining the percent of variability of the data.

1
2
3
4
reg.score(x_train,y_train)

# Note that this is NOT the adjusted R-squared
# in other words... find the Adjusted R-squared to have the appropriate measure.

Finding the weights and bias

1
2
3
4
5
6
7
# Obtain the bias (intercept) of the regression
reg.intercept_

# Obtain the weights (coefficients) of the regression
reg.coef_

# Note that they are barely interpretable if at all
1
2
3
4
# Create a regression summary where we can compare them with one-another
reg_summary = pd.DataFrame(inputs.columns.values, columns=['Features'])
reg_summary['Weights'] = reg.coef_
reg_summary

The table will show positive weights and negative weights.

For continuous variables:

  • A positive weight shows that when a feature increases in value, log_y and ‘y’ also increases.
  • A negative weight shows that when a feature increases in value, log_y and ‘y’ decreases.

For dummy variables:

  • A positive weight shows that respective category is more valuable than the benchmark
  • A negative weight shows that respective category is less valuable than the benchmark
1
2
3
4
5
# Check the different categories in the 'x1' variable
data_cleaned['x1'].unique()
# Compare with the above reg_summary table features and
# we can see which 'x1' is actually the benchmark
# (The one feature which is dropped / not in the reg_summary table)

^Use data_cleaned['name_of_categorical_variable'].unique() and find which ones are missing to know the benchmark.

Testing

1
2
3
4
5
6
7
# Once we have trained and fine-tuned our model, we can proceed to testing it
# Testing is done on a dataset that the algorithm has never seen
# Luckily we have prepared such a dataset
# Our test inputs are 'x_test', while the outputs: 'y_test'
# We SHOULD NOT TRAIN THE MODEL ON THEM, we just feed them and find the predictions
# If the predictions are far off, we will know that our model overfitted
y_hat_test = reg.predict(x_test)

Create a scatter plot with the test targets and the test predictions

You can include the argument alpha which will introduce opacity to the graph make it like a heatmap.

1
2
3
4
5
6
plt.scatter(y_test, y_hat_test, alpha=0.2)
plt.xlabel('Targets (y_test)',size=18)
plt.ylabel('Predictions (y_hat_test)',size=18)
plt.xlim(6,13)
plt.ylim(6,13)
plt.show()

Finally, let’s manually check these predictions

To obtain the actual y, we take the exponential of the log_y

Normally we’d prefer the y not their logarithms since the log is the opposite of the exponential.

exp(ln(y))=yexp(ln(y)) = |y|

log(exp(y))=ylog(exp(y)) = y

1
2
df_pf = pd.DataFrame(np.exp(y_hat_test), columns=['Prediction'])
df_pf.head()

We can also include the test targets in that data frame (so we can manually compare them)

1
2
3
4
5
6
df_pf['Target'] = np.exp(y_test)
df_pf

# Note that we have a lot of missing values
# There is no reason to have ANY missing values, though
# This suggests that something is wrong with the data frame / indexing

Remove old indexing

1
2
3
4
5
6
7
8
9
10
11
y_test

# After displaying y_test, we find what the issue is
# The old indexes are preserved (recall earlier in that code we made a note on that)
# The code was: data_cleaned = data_4.reset_index(drop=True)

# Therefore, to get a proper result, we must reset the index and drop the old indexing
y_test = y_test.reset_index(drop=True)

# Check the result
y_test.head()
1
2
3
4
# Let's overwrite the 'Target' column with the appropriate values
# Again, we need the exponential of the test log price
df_pf['Target'] = np.exp(y_test)
df_pf

Proceed to compare them using Residual

1
2
3
4
5
6
# Additionally, we can calculate the difference between the targets and the predictions
# Note that this is actually the residual (we already plotted the residuals)
df_pf['Residual'] = df_pf['Target'] - df_pf['Prediction']

# Since OLS is basically an algorithm which minimizes the total sum of squared errors (residuals),
# this comparison makes a lot of sense

make the dataframe become percentage

1
2
3
4
5
6
7
# Finally, it makes sense to see how far off we are from the result percentage-wise
# Here, we take the absolute difference in %, so we can easily order the data frame
df_pf['Difference%'] = np.absolute(df_pf['Residual']/df_pf['Target']*100)
df_pf

# Exploring the descriptives here gives us additional insights
df_pf.describe()

Make the data more readable

1
2
3
4
5
6
7
8
9
# Sometimes it is useful to check these outputs manually
# To see all rows, we use the relevant pandas syntax
pd.options.display.max_rows = 999
# Moreover, to make the dataset clear, we can display the result with only 2 digits after the dot
pd.set_option('display.float_format', lambda x: '%.2f' % x)


# Finally, we sort by difference in % and manually check the model
df_pf.sort_values(by=['Difference%'])

To improve our model, we can:

  • Use a different set of variables
  • Remove a bigger part of the outliers
  • Use different kinds of transformation

Reference

The Data Science Course 2020: Complete Data Science Bootcamp