Linear Regression in Python
Analysis of variance (ANOVA)
SST (Sum of Squares Total)
Also known as TSS (Total sum of squares.)
Measures the total variablity of the dataset
SSR (Sum of Squares Regression)
Also known as ESS (Explained Sum of squares)
It is the sum of the differences between the predicted value and the mean of the dependent variable.
Measures the explained variablity by your line.
SSE (Sum of Squares Error)
Also known as RSS (Residual Sum of Squares) -> Remaining/Unexplained
Measures the unexplained variabilty by the regression
Lower error => better explanatory power
Connection
R-Squared
R-squared measures how well your model fits your data.
The R-squared shows how much of the total variability of the dataset is explained by your regression model. This may be expressed as: how well your model fits your data.
It is incorrect to say your regression line fits the data, as the line is the geometrical representation of the regression equation. It also incorrect to say the data fits the model or the regression line, as you are trying to explain the data with a model, not vice versa.
- It is a relative measure and takes values ranging from 0 to 1.
- means your model explains none of the variability of the data.
- means your model explains the entire variability of the data.
- you will usually observe ranging from 0.2 to 0.9 .
- Good value depends on the complexity of the topic and how many variables are believed to be in play.
Adjusted R-Squared
Multiple regressions are always better than simple ones, as with each additional variable that you add the explanatory power may only increase or stay the same.
Adjusted R-Squared is used to measure how weel your model fits the data but penalizes excessive use of variables.
- Adjusted R-Squared () is almost always smaller than R-Squared ().
- The statement is not true only in the extreme occasions of small sample sizes and a high number of independent variables.
- Adjusted R-Squared penalizes excessive use of variables.
If adding a variable increases the R-squared but decreases the Adjusted R-squared, It means that variable can be omitted since it holds no predictive power.
F-Statistic
The F statistic follows an F distribution.
F-Test is used for testing the overall significance of the model.
the lower the F statistic, the closer to a non-significant model.
The F-Test:
: If all betas are 0, then none of the independent variables matter, our model has no merit.
Linear Regression Assumption
We should consider them before performing linear regression analysis.
- Linearity
- No endogeneity of regressor
- Normality and homoscedasticity
- No autocorrelation
- No multicollinearity
We should not violate the assumptions.
If a regression assumption is violated, performing regression anakysis will yeild an incorrect result.
Linearity
If the data points form a pattern that looks like a straight line then a linear regression model is suitable.
No endogeneity of regressor
Omitted Variables Bias happens when you forget to include a relevant variable. This is reflected in the error term as the factor you forgot about is included in the error. In this way, the error is not random but includes a systematic part (the omitted variable).
Normality and homoscedasticity
Normality
We assume the error term is normally distributed.
T-tests and F-tests because we have assumed normality.
Zero mean
If the mean is not expected to be 0 then the line is not the best fitting one.
Homoscedasticity
It means to have equal variance, So the error term should have equal variance one with the other.
To prevent High variability:
- Look for Omitted Variables Bias
- Look for outliers
- Apply Transformation
- Log Transformation
To take Log Transformation:
- Take the natural log of the variable
- then create a regression between the log of Y and the independent Xs
- semi-log model:
- Or conversely, create a regression between Y and the log of independent Xs
- semi-log model:
- Or sometimes we need to change both scales to log.
- log-log model:
No autocorrelation
Also known as No serial correlation.
errors are assumed to be uncorrelated.
Autocorrelation is not likely to be observed in cross-sectional data. You usually spot it at time series data, which is a subset of panel data. Sample data is not relevant for this question.
To Detect autocorrelation:
- plot all the residuals on a graph and look for patterns
- If there are no patterns to be seen => no autocorrelation
- Derbin-Watson test (value falls between 0 to 4)
- 2 => no autocorrelation
- <1 or >3 causes an alarm
To Fix autocorrelation, Use alternative models:
- Autoregressive model
- Moving average model
- Autoregressive moving average model
- Autoregressive integrated moving average model
No multicollinearity
We observe multicollinearity when two or more variables have a high correlation.
Prevention:
- Find the correlation between each two pairs of independent variables
To fix multicollinearity:
- Drop one of the two variable
- Transform them into one
Dummy Variable
Dummy Variable is an imitation or a copy that stands as a substitute in regression analysis.
A dummy is a variable that is used to include categorical data into a regression model.
Usually we imitate the categories with number. (e.g. Yes = 1, No = 0)
Dummy Variable In Python
Using panda library, We can use backup the dataset and map the values.
1 | data = raw_data.copy() |
Regression Analysis
Regression analysis is one of the most common methods of prediction it is use whenever we have a causal relationship between variables.
Fundamentals of regression analysis are used in supervised machine learning.
Linear Regression Model
A linear regression is a linear approximation of a causal relationship between two or more variables.
One of the most common ways to make inferences and predictions
Process:
- You get sample data
- design a model that explains the data
- then make predictions for the whole population based on the model you’ve developed
Simple Linear Regression Equation
The simple Linear Regression Equation go like this :
Variable Explained:
- Estimated / Predicted value (dependent variable)
- a constant (intercept)
- slope, quantifies the effect the independent on the dependent .
- the sample data for independent variable. (Observed value)
Relation: affact
Correlation vs Regression
Correlation does not imply causation
Correlation | Regression |
---|---|
in a Relationship (affect each other) | one variable affects the other |
Movement together | cause and effect |
One way only | |
Single point only | Line |
Linear Regression In Python (with StatsModels)
We mainly use pandas to create data frame.
Import the relevant libraries
1 | import numpy as np |
Load the data from .csv file
1 | # Load the data from a .csv in the same folder |
We can use
.describe(include='all')
for the categorical variables.
Declare the dependent and the independent variables
1 | # Following the regression equation, our dependent variable (y) |
Explore the data first.
1 | # Plot a scatter plot (first we put the horizontal axis, then the vertical axis) |
Do a Regression
now we find the regression line to do the prediction.
1 | # Add a constant. Essentially, we are adding a new column (equal in lenght to x), which consists only of 1s |
^
results.summary
contains a few table. Model summary, coefficient table and some additional tests.
- coefficient table will show the coefficient of x1 and coefficient of constant (intercept).
- which will be used in below code stored as
coef_of_x1
andcoef_of_const
as example.- standard error(std err) shows the accuracy of prediction.
- The lower the standard error the better the estimate.
- p-value determines whether the variable is significant.
- p-value < 0.05 means the variable is significant, the coefficient is most probably different from zero.
1 | plt.scatter(x1,y) |
Do a Prediction
Let’s say we have some new records called new_data
.
We can use the predict method to know the predictions.
1 | predictions = results.predict(new_data) |
then show them again as dataframe.
1 | predictionsdf = pd.DataFrame({'Predictions':predictions}) |
Linear Regression In Python (with Sklearn)
Scikit-learn is built on numpy, Scipy and Matplotlib.
We need to translate our data into ndarray using numpy then feed to the algorithm.
Import the relevant libraries
1 | # For these lessons we will need NumPy, pandas, matplotlib and seaborn |
Load the data from .csv file
1 | # Load the data from a .csv in the same folder |
Declare the dependent and the independent variables
- Our dependent variable (y) is called target or output
- Our independent variable (x) is called feature or input.
^ This is supervised learning.
1 | # Our dependent variable (y) is called target or output |
In order to feed x to sklearn, it should be a 2D array (a matrix).
Therefore, we must reshape it since we only get 1 feature in this case…
1 | # Note that this will not be needed when we've got more than 1 feature (as the inputs will be a 2D array by default) |
Note:
Regression itself
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
1 | reg = LinearRegression() |
^ Above code will give an output like this:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
- normalize : subtract the mean and divide by the L2-norm of the inputs
- copy_X : copies the inputs before fitting them. Usually set as True.
- fit_intercept : add a constant.
- n_jobs : decide how much CPU you want to use.
Note:
https://en.wikipedia.org/wiki/Feature_scaling
Standardization is the process of subtracting the mean and dividing by the standard deviation
which is a type of normalization, but here we have different explaination.
In this case, Normalization means subtract the mean and divide by the L2-norm of the inputs.
fit_intercept = True
is actually same asx = sm.add_constant(x1)
in statmodel example but in sklearn we can do it automatically.
Find R-squared
1 | reg.score(x_matrix,y) |
Find Coefficients
1 | reg.coef_ |
Find Intercept
1 | reg.intercept_ |
Making Predictions
Lets say we have a new data set of input.
Create the fake datasets using Dataframe
1 | new_data = pd.DataFrame(data=[1111,1101], columns=['xxx']) |
Now Making Predictions
1 | reg.predict(new_data) |
To show in a table:
1 | new_data['Predicted_yyy'] = reg.predict(new_data) |
Plot the Regression
Same as previous example.
1 | plt.scatter(x,y) |
Multiple Linear Regression Model
We prefer using a multiple linear regression model to a simple linear regression mode because it is more realistic. Things often depend on many factors (more than 2).
Multiple Linear Regression Equation
The Multiple Linear Regression Equation go like this :
Variable Explained:
- Estimated / Predicted value (dependent variable)
- a constant (intercept)
to - coefficient/slope, quantifies the effect the independent on the dependent .
to - independent variables. (Observed value)
Relation: affact
Actually it stops being two dimensional (2D) and when we have over three dimensions (3D) there is no visual way to represent the data.
It’s about the best fitting model. (Least SSE for OLS)
Multiple Linear Regression In Python (with StatsModels)
Import the relevant libraries
1 | import numpy as np |
Load the data from .csv file
1 | # Load the data from a .csv in the same folder |
Declare the dependent and the independent variables
1 | # Following the regression equation, our dependent variable (y) |
To have better understanding, run this code to know the x1.
1 | x1 |
Do a Regression
now we find the regression line to do the prediction. Just like the simple one.
1 | # Add a constant. Essentially, we are adding a new column (equal in lenght to x), which consists only of 1s |
^
results.summary
contains a few table. Model summary, coefficient table and some additional tests.
- coefficient table will show the coefficient of x1 and coefficient of constant (intercept).
- which will be used in below code stored as
coef_of_x1
andcoef_of_const
as example.- standard error(std err) shows the accuracy of prediction.
- The lower the standard error the better the estimate.
- p-value determines whether the variable is significant.
- p-value < 0.05 means the variable is significant, the coefficient is most probably different from zero.
- With the p-value, we can know which variable to remove in the equation.
Multiple Regression In Python (with Sklearn)
Import the relevant libraries
1 | # For these lessons we will need NumPy, pandas, matplotlib and seaborn |
Load the data from .csv file
1 | # Load the data from a .csv in the same folder |
Declare the dependent and the independent variables
- Our dependent variable (y) is called target or output
- Our independent variable (x) is called feature or input.
^ This is supervised learning.
1 | # Our dependent variable (y) is called target or output |
This time x is a 2D array (a matrix).
Therefore, we need not reshape it since we only get 2 features in this case.
Standardization (Feature Scaling)
Standardization is the process of subtracting the mean and dividing by the standard deviation
1 | from sklearn.preprocessing import StandardScaler |
1 | # declare standard scaler object |
output will be something like : StandardScaler(copy=True, with_mean=True, with_std=True)
Then we can get the scaled data using scaler.transform(x)
.
1 | # scaled data |
Note : This will affact the Coefficients.
When we perform feature scaling, we don’t care if a useless variable is there or not.
Regression itself
Same as the simple one.
1 | reg = LinearRegression() |
Finding Coefficients and Intercept
1 | reg.coef_ |
Find R-squared
Note The R-squared is a universal measure to evaluate how well linear regressions fare and compare
1 | reg.score(x,y) |
Find Adjusted R-squared
Adjusted R-Squared is used to measure how weel your model fits the data but penalizes excessive use of variables.
where is the number of observations (samples) and is the number of predictors (features).
You can know and by checking x.shape
.
1 | r2 = reg.score(x,y) |
You can also make it as a function:
1 | def find_adj_r2(x,y): |
Feature Selection with F-regression
- Used to detect the variables which are unneeded in the model.
- it simplifies models which makes them much easier to interpret by data scientists.
- Through this process we gain improved speed and often prevent a series of other unwanted issues arising from having too many features.
- In StatsModel we use p-value of x to determine whether the independent variables were relevant for the model, but In Sklearn we use the F-regression to find the p-value of x using F-statistic.
1 | from sklearn.feature_selection import f_regression |
^ If p value > 0.005, the independent variable x is redundent.
We use P-values to determine if a variable is redundent.
Note:
the F-regression does not take into account the interrelation of the features.
Creating a summary table
Note:
- Intercept is also called bias
- Coefficients is also called Weights
1 | # Let's create a new data frame with the names of the features |
1 | # Then we create and fill a second column, |
Now you have a good summary table to see which variable is redundent.
Template: “It seems that ‘xxx’ is not event significant, therefore we should remove it from the model.”
Practical Linear Regression Concepts
Spliting Train Data and Test Data in Python
Import the relevant libraries
1 | import numpy as np |
Generate some data
1 | a = np.arange(1,101) # generate an array from number 1 to 100 |
Split the data
We use the method train_test_split(x)
to split arrays or matrices into random train and test subsets.
The first array is training and the second array is test.
1 | a_train, a_test = train_test_split(a, test_size=0.2, shuffle=True) |
Note if it is shuffled, each time we split the data we will get different training and testing datasets.
It will affect the result a bit.
To avoid each time shuffling different result, we use random_state
to make the datasets consistant.
1 | a_train, a_test = train_test_split(a, test_size=0.2, random_state=42) |
^ Now We can get exactly shuffle split everytime.
Note We can also split the data for more than 1 datasets.
1 | # The 2 arrays will be split into 4 |
Preprocessing Data
Check the descriptive statistics of all variables
1 | # To include the categorical ones, you should specify this with an argument |
Determining the variables of interest
Sometimes we will drop some of the variables from the data because they are not useful.
1 | # DataFrame.drop(columns, axis) returns new object with the indicated columns dropped |
Dealing with missing values
1 | # check for miss values |
A rule of thumb :
if you are removing < 5% of the observations you are free to just remove all observations that have missing values.
1 | # dropping records with missing values |
Actually a common way to label missing values is by assigning 99.99. Be aware of that.
(It is a bad idea to label values in such ways as it is very hard for other users of the data to distinguish
them from the true values.) But some people still do it.
Showing Probability density function (PDF)
1 | # A great step in the data exploration is to display the probability distribution function (PDF) of a variable |
Dealing with outliers
Outliers = observations that lie on abnormal distance from other observations in the data.
they will affect the regression dramatically and cause coefficients to be inflated as the regression will try to place the line closer to those values.
- One way to deal with outliers seemlessly is to remove top 1% of observations.
DataFrame.quantile(0.99)
method
1 | # Declare a variable that will be equal to |
Then we repeat above step to deal all the outliers of independent variables.
We need to reset the index at the end.
1 | # When we remove observations, the original indexes are preserved |
Checking OLS assumptions
The categorical variables will be included as dummies so we don’t need to check the assumptions for them.
1 | f, (ax1, ax2, ax3) = plt.subplots(1, 3, sharey=True, figsize =(15,3)) |
Note the graph usually will not be linear.
Just check the dependent variable again:
1 | # From the subplots and the PDF of y, we can easily determine that how 'y' is distributed |
Lets say this time ‘y’ is exponentially distributed, we need to apply a log transformation.
Relaxing the assumption
1 | # Let's transform 'y' with a log transformation |
1 | # Let's check the three scatters once again |
Check for Multicollinearity
We need to assume No multicollinearity.
1 | data_cleaned.columns.values |
- One of the best ways to check for multicollinearity is through VIF (Variance Inflation Factor).
1 | # sklearn does not have a built-in way to check for multicollinearity |
- VIF = 1 means no multicollinearity
- 1 < VIF < 5 is considered perfectly okay
- 10 < VIF means that variable too correlated with the other variables, needs to be removed
Create Dummy Variables Effectively
- If we have N categories for a feature, we have to create N-1 dummies.
- since N dummies will introduce multicollinearity and causing VIF to infinity
pd.get_dummies(df [. drop_first])
spots all categorical variables and creates dummies automatically
1 | # To include the categorical data in the regression, let's create dummies |
To make our data frame more organized, we prefer to place the dependent variable in the beginning of the dataframe. Since each problem is different, that must be done manually. We can display all possible features using data_with_dummies.columns.values
, and then choose the desired order and store with variable cols
.
1 | cols = [<variable list you want>] |
1 | data_preprocessed = data_with_dummies[cols] |
Practical Linear Regession Model
Declare the inputs and targets
1 | # The target(s) (dependent variable) is 'log y' |
Scale the Data (Standardization) and Data Split
1 | # Import the scaling module |
1 | # Scale the features and store them in a new variable (the actual scaling procedure) |
Note that it is not usually recommended to standardize dummy variables.
Then do a Train Test Split.
1 | # Import the module for the split |
Create Regression
1 | # Create a linear regression object |
Scatter Plot Check
1 | # The simplest way to compare the targets (y_train) and the predictions (y_hat) is to plot them on a scatter plot |
Residual Plot Check
Residual = Differences between the target and predictions
1 | # Another useful check of our model is a residual plot |
In the best case scenario this plot should be normally distributed.
If there are many negative residuals (far away from the mean),
Given the definition of the residuals (y_train - y_hat),
negative values imply that y_hat (predictions) are much higher than y_train (the targets) (overestimate).
Find the R-squared of the model
The score represents how much our model is explaining the percent of variability of the data.
1 | reg.score(x_train,y_train) |
Finding the weights and bias
1 | # Obtain the bias (intercept) of the regression |
1 | # Create a regression summary where we can compare them with one-another |
The table will show positive weights and negative weights.
For continuous variables:
- A positive weight shows that when a feature increases in value, log_y and ‘y’ also increases.
- A negative weight shows that when a feature increases in value, log_y and ‘y’ decreases.
For dummy variables:
- A positive weight shows that respective category is more valuable than the benchmark
- A negative weight shows that respective category is less valuable than the benchmark
1 | # Check the different categories in the 'x1' variable |
^Use
data_cleaned['name_of_categorical_variable'].unique()
and find which ones are missing to know the benchmark.
Testing
1 | # Once we have trained and fine-tuned our model, we can proceed to testing it |
Create a scatter plot with the test targets and the test predictions
You can include the argument alpha
which will introduce opacity to the graph make it like a heatmap.
1 | plt.scatter(y_test, y_hat_test, alpha=0.2) |
Finally, let’s manually check these predictions
To obtain the actual y, we take the exponential of the log_y
Normally we’d prefer the y not their logarithms since the log is the opposite of the exponential.
1 | df_pf = pd.DataFrame(np.exp(y_hat_test), columns=['Prediction']) |
We can also include the test targets in that data frame (so we can manually compare them)
1 | df_pf['Target'] = np.exp(y_test) |
Remove old indexing
1 | y_test |
1 | # Let's overwrite the 'Target' column with the appropriate values |
Proceed to compare them using Residual
1 | # Additionally, we can calculate the difference between the targets and the predictions |
make the dataframe become percentage
1 | # Finally, it makes sense to see how far off we are from the result percentage-wise |
Make the data more readable
1 | # Sometimes it is useful to check these outputs manually |
To improve our model, we can:
- Use a different set of variables
- Remove a bigger part of the outliers
- Use different kinds of transformation
Reference
The Data Science Course 2020: Complete Data Science Bootcamp