Feature Selection and Dimensionality Reduction
In Machine Learning Problem, There must be Lots of different features have been proposed.
- How many of them should we include during our model training?
- Generally, we want to use less feature.
- More feature in model training = More memory usage by program
- More feature in model training = Potentially Longer model training time (higher computational complexity)
- More feature does not mean to produce better performance
- Because it is possible to have irrelevant and unimportant features
Feature Selection
The Purpose of Feature Selection is to:
- Improve prediction performance
- Provide faster and more cost-effective predictors
- Provide a better understanding of the underlying process that generated the data
- Thus easier to interpret the model
- Identify important features and exclude insignificant features
Filter-based Methods
Statistical approach
- Evaluate the relationship between feature and the “label” using some statistical measures
- E.g., information gain (Entropy), correlation coefficient scores (Correlation), Chi-squared test
- Select the features that have the strongest relationship with the “label”
- Features are ranked independent of the ML methods use
- Do not need to specify the Machine Learning Model
Information gain: Entropy
The idea is to Choose feature that provides most information gain
- First, if the features are continous values, get the average of the all features for all classes. Then apply threshold to categorize them.
- Compute entropy for each feature:
- Pick the lowest entropy (highest information gain)
- Large entropy value means uncertain = Distribution is more biased
- So we need to pick the feature with biggest reduction in uncertainty (lowest entropy)
1 | import math |
Drawback:
- Assume feature and other feature are inpendent (Which is not a good assumption)
- Because some features might be closely related
Feature Importance: Hypothesis testing with Chi-square
The idea is to Evaluate the likelihood of correlation among a group of features
- Statistical approach
Use a Hypothesis Testing with Chi-square
- Null hypothesis : X and Y are statistically independent on each other
A hypothesis is an idea that can be tested.
- Null hypothesis is the hypothesis to be tested, the idea you want to reject.
- is True until rejected (Innocent until proven guilty)
- is the Alternative hypothesis (Everything else except ) =
Test if two variables X and Y are statistically dependent on each other
Where
- : degree of freedom (degree of freedom = (number of rows - 1)x(number of columns - 1) )
- degree of freedom determine the threshold of shaded region
- : observed value (the Given data)
- : expected value
- : index of our data
- is the top pencentage of shaded area in the whole distribution (5% in this case)
The shaded zone is the Rejection region. We reject our Null hypothesis (X and Y are statistically independent on each other) and thus the Shaded zone means X and Y are statistically dependent on each other. This means we can use the feature to make prediction on our label. They have some kind of similarity.
If The is within the blank part, it means we cannot reject our Null hypothesis (X and Y are statistically independent on each other). Labels and the features are quite independent to each other (dont have high similarity). This means we cannot use the feature to predict the label.
Also Note the Chi-Squared Distribution only consists of non-negative values. Its graph is Asymmetric and skewed to the right.
A Small code snippet available to use:
1 | from sklearn.feature_selection import SelectKBest |
An Example of manual calculation, also provide clearer explaination:
- Determine if there is any relationship between “gender”(A Feature) and “customer who closes the bank account” (Label).
- Our : “gender” and “customer who closes the bank account” are independent
- “gender” and “customer who closes the bank account” are dependent
- Degree of freedom = (number of rows - 1) x (number of columns - 1) =
- degree of freedom determine the threshold of shaded region
- Observed value are the given data (e.g. Yes and Male, Yes and Female, No and Male, No and Female)
- Expected value are the prediction of probability times total records (e.g. P(Yes and Male) x Total, P(Yes and Female) x Total, P(No and Male) x Total, P(No and Female) x Total )
- The probability of Expected value is where we assume independent
Note If and are fully independent, will equals to
In other words, will be equal to 0 If Feature and Label are fully independent. So for larger value, the feature and label is more dependent.
Finding the Expected Value of (Yes and Male) :
Here we rounded off because the unit of expected value is number of person
(Given)
Finding the Expected Value of (Yes and Female) :
(Given)
Finding the Expected Value of (No and Male) :
(Given)
Finding the Expected Value of (No and Female) :
(Given)
Plugging the variables into formula:
How to find the threshold of shaded region using the degree of freedom?
- df is the degree of freedom
- The value of = 5% = p = 0.05, so the threshold will 3.84
- Since 2.22 < 3.84, We cannot assume they are dependent, therefore no strong relationship.
For Each Feature, We can carry a Chi-Square Test to see whether which feature is a important feature (has strong relationship with label).
An Example
Determine the more important feature (Outlook vs Wind).
First construct the table for Outlook (First feature):
Go hiking = Yes Go hiking = No Sunny 2 3 5 Overcast 4 0 4 Rain 3 2 5 9 5 14 Observed values:
Expected values:
Then construct the table for Wind (Second feature):
Go hiking = Yes Go hiking = No Strong 3 3 6 Weak 6 2 8 9 5 14 Observed values:
Expected values:
Plugging the variables into formula:
Note degree of freedom = (number of rows - 1)x(number of columns - 1)
Since Outlook has higher chi square score than Wind, Outlook is more important.
Wrapper Methods
Depends on the Machine Learning Model we are using
- Need to specify the Machine Learning Model
- Use performance measure embedded in the ML model to decide feature importance
- Use the Machine Learning model performance to judge whether the feature is important
- consider a particular ML model and use the model as evaluation criteria
- Searches for features which are best-suited for the ML model used in terms of measures such as accuracy
Forward selection
- Starts with an empty set of features (No feature), add the best features at a time
Example: 3 Features {A B C}
We first train with only 1 feature. Pick A to train, Then Pick B to train, Then Pick C to train.
Choose the 1 feature with highest accuracy and add to the list.
Lets say B has the highest accuracy,
Now we train with 2 features. Pick A and B to train, Then pick B and C to train.
Choose the 1 feature with highest accuracy and add to the list.
The process goes on and on.
Backward elimination
- Starts with all features at the beginning, removes the worst performing feature at a time
Combination of Forward selection and Backward elimination
- Select the best attribute and remove the worst
Wrapper Method Example
Using logistic regression as ML model and recursive feature elimination (RFE)
1 | from sklearn.feature_selection import RFE |
Filter-based Methods vs Wrapper Methods
- 2 methods in feature selection: filter vs wrapper
What is their main difference?
- filter -based: don’t need to consider the mode
- wrapper: need to consider the model
Filter methods do not incorporate a machine learning model in order to determine if a feature is good or bad whereas wrapper methods use a machine learning model and train it the feature to decide if it is essential or not.
Which is more computationally efficient?
- filter-based method is more computationally efficient (do not involve training)
- Wrapper methods are computationally costly (involve training)
filter-based method is more computationally efficient. they do not involve training the models. On the other hand, wrapper methods are computationally costly, and in the case of massive datasets, wrapper methods are not the most effective feature selection method to consider
Which is more prone to over-fitting?
- wrapper method
as wrapper methods already train machine learning models with the features and it affects the true power of learning. But the features from filter methods will not lead to overfitting in most of the cases
Dimensionality reduction
Dimensionality reduction is different to Feature Selection.
- Transform features into another domain
- The idea is to pack the most important information into as few derived features as possible
- Can be totally different from original features
- Create new combinations of features
- Interpretation of the model is a bit difficult
We transform the original set of features into another set of features. The idea is to pack the most important information into as few derived features as possible. We can reduce the number of dimensions by dropping some of the derived features. But we don’t lose complete information from the original features: derived features are a linear combination of the original features.
Dimensionality reduction methods includes:
- Singular value decomposition
- Linear discriminant analysis
- Principal component analysis
Principal component analysis (PCA)
In PCA, derived features are also called composite features or principal components. Moreover, these principal components are linearly independent from one another.
Principal Component Analysis (PCA) extracts the most important information. This in turn leads to compression since the less important information are discarded. With fewer data points to consider, it becomes simpler to describe and analyze the dataset.
- Transform features into a set of Principal components (PC)
- Extracted Principal comonents are Independent of each other
2 feature to 1 Principal component example (2d to 1d):
Another example (5d to 5d):
- Original data: 5 features
- After PCA, have 5 principal components
- The 5 principal components are Independent of each other
- 1st principal component (PC) : keep the max possible information
- Followed by 2nd PC, 3rd PC, 4th PC, 5th PC
- 5th PC: least significant, can be removed
Code Example
1 | from sklearn.decomposition import PCA |
Advantage of PCA:
- PCA can keep the significant information in the original data and each Principal component is independent to another Principal component
- Extracted Principal comonents are Independent of each other
Drawbacks of PCA:
- PCA works only if the observed variables are linearly correlated. If there’s no correlation, PCA will fail to capture adequate variance with fewer components.
- PCA is lossy. Information is lost when we discard insignificant components.
- Scaling of variables can yield different results. Hence, scaling that you use should be documented. Scaling should not be adjusted to match prior knowledge of data.
- Since each principal components is a linear combination of the original features, visualizations are not easy to interpret or relate to original features.
PCA can be seen a trade-off between faster computation and less memory consumption versus information loss. It’s considered as one of the most useful tools for data analysis.
Feature selection VS Dimensionality reduction
- Similarity:
- trying to reduce the number of features
- Difference:
- feature selection: identify important features and exclude insignificant feature
- dimensional reduction: creates new set of features.