In Machine Learning Problem, There must be Lots of different features have been proposed.

How many of them should we include during our model training?
Generally, we want to use less feature.
- More feature in model training = More memory usage by program
- More feature in model training = Potentially Longer model training time (higher computational complexity)
- More feature does not mean to produce better performance
  - Because it is possible to have irrelevant and unimportant features

Feature Selection

The Purpose of Feature Selection is to:

Improve prediction performance
Provide faster and more cost-effective predictors
Provide a better understanding of the underlying process that generated the data
- Thus easier to interpret the model

Identify important features and exclude insignificant features

Filter-based Methods

Statistical approach

Evaluate the relationship between feature and the “label” using some statistical measures
- E.g., information gain (Entropy), correlation coefficient scores (Correlation), Chi-squared test
- Select the features that have the strongest relationship with the “label”
- Features are ranked independent of the ML methods use

Do not need to specify the Machine Learning Model

Information gain: Entropy

The idea is to Choose feature that provides most information gain

First, if the features are continous values, get the average of the all features for all classes. Then apply threshold to categorize them.
Compute entropy for each feature:

$H(X)=\sum_{i=1}^{n}-P\left(x_{i}\right) \log _{2} P\left(x_{i}\right)$

Pick the lowest entropy (highest information gain)
- Large entropy value means uncertain = Distribution is more biased
- So we need to pick the feature with biggest reduction in uncertainty (lowest entropy)

import math
def calc_entropy_for_one(column):
    # Compute the counts of each unique value in the column 
    counts = np.bincount(column)
    # Divide by the length of the column to get the probability 
    probabilities = counts / len(column)
    
    entropy = 0
    # Loop through the probabilities and add to the total entropy 
    for prob in probabilities:
        if prob > 0:
            # use log from math and set base to 2 
            entropy += prob * math.log(prob, 2)
    return -entropy
 
def calc_entropy_full(data, feature_name, target_name):
    #Find the unique values in the column “feature_name” 
    values = data[feature_name].unique()
    
    entropy=0
    for i in values:
        a = data[data[feature_name] == i]
        prob = (a.shape[0] / data.shape[0])
        entropy += prob*calc_entropy_for_one(a[target_name])
    return entropy

Drawback:

Assume feature and other feature are inpendent (Which is not a good assumption)
- Because some features might be closely related

Feature Importance: Hypothesis testing with Chi-square

The idea is to Evaluate the likelihood of correlation among a group of features

Statistical approach

Use a Hypothesis Testing with Chi-square

Null hypothesis $H_0$ : X and Y are statistically independent on each other

A hypothesis is an idea that can be tested.

Null hypothesis $H_0$ is the hypothesis to be tested, the idea you want to reject.

$H_0$ is True until rejected (Innocent until proven guilty)

$H_1$ is the Alternative hypothesis (Everything else except $H_0$ ) = $\overline{H_0}$

Test if two variables X and Y are statistically dependent on each other

$\chi_{c}^{2}=\sum_{i} \frac{\left(O_{i}-E_{i}\right)^{2}}{E_{i}}$

Where

$c$ $c$ : degree of freedom (degree of freedom = (number of rows - 1)x(number of columns - 1) )
- degree of freedom determine the threshold of shaded region
$O$ : observed value (the Given data)
$E$ : expected value
$i$ : index of our data

$\alpha$ is the top pencentage of shaded area in the whole distribution (5% in this case)

The shaded zone is the Rejection region. We reject our Null hypothesis $H_0$ (X and Y are statistically independent on each other) and thus the Shaded zone means X and Y are statistically dependent on each other. This means we can use the feature to make prediction on our label. They have some kind of similarity.

If The $\chi_{c}^{2}$ is within the blank part, it means we cannot reject our Null hypothesis $H_0$ (X and Y are statistically independent on each other). Labels and the features are quite independent to each other (dont have high similarity). This means we cannot use the feature to predict the label.

Also Note the Chi-Squared Distribution only consists of non-negative values. Its graph is Asymmetric and skewed to the right.

A Small code snippet available to use:

from sklearn.feature_selection import SelectKBest 
from sklearn.feature_selection import chi2
test = SelectKBest(score_func=chi2, k=5)
fit = test.fit(X, Y)
dfscores = pd.DataFrame(fit.scores_) #fit.scores_ are containing the chi2scores
#Higher score means More important, since they will be staying at shaded region, strong dependent.

#Select the top 8 important features
dfcolumns = pd.DataFrame(names[0:8])
featureScores = pd.concat([dfcolumns,dfscores],axis=1) 
featureScores.columns = ['Specs','Score']
print(featureScores.nlargest(8,'Score'))

An Example of manual calculation, also provide clearer explaination:

Determine if there is any relationship between “gender”(A Feature) and “customer who closes the bank account” (Label).
- Our $H_0$ : “gender” and “customer who closes the bank account” are independent
- $H_1 = \overline{H_0} =$ “gender” and “customer who closes the bank account” are dependent

$\chi_{c}^{2}=\sum_{i} \frac{\left(O_{i}-E_{i}\right)^{2}}{E_{i}}$

Degree of freedom = (number of rows - 1) x (number of columns - 1) = $(2-1)(2-1) = 1$ $(2 - 1) (2 - 1) = 1$
- degree of freedom determine the threshold of shaded region
Observed value $O_i$ are the given data (e.g. Yes and Male, Yes and Female, No and Male, No and Female)
Expected value $E_i$ $E_{i}$ are the prediction of probability times total records (e.g. P(Yes and Male) x Total, P(Yes and Female) x Total, P(No and Male) x Total, P(No and Female) x Total )
- The probability of Expected value is $P(A\cap B) = P(A)P(B)$ where we assume independent

Note If $A$ and $B$ are fully independent, $P(A\cap B)$ will equals to $P(A)P(B)$

In other words, $O - E$ will be equal to 0 If Feature and Label are fully independent. So for larger $\chi_{c}^{2}$ value, the feature and label is more dependent.

Finding the Expected Value of (Yes and Male) $E_1$ :

$P(\text{Yes and Male}) = P(\text{Yes})P(\text{Male}) = \frac{82}{400}\times\frac{216}{400} = 0.1107$

$E_1 = P(\text{Yes and Male})\times\text{Total} = 0.1107 \times 400 = 44.28 = 44$

Here we rounded off because the unit of expected value is number of person

$O_1 = 38$ (Given)

Finding the Expected Value of (Yes and Female) $E_2$ :

$P(\text{Yes and Female}) = P(\text{Yes})P(\text{Female}) = \frac{82}{400}\times\frac{184}{400} = 0.0943$

$E_2 = P(\text{Yes and Female}) \times \text{Total} = 0.0943 \times 400 = 37.72 = 38$

$O_2 = 44$ (Given)

Finding the Expected Value of (No and Male) $E_3$ :

$P(\text{No and Male}) = P(\text{No})P(\text{Male}) = \frac{318}{400}\times\frac{216}{400} = 0.4293$

$E_3 = P(\text{No and Male}) \times \text{Total} = 0.4293 \times 400 = 171.72 = 172$

$O_3 = 178$ (Given)

Finding the Expected Value of (No and Female) $E_4$ :

$P(\text{No and Female}) = P(\text{No})P(\text{Female}) = \frac{318}{400}\times\frac{184}{400} = 0.3657$

$E_4 = P(\text{No and Female}) \times \text{Total} = 0.3657 \times 400 = 146.28 = 146$

$O_4 = 140$ (Given)

Plugging the variables into formula:

$\chi_{c}^{2}=\sum_{i} \frac{\left(O_{i}-E_{i}\right)^{2}}{E_{i}} = \frac{(38-44)^{2}}{44}+\frac{(44-38)^{2}}{38}+\frac{(178-172)^{2}}{172}+\frac{(140-146)^{2}}{146}$

$\chi_{1}^{2} = \frac{1457370}{656051} = 2.22$

How to find the threshold of shaded region using the degree of freedom?

df is the degree of freedom
The value of $\alpha$ = 5% = p = 0.05, so the threshold will 3.84

Since 2.22 < 3.84, We cannot assume they are dependent, therefore no strong relationship.

For Each Feature, We can carry a Chi-Square Test to see whether which feature is a important feature (has strong relationship with label).

An Example

Determine the more important feature (Outlook vs Wind).

First construct the table for Outlook (First feature):

Go hiking = Yes Go hiking = No

Sunny 2 3 5

Overcast 4 0 4

Rain 3 2 5

9 5 14

Observed values:

$\text{Yes and Sunny} = 2$

$\text{No and Sunny} = 3$

$\text{Yes and Overcast} = 4$

$\text{No and Overcast} = 0$

$\text{Yes and Rain} = 3$

$\text{No and Rain} = 2$

Expected values:

$P(\text{Yes})P(\text{Sunny}) \times \text{Total} = \frac{9}{14}\times\frac{5}{14}\times 14 = 3.21$

$P(\text{No})P(\text{Sunny}) \times \text{Total} = \frac{5}{14}\times\frac{5}{14}\times 14 = 1.79$

$P(\text{Yes})P(\text{Overcast}) \times \text{Total} = \frac{9}{14}\times\frac{4}{14}\times 14 = 2.57$

$P(\text{No})P(\text{Overcast}) \times \text{Total} = \frac{5}{14}\times\frac{4}{14}\times 14 = 1.43$

$P(\text{Yes})P(\text{Rain}) \times \text{Total} = \frac{9}{14}\times\frac{5}{14}\times 14 = 3.21$

$P(\text{No})P(\text{Rain}) \times \text{Total} = \frac{5}{14}\times\frac{5}{14}\times 14 = 1.79$

Then construct the table for Wind (Second feature):

Go hiking = Yes Go hiking = No

Strong 3 3 6

Weak 6 2 8

9 5 14

Observed values:

$\text{Yes and Strong} = 3$

$\text{No and Strong} = 3$

$\text{Yes and Weak} = 6$

$\text{No and Weak} = 2$

Expected values:

$P(\text{Yes})P(\text{Strong}) \times \text{Total} = \frac{9}{14}\times\frac{6}{14}\times 14 = 3.86$

$P(\text{No})P(\text{Strong}) \times \text{Total} = \frac{5}{14}\times\frac{6}{14}\times 14 = 2.14$

$P(\text{Yes})P(\text{Weak}) \times \text{Total} = \frac{9}{14}\times\frac{8}{14}\times 14 = 5.14$

$P(\text{No})P(\text{Weak}) \times \text{Total} = \frac{5}{14}\times\frac{8}{14}\times 14 = 2.86$

Plugging the variables into formula:

Note degree of freedom $c$ = (number of rows - 1)x(number of columns - 1)

$\chi_{c}^{2}=\sum_{i} \frac{\left(O_{i}-E_{i}\right)^{2}}{E_{i}} = \frac{(3-3.86)^{2}}{3.86}+\frac{(3-2.14)^{2}}{2.14}+\frac{(6-5.14)^{2}}{5.14}+\frac{(2-2.86)^{2}}{2.86}$

$\chi_{1}^{2} = \frac{0.7396}{3.86}+\frac{0.7396}{2.14}+\frac{0.7396}{5.14}+\frac{0.7396}{2.86} = 0.9397$

Since Outlook has higher chi square score than Wind, Outlook is more important.

	Go hiking = Yes	Go hiking = No
Sunny	2	3	5
Overcast	4	0	4
Rain	3	2	5
	9	5	14

	Go hiking = Yes	Go hiking = No
Strong	3	3	6
Weak	6	2	8
	9	5	14

Wrapper Methods

Depends on the Machine Learning Model we are using

Need to specify the Machine Learning Model
Use performance measure embedded in the ML model to decide feature importance
- Use the Machine Learning model performance to judge whether the feature is important
consider a particular ML model and use the model as evaluation criteria
- Searches for features which are best-suited for the ML model used in terms of measures such as accuracy

Forward selection

Starts with an empty set of features (No feature), add the best features at a time

Example: 3 Features {A B C}

We first train with only 1 feature. Pick A to train, Then Pick B to train, Then Pick C to train.

Choose the 1 feature with highest accuracy and add to the list.

Lets say B has the highest accuracy,

Now we train with 2 features. Pick A and B to train, Then pick B and C to train.

Choose the 1 feature with highest accuracy and add to the list.

The process goes on and on.

Backward elimination

Starts with all features at the beginning, removes the worst performing feature at a time

Combination of Forward selection and Backward elimination

Select the best attribute and remove the worst

Wrapper Method Example

Using logistic regression as ML model and recursive feature elimination (RFE)

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression 
model = LogisticRegression()
rfe = RFE(model, 5) #Select only 5 features
fit = rfe.fit(X, Y)
print("Num Features: %s" % (fit.n_features_)) 
print("Selected Features: %s" % (fit.support_)) 
print("Feature Ranking: %s" % (fit.ranking_))

Filter-based Methods vs Wrapper Methods

2 methods in feature selection: filter vs wrapper

What is their main difference?

filter -based: don’t need to consider the mode
wrapper: need to consider the model

Filter methods do not incorporate a machine learning model in order to determine if a feature is good or bad whereas wrapper methods use a machine learning model and train it the feature to decide if it is essential or not.

Which is more computationally efficient?

filter-based method is more computationally efficient (do not involve training)
Wrapper methods are computationally costly (involve training)

filter-based method is more computationally efficient. they do not involve training the models. On the other hand, wrapper methods are computationally costly, and in the case of massive datasets, wrapper methods are not the most effective feature selection method to consider

Which is more prone to over-fitting?

wrapper method

as wrapper methods already train machine learning models with the features and it affects the true power of learning. But the features from filter methods will not lead to overfitting in most of the cases

Dimensionality reduction

Dimensionality reduction is different to Feature Selection.

Transform features into another domain
- The idea is to pack the most important information into as few derived features as possible
Can be totally different from original features
- Create new combinations of features
Interpretation of the model is a bit difficult

We transform the original set of features into another set of features. The idea is to pack the most important information into as few derived features as possible. We can reduce the number of dimensions by dropping some of the derived features. But we don’t lose complete information from the original features: derived features are a linear combination of the original features.

Dimensionality reduction methods includes:

Singular value decomposition
Linear discriminant analysis
Principal component analysis

Principal component analysis (PCA)

In PCA, derived features are also called composite features or principal components. Moreover, these principal components are linearly independent from one another.

Principal Component Analysis (PCA) extracts the most important information. This in turn leads to compression since the less important information are discarded. With fewer data points to consider, it becomes simpler to describe and analyze the dataset.

Transform features into a set of Principal components (PC)
- Extracted Principal comonents are Independent of each other

2 feature to 1 Principal component example (2d to 1d):

PCA in a nutshell. Source: Lavrenko and Sutton 2011, slide 13.

Another example (5d to 5d):

Original data: 5 features
After PCA, have 5 principal components
- The 5 principal components are Independent of each other
- 1st principal component (PC) : keep the max possible information
- Followed by 2nd PC, 3rd PC, 4th PC, 5th PC
- 5th PC: least significant, can be removed

Code Example

from sklearn.decomposition import PCA

pca = PCA(n_components=3) 
principalComponents = pca.fit_transform(X) 
principalDf = pd.DataFrame(data = principalComponents, columns = ['PC1', 'PC2', 'PC3'])
explained_variance = pca.explained_variance_ratio_

Advantage of PCA:

PCA can keep the significant information in the original data and each Principal component is independent to another Principal component
- Extracted Principal comonents are Independent of each other

Drawbacks of PCA:

PCA works only if the observed variables are linearly correlated. If there’s no correlation, PCA will fail to capture adequate variance with fewer components.
PCA is lossy. Information is lost when we discard insignificant components.
Scaling of variables can yield different results. Hence, scaling that you use should be documented. Scaling should not be adjusted to match prior knowledge of data.
Since each principal components is a linear combination of the original features, visualizations are not easy to interpret or relate to original features.

PCA can be seen a trade-off between faster computation and less memory consumption versus information loss. It’s considered as one of the most useful tools for data analysis.

Feature selection VS Dimensionality reduction

Similarity:
- trying to reduce the number of features
Difference:
- feature selection: identify important features and exclude insignificant feature
- dimensional reduction: creates new set of features.

Reference

https://devopedia.org/principal-component-analysis