Quick Review on K-NN, Decision tree and Naïve Bayes
K-nearest neighbor
Linear regression is parametric approach because we have model parameters a, b, …
knn is a non parametric method, there is no model parameter
unlike logistic regression, knn supports multi class classification automatically.
- Key idea: Properties of an input are likely to be similar to those of points in the neighbourhood of
- Classify based on nearest points
- K-nearest neighbor is Supervised learning
- Used For Classification problems
- Make use of Geometry
kNN performs much better if all of the data have the same scale
Procedures of KNN
The classification is done like a vote.
- Compute the distance between the selected item and all others, using Euclidean distance:
- Sort the distance (Similarity) from the closest to the farthest
- Classify based on the class labels of the closest distances
Example
Consider a set of data and (6,5) is the data that we want to predict.
First we calculate the distance (Similarity) for each data point using the Euclidean distance formula
E.g. Distance between (7,4) and our predicting data (6,5) =
Then We sort the distance (Similarity) from the closest to the farthest
And Classify based on the class labels of the closest distances.
- 1-NN: Using the closest 1 data point as classifer
- predicted class label of (6,5) => 1bad => bad
- 3-NN: Using the closest 3 data point as classifer
- predicted class label of (6,5) => 2bad:1good => bad
- 5-NN: Using the closest 3 data point as classifer
- predicted class label of (6,5) => 3bad:2good => bad
For binary classification it is better use pick k as odd number so we wont encounter 50:50 problem.
Summary of KNN
kNN is a memory-intensive approach in that the classifier adapts as we collect new training data.
- lazy algorithm
- no model building
- simply remember all training data
- memory-intensive approach
Advantages
- No training is needed
- The training time for any value of K is the same.
- kNN works well with a small number of input variables
- kNN makes no assumptions about the functional form of the problem being solved
- Works for multi-classification
Disadvantages
- Complexity grows with data size, Is slow at the classification time
- Large memory/high computations in testing
- Sensitive to local structure (random variations in local structure of training set can have undesirable impact)
Decision tree
- Key idea: Construct a tree-like structure by devising a set of if-else rules
- Each node is a test condition
- Each branch is outcome of test represented by corresponding node
- Leave nodes contain the final decision
- Choose feature that provides most information gain and split data
- Decision tree is Supervised learning
- Used For both Regression and Classification problems
- Make use of Entropy
- Based on probabilities and entropy (the uncertainty)
- Uncertainty drop = Information gain
Decision trees are flowchart-like structures that let you classify input data points or predict output values given inputs.
Procedures of Designing a Decision Tree
The idea is to Choose feature that provides most information gain and split data
- First, if the features are continous values, get the average of the all features for all classes. Then apply threshold to categorize them.
- Compute entropy for each feature:
- Pick the lowest entropy (highest information gain) feature to start splitting
- Large entropy value means uncertain = Distribution is more biased
- So we need to pick the feature with biggest reduction in uncertainty (lowest entropy)
Example
Construct the decision tree to distinguish dogs from cats.
Since the features are categorized values, we just need to compute entropy for each feature:
For feature Sound:
Therefore Entropy of sound:
For feature Fur:
Therefore Entropy of fur:
For feature Color:
Therefore Entropy of color:
Then we can pick the lowest entropy (highest information gain) to start splitting
Suppose we choose sound to be the root. Then we have two branches:
For each case, we calculate the entropy again, you would find that we should choose color as color has a lower entropy as compared with fur.
For feature Fur in branch 1:
Therefore Entropy of fur in branch 1:
For feature Color in branch 1:
Therefore Entropy of color in branch 1:
So goes on and on. Try to draw a tree on ur own.
We can get a different free, if you start splitting at fur.
Random Forest
More trees can further discrease errors
Ensure that the behavior of each individual tree is not “too correlated”
- need to make sure they are independent (Bagging and Feature randomness)
- Node splitting in a random forest model is based on a random subset of features for each tree.
The Random forest algorithm introduced a robust, practical take on decision-tree learning that involves building a large number of specialized decision trees and then ensembling their outputs.
Bagging and Feature randomness
Random forest is based on bagging concept, that consider faction of sample and faction of feature for building the individual trees.
- Bagging: assume sample size is N
- Training data: take a random sample of size N with replacement
- Feature randomness
- Each tree pick a random subset of features in splitting
Boosting
in boosting, we work on the trees sequentially. From the first tree’s result, we identify the data that was mis-classified by the first tree. Then increase the weight of this data and put to the second tree. Hence the second tree can work on the "weakness of the first tree and improve it. After that, results of second tree is tested, and identify those mis-classified results and supplied to the third tree
in this way, the results can be improved, but of course, computationally it is slower
XGBoost
A gradient boosting machine, much like a random forest, is a machine-learning technique based on ensembling weak prediction models, generally decision trees.
It uses gradient boosting, a way to improve any machine-learning model by iteratively training model by iteratively training new models that specialize in addressing the weak points of the previous models.
In this article I wont cover Extreme Gradient Boost (XGBoost) in detail.
Naïve Bayes
- Key idea: Based on computing probability distribution for unknown variables given observed values of other variables and find out the most probable class
- Naïve Bayes is Supervised learning
- Used For Classification problems
- Make use of Probability (Statistical approach)
Naive Bayes is a type of machine-learning classifier based on applying Bayes’ theorem while assuming that all the features in the input data are all indendent (which is a “naive” assumption).
Assumption: all the input features are conditionally independent of each other
Formula of Naive Bayes:
where is class (output), and are the attributes (input features)
Remember some old formulas:
Conditional probability: prob of under the assumption that took place
Bayes formula:
Procedures of Naïve Bayes
Example
Construct the probability tables for the input features “Sound”, “Fur” and “Color”.
If sound is bark, fur is coarse and color is brown, what is the most probable class?
Likelihood of Cat = P(bark, coarse, brown | cat) =
Likelihood of Dog = P(bark, coarse, brown | dog) =
Likelihood only considers the observed data, but NB considers prior prob too
Therefore:
Probability of Cat = P(Cat) P(bark, coarse, brown | Cat) =
Probability of Dog = P(Dog) P(bark, coarse, brown | Dog) =
Normalizing the probabilities:
in fact, in NB, we don’t need to normalize it (although there is nothing wrong to perform normalization)
Cat:
Dog:
Therefore, the most probable class is Dog.
Summary of Naïve Bayes Classifier
- Work even if dataset size is small
- Computationally efficient algorithm
- Can be applied in real time
Feature independence assumption
- If hold, can give good estimation
- In practical situation, it is almost impossible to have independent features
- Feature engineering: check for correlated features