K-nearest neighbor

Linear regression is parametric approach because we have model parameters a, b, …

knn is a non parametric method, there is no model parameter

unlike logistic regression, knn supports multi class classification automatically.

Key idea: Properties of an input are likely to be similar to those of points in the neighbourhood of $x$ $x$
- Classify $x$ based on $k$ nearest points
K-nearest neighbor is Supervised learning
Used For Classification problems
Make use of Geometry

kNN performs much better if all of the data have the same scale

Procedures of KNN

The classification is done like a vote.

Compute the distance between the selected item and all others, using Euclidean distance:

$E(x, y)=\sqrt{\sum_{i=0}^{n}\left(x_{i}-y_{i}\right)^{2}}$

Sort the distance (Similarity) from the closest to the farthest
Classify $x$ based on the class labels of the $k$ closest distances

Example

Consider a set of data and (6,5) is the data that we want to predict.

First we calculate the distance (Similarity) for each data point using the Euclidean distance formula

E.g. Distance between (7,4) and our predicting data (6,5) = $\sqrt{(7-6)^2 + (4-5)} = \sqrt{2} = 1.414214$

Then We sort the distance (Similarity) from the closest to the farthest

And Classify $x$ based on the class labels of the $k$ closest distances.

1-NN: Using the closest 1 data point as classifer
- predicted class label of (6,5) => 1bad => bad
3-NN: Using the closest 3 data point as classifer
- predicted class label of (6,5) => 2bad:1good => bad
5-NN: Using the closest 3 data point as classifer
- predicted class label of (6,5) => 3bad:2good => bad

For binary classification it is better use pick k as odd number so we wont encounter 50:50 problem.

Summary of KNN

kNN is a memory-intensive approach in that the classifier adapts as we collect new training data.

lazy algorithm
no model building
simply remember all training data
memory-intensive approach

Advantages

No training is needed
The training time for any value of K is the same.
kNN works well with a small number of input variables
kNN makes no assumptions about the functional form of the problem being solved
Works for multi-classification

Disadvantages

Complexity grows with data size, Is slow at the classification time
Large memory/high computations in testing
Sensitive to local structure (random variations in local structure of training set can have undesirable impact)

Decision tree

Key idea: Construct a tree-like structure by devising a set of if-else rules
- Each node is a test condition
- Each branch is outcome of test represented by corresponding node
- Leave nodes contain the final decision
- Choose feature that provides most information gain and split data
Decision tree is Supervised learning
Used For both Regression and Classification problems
Make use of Entropy
- Based on probabilities and entropy (the uncertainty)
- Uncertainty drop = Information gain

Decision trees are flowchart-like structures that let you classify input data points or predict output values given inputs.

Procedures of Designing a Decision Tree

The idea is to Choose feature that provides most information gain and split data

First, if the features are continous values, get the average of the all features for all classes. Then apply threshold to categorize them.
Compute entropy for each feature:

$H(X)=\sum_{i=1}^{n}-P\left(x_{i}\right) \log _{2} P\left(x_{i}\right)$

Pick the lowest entropy (highest information gain) feature to start splitting
- Large entropy value means uncertain = Distribution is more biased
- So we need to pick the feature with biggest reduction in uncertainty (lowest entropy)

Example

Construct the decision tree to distinguish dogs from cats.

Since the features are categorized values, we just need to compute entropy for each feature:

For feature Sound:

$\text{Entropy(sound = meow)} = E(1,3) = -P(meow|dog)\log_2P(meow|dog)-P(meow|cat)\log_2P(meow|cat)$

$E(1,3)= -[\frac{1}{4} \times \log_2(\frac{1}{4})+\frac{3}{4} \times \log_2(\frac{3}{4})] = 0.811$

$\text{Entropy(sound = bark)} = E(3,1) = -P(bark|dog)\log_2P(bark|dog)-P(bark|cat)\log_2P(bark|cat)$

$E(3,1) = -[\frac{3}{4} \times \log_2(\frac{3}{4})+\frac{1}{4} \times \log_2(\frac{1}{4})] = 0.811$

Therefore Entropy of sound:

$\text{Entropy of sound} = P(Meow) \times \text{Entropy(sound = meow)} + P(Bark) \times \text{Entropy(sound = bark)}$

$= \frac{1}{2} \times 0.811 + \frac{1}{2} \times 0.811 = 0.811$

For feature Fur:

$\text{Entropy(fur = coarse)} = E(3,1) = -P(coarse|dog)\log_2P(coarse|dog)-P(coarse|cat)\log_2P(coarse|cat)$

$E(3,1) = -[\frac{3}{4} \times \log_2(\frac{3}{4})+\frac{1}{4} \times \log_2(\frac{1}{4})] = 0.811$

$\text{Entropy(fur = fine)} = E(1,3) = -P(fine|dog)\log_2P(fine|dog)-P(fine|cat)\log_2P(fine|cat)$

$E(1,3)= -[\frac{1}{4} \times \log_2(\frac{1}{4})+\frac{3}{4} \times \log_2(\frac{3}{4})] = 0.811$

Therefore Entropy of fur:

$\text{Entropy of fur} = P(Coarse) \times \text{Entropy(fur = coarse)} + P(Fine) \times \text{Entropy(fur = fine)}$

$= \frac{1}{2} \times 0.811 + \frac{1}{2} \times 0.811 = 0.811$

For feature Color:

$\text{Entropy(color = brown)} = E(2,2) = -P(brown|dog)\log_2P(brown|dog)-P(brown|cat)\log_2P(brown|cat)$

$E(2,2) = -[\frac{2}{4} \times \log_2(\frac{2}{4})+\frac{2}{4} \times \log_2(\frac{2}{4})] = 1$

$\text{Entropy(color = black)} = E(2,2) = -P(black|dog)\log_2P(black|dog)-P(black|cat)\log_2P(black|cat)$

$E(2,2) = -[\frac{2}{4} \times \log_2(\frac{2}{4})+\frac{2}{4} \times \log_2(\frac{2}{4})] = 1$

Therefore Entropy of color:

$\text{Entropy of color} = P(Brown) \times \text{Entropy(color = brown)} + P(black) \times \text{Entropy(color = black)}$

$= \frac{1}{2} \times 1 + \frac{1}{2} \times 1 = 1$

Then we can pick the lowest entropy (highest information gain) to start splitting

Suppose we choose sound to be the root. Then we have two branches:

For each case, we calculate the entropy again, you would find that we should choose color as color has a lower entropy as compared with fur.

For feature Fur in branch 1:

$\text{Entropy(fur = coarse)} = E(1,1) = -P(coarse|dog)\log_2P(coarse|dog)-P(coarse|cat)\log_2P(coarse|cat)$

$E(1,1) = -[\frac{1}{2} \times \log_2(\frac{1}{2})+\frac{1}{2} \times \log_2(\frac{1}{2})] = 1$

$\text{Entropy(fur = fine)} = E(0,2) = -P(fine|dog)\log_2P(fine|dog)-P(fine|cat)\log_2P(fine|cat)$

$E(0,2)= -[0 \times \log_2(0)+\frac{2}{2} \times \log_2(\frac{2}{2})] = 0$

Therefore Entropy of fur in branch 1:

$\text{Entropy of fur} = P(Coarse) \times \text{Entropy(fur = coarse)} + P(Fine) \times \text{Entropy(fur = fine)}$

$= \frac{2}{4} \times 0 + \frac{2}{4} \times 1 = 0.5$

For feature Color in branch 1:

$\text{Entropy(color = brown)} = E(1,2) = -P(brown|dog)\log_2P(brown|dog)-P(brown|cat)\log_2P(brown|cat)$

$E(1,2) = -[\frac{1}{3} \times \log_2(\frac{1}{3})+\frac{2}{3} \times \log_2(\frac{2}{3})] = 0.9183$

$\text{Entropy(color = black)} = E(0,1) = -P(black|dog)\log_2P(black|dog)-P(black|cat)\log_2P(black|cat)$

$E(0,1) = -[0 \times \log_2(0)+\frac{1}{1} \times \log_2(\frac{1}{1})] = 0$

Therefore Entropy of color in branch 1:

$\text{Entropy of color} = P(Brown) \times \text{Entropy(color = brown)} + P(black) \times \text{Entropy(color = black)}$

$= \frac{3}{4} \times 0.9183 + \frac{1}{4} \times 0 = 0.6887$

So goes on and on. Try to draw a tree on ur own.

We can get a different free, if you start splitting at fur.

Random Forest

More trees can further discrease errors

Ensure that the behavior of each individual tree is not “too correlated”

need to make sure they are independent (Bagging and Feature randomness)
Node splitting in a random forest model is based on a random subset of features for each tree.

The Random forest algorithm introduced a robust, practical take on decision-tree learning that involves building a large number of specialized decision trees and then ensembling their outputs.

Bagging and Feature randomness

Random forest is based on bagging concept, that consider faction of sample and faction of feature for building the individual trees.

Bagging: assume sample size is N
- Training data: take a random sample of size N with replacement
Feature randomness
- Each tree pick a random subset of features in splitting

Boosting

in boosting, we work on the trees sequentially. From the first tree’s result, we identify the data that was mis-classified by the first tree. Then increase the weight of this data and put to the second tree. Hence the second tree can work on the "weakness of the first tree and improve it. After that, results of second tree is tested, and identify those mis-classified results and supplied to the third tree

in this way, the results can be improved, but of course, computationally it is slower

XGBoost

A gradient boosting machine, much like a random forest, is a machine-learning technique based on ensembling weak prediction models, generally decision trees.

It uses gradient boosting, a way to improve any machine-learning model by iteratively training model by iteratively training new models that specialize in addressing the weak points of the previous models.

In this article I wont cover Extreme Gradient Boost (XGBoost) in detail.

Naïve Bayes

Key idea: Based on computing probability distribution for unknown variables given observed values of other variables and find out the most probable class
Naïve Bayes is Supervised learning
Used For Classification problems
Make use of Probability (Statistical approach)

Naive Bayes is a type of machine-learning classifier based on applying Bayes’ theorem while assuming that all the features in the input data are all indendent (which is a “naive” assumption).

Assumption: all the input features are conditionally independent of each other

Formula of Naive Bayes:

$c=\underset{c_{j}}{\arg \max } \operatorname{Pr}\left(c_{j}\right) \prod_{i=1}^{|A|} \operatorname{Pr}\left(A_{i}=a_{i} \mid C=c_{j}\right)$

where $c$ is class (output), and $A$ are the attributes (input features)

Remember some old formulas:

Conditional probability: $P(A|B)$ prob of $A$ under the assumption that $B$ took place

$P(A|B) = \frac{P(A\cap B)}{P(B)}$

Bayes formula:

$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$

$P(B|A) = \frac{P(B\cap A)}{P(A)}$

$P(B\cap A) = P(B|A) P(A) = P(A|B)P(B)$

Procedures of Naïve Bayes

Example

Construct the probability tables for the input features “Sound”, “Fur” and “Color”.

If sound is bark, fur is coarse and color is brown, what is the most probable class?

Likelihood of Cat = P(bark, coarse, brown | cat) = $\frac{1}{4} \times \frac{1}{4} \times \frac{2}{4} = 0.03125$

Likelihood of Dog = P(bark, coarse, brown | dog) = $\frac{3}{4} \times \frac{3}{4} \times \frac{2}{4} = 0.28125$

Likelihood only considers the observed data, but NB considers prior prob too

Therefore:

Probability of Cat = P(Cat) $\times$ P(bark, coarse, brown | Cat) = $\frac{4}{8} \times 0.03125 = 0.015625$

Probability of Dog = P(Dog) $\times$ P(bark, coarse, brown | Dog) = $\frac{4}{8} \times 0.28125 = 0.140625$

Normalizing the probabilities:

in fact, in NB, we don’t need to normalize it (although there is nothing wrong to perform normalization)

Cat: $\frac{0.015625}{(0.015625 + 0.140625)} = 0.1 = 10\%$

Dog: $\frac{0.140625}{(0.015625 + 0.140625)} = 0.9 = 90\%$

Therefore, the most probable class is Dog.

Summary of Naïve Bayes Classifier

Work even if dataset size is small
Computationally efficient algorithm
- Can be applied in real time

Feature independence assumption

If hold, can give good estimation
In practical situation, it is almost impossible to have independent features
Feature engineering: check for correlated features