K-nearest neighbor

Linear regression is parametric approach because we have model parameters a, b, …

knn is a non parametric method, there is no model parameter

unlike logistic regression, knn supports multi class classification automatically.

  • Key idea: Properties of an input are likely to be similar to those of points in the neighbourhood of xx
    • Classify xx based on kk nearest points
  • K-nearest neighbor is Supervised learning
  • Used For Classification problems
  • Make use of Geometry

kNN performs much better if all of the data have the same scale

Procedures of KNN

The classification is done like a vote.

  • Compute the distance between the selected item and all others, using Euclidean distance:

E(x,y)=i=0n(xiyi)2E(x, y)=\sqrt{\sum_{i=0}^{n}\left(x_{i}-y_{i}\right)^{2}}

  • Sort the distance (Similarity) from the closest to the farthest
  • Classify xx based on the class labels of the kk closest distances

Example

Consider a set of data and (6,5) is the data that we want to predict.

img img

First we calculate the distance (Similarity) for each data point using the Euclidean distance formula

E.g. Distance between (7,4) and our predicting data (6,5) = (76)2+(45)=2=1.414214\sqrt{(7-6)^2 + (4-5)} = \sqrt{2} = 1.414214

img

Then We sort the distance (Similarity) from the closest to the farthest

And Classify xx based on the class labels of the kk closest distances.

  • 1-NN: Using the closest 1 data point as classifer
    • img
    • predicted class label of (6,5) => 1bad => bad
  • 3-NN: Using the closest 3 data point as classifer
    • img
    • predicted class label of (6,5) => 2bad:1good => bad
  • 5-NN: Using the closest 3 data point as classifer
    • predicted class label of (6,5) => 3bad:2good => bad

For binary classification it is better use pick k as odd number so we wont encounter 50:50 problem.

Summary of KNN

kNN is a memory-intensive approach in that the classifier adapts as we collect new training data.

  • lazy algorithm
  • no model building
  • simply remember all training data
  • memory-intensive approach

Advantages

  • No training is needed
  • The training time for any value of K is the same.
  • kNN works well with a small number of input variables
  • kNN makes no assumptions about the functional form of the problem being solved
  • Works for multi-classification

Disadvantages

  • Complexity grows with data size, Is slow at the classification time
  • Large memory/high computations in testing
  • Sensitive to local structure (random variations in local structure of training set can have undesirable impact)

Decision tree

  • Key idea: Construct a tree-like structure by devising a set of if-else rules
    • Each node is a test condition
    • Each branch is outcome of test represented by corresponding node
    • Leave nodes contain the final decision
    • Choose feature that provides most information gain and split data
  • Decision tree is Supervised learning
  • Used For both Regression and Classification problems
  • Make use of Entropy
    • Based on probabilities and entropy (the uncertainty)
    • Uncertainty drop = Information gain

Decision trees are flowchart-like structures that let you classify input data points or predict output values given inputs.

Procedures of Designing a Decision Tree

The idea is to Choose feature that provides most information gain and split data

  • First, if the features are continous values, get the average of the all features for all classes. Then apply threshold to categorize them.
  • Compute entropy for each feature:

H(X)=i=1nP(xi)log2P(xi)H(X)=\sum_{i=1}^{n}-P\left(x_{i}\right) \log _{2} P\left(x_{i}\right)

  • Pick the lowest entropy (highest information gain) feature to start splitting
    • Large entropy value means uncertain = Distribution is more biased
    • So we need to pick the feature with biggest reduction in uncertainty (lowest entropy)

Example

Construct the decision tree to distinguish dogs from cats.

Since the features are categorized values, we just need to compute entropy for each feature:

For feature Sound:

Entropy(sound = meow)=E(1,3)=P(meowdog)log2P(meowdog)P(meowcat)log2P(meowcat)\text{Entropy(sound = meow)} = E(1,3) = -P(meow|dog)\log_2P(meow|dog)-P(meow|cat)\log_2P(meow|cat)

E(1,3)=[14×log2(14)+34×log2(34)]=0.811E(1,3)= -[\frac{1}{4} \times \log_2(\frac{1}{4})+\frac{3}{4} \times \log_2(\frac{3}{4})] = 0.811

Entropy(sound = bark)=E(3,1)=P(barkdog)log2P(barkdog)P(barkcat)log2P(barkcat)\text{Entropy(sound = bark)} = E(3,1) = -P(bark|dog)\log_2P(bark|dog)-P(bark|cat)\log_2P(bark|cat)

E(3,1)=[34×log2(34)+14×log2(14)]=0.811E(3,1) = -[\frac{3}{4} \times \log_2(\frac{3}{4})+\frac{1}{4} \times \log_2(\frac{1}{4})] = 0.811

Therefore Entropy of sound:

Entropy of sound=P(Meow)×Entropy(sound = meow)+P(Bark)×Entropy(sound = bark)\text{Entropy of sound} = P(Meow) \times \text{Entropy(sound = meow)} + P(Bark) \times \text{Entropy(sound = bark)}

=12×0.811+12×0.811=0.811= \frac{1}{2} \times 0.811 + \frac{1}{2} \times 0.811 = 0.811

For feature Fur:

Entropy(fur = coarse)=E(3,1)=P(coarsedog)log2P(coarsedog)P(coarsecat)log2P(coarsecat)\text{Entropy(fur = coarse)} = E(3,1) = -P(coarse|dog)\log_2P(coarse|dog)-P(coarse|cat)\log_2P(coarse|cat)

E(3,1)=[34×log2(34)+14×log2(14)]=0.811E(3,1) = -[\frac{3}{4} \times \log_2(\frac{3}{4})+\frac{1}{4} \times \log_2(\frac{1}{4})] = 0.811

Entropy(fur = fine)=E(1,3)=P(finedog)log2P(finedog)P(finecat)log2P(finecat)\text{Entropy(fur = fine)} = E(1,3) = -P(fine|dog)\log_2P(fine|dog)-P(fine|cat)\log_2P(fine|cat)

E(1,3)=[14×log2(14)+34×log2(34)]=0.811E(1,3)= -[\frac{1}{4} \times \log_2(\frac{1}{4})+\frac{3}{4} \times \log_2(\frac{3}{4})] = 0.811

Therefore Entropy of fur:

Entropy of fur=P(Coarse)×Entropy(fur = coarse)+P(Fine)×Entropy(fur = fine)\text{Entropy of fur} = P(Coarse) \times \text{Entropy(fur = coarse)} + P(Fine) \times \text{Entropy(fur = fine)}

=12×0.811+12×0.811=0.811= \frac{1}{2} \times 0.811 + \frac{1}{2} \times 0.811 = 0.811

For feature Color:

Entropy(color = brown)=E(2,2)=P(browndog)log2P(browndog)P(browncat)log2P(browncat)\text{Entropy(color = brown)} = E(2,2) = -P(brown|dog)\log_2P(brown|dog)-P(brown|cat)\log_2P(brown|cat)

E(2,2)=[24×log2(24)+24×log2(24)]=1E(2,2) = -[\frac{2}{4} \times \log_2(\frac{2}{4})+\frac{2}{4} \times \log_2(\frac{2}{4})] = 1

Entropy(color = black)=E(2,2)=P(blackdog)log2P(blackdog)P(blackcat)log2P(blackcat)\text{Entropy(color = black)} = E(2,2) = -P(black|dog)\log_2P(black|dog)-P(black|cat)\log_2P(black|cat)

E(2,2)=[24×log2(24)+24×log2(24)]=1E(2,2) = -[\frac{2}{4} \times \log_2(\frac{2}{4})+\frac{2}{4} \times \log_2(\frac{2}{4})] = 1

Therefore Entropy of color:

Entropy of color=P(Brown)×Entropy(color = brown)+P(black)×Entropy(color = black)\text{Entropy of color} = P(Brown) \times \text{Entropy(color = brown)} + P(black) \times \text{Entropy(color = black)}

=12×1+12×1=1= \frac{1}{2} \times 1 + \frac{1}{2} \times 1 = 1

Then we can pick the lowest entropy (highest information gain) to start splitting

Suppose we choose sound to be the root. Then we have two branches:

img

For each case, we calculate the entropy again, you would find that we should choose color as color has a lower entropy as compared with fur.

For feature Fur in branch 1:

Entropy(fur = coarse)=E(1,1)=P(coarsedog)log2P(coarsedog)P(coarsecat)log2P(coarsecat)\text{Entropy(fur = coarse)} = E(1,1) = -P(coarse|dog)\log_2P(coarse|dog)-P(coarse|cat)\log_2P(coarse|cat)

E(1,1)=[12×log2(12)+12×log2(12)]=1E(1,1) = -[\frac{1}{2} \times \log_2(\frac{1}{2})+\frac{1}{2} \times \log_2(\frac{1}{2})] = 1

Entropy(fur = fine)=E(0,2)=P(finedog)log2P(finedog)P(finecat)log2P(finecat)\text{Entropy(fur = fine)} = E(0,2) = -P(fine|dog)\log_2P(fine|dog)-P(fine|cat)\log_2P(fine|cat)

E(0,2)=[0×log2(0)+22×log2(22)]=0E(0,2)= -[0 \times \log_2(0)+\frac{2}{2} \times \log_2(\frac{2}{2})] = 0

Therefore Entropy of fur in branch 1:

Entropy of fur=P(Coarse)×Entropy(fur = coarse)+P(Fine)×Entropy(fur = fine)\text{Entropy of fur} = P(Coarse) \times \text{Entropy(fur = coarse)} + P(Fine) \times \text{Entropy(fur = fine)}

=24×0+24×1=0.5= \frac{2}{4} \times 0 + \frac{2}{4} \times 1 = 0.5

For feature Color in branch 1:

Entropy(color = brown)=E(1,2)=P(browndog)log2P(browndog)P(browncat)log2P(browncat)\text{Entropy(color = brown)} = E(1,2) = -P(brown|dog)\log_2P(brown|dog)-P(brown|cat)\log_2P(brown|cat)

E(1,2)=[13×log2(13)+23×log2(23)]=0.9183E(1,2) = -[\frac{1}{3} \times \log_2(\frac{1}{3})+\frac{2}{3} \times \log_2(\frac{2}{3})] = 0.9183

Entropy(color = black)=E(0,1)=P(blackdog)log2P(blackdog)P(blackcat)log2P(blackcat)\text{Entropy(color = black)} = E(0,1) = -P(black|dog)\log_2P(black|dog)-P(black|cat)\log_2P(black|cat)

E(0,1)=[0×log2(0)+11×log2(11)]=0E(0,1) = -[0 \times \log_2(0)+\frac{1}{1} \times \log_2(\frac{1}{1})] = 0

Therefore Entropy of color in branch 1:

Entropy of color=P(Brown)×Entropy(color = brown)+P(black)×Entropy(color = black)\text{Entropy of color} = P(Brown) \times \text{Entropy(color = brown)} + P(black) \times \text{Entropy(color = black)}

=34×0.9183+14×0=0.6887= \frac{3}{4} \times 0.9183 + \frac{1}{4} \times 0 = 0.6887

So goes on and on. Try to draw a tree on ur own.

We can get a different free, if you start splitting at fur.

Random Forest

More trees can further discrease errors

Ensure that the behavior of each individual tree is not “too correlated”

  • need to make sure they are independent (Bagging and Feature randomness)
  • Node splitting in a random forest model is based on a random subset of features for each tree.

The Random forest algorithm introduced a robust, practical take on decision-tree learning that involves building a large number of specialized decision trees and then ensembling their outputs.

Bagging and Feature randomness

Random forest is based on bagging concept, that consider faction of sample and faction of feature for building the individual trees.

  • Bagging: assume sample size is N
    • Training data: take a random sample of size N with replacement
  • Feature randomness
    • Each tree pick a random subset of features in splitting

Boosting

in boosting, we work on the trees sequentially. From the first tree’s result, we identify the data that was mis-classified by the first tree. Then increase the weight of this data and put to the second tree. Hence the second tree can work on the "weakness of the first tree and improve it. After that, results of second tree is tested, and identify those mis-classified results and supplied to the third tree

in this way, the results can be improved, but of course, computationally it is slower

XGBoost

A gradient boosting machine, much like a random forest, is a machine-learning technique based on ensembling weak prediction models, generally decision trees.

It uses gradient boosting, a way to improve any machine-learning model by iteratively training model by iteratively training new models that specialize in addressing the weak points of the previous models.

In this article I wont cover Extreme Gradient Boost (XGBoost) in detail.

Naïve Bayes

  • Key idea: Based on computing probability distribution for unknown variables given observed values of other variables and find out the most probable class
  • Naïve Bayes is Supervised learning
  • Used For Classification problems
  • Make use of Probability (Statistical approach)

Naive Bayes is a type of machine-learning classifier based on applying Bayes’ theorem while assuming that all the features in the input data are all indendent (which is a “naive” assumption).

Assumption: all the input features are conditionally independent of each other

Formula of Naive Bayes:

c=argmaxcjPr(cj)i=1APr(Ai=aiC=cj)c=\underset{c_{j}}{\arg \max } \operatorname{Pr}\left(c_{j}\right) \prod_{i=1}^{|A|} \operatorname{Pr}\left(A_{i}=a_{i} \mid C=c_{j}\right)

where cc is class (output), and AA are the attributes (input features)

Remember some old formulas:

Conditional probability: P(AB)P(A|B) prob of AA under the assumption that BB took place

P(AB)=P(AB)P(B)P(A|B) = \frac{P(A\cap B)}{P(B)}

Bayes formula:

P(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A)P(A)}{P(B)}

P(BA)=P(BA)P(A)P(B|A) = \frac{P(B\cap A)}{P(A)}

P(BA)=P(BA)P(A)=P(AB)P(B)P(B\cap A) = P(B|A) P(A) = P(A|B)P(B)

Procedures of Naïve Bayes

Example

Construct the probability tables for the input features “Sound”, “Fur” and “Color”.

If sound is bark, fur is coarse and color is brown, what is the most probable class?

Likelihood of Cat = P(bark, coarse, brown | cat) = 14×14×24=0.03125\frac{1}{4} \times \frac{1}{4} \times \frac{2}{4} = 0.03125

Likelihood of Dog = P(bark, coarse, brown | dog) = 34×34×24=0.28125\frac{3}{4} \times \frac{3}{4} \times \frac{2}{4} = 0.28125

Likelihood only considers the observed data, but NB considers prior prob too

Therefore:

Probability of Cat = P(Cat) ×\times P(bark, coarse, brown | Cat) = 48×0.03125=0.015625\frac{4}{8} \times 0.03125 = 0.015625

Probability of Dog = P(Dog) ×\times P(bark, coarse, brown | Dog) = 48×0.28125=0.140625\frac{4}{8} \times 0.28125 = 0.140625

Normalizing the probabilities:

in fact, in NB, we don’t need to normalize it (although there is nothing wrong to perform normalization)

Cat: 0.015625(0.015625+0.140625)=0.1=10%\frac{0.015625}{(0.015625 + 0.140625)} = 0.1 = 10\%

Dog: 0.140625(0.015625+0.140625)=0.9=90%\frac{0.140625}{(0.015625 + 0.140625)} = 0.9 = 90\%

Therefore, the most probable class is Dog.

Summary of Naïve Bayes Classifier

  • Work even if dataset size is small
  • Computationally efficient algorithm
    • Can be applied in real time

Feature independence assumption

  • If hold, can give good estimation
  • In practical situation, it is almost impossible to have independent features
  • Feature engineering: check for correlated features