Data Mining

Data mining is defined as the process of discovering patterns in data.

It uses pattern recognition and machine learning techniques to identify trends within a sample data set.

The process must be automatic or (more usually) semi-automatic.

The patterns discovered must be meaningful
- in that they lead to some advantage, usually an economic advantage.
Data mining can provide huge paybacks for companies who have made a significant investment in data warehousing.

Data Warehouse for Data Mining

The existence of a data warehouse is not a prerequisite for data mining.

In practice, the task of data mining, especially for some large companies, is made a lot easier by having access to a data warehouse.

The transformations of data when a warehouse is created

is a pre-processing step for data mining.

How does the data mining and data warehousing work together?

Data warehousing can be used for analyzing the business needs by storing data in a meaningful form.

Using Data mining, one can forecast the business needs. Data warehouse can act as a source of this forecasting.

Example Applications of Data Mining

Retail / Marketing
Banking
Insurance
Medicine
Scientific and Engineering Applications

Software for Data Mining

Weka

Free, based on Java
Provide access to SQL database using JDBC

Orange

Free, based on Python
Provide add on for text mining

Rattle GUI

Free, based on R

Apache Mahout

Free, based on Java
Work with Hadoop
For large scale data mining and big-data analytics

Microsft Analysis Services

Work with Microsoft SQL Server

Microsoft Azure Machine Learning Service

Provide predictive analysis and machine learning services on Microsoft’s cloud

Oracle Data Mining

Embed data mining within Oracle database

Machine Learning

Key Concept

Instances

Things to be classified, associated, or clustered.
Also called “examples” or tuples in database terminology

It is important to clean up the instances. Remove null values and missing values

Attributes

Each instance is characterized by its values on a fixed, predefined set of features or attributes
Attributes are columns in database tables
Numeric attributes are either real numbers or integer values
- e.g., height of a person.
Nominal (categorical) attributes take on values in a pre-defined and finite set of possibilities
- e.g., month of year, gender.
Math operations can be applied to numeric attributes, but not to nominal attributes.
- Only comparison operations can be applied to nominal attributes.

Machine Learning Algorithms

1-of-N transformation

Most machine learning algorithms (e.g., SVM and artificial neural networks) can only work on real numbers instead of categorical or nominal attributes.

We can convert categorical attributes into real number
- e.g. Temperature can be {hot, mild, cool}
  - Using a vector such that [1 0 0] as hot, [0 1 0] as mild, [0 0 1] as cool

Unsupervised vs Supervised

Unsupervised Learning

Used to group things

e.g. Clustering, Association

input are not labelled
without output label
- e.g. used to decide Image is group 1 or group 2 or even group 3

Supervised Learning

Used to classify

e.g. Classification, Regression

input are labelled
with output label
- e.g. used to classify Image is label 1 (smile) or label 2 (not smile)

Classification

Classification is a process of learning a function that

maps a data item into one of several pre-defined classes.
determine the class label of an unknown sample

Produce class labels as output.

Classification is used for or help decision making.

We will go through some popular classifiers.

Naïve-Bayes Classifier

Category: Supervised Learning - Classification

To find the probability of the class given an instance - Bayes’ Rule

$\operatorname{Pr}[H | E]=\frac{\operatorname{Pr}[E | H] \operatorname{Pr}[H]}{\operatorname{Pr}[E]}$

$\operatorname{Pr}[H | E]$ - Posterior
$\operatorname{Pr}[E | H]$ - Likelihood
$\operatorname{Pr}[H]$ - Prior
$\operatorname{Pr}[E]$ - Evidence

Where H is the hypothesis (class) and E is the evidence (attributes) { $E_0$ , $E_1$ , …, $E_n$ }

Assume that each feature $E_i$ is conditionally independent of every other feature $E_j$ for j not equals i, given the hypothesis H

The conditional distribution over the class variable H is

$\operatorname{Pr}[H | E]=\frac{\prod_{i=0}^{n} \operatorname{Pr}\left[E_{i} | H\right] \operatorname{Pr}[H]}{\operatorname{Pr}[E]}$

Note:

Find the priors and likelihoods

Normalize the probabilities (make it into scale 0 to 1)

In Understandable words:

$\operatorname{Pr}[\text{Wanted Output} | x]=\frac{\prod_{i=0}^{n} \operatorname{Pr}\left[\text{Input}_{i} | \text{Wanted Output}\right] \operatorname{Pr}[\text{Wanted Output}]}{\operatorname{Pr}[x]}$

No need to calculate $\operatorname{Pr}[x]$ as it will be cancelled out after Normalization

Example: Naïve-Bayes Classifier

Example: What is the prbability that its class is “yes”?

Firstly, Count the each label inside the attribute : attribute to output

Then Convert it into a Table - Probabilities of the Weather example

What is the normalized prbabilities of the class given a new instance $x$ if:

Outlook = sunny,
Temp = cool,
Humidity = high and
Windy = true?

Plug in the formula:

$\operatorname{Pr}[Y | x]=\frac{\operatorname{Pr}[\text {Outlook}=\operatorname{sunny} | Y] \operatorname{Pr}[\operatorname{Temp} =\operatorname{cool} | Y] \operatorname{Pr}[\text {Humidity}=\operatorname{high} | Y] \operatorname{Pr}[\text {Wind} y=\text {true} | Y] \operatorname{Pr}[Y]}{\operatorname{Pr}[x]}$

Find the priors and likelihoods for both Yes and No case:

$\operatorname{Pr}[Y | x] = \frac{\frac{2}{9}\times\frac{3}{9}\times\frac{3}{9}\times\frac{3}{9}\times\frac{9}{14}}{\operatorname{Pr}[x]}$

You also need No case for Normalization.

$\operatorname{Pr}[N | x] = \frac{\frac{3}{5}\times\frac{1}{5}\times\frac{4}{5}\times\frac{3}{5}\times\frac{5}{14}}{\operatorname{Pr}[x]}$

$\text{Yes} = \frac{2}{9}\times\frac{3}{9}\times\frac{3}{9}\times\frac{3}{9}\times\frac{9}{14} = 0.0053$

$\text{No} = \frac{3}{5}\times\frac{1}{5}\times\frac{4}{5}\times\frac{3}{5}\times\frac{5}{14} = 0.0206$

Normalize the probabilities (make it into scale 0 to 1)

$\operatorname{Pr}(Y | x) = \frac{0.0053}{0.0053 + 0.0206} = 0.205$

$\operatorname{Pr}(N | x) = \frac{0.0206}{0.0053 + 0.0206} = 0.795$

Note : $\operatorname{Pr}(Y | x) + \operatorname{Pr}(N | x) = 1$

Based on the probabilities, 0.795 is higher. Therefore $x$ is classified to “No play” Class.

Another Example.

What is the normalized prbabilities of the class given a new instance $x$ if:

Outlook = rainy,
Temp = hot,
Humidity = normal and
Windy = true?

Plug in the formula:

$\operatorname{Pr}[Y | x]=\frac{\operatorname{Pr}[\text {Outlook}=\operatorname{rainy} | Y] \operatorname{Pr}[\operatorname{Temp} =\operatorname{hot} | Y] \operatorname{Pr}[\text {Humidity}=\operatorname{normal} | Y] \operatorname{Pr}[\text {Windy} =\text {true} | Y] \operatorname{Pr}[Y]}{\operatorname{Pr}[x]}$

$\operatorname{Pr}[N | x]=\frac{\operatorname{Pr}[\text {Outlook}=\operatorname{rainy} | N] \operatorname{Pr}[\operatorname{Temp} =\operatorname{hot} | N] \operatorname{Pr}[\text {Humidity}=\operatorname{normal} | N] \operatorname{Pr}[\text {Windy} =\text {true} | N] \operatorname{Pr}[N]}{\operatorname{Pr}[x]}$

$\operatorname{Pr}[Y | x] = \frac{\frac{3}{9}\times\frac{2}{9}\times\frac{6}{9}\times\frac{3}{9}\times\frac{9}{14}}{\operatorname{Pr}[x]}$

$\operatorname{Pr}[N | x] = \frac{\frac{2}{5}\times\frac{2}{5}\times\frac{1}{5}\times\frac{3}{5}\times\frac{5}{14}}{\operatorname{Pr}[x]}$

$\text{Yes} = \frac{3}{9}\times\frac{2}{9}\times\frac{6}{9}\times\frac{3}{9}\times\frac{9}{14} = 0.011$

$\text{No} = \frac{2}{5}\times\frac{2}{5}\times\frac{1}{5}\times\frac{3}{5}\times\frac{5}{14} = 0.007$

Normalize the probabilities (make it into scale 0 to 1)

$\operatorname{Pr}(Y | x) = \frac{0.0011}{0.0011 + 0.007} = 0.611$

$\operatorname{Pr}(N | x) = \frac{0.0007}{0.0011 + 0.007} = 0.389$

Note : $\operatorname{Pr}(Y | x) + \operatorname{Pr}(N | x) = 1$

Based on the probabilities, 0.611 is higher. Therefore $x$ is classified to “Yes play” Class.

Decision Trees

Category: Supervised Learning - Classification

To make a Decision Tree:

Decide a attribute as root node first. (Decide the attribute with least branch)
Make ine branch for each possible value of that attribute
Splits up the example set into subsets
Repeat recursively until all instances at a node have the same classification

Note :

Smallest tree = least branches

Purest Node = All data in a node are the same

Entropy is a probability distribution to measure “uncertainty”.

$\text {entropy}\left(p_{1}, p_{2}, \ldots p_{n}\right)=-p_{1} \log _{2}\left(p_{1}\right)-p_{2} \log _{2}\left(p_{2}\right) \ldots-p_{n} \log _{2}\left(p_{n}\right)$

$E(X)=-\sum_{x} p(x) \log _{2}(p(x))$

Lets say if there is 3 yes value and 2 no value in a attribute:

$E(2,3) = - p(\text{yes})log_2(p(\text{yes})) -p(\text{no})log_2(p(\text{no}))$

Afterwards, with Entropy , Expected Information can be calucated.

$E(T, X)=\sum_{c \in X} p(c) E(c)$

Note

Training Set is denoted as $T$ .

Attribute is denoted as $X$ .

Entropy $E(X) = E(c)$ in this case.

$p(c)$ is each data probability.

With Entropy and Expected Information, Information Gain can be found.

Information Gain = info before splitting - info after splitting

$gain(T_{label},X) = entropy(T_{label}) - entropy(T,X)$

Example: Decision Trees

Example: Decide the Head node of the Decision Tree.

Using the same table.

Find the Entropy for possibility of different attribute first.

possibility of outlook:

$\operatorname{Entropy}(\text{outlook = sunny}) = E(2,3)= - p(\text{yes})log_2(p(\text{yes})) -p(\text{no})log_2(p(\text{no}))$

$E(2,3)= -\frac{2}{5}log_2(\frac{2}{5}) -\frac{3}{5}log_2(\frac{3}{5}) = 0.971$

$\operatorname{Entropy}(\text{outlook = overcast}) = E(4,0)= - p(\text{yes})log_2(p(\text{yes})) -p(\text{no})log_2(p(\text{no}))$

$E(4,0) = -\frac{4}{4}log_2(\frac{4}{4}) -\frac{0}{4}log_2(\frac{0}{4}) = 0-0 = 0$

$\operatorname{Entropy}(\text{outlook = rainy}) = E(3,2)= - p(\text{yes})log_2(p(\text{yes})) -p(\text{no})log_2(p(\text{no}))$

$E(3,2)= -\frac{3}{5}log_2(\frac{3}{5}) -\frac{2}{5}log_2(\frac{2}{5}) = 0.971$

Note:

To Convert log10 into log2:

$log_2a = \frac{log_{10}a}{log_{10}2}$

$log_2a = log(2,a)$

or use logarithm quotient rule

$log_2(\frac{x}{y}) = log_2(x) - log_2(y)$

$E(a,a) = 1$

$E(a,b) = E(b,a)$

possibility of temperture:

$\operatorname{Entropy}(\text{temp = hot}) = E(2,2)= 1$

$\operatorname{Entropy}(\text{temp = mild}) = E(4,2) = -\frac{4}{6}log_2(\frac{4}{6}) -\frac{2}{6}log_2(\frac{2}{6}) = 0.918$

$\operatorname{Entropy}(\text{temp = cool}) = E(3,1) = -\frac{3}{4}log_2(\frac{3}{4}) -\frac{1}{4}log_2(\frac{1}{4}) = 0.811$

possibility of humidity:

$\operatorname{Entropy}(\text{humidity = high}) = E(3,4) = -\frac{3}{7}log_2(\frac{3}{7}) -\frac{4}{7}log_2(\frac{4}{7}) = 0.985$

$\operatorname{Entropy}(\text{humidity = normal}) = E(6,1) = -\frac{6}{7}log_2(\frac{6}{7}) -\frac{1}{7}log_2(\frac{1}{7}) = 0.592$

possibility of windy:

$\operatorname{Entropy}(\text{windy = true}) = E(3,3) = 1$

$\operatorname{Entropy}(\text{windy = false}) = E(6,2) = -\frac{6}{8}log_2(\frac{6}{8}) -\frac{2}{8}log_2(\frac{2}{8}) = 0.811$

Lets find the probabilities for each data in a attribute too:

probabilities for each data in Outlook:

$p(\text{sunny}) = \frac{5}{14}$

$p(\text{overcast}) = \frac{4}{14}$

$p(\text{rainy}) = \frac{5}{14}$

^Those info will be used later.

probabilities for each data in Temperature:

$p(\text{hot}) = \frac{4}{14}$

$p(\text{mild}) = \frac{6}{14}$

$p(\text{cool}) = \frac{4}{14}$

^Those info will be used later.

probabilities for each data in Humidity:

$p(\text{high}) = \frac{7}{14}$

$p(\text{normal}) = \frac{7}{14}$

^Those info will be used later.

probabilities for each data in Windy:

$p(\text{true}) = \frac{6}{14}$

$p(\text{false})= \frac{8}{14}$

^Those info will be used later.

Then with all the Entropy for each labels, Expected Information can be calucated.

$E(T, X) = E(\text{training set},\text{attribute}) = \sum_{c \in X} p(c) E(c)$

$\text{E(play, outlook)} = info([2,3],[4,0],[3,2])$

$= p(\text{sunny})E(\text{sunny}) + p(\text{overcast})E(\text{overcast}) + p(\text{rainy})E(\text{rainy})$

$= p(\text{sunny})E(2,3) + p(\text{overcast})E(4,0) + p(\text{rainy})E(3,2)$

$= \frac{5}{14}\times0.971 + \frac{4}{14}\times 0 + \frac{5}{14}\times 0.971$

$= 0.693$

$\text{E(play, temperture)} = info([2,2],[4,2],[3,1])$

$= p(\text{hot})E(\text{hot}) + p(\text{mild})E(\text{mild}) + p(\text{cool})E(\text{cool})$

$= p(\text{hot})E(2,2) + p(\text{mild})E(4,2) + p(\text{pool})E(3,1)$

$= \frac{4}{14}\times1 + \frac{6}{14}\times 0.918 + \frac{4}{14}\times 0.811$

$= 0.911$

$\text{E(play, humidity)} = info([3,4],[6,1])$

$= p(\text{high})E(3,4) + p(\text{normal})E(6,1)$

$= \frac{7}{14}\times0.985 + \frac{7}{14}\times 0.592$

$= 0.789$

$\text{E(play, windy)} = info([3,3],[6,2])$

$= p(\text{true})E(3,3) + p(\text{false})E(6,2)$

$= \frac{6}{14}\times1 + \frac{8}{14}\times 0.811$

$= 0.892$

With Entropy and Expected Information, Information Gain can be found.

You need to calculate the information gains for all the attributes in the weather example.

$gain(T_{label},X) = entropy(T_{label}) - entropy(T,X)$

$gain([yes,no],X) = gain([9,5],X) = entropy([9,5]) - entropy(T,X)$

$entropy([9,5]) = -\frac{9}{14}log_2(\frac{9}{14}) -\frac{5}{14}log_2(\frac{5}{14}) = 0.940$

$gain([9,5], \text{outlook}) = entropy([9,5]) - entropy(play,outlook)$

$= 0.940 - 0.693 = 0.247$

$gain([9,5], \text{temp}) = 0.940 - 0.911 = 0.029$

$gain([9,5], \text{humidity}) = 0.940 - 0.788 = 0.152$

$gain([9,5], \text{windy}) = 0.940 - 0.892 = 0.048$

Choose the attribute with largest information gains as the node

Outlook is the winner. Therefore Outlook is selected as head node.

To continue to split, Repeat above steps to determine the child nodes until:

Data are all purest nodes

The final decision tree looks like this:

Support Vector Machines (SVM)

Category: Supervised Learning - Classification

Vector Machine

Might remove the influence of patterns that are far away from the decision boundary
- their influence is usually small
May also select only a few important data point (called support vectors) and weight them differently.
With Support Vector Machine, We aim to find a decision plane that maximizes the margin.

Support Vector Machine

Decision plane is determined by the perpendicular of the 2 cloest vectors .

In case the training data X are not linearly separable, we may use a non-linear function $\phi(x)$ to map the data from the input space to a new high-dim space (called feature space $\phi$ ) where data become linearly separable.

K-Nearest neighbour (K-Nn)

Category: Supervised Learning - Classification

Assign K a value – preferably a small odd number

Find the closest number of K points

Assign the new point from the majority of classes.

TLDR : In a specific of range, The class of a new instance = the more Class in that range

Common Disadvantage using K-Nearest neighbour:

Classification will take a long time when the training data set is very large
- due to the algorithm does not have training/learning stages

Neural Networks (NN)

Category: Supervised Learning - Classification

Feed-forward networks with several nonlinear hidden layers.

The output can be considered as the posterior probability of input, $Pr(\text{Class} = i|x)$ where $x$ is the instance and Class is the possible labels.

Deep Learning = Neural Networks (NN) with A LOTS OF LAYERS IN THE MIDDLE

Prediction

Implement real-value mapping functions

Outcomes are numerical values

Linear regression

For modeling the relationship between a scalar dependent variable y and one or more explanatory variables denoted X.

Given an unknown multi-dimensional input x, the predicted output of a linear regression model is

$f(\mathbf{x})=\mathbf{a}^{\mathrm{T}} \mathbf{x}+b$

You assign weighting to the variables.

One of the real life example is Course Weightings.

Cluster Analysis

Divide input data into a number of groups

Used for finding groups of examples that belong together

Based on unsupervised learning

To group a set of data with similar characteristics to form clusters

The process of assigning samples to clusters is iterative.
each iteration, the samples in the clusters are redefined to better represent the structure of the data
After the clustering process, each cluster can be considered as a summary of a large number of samples.
Thus, they can help in making faster decisions. The clusters also help to identify different groups in the data.

Cluster analysis is widely used in market research when working with multivariate data from surveys. Market researchers use cluster analysis to partition the general population of consumers into market segments and to better understand the relationships between different groups of consumers/potential customers.

K-Means Clustering

An unsupervised learning methods

labels are not required.

TLDR : Keep changing the location of centre of a label. Some samples may change the label.

Associative-Rule Learning

Finding association rules among non-numeric attributes

Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction.

Itemset

A collection of one or more items
- E.g., {Milk, Bread, Diaper} <- 3-itemset
An itemset containing k items is called k-itemset

Support

Percentage of transactions containing an itemset
Indication of how frequently the items appear in the data
$s = \frac{\text{Occurrence(X and Y)}}{\text{Total Transaction}}$

Association Rule

X→Y, where X and Y are itemsets

Confidence of Association Rule

Verify the Confidence of Association Rule
Indicates the number of times the if-then statements are found true
$c = \frac{\text{Occurrence(X and Y)}}{Occurrence(X)}$

Is the Association Rule Good?

Check both Confidence and Support are higher than threshold
Support $s > minsup$
Confidence $c > minconf$

Example: Associative-Rule Learning

Find the support and confidence for the association rule {Milk, Diaper}→{Beer}. If both the thresholds for minsup and minconf are 60%, does this rule qualify? Why?

Support $s = \frac{\text{Occurrence(Milk, Diaper and Beer)}}{\text{Total Transaction}} = \frac{2}{5} = 40\%$ which is < 60%

Confidence $c = \frac{\text{Occurrence(Milk, Diaper and Beer)}}{\text{Occurrence(Milk, Diaper)}} = \frac{2}{3} = 66.67\%$ which is meet threshold

This rule does not qualify because it’s support failed to meet the threshold.