Data Mining

Data mining is defined as the process of discovering patterns in data.

It uses pattern recognition and machine learning techniques to identify trends within a sample data set.

The process must be automatic or (more usually) semi-automatic.

  • The patterns discovered must be meaningful
    • in that they lead to some advantage, usually an economic advantage.
  • Data mining can provide huge paybacks for companies who have made a significant investment in data warehousing.

Data Warehouse for Data Mining

The existence of a data warehouse is not a prerequisite for data mining.

In practice, the task of data mining, especially for some large companies, is made a lot easier by having access to a data warehouse.

  • The transformations of data when a warehouse is created
    • is a pre-processing step for data mining.

How does the data mining and data warehousing work together?

  • Data warehousing can be used for analyzing the business needs by storing data in a meaningful form.
  • Using Data mining, one can forecast the business needs. Data warehouse can act as a source of this forecasting.

Example Applications of Data Mining

  • Retail / Marketing
  • Banking
  • Insurance
  • Medicine
  • Scientific and Engineering Applications

Software for Data Mining

Weka

  • Free, based on Java
  • Provide access to SQL database using JDBC

Orange

  • Free, based on Python
  • Provide add on for text mining

Rattle GUI

  • Free, based on R

Apache Mahout

  • Free, based on Java
  • Work with Hadoop
  • For large scale data mining and big-data analytics

Microsft Analysis Services

  • Work with Microsoft SQL Server

Microsoft Azure Machine Learning Service

  • Provide predictive analysis and machine learning services on Microsoft’s cloud

Oracle Data Mining

  • Embed data mining within Oracle database

Machine Learning

Key Concept

Instances

  • Things to be classified, associated, or clustered.
  • Also called “examples” or tuples in database terminology

It is important to clean up the instances. Remove null values and missing values

Attributes

  • Each instance is characterized by its values on a fixed, predefined set of features or attributes
  • Attributes are columns in database tables
  • Numeric attributes are either real numbers or integer values
    • e.g., height of a person.
  • Nominal (categorical) attributes take on values in a pre-defined and finite set of possibilities
    • e.g., month of year, gender.
  • Math operations can be applied to numeric attributes, but not to nominal attributes.
    • Only comparison operations can be applied to nominal attributes.

Machine Learning Algorithms

1-of-N transformation

Most machine learning algorithms (e.g., SVM and artificial neural networks) can only work on real numbers instead of categorical or nominal attributes.

  • We can convert categorical attributes into real number
    • e.g. Temperature can be {hot, mild, cool}
      • Using a vector such that [1 0 0] as hot, [0 1 0] as mild, [0 0 1] as cool

Unsupervised vs Supervised

Unsupervised Learning

Used to group things

e.g. Clustering, Association

  • input are not labelled
  • without output label
    • e.g. used to decide Image is group 1 or group 2 or even group 3

Supervised Learning

Used to classify

e.g. Classification, Regression

  • input are labelled
  • with output label
    • e.g. used to classify Image is label 1 (smile) or label 2 (not smile)

Classification

Classification is a process of learning a function that

  • maps a data item into one of several pre-defined classes.
  • determine the class label of an unknown sample
  • Produce class labels as output.

Classification is used for or help decision making.

We will go through some popular classifiers.

Naïve-Bayes Classifier

Category: Supervised Learning - Classification

To find the probability of the class given an instance - Bayes’ Rule

Pr[HE]=Pr[EH]Pr[H]Pr[E]\operatorname{Pr}[H | E]=\frac{\operatorname{Pr}[E | H] \operatorname{Pr}[H]}{\operatorname{Pr}[E]}

  • Pr[HE]\operatorname{Pr}[H | E] - Posterior
  • Pr[EH]\operatorname{Pr}[E | H] - Likelihood
  • Pr[H]\operatorname{Pr}[H] - Prior
  • Pr[E]\operatorname{Pr}[E] - Evidence

Where H is the hypothesis (class) and E is the evidence (attributes) {E0E_0, E1E_1, …, EnE_n}

Assume that each feature EiE_i is conditionally independent of every other feature EjE_j for j not equals i, given the hypothesis H

The conditional distribution over the class variable H is

Pr[HE]=i=0nPr[EiH]Pr[H]Pr[E]\operatorname{Pr}[H | E]=\frac{\prod_{i=0}^{n} \operatorname{Pr}\left[E_{i} | H\right] \operatorname{Pr}[H]}{\operatorname{Pr}[E]}

Note:

  1. Find the priors and likelihoods
  2. Normalize the probabilities (make it into scale 0 to 1)

In Understandable words:

Pr[Wanted Outputx]=i=0nPr[InputiWanted Output]Pr[Wanted Output]Pr[x]\operatorname{Pr}[\text{Wanted Output} | x]=\frac{\prod_{i=0}^{n} \operatorname{Pr}\left[\text{Input}_{i} | \text{Wanted Output}\right] \operatorname{Pr}[\text{Wanted Output}]}{\operatorname{Pr}[x]}

No need to calculate Pr[x]\operatorname{Pr}[x] as it will be cancelled out after Normalization

Example: Naïve-Bayes Classifier

Example: What is the prbability that its class is “yes”?

Firstly, Count the each label inside the attribute : attribute to output

Then Convert it into a Table - Probabilities of the Weather example

What is the normalized prbabilities of the class given a new instance xx if:

  • Outlook = sunny,
  • Temp = cool,
  • Humidity = high and
  • Windy = true?

Plug in the formula:

Pr[Yx]=Pr[Outlook=sunnyY]Pr[Temp=coolY]Pr[Humidity=highY]Pr[Windy=trueY]Pr[Y]Pr[x]\operatorname{Pr}[Y | x]=\frac{\operatorname{Pr}[\text {Outlook}=\operatorname{sunny} | Y] \operatorname{Pr}[\operatorname{Temp} =\operatorname{cool} | Y] \operatorname{Pr}[\text {Humidity}=\operatorname{high} | Y] \operatorname{Pr}[\text {Wind} y=\text {true} | Y] \operatorname{Pr}[Y]}{\operatorname{Pr}[x]}

Find the priors and likelihoods for both Yes and No case:

Pr[Yx]=29×39×39×39×914Pr[x]\operatorname{Pr}[Y | x] = \frac{\frac{2}{9}\times\frac{3}{9}\times\frac{3}{9}\times\frac{3}{9}\times\frac{9}{14}}{\operatorname{Pr}[x]}

You also need No case for Normalization.

Pr[Nx]=35×15×45×35×514Pr[x]\operatorname{Pr}[N | x] = \frac{\frac{3}{5}\times\frac{1}{5}\times\frac{4}{5}\times\frac{3}{5}\times\frac{5}{14}}{\operatorname{Pr}[x]}

Yes=29×39×39×39×914=0.0053\text{Yes} = \frac{2}{9}\times\frac{3}{9}\times\frac{3}{9}\times\frac{3}{9}\times\frac{9}{14} = 0.0053

No=35×15×45×35×514=0.0206\text{No} = \frac{3}{5}\times\frac{1}{5}\times\frac{4}{5}\times\frac{3}{5}\times\frac{5}{14} = 0.0206

Normalize the probabilities (make it into scale 0 to 1)

Pr(Yx)=0.00530.0053+0.0206=0.205\operatorname{Pr}(Y | x) = \frac{0.0053}{0.0053 + 0.0206} = 0.205

Pr(Nx)=0.02060.0053+0.0206=0.795\operatorname{Pr}(N | x) = \frac{0.0206}{0.0053 + 0.0206} = 0.795

Note : Pr(Yx)+Pr(Nx)=1\operatorname{Pr}(Y | x) + \operatorname{Pr}(N | x) = 1

Based on the probabilities, 0.795 is higher. Therefore xx is classified to “No play” Class.

Another Example.

What is the normalized prbabilities of the class given a new instance xx if:

  • Outlook = rainy,
  • Temp = hot,
  • Humidity = normal and
  • Windy = true?

Plug in the formula:

Pr[Yx]=Pr[Outlook=rainyY]Pr[Temp=hotY]Pr[Humidity=normalY]Pr[Windy=trueY]Pr[Y]Pr[x]\operatorname{Pr}[Y | x]=\frac{\operatorname{Pr}[\text {Outlook}=\operatorname{rainy} | Y] \operatorname{Pr}[\operatorname{Temp} =\operatorname{hot} | Y] \operatorname{Pr}[\text {Humidity}=\operatorname{normal} | Y] \operatorname{Pr}[\text {Windy} =\text {true} | Y] \operatorname{Pr}[Y]}{\operatorname{Pr}[x]}

Pr[Nx]=Pr[Outlook=rainyN]Pr[Temp=hotN]Pr[Humidity=normalN]Pr[Windy=trueN]Pr[N]Pr[x]\operatorname{Pr}[N | x]=\frac{\operatorname{Pr}[\text {Outlook}=\operatorname{rainy} | N] \operatorname{Pr}[\operatorname{Temp} =\operatorname{hot} | N] \operatorname{Pr}[\text {Humidity}=\operatorname{normal} | N] \operatorname{Pr}[\text {Windy} =\text {true} | N] \operatorname{Pr}[N]}{\operatorname{Pr}[x]}

Pr[Yx]=39×29×69×39×914Pr[x]\operatorname{Pr}[Y | x] = \frac{\frac{3}{9}\times\frac{2}{9}\times\frac{6}{9}\times\frac{3}{9}\times\frac{9}{14}}{\operatorname{Pr}[x]}

Pr[Nx]=25×25×15×35×514Pr[x]\operatorname{Pr}[N | x] = \frac{\frac{2}{5}\times\frac{2}{5}\times\frac{1}{5}\times\frac{3}{5}\times\frac{5}{14}}{\operatorname{Pr}[x]}

Yes=39×29×69×39×914=0.011\text{Yes} = \frac{3}{9}\times\frac{2}{9}\times\frac{6}{9}\times\frac{3}{9}\times\frac{9}{14} = 0.011

No=25×25×15×35×514=0.007\text{No} = \frac{2}{5}\times\frac{2}{5}\times\frac{1}{5}\times\frac{3}{5}\times\frac{5}{14} = 0.007

Normalize the probabilities (make it into scale 0 to 1)

Pr(Yx)=0.00110.0011+0.007=0.611\operatorname{Pr}(Y | x) = \frac{0.0011}{0.0011 + 0.007} = 0.611

Pr(Nx)=0.00070.0011+0.007=0.389\operatorname{Pr}(N | x) = \frac{0.0007}{0.0011 + 0.007} = 0.389

Note : Pr(Yx)+Pr(Nx)=1\operatorname{Pr}(Y | x) + \operatorname{Pr}(N | x) = 1

Based on the probabilities, 0.611 is higher. Therefore xx is classified to “Yes play” Class.

Decision Trees

Category: Supervised Learning - Classification

To make a Decision Tree:

  1. Decide a attribute as root node first. (Decide the attribute with least branch)
  2. Make ine branch for each possible value of that attribute
  3. Splits up the example set into subsets
  4. Repeat recursively until all instances at a node have the same classification

Note :

Smallest tree = least branches

Purest Node = All data in a node are the same

Entropy is a probability distribution to measure “uncertainty”.

entropy(p1,p2,pn)=p1log2(p1)p2log2(p2)pnlog2(pn)\text {entropy}\left(p_{1}, p_{2}, \ldots p_{n}\right)=-p_{1} \log _{2}\left(p_{1}\right)-p_{2} \log _{2}\left(p_{2}\right) \ldots-p_{n} \log _{2}\left(p_{n}\right)

E(X)=xp(x)log2(p(x))E(X)=-\sum_{x} p(x) \log _{2}(p(x))

Lets say if there is 3 yes value and 2 no value in a attribute:

E(2,3)=p(yes)log2(p(yes))p(no)log2(p(no))E(2,3) = - p(\text{yes})log_2(p(\text{yes})) -p(\text{no})log_2(p(\text{no}))

Afterwards, with Entropy , Expected Information can be calucated.

E(T,X)=cXp(c)E(c)E(T, X)=\sum_{c \in X} p(c) E(c)

Note

Training Set is denoted as TT.

Attribute is denoted as XX.

Entropy E(X)=E(c)E(X) = E(c) in this case.

p(c)p(c) is each data probability.

With Entropy and Expected Information, Information Gain can be found.

Information Gain = info before splitting - info after splitting

gain(Tlabel,X)=entropy(Tlabel)entropy(T,X)gain(T_{label},X) = entropy(T_{label}) - entropy(T,X)

Example: Decision Trees

Example: Decide the Head node of the Decision Tree.

Using the same table.

Find the Entropy for possibility of different attribute first.

possibility of outlook:

Entropy(outlook = sunny)=E(2,3)=p(yes)log2(p(yes))p(no)log2(p(no))\operatorname{Entropy}(\text{outlook = sunny}) = E(2,3)= - p(\text{yes})log_2(p(\text{yes})) -p(\text{no})log_2(p(\text{no}))

E(2,3)=25log2(25)35log2(35)=0.971E(2,3)= -\frac{2}{5}log_2(\frac{2}{5}) -\frac{3}{5}log_2(\frac{3}{5}) = 0.971

Entropy(outlook = overcast)=E(4,0)=p(yes)log2(p(yes))p(no)log2(p(no))\operatorname{Entropy}(\text{outlook = overcast}) = E(4,0)= - p(\text{yes})log_2(p(\text{yes})) -p(\text{no})log_2(p(\text{no}))

E(4,0)=44log2(44)04log2(04)=00=0E(4,0) = -\frac{4}{4}log_2(\frac{4}{4}) -\frac{0}{4}log_2(\frac{0}{4}) = 0-0 = 0

Entropy(outlook = rainy)=E(3,2)=p(yes)log2(p(yes))p(no)log2(p(no))\operatorname{Entropy}(\text{outlook = rainy}) = E(3,2)= - p(\text{yes})log_2(p(\text{yes})) -p(\text{no})log_2(p(\text{no}))

E(3,2)=35log2(35)25log2(25)=0.971E(3,2)= -\frac{3}{5}log_2(\frac{3}{5}) -\frac{2}{5}log_2(\frac{2}{5}) = 0.971

Note:

To Convert log10 into log2:

log2a=log10alog102log_2a = \frac{log_{10}a}{log_{10}2}

log2a=log(2,a)log_2a = log(2,a)

or use logarithm quotient rule

log2(xy)=log2(x)log2(y)log_2(\frac{x}{y}) = log_2(x) - log_2(y)

E(a,a)=1E(a,a) = 1

E(a,b)=E(b,a)E(a,b) = E(b,a)

possibility of temperture:

Entropy(temp = hot)=E(2,2)=1\operatorname{Entropy}(\text{temp = hot}) = E(2,2)= 1

Entropy(temp = mild)=E(4,2)=46log2(46)26log2(26)=0.918\operatorname{Entropy}(\text{temp = mild}) = E(4,2) = -\frac{4}{6}log_2(\frac{4}{6}) -\frac{2}{6}log_2(\frac{2}{6}) = 0.918

Entropy(temp = cool)=E(3,1)=34log2(34)14log2(14)=0.811\operatorname{Entropy}(\text{temp = cool}) = E(3,1) = -\frac{3}{4}log_2(\frac{3}{4}) -\frac{1}{4}log_2(\frac{1}{4}) = 0.811

possibility of humidity:

Entropy(humidity = high)=E(3,4)=37log2(37)47log2(47)=0.985\operatorname{Entropy}(\text{humidity = high}) = E(3,4) = -\frac{3}{7}log_2(\frac{3}{7}) -\frac{4}{7}log_2(\frac{4}{7}) = 0.985

Entropy(humidity = normal)=E(6,1)=67log2(67)17log2(17)=0.592\operatorname{Entropy}(\text{humidity = normal}) = E(6,1) = -\frac{6}{7}log_2(\frac{6}{7}) -\frac{1}{7}log_2(\frac{1}{7}) = 0.592

possibility of windy:

Entropy(windy = true)=E(3,3)=1\operatorname{Entropy}(\text{windy = true}) = E(3,3) = 1

Entropy(windy = false)=E(6,2)=68log2(68)28log2(28)=0.811\operatorname{Entropy}(\text{windy = false}) = E(6,2) = -\frac{6}{8}log_2(\frac{6}{8}) -\frac{2}{8}log_2(\frac{2}{8}) = 0.811

Lets find the probabilities for each data in a attribute too:

probabilities for each data in Outlook:

p(sunny)=514p(\text{sunny}) = \frac{5}{14}

p(overcast)=414p(\text{overcast}) = \frac{4}{14}

p(rainy)=514p(\text{rainy}) = \frac{5}{14}

^Those info will be used later.

probabilities for each data in Temperature:

p(hot)=414p(\text{hot}) = \frac{4}{14}

p(mild)=614p(\text{mild}) = \frac{6}{14}

p(cool)=414p(\text{cool}) = \frac{4}{14}

^Those info will be used later.

probabilities for each data in Humidity:

p(high)=714p(\text{high}) = \frac{7}{14}

p(normal)=714p(\text{normal}) = \frac{7}{14}

^Those info will be used later.

probabilities for each data in Windy:

p(true)=614p(\text{true}) = \frac{6}{14}

p(false)=814p(\text{false})= \frac{8}{14}

^Those info will be used later.

Then with all the Entropy for each labels, Expected Information can be calucated.

E(T,X)=E(training set,attribute)=cXp(c)E(c)E(T, X) = E(\text{training set},\text{attribute}) = \sum_{c \in X} p(c) E(c)

E(play, outlook)=info([2,3],[4,0],[3,2])\text{E(play, outlook)} = info([2,3],[4,0],[3,2])

=p(sunny)E(sunny)+p(overcast)E(overcast)+p(rainy)E(rainy)= p(\text{sunny})E(\text{sunny}) + p(\text{overcast})E(\text{overcast}) + p(\text{rainy})E(\text{rainy})

=p(sunny)E(2,3)+p(overcast)E(4,0)+p(rainy)E(3,2)= p(\text{sunny})E(2,3) + p(\text{overcast})E(4,0) + p(\text{rainy})E(3,2)

=514×0.971+414×0+514×0.971= \frac{5}{14}\times0.971 + \frac{4}{14}\times 0 + \frac{5}{14}\times 0.971

=0.693= 0.693

E(play, temperture)=info([2,2],[4,2],[3,1])\text{E(play, temperture)} = info([2,2],[4,2],[3,1])

=p(hot)E(hot)+p(mild)E(mild)+p(cool)E(cool)= p(\text{hot})E(\text{hot}) + p(\text{mild})E(\text{mild}) + p(\text{cool})E(\text{cool})

=p(hot)E(2,2)+p(mild)E(4,2)+p(pool)E(3,1)= p(\text{hot})E(2,2) + p(\text{mild})E(4,2) + p(\text{pool})E(3,1)

=414×1+614×0.918+414×0.811= \frac{4}{14}\times1 + \frac{6}{14}\times 0.918 + \frac{4}{14}\times 0.811

=0.911= 0.911

E(play, humidity)=info([3,4],[6,1])\text{E(play, humidity)} = info([3,4],[6,1])

=p(high)E(3,4)+p(normal)E(6,1)= p(\text{high})E(3,4) + p(\text{normal})E(6,1)

=714×0.985+714×0.592= \frac{7}{14}\times0.985 + \frac{7}{14}\times 0.592

=0.789= 0.789

E(play, windy)=info([3,3],[6,2])\text{E(play, windy)} = info([3,3],[6,2])

=p(true)E(3,3)+p(false)E(6,2)= p(\text{true})E(3,3) + p(\text{false})E(6,2)

=614×1+814×0.811= \frac{6}{14}\times1 + \frac{8}{14}\times 0.811

=0.892= 0.892

With Entropy and Expected Information, Information Gain can be found.

You need to calculate the information gains for all the attributes in the weather example.

gain(Tlabel,X)=entropy(Tlabel)entropy(T,X)gain(T_{label},X) = entropy(T_{label}) - entropy(T,X)

gain([yes,no],X)=gain([9,5],X)=entropy([9,5])entropy(T,X)gain([yes,no],X) = gain([9,5],X) = entropy([9,5]) - entropy(T,X)

entropy([9,5])=914log2(914)514log2(514)=0.940entropy([9,5]) = -\frac{9}{14}log_2(\frac{9}{14}) -\frac{5}{14}log_2(\frac{5}{14}) = 0.940

gain([9,5],outlook)=entropy([9,5])entropy(play,outlook)gain([9,5], \text{outlook}) = entropy([9,5]) - entropy(play,outlook)

=0.9400.693=0.247= 0.940 - 0.693 = 0.247

gain([9,5],temp)=0.9400.911=0.029gain([9,5], \text{temp}) = 0.940 - 0.911 = 0.029

gain([9,5],humidity)=0.9400.788=0.152gain([9,5], \text{humidity}) = 0.940 - 0.788 = 0.152

gain([9,5],windy)=0.9400.892=0.048gain([9,5], \text{windy}) = 0.940 - 0.892 = 0.048

Choose the attribute with largest information gains as the node

Outlook is the winner. Therefore Outlook is selected as head node.

To continue to split, Repeat above steps to determine the child nodes until:

  • Data are all purest nodes

The final decision tree looks like this:

Support Vector Machines (SVM)

Category: Supervised Learning - Classification

Vector Machine

  • Might remove the influence of patterns that are far away from the decision boundary
    • their influence is usually small
  • May also select only a few important data point (called support vectors) and weight them differently.
  • With Support Vector Machine, We aim to find a decision plane that maximizes the margin.

Support Vector Machine

Decision plane is determined by the perpendicular of the 2 cloest vectors .

In case the training data X are not linearly separable, we may use a non-linear function ϕ(x)\phi(x) to map the data from the input space to a new high-dim space (called feature space ϕ\phi) where data become linearly separable.

K-Nearest neighbour (K-Nn)

Category: Supervised Learning - Classification

  1. Assign K a value – preferably a small odd number
  2. Find the closest number of K points
  3. Assign the new point from the majority of classes.

TLDR : In a specific of range, The class of a new instance = the more Class in that range

Common Disadvantage using K-Nearest neighbour:

  • Classification will take a long time when the training data set is very large
    • due to the algorithm does not have training/learning stages

Neural Networks (NN)

Category: Supervised Learning - Classification

Feed-forward networks with several nonlinear hidden layers.

The output can be considered as the posterior probability of input, Pr(Class=ix)Pr(\text{Class} = i|x) where xx is the instance and Class is the possible labels.

Deep Learning = Neural Networks (NN) with A LOTS OF LAYERS IN THE MIDDLE

Prediction

  • Implement real-value mapping functions
  • Outcomes are numerical values

Linear regression

For modeling the relationship between a scalar dependent variable y and one or more explanatory variables denoted X.

Given an unknown multi-dimensional input x, the predicted output of a linear regression model is

f(x)=aTx+bf(\mathbf{x})=\mathbf{a}^{\mathrm{T}} \mathbf{x}+b

You assign weighting to the variables.

One of the real life example is Course Weightings.

Cluster Analysis

  • Divide input data into a number of groups
  • Used for finding groups of examples that belong together
  • Based on unsupervised learning

To group a set of data with similar characteristics to form clusters

  1. The process of assigning samples to clusters is iterative.
  2. each iteration, the samples in the clusters are redefined to better represent the structure of the data
  3. After the clustering process, each cluster can be considered as a summary of a large number of samples.
  4. Thus, they can help in making faster decisions. The clusters also help to identify different groups in the data.

Cluster analysis is widely used in market research when working with multivariate data from surveys. Market researchers use cluster analysis to partition the general population of consumers into market segments and to better understand the relationships between different groups of consumers/potential customers.

K-Means Clustering

An unsupervised learning methods

labels are not required.

TLDR : Keep changing the location of centre of a label. Some samples may change the label.

Associative-Rule Learning

  • Finding association rules among non-numeric attributes
  • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction.

Itemset

  • A collection of one or more items
    • E.g., {Milk, Bread, Diaper} <- 3-itemset
  • An itemset containing k items is called k-itemset

Support

  • Percentage of transactions containing an itemset

  • Indication of how frequently the items appear in the data

  • s=Occurrence(X and Y)Total Transactions = \frac{\text{Occurrence(X and Y)}}{\text{Total Transaction}}

Association Rule

  • X→Y, where X and Y are itemsets

Confidence of Association Rule

  • Verify the Confidence of Association Rule

  • Indicates the number of times the if-then statements are found true

  • c=Occurrence(X and Y)Occurrence(X)c = \frac{\text{Occurrence(X and Y)}}{Occurrence(X)}

Is the Association Rule Good?

  • Check both Confidence and Support are higher than threshold
  • Support s>minsups > minsup
  • Confidence c>minconfc > minconf

Example: Associative-Rule Learning

Find the support and confidence for the association rule {Milk, Diaper}→{Beer}. If both the thresholds for minsup and minconf are 60%, does this rule qualify? Why?

Support s=Occurrence(Milk, Diaper and Beer)Total Transaction=25=40%s = \frac{\text{Occurrence(Milk, Diaper and Beer)}}{\text{Total Transaction}} = \frac{2}{5} = 40\% which is < 60%

Confidence c=Occurrence(Milk, Diaper and Beer)Occurrence(Milk, Diaper)=23=66.67%c = \frac{\text{Occurrence(Milk, Diaper and Beer)}}{\text{Occurrence(Milk, Diaper)}} = \frac{2}{3} = 66.67\% which is meet threshold

This rule does not qualify because it’s support failed to meet the threshold.