Probability for Data Science

What is Probability?

Probability is the likelihood of an event occurring. This event can be pretty much anything – getting heads, rolling a 4. We measure probability with numeric values between 0 and 1, because we like to compare the relative likelihood of events. Observe the general probability formula.

Probability = quantifying how likely each event is on it’s own

P(X)=Preferred outcomesSample SpaceP(X) = \frac{Preferred \space outcomes}{Sample \space Space}

Where X means event, P(x) means probability.

Example: rolling a dice to get value 3, the probability = 1/6.

If Two Events are independent:

P(A and B)=P(A)×P(B)P(A \space and \space B) = P(A) \times P(B)

P(A or B)=P(A)+P(B)P(A and B)P(A\space or \space B) = P(A) + P(B) - P(A \space and \space B)

Expected Values

  • Trial – Observing an event occur and recording the outcome
  • Experiment – A collection of one or multiple trials
  • Experimental Probability – The probability we assign an event, based on an experiment we conduct
    • It is different to theoretical (true) probabilities!
  • Expected value – the specific outcome we expect to occur when we run an experiment many times

The expected value can be numerical, Boolean, categorical or other, depending on the type of the event we are interested in.

Expected value – the specific outcome we expect to occur when we run an experiment many times

For instance, the expected value of the trial would be the more likely of the two outcomes, whereas the expected value of the experiment will be the number of time we expect to get either heads or tails after the 20 trials.

We can use expected values to make predictions about the future based on past data

Expected value for categorical variables

E(X)=n×p=n×P(X)E(X) = n \times p = n \times P(X)

Expected value for numeric variables

E(X)=i=1nxi×pi=i=1n×P(Xi)E(X) = \sum_{i=1}^{n} x_i \times p_i = \sum_{i=1}^{n} \times P(X_i)

Probability Frequency Distribution

A collection of the probabilities for each possible outcome of an event.

We need the probability frequency distribution to try and predict future events when the expected value is unattainable.

usually the highest bars in the graph will form the expected value.

Frequency

Frequency is the number of times a given value or outcome appears in the sample space.

Complements

A’ = Not A

P(A)=1P(A)P(A') = 1 - P(A)

P(A)P(A') is same as P(A)P(\overline A)

Factorials

Factorials express the product of all integers from 1 to n and we denote them with the “!” symbol.

n!=n×(n1)×(n2)××1n! = n \times (n-1) \times (n-2) \times \cdots \times 1

Key Points

0!=10! = 1

Let n > 0, k > 0 and n > k.

  • (n+k)!=n!×(n+1)××(n+k)(n+k)! = n! \times (n+1) \times \cdots \times (n + k)
  • (nk)!=n!(nk+1)××(nk+k)(n-k)! = \frac{n!}{(n-k+1)\times \cdots \times (n-k+k)}
  • n!k!=(k+1)××n\frac{n!}{k!} = (k+1) \times \cdots \times n

Variations

Variations represent the number of different possible ways we can pick and arrange a number of elements of a given set.

###Variations without repetition

  • VV is the Variations without repetition
  • nn is the number of different elements available
  • pp is the number of elements we are arranging.

V(n,p)=Vpn=n!(np)!V(n,p) =V^n_p = \frac{n!}{(n-p)!}​

Formula Explained:

  • We have n-many options for the first element.
  • We only have (n-1)-many options for the second element because we cannot repeat the value for we chose to start with.
  • We have less options left for each additional element.

###Variations with repetition

  • V\overline V is the Variations with repetition
  • nn is the number of different elements available
  • pp is the number of elements we are arranging.

V(n,p)=Vpn=np\overline{V}(n,p) = \overline{V}^n_p = n^p

Formula Explained:

  • We have n-many options for the first element.
  • We still have n-many options for the second element because repetition is allowed.
  • We have n-many options for each of the p- many elements.

Permutations

Permutations represent the number of different possible ways we can arrange a number of elements.

Pn=n×(n1)×(n2)××(nk)×1=n!P_n = n \times (n-1) \times (n-2) \times \cdots\times(n-k) \times 1 = n!

Characteristics of Permutations

  • Arranging all elements within the sample space
  • No repetition

Combinations

Combinations represent the number of different possible ways we can pick a number of elements.

C(n,p)=Cpn=n!(np)!p!C(n,p) = C^n_p = \frac{n!}{(n-p)!p!}

Characteristics of Combinations

  • Takes into account double-counting.
    • e.g. Selecting Johny, Kate and Marie is the same as selecting Marie, Kate and Johny
  • All the different permutations of a single combination are different variations
  • Combinations are symmetric
    • e.g. Cpn=CnpnC^n_p = C^n_{n-p}, since selecting p elements is the same as omitting n-p elements
    • C(n,p)=C(n,np)C(n,p) = C(n,n-p)
    • Apply symmetry to avoid calculating factorials of large numbers (simplify calculations)

Note:

Cpn=VpnPp=n!/(np)!p!=n!(np)!p!C^n_p = \frac{V^n_p}{P_p} = \frac{n!/(n-p)!}{p!} = \frac{n!}{(n-p)!p!}

Combinations with separate sample spaces

Combinations represent the number of different possible ways we can pick a number of elements.

C=n1×n2×n3××npC = n_1 \times n_2 \times n_3 \times \cdots \times n_p

Where C is the combinations, n1n_1 is the size of the first sample space, npn_p is the size of last sample space.

Characteristics of Combinations with separate sample spaces:

  • The option we choose for any element does not affect the number of options for the other elements.
  • The order in which we pick the individual elements is arbitrary.
  • We need to know the size of the sample space for each individual element.

Combinations with Repetition

Combinations represent the number of different possible ways we can pick a number of elements. In special cases we can have repetition in combinations and for those we use a different formula.

  • C\overline C is the Combinations with repetition
  • nn is the total number of elements in the sample space
  • pp is the number of elements we need to select

C(n,p)=(n+p1)!(n1)!p!\overline C(n,p) = \frac{(n+p-1)!}{(n-1)!p!}

Transform combinations with repetition to combinations without repetition

C(n+p1,p)=C(n,p)C(n+p-1, p) = \overline C(n,p)

Bayesian Notation

A set is a collection of elements, which hold certain values. Additionally, every event has a set of outcomes that satisfy it.

There always exist a set of outcomes that satisfy a given event.

The null-set (or empty set), denoted \emptyset, is an set which contain no values.

Intersection, Union and Matually Exclusive Sets

Let me quickly go through this.

Intersection

ABA \cap B = AA and BB

Union

ABA \cup B = AA or BB

Note : ABA \cup B = A+BABA + B - A \cap B

Matually Exclusive Sets

If AB=A \cap B = \emptyset, then the two sets are mutually exclusive.

Independent and Dependent Events

If the likelihood of event A occurring (P(A)) is affected event B occurring, then we say that A and B are dependent events. Alternatively, if it isn’t – the two events are independent.

We express the probability of event A occurring, given event B has occurred the following way:

P(AB)P(A|B)

We call this the conditional probability.

Independent Case

P(AB)P(A|B) still equals to P(A)P(A).

Since the outcome of A does not depend on the outcome of B.

Dependent Case

As the outcome of A depends on the outcome of B,

P(AB)P(A|B) is not equal to P(A)P(A).

Conditional Probability

For any two events A and B, such that the likelihood of B occurring is greater than 0 (𝑃(𝐵) > 0), the conditional probability formula states the following.

P(AB)=P(AB)P(B)P(A|B) = \frac{P(A\cap B)}{P(B)}

Note: P(AB)P(A|B) does not equal to P(BA)P(B|A)

Additive Law

The additive law calculates the probability of the union based on the probability of the individual sets it accounts for.

P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B)

This holds true for any events A and B.

The Multiplication Rule

The multiplication rule calculates the probability of the intersection based on the conditional probability.

P(AB)=P(AB)×P(B)P(A \cap B) = P(A|B) \times P(B)

Bayes’ Law / Bayes’ Theorem

Bayes’ Law / Bayes’ Theorem helps us understand the relationship between two events by computing the different conditional probabilities.

P(AB)=P(BA)×P(A)P(B)P(A|B) = \frac{P(B|A)\times P(A)}{P(B)}

Fact: Bayes’ Law is often used in medical or business analysis to determine which of two symptoms affects the other one more.

Distributions

A distribution shows the possible values a random variable can take and how frequently they occur.

A distribution is a function that shows the possible values for a variable and how often they occur.

mean = average value

variance = how spread out the data is

Types of Distributions

Certain distributions share characteristics, so we separate them into types.

The well-defined types of distributions we often deal with have elegant statistics.

We distinguish between two big types of distributions based on the type of the possible values for the variable - discrete and continuous.

Discrete

  • Have a finite number of outcomes.
  • Use formulas we already talked about.
  • Can add up individual values to determine probability of an interval.
  • Can be expressed with a table, graph or a piece-wise function.
  • Expected Values might be unattainable.
  • Graph consists of bars lined up one after the other.

Continuous

  • Have infinitely many consecutive possible values.
  • Use new formulas for attaining the probability of specific values and intervals.
  • Cannot add up the individual values that make up an interval because there are infinitely many of them.
  • Can be expressed with a graph or a continuous function.
  • Graph consists of a smooth curve.

Notation

Variable Type(Characteristics)\text{Variable~Type(Characteristics)}

Discrete Distributions

Discrete Distributions have finitely many different possible outcomes.

Entire probability distribution can be expressed with a table, a graph or a formula

The graph is called Probability density function (PDF).

  • CDF represents the sum of all the PDF values up to that point.

They possess several key characteristics which separate them from continuous ones.

Examples:

  • Discrete Uniform Distribution
  • Bernoulli Distribution
  • Binomial Distribution
  • Poisson Distribution

All outcomes are equally likely -> Equiprobable

Discrete Uniform Distribution

A distribution where all the outcomes are equally likely is called a Uniform Distribution.

All outcomes are equally likely -> Equiprobable

Both the mean and the variance are uninterpretable

No predictive power

Notation:

XX~U(a,b)U(a,b)

Key Characteristics:

  • All outcomes are equally likely.
  • All the bars on the graph are equally tall.
  • The expected value and variance have no predictive power

Example and uses:

  • Outcomes of rolling a single die.
  • Often used in shuffling algorithms due to its fairness.

Bernoulli Distribution

A distribution consisting of 1 single trial and only 2 possible outcomes – success or failure is called a Bernoulli Distribution.

Any event with only 2 outcomes can be transformed into a Bernoulli event

Notation:

XX~Bern(p)\text{Bern}(p)

Conventionally we will assign of 1 to the event with the probability = pp

Assign of 0 to the event with the probability = 1p1-p

Key Characteristics:

  • One trial.
  • Two possible outcomes.
  • E(X)=pE(X) = p
    • E(Bernoulli event) = which outcome we expect for a single trial
  • Var(X)=p×(1p)Var(X) = p \times (1-p)

Example and uses:

  • Guessing a single True/False question.
  • Often used in when trying to determine what we expect to get out a single trial of an experiment.

Binomial Distribution

A sequence of identical Bernoulli events is called Binomial and follows a Binomial Distribution.

Bern(p)=B(1,p)\text{Bern}(p) = B(1,p)

Carrying out a experiment and only two possible outcomes several times in a row

Notation:

YY~B(n,p)B(n,p)

Where nn is the number of trials and pp is the probability of success in each individual trial

Key Characteristics:

  • Measures the frequency of occurrence of one of the possible outcomes over the n trials
  • P(Y=y)=Cyn×py×(1p)nyP(Y = y) = C^n_y \times p^y \times (1-p)^{n-y}
  • E(Y)=n×pE(Y) = n \times p
    • E(Binomial Event) = the number of times we expect to get a specific outcome
  • Var(Y)=n×p×(1p)Var(Y) = n \times p \times (1-p)

Example and uses:

  • Determining how many times we expect to get a heads if we flip a coin 10 times
  • Often used when trying to predict how likely an event is to occur over a series of trials

Poisson Distribution

When we want to know the likelihood of a certain event occurring over a given interval of time or distance we use a Poisson Distribution.

Notation:

YY ~ Po(λ)Po(\lambda)

where λ\lambda is mean number of the occurence in the interval.

Key Characteristics:

  • Measures the frequency over an interval of time or distance. (Only non-negative values.)
  • P(Y=y)=λyeλy!P(Y = y) = \frac{\lambda^ye^{-\lambda}}{y!}
  • E(Y)=λE(Y) = \lambda
  • Var(Y)=λVar(Y) = \lambda

Example and uses:

  • Used to determine how likely a specific outcome is, knowing how often the event usually occurs.
  • Often incorporated in marketing analysis to determine whether above average visits are out of the ordinary or not.

Continuous Distributions

If the possible values a random variable can take are a sequence of infinitely many consecutive values, we are dealing with a continuous distribution.

The probability distribution would be a curve graph, called Probability distribution curve (PDC).

Examples:

  • Normal Distribution
  • Chi-Squared Distribution
  • Exponential Distribution
  • Logistic Distribution

Normal Distribution

Also known as Gaussian Distribution, Bell Curve

A Normal Distribution represents a distribution that most natural events follow.

Notation:

YY~N(μ,σ2)N(\mu,\sigma^2)

Where μ\mu is mean and σ2\sigma^2 is variance.

Note

68% of all outcomes fall within 1 standard deviation away from the mean

95% of all outcomes fall within 2 standard deviation away from the mean

99.7% of all outcomes fall within 3 standard deviation away from the mean

Note:

Dispersion of the distribution depends on Standard Deviation σ\sigma .

a lower standard deviation result in a lower dispersion (denser in the middle) and thinner tail.

a higher standard deviation will cause the graph to flatten out with less points in the middle and more to the end (fatter tail).

Key Characteristics:

  • Its graph is bell-shaped curve, symmetric and has thin tails
  • E(Y)=μE(Y) = \mu
  • Var(Y)=σ2Var(Y) = \sigma^2
  • 68% of all its values should fall in the interval: (μσ,μ+σ)(\mu-\sigma,\mu+\sigma)
  • mean = median = mode
  • no skew

Example and uses:

  • Often observed in the size of animals in the wilderness
  • Could be standardized to use the Z-table

Standard Normal Distribution

To standardize any normal distribution we need to transform it so that the expected value is 0 and the variance and standard deviation are 1.

E(Y)=0E(Y) = 0

Var(Y)=1Var(Y) = 1

We move the graph to the left or right until the mean = 0.

Then divide the whole graph with σ\sigma to get standard deviation = 1.

z=yμσz = \frac{y-\mu}{\sigma}

Importance of the Standard Normal Distribution:

  • The new variable ZZ (Z-score), represents how many standard deviations away from the mean, each corresponding value is.
  • We can transform any Normal Distribution into a Standard Normal Distribution using the transformation shown above.
  • Convenient to use because of a table of known values for its CDF, called the Z-score table, or simply the Z-table.

Students’ T Distribution

A small sample size approximation of a Normal Distribution

Student’s-T accommodates extreme values significantly better

Students’ T distribution compares to Normal Ditstribution

  • curves having fatter tails
  • larger number of values located far away from the mean

Notation

YY~t(k)t(k)

where kk is degrees of freedom.

Key characteristics:

  • A small sample size approximation of a Normal Distribution
  • Its graph is bell-shaped curve, symmetric, but has fat tails
  • Accounts for extreme values better than the Normal Distribution
  • E(Y)=μE(Y) = \mu

Example and uses:

  • Often used in analysis when examining a small sample of data that usually follows a Normal Distribution. (Not sufficient data and with unknown population variance)
  • Hypothesis testing with limited data
  • CDF table (t-table)

We can obtain the student’s T distribution for a variable with a Normally distributed population using the formula:

tv,α=tn1,α=xˉμs/nt_{\mathrm{v}, \alpha}= t_{\mathrm{n-1}, \alpha}= \frac{\bar{x}-\mu}{s / \sqrt{n}}

Chi-Squared Distribution

Only consists of non-negative values. Its graph is Asymmetric and skewed to the right.

Used in Hypothesis Testing to help determine goodness of fit

Notation

YY~x2(k)x^2(k)

where kk is degrees of freedom.

Key Characteristics:

  • Its graph is asymmetric and skewed to the right.
  • Used in Hypothesis Testing
  • E(Y)=kE(Y) = k
  • Var(Y)=2kVar(Y) = 2k
  • The Chi-Squared distribution is the square of the t-distribution.

Example and uses:

  • Often used to test goodness of fit
  • Contains a table of known values for its CDF called the x2x^2 -table. The only difference is the table shows what part of the table

Exponential Distribution

The Exponential Distribution is usually observed in events which significantly change early on.

e.g. Online news articles generate hits

Notation:

YY~Exp(λ)Exp(\lambda)

where λ\lambda is the rate parameter.

This determines how fast the CDF/PDF curve reaches the point of plateauing and how spread out the graph is.

Key Characteristics:

  • Both the PDF and the CDF plateau after a certain point.
  • E(Y)=1λE(Y) = \frac{1}{\lambda}
  • Var(Y)=1λ2Var(Y) = \frac{1}{\lambda^2}
  • We often use the natural logarithm to transform the values of such distributions since we do not have a table of known values like the Normal or Chi- Squared.
    • Apply ln(Y)ln(Y) to get normal distribution.

Example and uses:

  • Often used with dynamically changing variables, like online website traffic or radioactive decay.

Logistic Distribution

The Continuous Logistic Distribution is observed when trying to determine how continuous variable inputs can affect the probability of a binary outcome.

Useful in forecasting (prediction)

Useful for determing a cut-off point for a successful outcome.

Notation:

YY~Logistic(μ,s)Logistic(\mu,s)

where μ\mu is the location and ss is the scale parameter.

Key Characteristics:

  • E(Y)=μE(Y) = \mu
  • Var(Y)=s2×n23Var(Y) = \frac{s^2\times n^2}{3}
  • The CDF picks up when we reach values near the mean.
  • The smaller the scale parameter, the quicker it reaches values close to 1.

Example and uses:

  • Often used in sports to anticipate how a player’s or team’s performance can determine the outcome of the match.
  • Forcasting competitive sports

Statistics

First Step

You need to determine whether the data you are dealing with is a population or a sample.

Population

Collection of all items of interest

Denoted as NN

The number we’ve obtained when using a population are called Parameters.

Populations are hard to define and hard to observe in real life.

Sample

A subset of the population

Denoted as nn

The number we’ve obtained when using a sample are called Statistics.

A sample is much easier to gather.

  • Less time consuming
  • Less costly (cheaper)

Sample

Testicle tests are usually based on sample data

samples are key to accuracy testicle insights.

A sample must be both random and representative for an insight to be precise.

Randomness

a random sample is collected when each member of the sample is chosen from the population strictly by chance.

Representativeness

A representative sample is a subset of the population that accurately reflects the members of the entire population.

Types of Data

Categorical

Categorical data describes categories or groups.

E.g. Car brands, Yes No Questions

Numerical

Numerical data represents numbers.

it is further divided into two subsets - discrete and continuous

Discrete

  • Can be counted in finite matter

E.g. Number of objects (Only take integer value)

Continuous

  • Infinite and impossible to count
  • Just Approximation

E.g. Weight, Distance (All of these can vary by infinitely smaller amounts)

Levels of Measurement

Qualitative Variables

Nominal Variables

  • Not numbers and cannot be ordered

E.g. Four Seasons (winter, spring, summer, autumn)

Ordinal Variables

  • Followed by order

E.g. Rating your meal (from very bad, bad, neutral, good, very good)

Quantitative Variables

Interval Variables

  • Number
  • Does not have a true 0

E.g. Temperature (Celsuis and Fahrenheit)

Ratio Variables

  • Number
  • Has a true 0 point

E.g. Length, Weight

Graphs and Tables

In data science, there are many graphs and tables that represent categorical variables.

Frequency distribution tables (categorical)

Note Frequency distribution tables for numerical variables are different than the ones for categorical.

In Excel

In Excel, we can either hard code the frequencies or count them with a count function. This will come up later on.

Total formula: =SUM()

Bar charts

Called Clustered Column Charts in Excel.

In Excel

Bar charts are also called clustered column charts in Excel.

Choose your data, Insert -> Charts -> Clustered column or Bar chart.

Pie charts

Keyword: Market share

In Excel

Pie charts are created in the following way:

Choose your data, Insert -> Charts -> Pie chart

Pareto Diagrams

Basically a Bar chart shown in descending order of frequency, and a separate curve shows the cumulative frequency.

In Excel

  1. Order the data in your frequency distribution table in descending order.
  2. Create a bar chart.
  3. Add a column in your frequency distribution table that measures the cumulative frequency
  4. Select the plot area of the chart in Excel and Right click.
  5. Choose Select series
  6. Click Add
  7. Series name doesn’t matter. You can put ‘Line’
  8. For Series values choose the cells that refer to the cumulative frequency
  9. Click OK. You should see two side-by-side bars.
  10. Select the plot area of the chart and Right click.
  11. Choose Change Chart Type.
  12. Select Combo.
  13. Choose the type of representation from the dropdown list. Your initial categories should be ‘Clustered Column.’ Change the second series, that you called ‘Line’, to ‘Line’.

Frequency distribution tables (Numerical)

Frequency distribution tables for numerical variables are different than the ones for categorical. Usually, they are divided into intervals of equal (or unequal) length. The tables show the interval, the absolute frequency and sometimes it is useful to also include the relative (and cumulative) frequencies.

The interval width is calculated using the following formula:

Interal width=Largest number smallest numberNumber of desired intervals\text {Interal width}=\frac{\text {Largest number }-\text {smallest number}}{\text {Number of desired intervals}}

In Excel

  1. Decide on the number of intervals you would like to use.

  2. Find the interval width (using a the formula above).

  3. Start your 1st interval at the lowest value in your dataset.

  4. Finish your 1st interval at the lowest value + the interval width. ( = start_interval_cell + interval_width_cell )

  5. Start your 2nd interval where the 1st stops (that’s a formula as well - just make the starting cell of interval 2 = the ending of interval 1)

  6. Continue in this way until you have created the desired number of intervals.

  7. Count the absolute frequencies using the following COUNTIF formula: =COUNTIF(dataset_range,“>=“&interval start) -COUNTIF(dataset_range,”>“&interval end).

  8. In order to calculate the relative frequencies, use the following formula: = absolute_frequency_cell / number_of_observations

  9. In order to calculate the cumulative frequencies:

    i. The first cumulative frequency is equal to the relative frequency
    ii. Each consequitive cumulative frequency = previous cumulative frequency + the respective relative frequency

Histogram

Histograms are the one of the most common ways to represent numerical data. Each bar has width equal to the width of the interval. The bars are touching as there is continuation between intervals: where one ends -> the other begins.

Histogram vs Bar

In Excel

  1. Choose your data
  2. Insert -> Charts -> Histogram
  3. To change the number of bins (intervals):
    1. Select the x-axis
    2. Click Chart Tools -> Format -> Axis options
    3. You can select the bin width (interval width), number of bins, etc.

Cross tables (Side by Side Chart)

Cross tables (or contingency tables) are used to represent categorical variables.

A common way to represent the data from a cross table is by using a side-by-side bar chart.

One set of categories is labeling the rows and another is labeling the columns. We then fill in the table with the applicable data. It is a good idea to calculate the totals. Sometimes, these tables are constructed with the relative frequencies as shown in the table below.

####In Excel

  1. Choose your data
  2. Insert -> Charts -> Clustered Column
  3. Selecting more than one series ( groups of data ) will automatically prompt Excel to create a side-by-side bar (column) chart.

Scatter plots

When we want to represent two numerical variables on the same graph, we usually use a scatter plot. Scatter plots are useful especially later on, when we talk about regression analysis, as they help us detect patterns (linearity, homoscedasticity).

Scatter plots usually represent lots and lots of data. Typically, we are not interested in single observations, but rather in the structure of the dataset.

In Excel

  1. Choose the two datasets you want to plot.
  2. Insert -> Charts -> Scatter

Measures of Central Tendency

It is actually Mean, Median and Mode.

Mean

The mean is the simple average of the dataset.

Formula to scare away people:

i=1NxiN\frac{\sum^N_{i=1}x_i}{N}

Formula actually used and memorized:

x1+x2+x3++xN1+xNN\frac{x_1+x_2+x_3+\cdots+x_{N-1}+x_N}{N}

In Excel

=AVERAGE()

Median

The median is the midpoint of the ordered dataset. It is not as popular as the mean, but is often used in academia and data science. That is since it is not affected by outliers.

In an ordered dataset, the median is the number at position n+12\frac{n+1}{2} (middle number)

If the data in dataset is in even number, median = (sum of the two middle number / 2).

In Excel

=MEDIAN()

Mode

The mode is the value that occurs most often.

The mode is calculated simply by finding the value with the highest frequency.

A dataset can have 0 modes, 1 mode or multiple modes.

In Excel

=MODE.SNGL() => returns one mode

=MODE.MULT() => returns an array with the modes

Measures of Asymmetry

Formula to scare people away:

1ni=1n(xixˉ)3(1n1i=1n(xixˉ)2)32=1ni=1n(xixˉ)31n1i=1n(xixˉ)23\frac{\frac{1}{n} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{3}}{({\frac{1}{n-1} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}})^\frac{3}{2}} = \frac{\frac{1}{n} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{3}}{\sqrt{\frac{1}{n-1} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}^3}

Skewness

Skewness is a measure of asymmetry that indicates whether the observations in a dataset are concentrated on one side.

Skewness indicates whether the data is concentrated on one side.

Skewness tells us a lot about where the data is situated.

In Excel

=SKEW()

Positive / Right Skew

Mean > Median

Data points concentrated on the left side.

It depend on which side the tail is leaning to.

Zero / No Skew

Mean = Median = Mode

Graph is completely symmetrical

Negative / Left Skew

Mean < Median

Tail is leaning to left

Variability

Note We will typically use different formulas when working with population data and sample data.

What is Variance?

Variance measures the dispersion of a set of data points around their mean.

  • Dispersion is non-negative

The average of the squared differences from the Mean.

Sample Variance

Formula

s2=i=1n(xixˉ)2n1s^{2}=\frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}{n-1}​

why sample variance is divided by n-1?

In Excel

=VAR.S()

Population Variance

Formula

σ2=i=1N(xiμ)2N\sigma^{2}=\frac{\sum_{i=1}^{N}\left(x_{i}-\mu\right)^{2}}{N}

In Excel

=VAR.P()

What is Standard Deviation?

The Standard Deviation is a measure of how spread out numbers are.

Sample Standard Deviation

Formula

S=i=1n(xixˉ)2n1S=\sqrt{\frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}{n-1}}​

In Excel

=STDEV.S()

Population Standard Deviation

Formula

σ=i=1N(xiμ)2N\sigma=\sqrt{\frac{\sum_{i=1}^{N}\left(x_{i}-\mu\right)^{2}}{N}}

In Excel

=STDEV.P()

Variance and Standard deviation

Covariance and correlation

Covariance

The two variables are correlated and the main statisitc to measure this correlation is called covariance.

Covariance is a measure of the joint variability of two variables.

Covariance gives a sense of direction.

  • A positive covariance means that the two variables move together.
  • A covariance of 0 means that the two variables are independent.
  • A negative covariance means that the two variables move in opposite directions.

Sample covariance formula

Sxy=i=1n(xixˉ)(yiyˉ)n1S_{x y}=\frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right) *\left(y_{i}-\bar{y}\right)}{n-1}

Population covariance formula

σxy=i=1N(xiμx)(yiμy)N\sigma_{x y}=\frac{\sum_{i=1}^{N}\left(x_{i}-\mu_{x}\right) *\left(y_{i}-\mu_{y}\right)}{N}

In Excel

Sample covariance: =COVARIANCE.S()

Population covariance: =COVARIANCE.P()

Correlation

[Correlation Definition, Examples](https://www.simplypsychology.org/correlation.html#:~:text=An example of positive correlation,a decrease in the other.)

Correlation is a measure of the joint variability of two variables.

Unlike covariance, correlation could be thought of as a standardized measure.

It takes on values between -1 and 1, thus it is easy for us to interpret the result.

  • 1 : perfect positive correlation
    • two variables in which both variables move in the same direction.
  • 0 : two variables are independent
    • two variables have nothing in common
  • -1 : perfect negative correlation
    • two variables in which an increase in one variable is associated with a decrease in the other.

correlation formula

Cov(x,y)Stdev(x)Stdev(y)\frac{\operatorname{Cov}(x, y)}{\operatorname{Stdev}(x) * \operatorname{Stdev}(y)}

Sample correlation formula

r=sxysxsyr=\frac{s_{x y}}{s_{x} s_{y}}

Population correlation formula

ρ=σxyσxσy\rho=\frac{\sigma_{x y}}{\sigma_{x} \sigma_{y}}

In Excel

=CORREL()

Descriptive statistics

Descriptive statistics uses the data to provide descriptions of the population, either through numerical calculations or graphs or tables.

Inferential Statistics

Inferential statistics makes inferences and predictions about a population based on a sample of data taken from the population in question.

Basically: Use Probability + Distributions to predict population values based on sample data.

Central Limit Theorem

CLT simulator.

Central Limit Theorem allows us to perform tests, solve prolems and make inferences using the normal distribution, even when the population is not normally distributed.

Basically:

No matter the distribution of the population,

the sampling distribution of the mean will approximate a normal distribution.

Its mean is the same as the population mean.

Reasons to use the normal distribution:

  • They approximate a wide variety of random variables
  • Distributions of sample means with large enough sample sizes could be approximated to normal
  • All computable statics are elegant
  • Decisions based on normal distribution insights have a good track record

Standard Error

The standard error is the standard deviation of the distribution formed by the sample means.

In other words, The standard error is the standard deviation of its sampling distribution or an estimate of that standard deviation.

Standard deviation of the sampling distribution =σ2n=σn= \sqrt{\frac{\sigma^2}{n}} = \frac{\sigma}{\sqrt{n}}

The standard error shows variability.

  • Standard error decreases when sample size increases
    • bigger samples give a better approximation of the population.

Point Estimate

Point estimation gives us a particular value as an estimate of the population parameter.

point estimate is the midpoint of the interval.

Confidence Interval

Interval estimation gives us a range of values which is likely to contain the population parameter. This interval is called a confidence interval.

Confidence intervals provide much more information and are preferred when making inferences.

We are (1α)×100(1-\alpha)\times100% confident that the population parameter will fall in the specified interval.

Common alphas are: 0.01, 0.05, 0.1.

significance level=α=1confidence level\text{significance level} = \alpha = 1 - \text{confidence level}

  • (1α)(1-\alpha) \uparrow will result in CI \uparrow
  • σ\sigma \uparrow will result in CI \uparrow
  • nn \uparrow will result in CI \downarrow

Example:

95% CI Means there is only 5% chance that the population parameter is outside the range.

There is the lower limit and the upper limit in the distribution graph and 95% confidence interval would imply that we are 95% confident that the true population mean falls within this interval.

There is 2.5% chance that it will be on the left of the lower limit and 2.5% chance it will be on the right of the upper limit. There was 5 percent chance that our confidence that our role does not contain the true population mean. Using the Z table we can find that these limits.

T Statistic and Z Statisitic

Population standard deviation goes with the Z statistic.

When population variance is unknown, sample standard deviation goes with the t statistic.

Margin of Error would decrease when a lower statistic and a higher sample size.

Hypothesis Testing

A hypothesis is an idea that can be tested.

Steps in data-driven decision making:

  1. Formulate a hypothesis
  2. Find the right test
  3. Execute the test
  4. Make a decision based on the result

there are two hypotheses that are made.

  • Null hypothesis - H0H_0
    • H0H_0 is the hypothesis to be tested, the idea you want to reject.
    • It is the status-quo (現狀). Everything which was believed until now that we are contesting with our test.
    • H0H_0 is True until rejected (Innocent until proven guilty)
    • A null hypothesis is a hypothesis that says there is no statistical significance between the two variables.
    • It is usually the hypothesis a researcher or experimenter will try to disprove or discredit.
  • Alternative hypothesis - H1H_1 or HAH_A
    • H1H_1 cover everything else (H0\overline{H_0})
    • The alternative hypothesis is the change or innovation that is contesting the status-quo (現狀).
    • Usually the alternative is our own opinion.
    • An alternative hypothesis is one that states there is a statistically significant relationship between two variables.

Example:

You want to check if your height is above average, compared to your classmates.

H0H_0 : My height is lower or equal to the average height in the class.

H1H_1 : My height is higher than the average height in the class.

The null is the status quo (you are ‘shorter’), while the alternative is the ‘change or innovation’ (you are looking for this answer). Therefore, the null hypothesis of the test is ‘lower or equal’. You have to include ‘equal’ in the null hypothesis, as conventionally, the equality sign is always included in the null.

Testing and Decisions you can take

To Test:

  1. Calculate a statistic (e.g.xˉ\bar{x})
  2. Scale it (normalize to get zz)
  3. Check if zz is in the rejection region.

When testing, there are two decisions that can be made:

If your test value does not fall into the rejection region

  • accept the null hypothesis
    • there isn’t enough data to support the change or the innovation brought by the alternative.

If your test value falls into the rejection region

  • reject the null hypothesis.
  • there is enough statistical evidence that the status-quo is not representative of the truth.

Type I Error (False Positive)

Type 1 error is when you reject a null hypothesis which is true.

Probability:α\text{Probability} : \alpha

Example:

You are taking a pregnancy test. The null hypothesis of this test is: I am not pregnant. In reality, you are not pregnant, but the test says you are.

It is a Type I Error. You rejected a null hypothesis that is true. It is also called a false positive, although in this case, it is counter intuitive. While usually no error is made (that’s why hypothesis testing is so wide-spread), it is impossible to make both errors at the same time.

True Positive

Note : Probability of accepting a true null hypothesis (True Positive) is (1α)(1-\alpha)

Type II Error (False Negative)

Type 2 error is when you accept a null hypothesis which is false.

Probability:β\text{Probability} : \beta

where β\beta is depend mainly on sample size and magnitude of the effect.

Power of the test (False Positive)

Note : Probability of rejecting a false null hypothesis (False Positive) is (1β)(1-\beta) aka Power of the test

increase the power of the test by increasing the sample size.

Type I and Type II Explanation

Test for the Mean. Population (Variance Known)

ZZ ~ N(xˉμ0,1)N(\bar{x}-\mu_0 ,1)

Z=xˉμ0σ/n Z = \frac{\bar{x}-\mu_0}{\sigma/\sqrt n} \space

Note:

Z -> standardized variable associated with the test called the Z-score.

z -> one from the table and will be referred to as the critical value.

Decision Rule:

We Reject the null hypothesis H0H_0 if absolute value of Z-score(Z) > positive critical value(z).

Test for the Mean. Population (Variance Unknown)

T=xˉμs/n\mathrm{T}=\frac{\bar{x}-\mu}{s / \sqrt{n}}

Decision Rule:

We Accept the null hypothesis H0H_0 if absolute value of T-score(T) < positive critical value(t).

We Reject the null hypothesis H0H_0 if absolute value of T-score(T) > positive critical value(t).

P-Value

P-Value calculator

P-Value is smallest level of significance at which we can still reject the null hypothesis, given

the observed samples statistic.

  • Works with all distributions

Used to decide whether the variable is statistically significant or not.

The closer to 0 Your p value is, the more significant is the result you’ve obtained.

To find the p-value:

To find the p-value, you need to know the t-score/z-score, and whether this is a one-tailed or a two-tailed test.

  • One-sided p-value = 1- the number from the z table
  • Two-sided p-value = (1- the number from the z table) x2

Decision Rule:

If the p value < the significance level (α\alpha)

You should reject the null hypothesis H0H_0

If we have a p-value of 0.001 and a significance level of 1% we reject the null hypothesis.

Test for the Mean. Dependent Samples (Multiple Populations)

T=dˉμ0St.errorT = \frac{\bar{d}-\mu_0}{St.error}

We Accept the null hypothesis H0H_0 if p value > α\alpha.

We Reject the null hypothesis H0H_0 if p value < α\alpha.

Test for the mean. Independent Samples (Known Variance)

Z=xˉμ0σe2ne+σm2nm\mathrm{Z}=\frac{\bar{x}-\mu_{0}}{\sqrt{\frac{\sigma_{\mathrm{e}}^{2}}{n_{e}}+\frac{\sigma_{m}^{2}}{n_{m}}}}

Test for the mean. Independent Samples (Unknown Variance)

sp2=(nx1)sx2+(ny1)sy2nx+ny2s_{p}^{2}=\frac{\left(n_{x}-1\right) s_{x}^{2}+\left(n_{y}-1\right) s_{y}^{2}}{n_{x}+n_{y}-2}

Degrees of freedom for t-statistic = combined sample size - number of variables

Reject the null hypothesis when the T-score is bigger than 2.

Researchers believe 2 is sufficiently significant for a T-test.

Cheatsheet

When to use z / t statistic?

Confidence Interval cheatsheet

Reference

The Data Science Course 2020: Complete Data Science Bootcamp