Phishing

uses e-mails or malicious websites to solicit personal information from an individual or company by posting as a trustworthy organization or entity

Social engineering
- Psychological manipulation of a person to get useful and sensitive info from them
Used by criminals
- Baiting: convincing the victim to reveal info, promising him a reward of a gift
- Impersonation: pretending to be someone else
- Shoulder surfing: spying on other people’s machines from behind them, while they are typing

1998: the term “phishing” is first used to refer to an online fraud scheme

1999 to 2004: criminal organizations use this practice to target banks, and become common in global level

2005: anti-phishing practices appear, esp with regard to authentication levels

Signature: flagged URL/site as phishing  protect future victims

2006-2007: begins to describe combined phishing and malware attacks

2008: approaches that use computational intelligence appears

2009 to 2020: trend: behavioural patterns: elements of URL (protocol, path, query string, domain, subdomains)

Typical scenario of Phishing

Send out spoofing emails and put up deceiving websites to entice users to expose info. The spoofing emails usually purport to be from legal businesses, intended to lead users to counterfeit websites that lure the user to input sensitive info
Software: HTTrack: available for users to duplicate entire websites for their own purposes

Example:

2014, iCloud leaks of celebrity photos

Reason: phishing emails (seemed to be coming from Apple/Google) sent to the victims

Indicators

Visually appears like the original website
Email creates a sense of urgency to force user action
Fake HTTPS certificate and domain name

For example:

Visually look like: Facebook.com become Faceb00k.com,

Fake domain name such as paypal@notice-access-273.com,

Destination address doesn’t match the context of the email,

Fake preview site that will redirect you to another site

https://www.google.com

They also try to create some kind of urgency to make the user fall for it easily.

Phishing detection

Traditional Phishing detection

List-based approach: Make A list of known phishing sites

Make use of blacklist or whitelist

There are different lists runned by different companies. We can send visited URLs to a central service to be checked.

Example:

safeBrowsing: maintained by Google, operates in Chrome, Firefox, Safari
PhishTank: maintained by OpenDNS, operates in Opera
SmartScreen: Microsoft (IE, Edge)
Opera 9.1 uses live blacklists from Phishtank as well as whitelists from GeoTrust

PhishTank

PhishTanks uses the community method to publist a list.

Anyone can send, verify, track and share phishing data. (A Collaborative nature)

Confirmation: voting to determine a verdict on the complaint (valid or invalid phishing)
- Did not specify the “no of votes” need to be considered as a malicious URL
Availability: platform looks at whether the phishing is online or offline

Drawbacks of Traditional Phishing detection

Has delay
- Need users to report phishing websites manually
- Human effort introduces delay
- zero-day phishing
- Has a window of vulnerability

Machine Learning Phishing detection

Machine learning approach: Make data-driven decisions at scale

Phishing detection is a classificationn problem (Phish or Not Phish)

We can have Supervised or Unsupervised approach

Supervised: Labeled data => develop model to Make accurate predictions on unseen data
Unsupervised: Unlabeled data => find common characteristics  groups/clusters

Process of ML

Information Sources and Data capturing
- Normal URL/email and Phishing data (Select a tool to obtain them)
Data processing
- Data cleaning, deal with missing values
Feature engineering
- Find important features/attributes from the raw data
Feature Scaling and Selection
- Normalized and scaled features to prevent ML algorithms from getting biased.
Decide the Model to be using

Phishing data collection:

We can get from Alexa (well-known ranking service)

Public datasets:

https://archive.ics.uci.edu/ml/datasets/phishing+websites#

from the Univ of California, Irvine Machine Learning Repository

4898 legitimate, 6157 are phishing

https://data.mendeley.com/datasets/h3cgnj8hft/1/

5000 phishing webpages (PhishTank, OpenPhish)

5000 legitimate webpages (Alexa, Common Crawl)

48 features

https://research.aalto.fi/en/datasets/phishstorm--phishing--legitimate-url-dataset(f49465b2-c68a-4182-9171-075f0ed797d5).html

96018 URLs: 48009 legitimate and 48009 phishing URLs

Features in Phishing detection

Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.

Good feature = suitable representations of data

Types of features in Phishing detection:

URL-Based Features
- Lexical feature
- Domain/Host-Based Features
Page-Based Features
Content-Based Features

URL-Based Features

A very popular set of features used for phishing prediction

It has 2 types:

Lexical features (URL Features): Describes textual properties of the URL (not the page content)
- Length of the entire URL, no of dots in the URL
Host-based features (Domain Features) : describe characteristics of the web site
- where it is located, who manages it, when was the site installed

Some background info: URL anatomy

Nowadays with Https doesn’t mean it is more secure.

domain name is something we need to register
subdomain (Third level domain) name can be decided by the operator of the network (easy to manipulate)

Some background info: Domain name: identifies the server that hosts the web page

second-level domain and top level domain are Registered domain name (hard to manipulate)
Phisher can control over the sub-domain name
Phisher can also control the path

URL Features (Lexical features)

Is IP address used as domain

Sometimes people do not register the domain name, so they just use the IP address only.

Legitimate website seldom use IP as domain, because it will degrade user experience
URLs that use their IP address instead of DNS (hide actual URL and domain of the website) looks to be suspicious
- example: http://63.17.167.23/pc/verification.htm?=https://www.paypal.com/
But there are still legitimate websites that use IP addresses for internal private devices such as routers, networked printers etc

Using IP address as domain is more likely a phishing link.

Number of separators

3 common separators: -, _, @
username/pwd proceeds @, destination URL follows @
Legitimate website seldom uses Hyphen or underscore ( - or _ )

In part 1:

phishing records, hyphen: 8.33 occurrences per URL (higher than non-phishing)

In part 2:

separators are used more than in the first part

Hyphen: 15.37% of characters used in a malicious URL (7.6% in non-malicious)

URL size

Most URLs have between 25 and 50 characters
All URLs over 500 characters was found to be malicious

Number of sub-domains in URL

Phishers use sub-domains to lead users to believe, through careless observation, that the URL displayed in the browser is a legitimate domain

Example: facebook.edit.youraccount.com
Legitimate websites usually have no subdomain
There are some Subdomain keywords often used by phishers
- we can create another feature to check the list of keywords in the sub-domains name.

Example:

www.ABCBank.wxyz.com is a phishing link and does not belong to the ABCBank.

www.ABCBank.wxyz.com points to wxyz.com , rather than ABCBank.com. In interpreting a domain, the top level is .com. The domain name is wxyz.com. ABCBank is a specific server in the site wxyz.com. If it is ABCBank, the domain name should be xxx.ABCBank.com

URL with variables

the path or query string values can be manipulated
Longer path/query string is more likely a phishing link
Legitimate websites should have a small number of paths or query strings

Homographic attack

Phishers makes use of substitutions, which can be words having spelling errors and homographs that may pass by an inattentive end user
Example: faceb00k, dr0pbox, goggle

Redirection

Both phishing and legitimate sites use redirection
Legitimate domain: increase security level by directly http://a.com to https://a.com
- Usually only have 1 or 2 number of redirection (They dont want to degrade user experience)
Phishing website uses redirection to avoid being detected
- Usually have many redirections (Because they want to avoid being detected)

Domain Features (Host-based features)

Observations from domain name registration

Many phishing sites are hosted on recently registered domains
- only lives for a short period of time
Legitimate domains are regularly paid for several years to provide stability

Since many sites are put up and taken down in a matter of hours, it is pretty useless to build a list using traditional approach.

Compare WHOIS and Registrar of DNS record

WHOIS is a Internet service that provides info about a domain name/IP address
Registration details of a domain such as create date, updated date, expiry data, registrar/DNS servers of the domain
By comparing with DNS record, we can check if the claimed identify is same as the WHOIS database.
Empty registrar or name servers means Suspicious

Chronological domain features

Domain age: time difference between the enquired timestamp and domain creation time
- Legitimate website >= 6 months
Domain registration length: time difference between the enquired timestamp and expiry date
- Only 1 year means suspicious

Page-Based Features

Page-Based Features (Popularity-based)

Popularity of web sites has prove the Reliability of web sites
- Since Phishing websites only live for a short period of time

Website Ranking

Find out website ranking (e.g., from Alexa database: Top 1 Million Site)
Examines whether a website is in Google’s index or not (i.e., listed by Google search engine)
Number of links pointing to the page (Study: 98% phishing page has no links pointing to it)

PageRank

Tells how important a page is
- Relative importance of a page within a set of web pages
- Value between 0 to 1, 1 means very important
- Phishing pages only have a low pageRank because their short live
- 95% phishing page has 0 pageRank

Website Traffic

Web traffic is the amount of data sent and received by visitors to a website.
Useful measures but difficult to get free services (need to pay to obtain those data)
Number of Visits for the domain (daily, weekly or monthly)
- Average no of page views per visit
- Average visit duration

Content-Based Features

It is quite computationally expensive to get those informations

Content-based feature

Scan the target domain

Mis-spellings in website, CSS formatting, HTML or JavaScript code
Such as :
- Favicon (graphic icon representing a website)
  - Is it loaded from a domain other than the one in URL?
- No of external request URLs (including images, videos, mp3, …)
  - are they from a domain other than the one in URL?
- Status bar customization
  - Show a fake URL in the status bar to users, try to trick the user to click on it to redirect
  - Use onMouseOver to change the status?

Occurrence of suspicious keywords

Examine the content in the page

Other Features

Visual similarity-based approach

Very computationally expensive to get those informations

Perform prediction based on a screen capture of suspicious pages
Check similarity between pages
Some malicious pages do not always faithfully reproduce the look of the genuine page and produce false negatives

Feature Engineering in Phishing detection

Some background knowledge of Feature Engineering

The features can be numerical or categorical data.

We need to convert all to numbers. Since ML models only accept numerical data.

Numerical values features include:

Age of domain, number of dots in URL, …
Length of the redirection chain

Note Numerical values features usually have a diverse range, so we may need to perform feature scalinng (depending on the model).

For True False feature, we can use Binarization to present them.

Present or absent
- Is IP address used as domain {1, 0}
- Presence of homographic attack {1, 0}

Categorical features are harder to handle, such as:

Domain names

We usually perform encoding to Categorical features.

Dealing with Numerical values: Feature Scaling

We can have different type of scaling.

Min-Max Normalization

Perform scaling to be between 1 and 0
We assume a linear relationship

$X_{\text {norm }}=\frac{X-X_{\min }}{X_{\max }-X_{\min }}$

Z Score Standardization

Scaling through standard deviation (reduces the effect of outliers)
This method has more staticial effect
Most popular

$z=\frac{x-\mu}{\sigma}$

Histogram Summarization

Using the histogram, divide numerical values into different levels
- Put into interval for example, 0-10, 10-20, 20-30…

Dealing with Categorical values: Encoding

We can have different type of encoding.

Integer encoding

Each unique label is mapped to an integer
Naturally introduced some distance assumption

Example: Assign domain names into different labels: 1,2,3,4,5…

In python, we can use import the category_encoders 's OrdinalEncoder function to perform Integer encoding.

One-hot encoding

Each label is mapped to a binary vector
- So maximum distance will be always 1 and minimum distance will be always 0
- Result a increase in dimension
Most popular

Example: Assign color R,G,B into 3 vectors

Color: {R, G, B}: 3 vectors
R vector: {1, 0, 0}
G vector: {0, 1, 0}
B vector: {0, 0, 1}

In python, we can use import the category_encoders 's OnneHotEncoder function to perform One-hot encoding.

Dealing with Missing Data

May be due to human errors, privacy concerns, Some data might be missing.

There are generally 2 ways to solve:

Drop data with missing values

might decrease performance (reduction in training data size)

Imputation: estimate from context

Replace by mean, median, mode etc depends on context

Feature Engineering in Phishing detection

Features have to be converted to numerical values in machine learning-based phishing detection.

Checking how many “hyphens” in the URL

Can you give two ways to obtain the numerical feature of these characteristics?

Integer feature: counting number of hyphens in the URL
Binary feature: If number of hyphens is larger than T, =1, otherwise =0

Check if there is a homographic attack

Can you give one example of the homographic attack and explain how you can obtain the numerical feature of these characteristics?

facebook -> faceb00k
Binary feature: present (1) or absence (0)
Integer feature: number of occurence (2 in this case)

Check the domain age

Can you give two ways to obtain the numerical feature of these characteristics?

Integer feature: calculate the difference between the enquired data and the registration date
Histogram: set a threshold, using histogram to put them into interval

Check the popularity of the website

Give the ways to define the popularity of website. For each approach, explain how you can obtain the numerical feature of these characteristics?

Pagerank: a number between 0 to 1
Google index: binary feature (1 = found the record from google search)
Alexa one millon pages: binary feature (1 = on the rank)

Check the content in the phishing email

Explain how you can obtain the numerical feature of these characteristics.

check the content in the phishing email. Most of the phishing email sounds very urgent, may be related to the login, account etc. You can identify some keywords (terms) and then obtain the term frequency to characterize the content.

More Details about Feature Engineering in Phishing detection

https://archive.ics.uci.edu/ml/datasets/Phishing+Websites#

ML vs DL methods in Phishing detection

ML approach:

Creating features through expert knowledge
ML model then uses these creates to recognize patterns embedded in the data for phishing detection

Deep learning approach:

Do not extract features
Learn representation from the URL’s character sequence directly to perform phishing detection

Deep Learning in Phishing detection

one possible structure:

Data cleaning
- Including: Remove http://, https://, www
Character embedding
- Note we need to convert our url into numbers.

Character embedding

Change “characters” to numbers
Example: one-hot encoding

one hot encoding uses 0 and 1 to denote the presence/absence of the text, will not consider the relation between words

Word embedding

Besides Character embedding, we can use a Word embedding.

Words with similar meaning would be represented using similar “numbers”
Word embedding are often used in NLP (Natural language processing)

One-hot encoding vs word embedding

Similarity:
- both methods convert text to numerical vectors
Difference:
- one hot encoding uses 0 and 1 to denote the presence/absence of the text, will not consider the relation between words.
- word-embedding uses a numerical vector to represent a text in such a way that words with similar meaning are closer to each other in the distance measure as compared to words with dissimilar meaning.

1D Convolution layer (CNN)

Compute the output of neurons that are connected to local regions in the input

Animated:

Fully Connected Layer (FC)

Compute the output of neurons based on all inputs

Feature Engineering in ML-based Phishing detection

Phishing

Typical scenario of Phishing

Indicators

Phishing detection

Traditional Phishing detection

Machine Learning Phishing detection

Features in Phishing detection

URL-Based Features

URL Features (Lexical features)

Domain Features (Host-based features)

Page-Based Features

Page-Based Features (Popularity-based)

Content-Based Features

Content-based feature

Other Features

Visual similarity-based approach

Feature Engineering in Phishing detection

Some background knowledge of Feature Engineering

Dealing with Numerical values: Feature Scaling

Dealing with Categorical values: Encoding

Dealing with Missing Data

Feature Engineering in Phishing detection

Checking how many “hyphens” in the URL

Check if there is a homographic attack

Check the domain age

Check the popularity of the website

Check the content in the phishing email

More Details about Feature Engineering in Phishing detection

ML vs DL methods in Phishing detection

Deep Learning in Phishing detection

Character embedding

Word embedding

1D Convolution layer (CNN)

Fully Connected Layer (FC)

Performance