Phishing

uses e-mails or malicious websites to solicit personal information from an individual or company by posting as a trustworthy organization or entity

img

  • Social engineering

    • Psychological manipulation of a person to get useful and sensitive info from them
  • Used by criminals

    • Baiting: convincing the victim to reveal info, promising him a reward of a gift
    • Impersonation: pretending to be someone else
    • Shoulder surfing: spying on other people’s machines from behind them, while they are typing
  • 1998: the term “phishing” is first used to refer to an online fraud scheme
  • 1999 to 2004: criminal organizations use this practice to target banks, and become common in global level
  • 2005: anti-phishing practices appear, esp with regard to authentication levels
    • Signature: flagged URL/site as phishing  protect future victims
  • 2006-2007: begins to describe combined phishing and malware attacks
  • 2008: approaches that use computational intelligence appears
  • 2009 to 2020: trend: behavioural patterns: elements of URL (protocol, path, query string, domain, subdomains)

Typical scenario of Phishing

  • Send out spoofing emails and put up deceiving websites to entice users to expose info. The spoofing emails usually purport to be from legal businesses, intended to lead users to counterfeit websites that lure the user to input sensitive info
  • Software: HTTrack: available for users to duplicate entire websites for their own purposes

Example:

2014, iCloud leaks of celebrity photos

  • Reason: phishing emails (seemed to be coming from Apple/Google) sent to the victims

Indicators

  • Visually appears like the original website
  • Email creates a sense of urgency to force user action
  • Fake HTTPS certificate and domain name

For example:

Visually look like: Facebook.com become Faceb00k.com,

img

Fake domain name such as paypal@notice-access-273.com,

img

Destination address doesn’t match the context of the email,

img

Fake preview site that will redirect you to another site

img

https://www.google.com

They also try to create some kind of urgency to make the user fall for it easily.

Phishing detection

Traditional Phishing detection

List-based approach: Make A list of known phishing sites

  • Make use of blacklist or whitelist

There are different lists runned by different companies. We can send visited URLs to a central service to be checked.

Example:

  • safeBrowsing: maintained by Google, operates in Chrome, Firefox, Safari
  • PhishTank: maintained by OpenDNS, operates in Opera
  • SmartScreen: Microsoft (IE, Edge)
  • Opera 9.1 uses live blacklists from Phishtank as well as whitelists from GeoTrust

PhishTank

PhishTanks uses the community method to publist a list.

Anyone can send, verify, track and share phishing data. (A Collaborative nature)

  • Confirmation: voting to determine a verdict on the complaint (valid or invalid phishing)
    • Did not specify the “no of votes” need to be considered as a malicious URL
  • Availability: platform looks at whether the phishing is online or offline

img

Drawbacks of Traditional Phishing detection

  • Has delay
    • Need users to report phishing websites manually
    • Human effort introduces delay
    • zero-day phishing
    • Has a window of vulnerability

Machine Learning Phishing detection

Machine learning approach: Make data-driven decisions at scale

  • Phishing detection is a classificationn problem (Phish or Not Phish)

We can have Supervised or Unsupervised approach

  • Supervised: Labeled data => develop model to Make accurate predictions on unseen data
  • Unsupervised: Unlabeled data => find common characteristics  groups/clusters

Process of ML

  • Information Sources and Data capturing
    • Normal URL/email and Phishing data (Select a tool to obtain them)
  • Data processing
    • Data cleaning, deal with missing values
  • Feature engineering
    • Find important features/attributes from the raw data
  • Feature Scaling and Selection
    • Normalized and scaled features to prevent ML algorithms from getting biased.
  • Decide the Model to be using

Phishing data collection:

  • We can get from Alexa (well-known ranking service)

Public datasets:

Features in Phishing detection

Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.

Good feature = suitable representations of data

Types of features in Phishing detection:

  • URL-Based Features
    • Lexical feature
    • Domain/Host-Based Features
  • Page-Based Features
  • Content-Based Features

URL-Based Features

A very popular set of features used for phishing prediction

It has 2 types:

  • Lexical features (URL Features): Describes textual properties of the URL (not the page content)
    • Length of the entire URL, no of dots in the URL
  • Host-based features (Domain Features) : describe characteristics of the web site
    • where it is located, who manages it, when was the site installed

Some background info: URL anatomy

img

Nowadays with Https doesn’t mean it is more secure.

  • domain name is something we need to register
  • subdomain (Third level domain) name can be decided by the operator of the network (easy to manipulate)

img

Some background info: Domain name: identifies the server that hosts the web page

  • second-level domain and top level domain are Registered domain name (hard to manipulate)

  • Phisher can control over the sub-domain name

  • Phisher can also control the path

img

URL Features (Lexical features)

Is IP address used as domain

Sometimes people do not register the domain name, so they just use the IP address only.

  • Legitimate website seldom use IP as domain, because it will degrade user experience
  • URLs that use their IP address instead of DNS (hide actual URL and domain of the website) looks to be suspicious
    • example: http://63.17.167.23/pc/verification.htm?=https://www.paypal.com/
  • But there are still legitimate websites that use IP addresses for internal private devices such as routers, networked printers etc

Using IP address as domain is more likely a phishing link.

Number of separators

  • 3 common separators: -, _, @
  • username/pwd proceeds @, destination URL follows @
  • Legitimate website seldom uses Hyphen or underscore ( - or _ )

img

In part 1:

phishing records, hyphen: 8.33 occurrences per URL (higher than non-phishing)

In part 2:

separators are used more than in the first part

Hyphen: 15.37% of characters used in a malicious URL (7.6% in non-malicious)

URL size

  • Most URLs have between 25 and 50 characters
  • All URLs over 500 characters was found to be malicious

Number of sub-domains in URL

  • Phishers use sub-domains to lead users to believe, through careless observation, that the URL displayed in the browser is a legitimate domain
  • Example: facebook.edit.youraccount.com
  • Legitimate websites usually have no subdomain
  • There are some Subdomain keywords often used by phishers
    • we can create another feature to check the list of keywords in the sub-domains name.

Example:

www.ABCBank.wxyz.com is a phishing link and does not belong to the ABCBank.

img

www.ABCBank.wxyz.com points to wxyz.com , rather than ABCBank.com. In interpreting a domain, the top level is .com. The domain name is wxyz.com. ABCBank is a specific server in the site wxyz.com. If it is ABCBank, the domain name should be xxx.ABCBank.com

URL with variables

  • the path or query string values can be manipulated
  • Longer path/query string is more likely a phishing link
  • Legitimate websites should have a small number of paths or query strings

Homographic attack

  • Phishers makes use of substitutions, which can be words having spelling errors and homographs that may pass by an inattentive end user
  • Example: faceb00k, dr0pbox, goggle

Redirection

  • Both phishing and legitimate sites use redirection
  • Legitimate domain: increase security level by directly http://a.com to https://a.com
    • Usually only have 1 or 2 number of redirection (They dont want to degrade user experience)
  • Phishing website uses redirection to avoid being detected
    • Usually have many redirections (Because they want to avoid being detected)

Domain Features (Host-based features)

Observations from domain name registration

  • Many phishing sites are hosted on recently registered domains
    • only lives for a short period of time
  • Legitimate domains are regularly paid for several years to provide stability

Since many sites are put up and taken down in a matter of hours, it is pretty useless to build a list using traditional approach.

Compare WHOIS and Registrar of DNS record

  • WHOIS is a Internet service that provides info about a domain name/IP address
  • Registration details of a domain such as create date, updated date, expiry data, registrar/DNS servers of the domain
  • By comparing with DNS record, we can check if the claimed identify is same as the WHOIS database.
  • Empty registrar or name servers means Suspicious

Chronological domain features

  • Domain age: time difference between the enquired timestamp and domain creation time
    • Legitimate website >= 6 months
  • Domain registration length: time difference between the enquired timestamp and expiry date
    • Only 1 year means suspicious

Page-Based Features

Page-Based Features (Popularity-based)

  • Popularity of web sites has prove the Reliability of web sites

    • Since Phishing websites only live for a short period of time

Website Ranking

  • Find out website ranking (e.g., from Alexa database: Top 1 Million Site)
  • Examines whether a website is in Google’s index or not (i.e., listed by Google search engine)
  • Number of links pointing to the page (Study: 98% phishing page has no links pointing to it)

PageRank

  • Tells how important a page is
    • Relative importance of a page within a set of web pages
    • Value between 0 to 1, 1 means very important
    • Phishing pages only have a low pageRank because their short live
    • 95% phishing page has 0 pageRank

Website Traffic

  • Web traffic is the amount of data sent and received by visitors to a website.
  • Useful measures but difficult to get free services (need to pay to obtain those data)
  • Number of Visits for the domain (daily, weekly or monthly)
    • Average no of page views per visit
    • Average visit duration

Content-Based Features

It is quite computationally expensive to get those informations

Content-based feature

Scan the target domain

  • Mis-spellings in website, CSS formatting, HTML or JavaScript code
  • Such as :
    • Favicon (graphic icon representing a website)
      • Is it loaded from a domain other than the one in URL?
    • No of external request URLs (including images, videos, mp3, …)
      • are they from a domain other than the one in URL?
    • Status bar customization
      • Show a fake URL in the status bar to users, try to trick the user to click on it to redirect
      • Use onMouseOver to change the status?

Occurrence of suspicious keywords

  • Examine the content in the page

img

Other Features

Visual similarity-based approach

Very computationally expensive to get those informations

  • Perform prediction based on a screen capture of suspicious pages
  • Check similarity between pages
  • Some malicious pages do not always faithfully reproduce the look of the genuine page and produce false negatives

Feature Engineering in Phishing detection

Some background knowledge of Feature Engineering

The features can be numerical or categorical data.

  • We need to convert all to numbers. Since ML models only accept numerical data.

Numerical values features include:

  • Age of domain, number of dots in URL, …
  • Length of the redirection chain

Note Numerical values features usually have a diverse range, so we may need to perform feature scalinng (depending on the model).

For True False feature, we can use Binarization to present them.

  • Present or absent
    • Is IP address used as domain {1, 0}
    • Presence of homographic attack {1, 0}

Categorical features are harder to handle, such as:

  • Domain names

We usually perform encoding to Categorical features.

Dealing with Numerical values: Feature Scaling

We can have different type of scaling.

Min-Max Normalization

  • Perform scaling to be between 1 and 0
  • We assume a linear relationship

Xnorm =XXminXmaxXminX_{\text {norm }}=\frac{X-X_{\min }}{X_{\max }-X_{\min }}

Z Score Standardization

  • Scaling through standard deviation (reduces the effect of outliers)
  • This method has more staticial effect
  • Most popular

z=xμσz=\frac{x-\mu}{\sigma}

Histogram Summarization

  • Using the histogram, divide numerical values into different levels
    • Put into interval for example, 0-10, 10-20, 20-30…
img img

Dealing with Categorical values: Encoding

We can have different type of encoding.

Integer encoding

  • Each unique label is mapped to an integer
  • Naturally introduced some distance assumption

Example: Assign domain names into different labels: 1,2,3,4,5…

In python, we can use import the category_encoders 's OrdinalEncoder function to perform Integer encoding.

img

One-hot encoding

  • Each label is mapped to a binary vector
    • So maximum distance will be always 1 and minimum distance will be always 0
    • Result a increase in dimension
  • Most popular

Example: Assign color R,G,B into 3 vectors

  • Color: {R, G, B}: 3 vectors
  • R vector: {1, 0, 0}
  • G vector: {0, 1, 0}
  • B vector: {0, 0, 1}

In python, we can use import the category_encoders 's OnneHotEncoder function to perform One-hot encoding.

img

Dealing with Missing Data

May be due to human errors, privacy concerns, Some data might be missing.

There are generally 2 ways to solve:

Drop data with missing values

  • might decrease performance (reduction in training data size)

Imputation: estimate from context

  • Replace by mean, median, mode etc depends on context

Feature Engineering in Phishing detection

Features have to be converted to numerical values in machine learning-based phishing detection.

Checking how many “hyphens” in the URL

Can you give two ways to obtain the numerical feature of these characteristics?

  • Integer feature: counting number of hyphens in the URL
  • Binary feature: If number of hyphens is larger than T, =1, otherwise =0

Check if there is a homographic attack

Can you give one example of the homographic attack and explain how you can obtain the numerical feature of these characteristics?

  • facebook -> faceb00k
  • Binary feature: present (1) or absence (0)
  • Integer feature: number of occurence (2 in this case)

Check the domain age

Can you give two ways to obtain the numerical feature of these characteristics?

  • Integer feature: calculate the difference between the enquired data and the registration date
  • Histogram: set a threshold, using histogram to put them into interval

Check the popularity of the website

Give the ways to define the popularity of website. For each approach, explain how you can obtain the numerical feature of these characteristics?

  • Pagerank: a number between 0 to 1
  • Google index: binary feature (1 = found the record from google search)
  • Alexa one millon pages: binary feature (1 = on the rank)

Check the content in the phishing email

Explain how you can obtain the numerical feature of these characteristics.

  • check the content in the phishing email. Most of the phishing email sounds very urgent, may be related to the login, account etc. You can identify some keywords (terms) and then obtain the term frequency to characterize the content.

More Details about Feature Engineering in Phishing detection

https://archive.ics.uci.edu/ml/datasets/Phishing+Websites#

ML vs DL methods in Phishing detection

ML approach:

  • Creating features through expert knowledge
  • ML model then uses these creates to recognize patterns embedded in the data for phishing detection

Deep learning approach:

  • Do not extract features
  • Learn representation from the URL’s character sequence directly to perform phishing detection

img

Deep Learning in Phishing detection

one possible structure:

img

  • Data cleaning
    • Including: Remove http://, https://, www
  • Character embedding
    • Note we need to convert our url into numbers.

Character embedding

  • Change “characters” to numbers
  • Example: one-hot encoding
img
  • one hot encoding uses 0 and 1 to denote the presence/absence of the text, will not consider the relation between words

Word embedding

Besides Character embedding, we can use a Word embedding.

  • Words with similar meaning would be represented using similar “numbers”
  • Word embedding are often used in NLP (Natural language processing)
img

One-hot encoding vs word embedding

  • Similarity:
    • both methods convert text to numerical vectors
  • Difference:
    • one hot encoding uses 0 and 1 to denote the presence/absence of the text, will not consider the relation between words.
    • word-embedding uses a numerical vector to represent a text in such a way that words with similar meaning are closer to each other in the distance measure as compared to words with dissimilar meaning.

1D Convolution layer (CNN)

  • Compute the output of neurons that are connected to local regions in the input

img

Animated:

img

Fully Connected Layer (FC)

  • Compute the output of neurons based on all inputs

Performance

img