Feature Engineering in ML-based Phishing detection
Phishing
uses e-mails or malicious websites to solicit personal information from an individual or company by posting as a trustworthy organization or entity
-
Social engineering
- Psychological manipulation of a person to get useful and sensitive info from them
-
Used by criminals
- Baiting: convincing the victim to reveal info, promising him a reward of a gift
- Impersonation: pretending to be someone else
- Shoulder surfing: spying on other people’s machines from behind them, while they are typing
- 1998: the term “phishing” is first used to refer to an online fraud scheme
- 1999 to 2004: criminal organizations use this practice to target banks, and become common in global level
- 2005: anti-phishing practices appear, esp with regard to authentication levels
- Signature: flagged URL/site as phishing protect future victims
- 2006-2007: begins to describe combined phishing and malware attacks
- 2008: approaches that use computational intelligence appears
- 2009 to 2020: trend: behavioural patterns: elements of URL (protocol, path, query string, domain, subdomains)
Typical scenario of Phishing
- Send out spoofing emails and put up deceiving websites to entice users to expose info. The spoofing emails usually purport to be from legal businesses, intended to lead users to counterfeit websites that lure the user to input sensitive info
- Software: HTTrack: available for users to duplicate entire websites for their own purposes
Example:
2014, iCloud leaks of celebrity photos
- Reason: phishing emails (seemed to be coming from Apple/Google) sent to the victims
Indicators
- Visually appears like the original website
- Email creates a sense of urgency to force user action
- Fake HTTPS certificate and domain name
For example:
Visually look like: Facebook.com
become Faceb00k.com
,
Fake domain name such as paypal@notice-access-273.com
,
Destination address doesn’t match the context of the email,
Fake preview site that will redirect you to another site
They also try to create some kind of urgency to make the user fall for it easily.
Phishing detection
Traditional Phishing detection
List-based approach: Make A list of known phishing sites
- Make use of blacklist or whitelist
There are different lists runned by different companies. We can send visited URLs to a central service to be checked.
Example:
- safeBrowsing: maintained by Google, operates in Chrome, Firefox, Safari
- PhishTank: maintained by OpenDNS, operates in Opera
- SmartScreen: Microsoft (IE, Edge)
- Opera 9.1 uses live blacklists from Phishtank as well as whitelists from GeoTrust
PhishTank
PhishTanks uses the community method to publist a list.
Anyone can send, verify, track and share phishing data. (A Collaborative nature)
- Confirmation: voting to determine a verdict on the complaint (valid or invalid phishing)
- Did not specify the “no of votes” need to be considered as a malicious URL
- Availability: platform looks at whether the phishing is online or offline
Drawbacks of Traditional Phishing detection
- Has delay
- Need users to report phishing websites manually
- Human effort introduces delay
- zero-day phishing
- Has a window of vulnerability
Machine Learning Phishing detection
Machine learning approach: Make data-driven decisions at scale
- Phishing detection is a classificationn problem (Phish or Not Phish)
We can have Supervised or Unsupervised approach
- Supervised: Labeled data => develop model to Make accurate predictions on unseen data
- Unsupervised: Unlabeled data => find common characteristics groups/clusters
Process of ML
- Information Sources and Data capturing
- Normal URL/email and Phishing data (Select a tool to obtain them)
- Data processing
- Data cleaning, deal with missing values
- Feature engineering
- Find important features/attributes from the raw data
- Feature Scaling and Selection
- Normalized and scaled features to prevent ML algorithms from getting biased.
- Decide the Model to be using
Phishing data collection:
- We can get from Alexa (well-known ranking service)
Public datasets:
- https://archive.ics.uci.edu/ml/datasets/phishing+websites#
- from the Univ of California, Irvine Machine Learning Repository
- 4898 legitimate, 6157 are phishing
- https://data.mendeley.com/datasets/h3cgnj8hft/1/
- 5000 phishing webpages (PhishTank, OpenPhish)
- 5000 legitimate webpages (Alexa, Common Crawl)
- 48 features
- https://research.aalto.fi/en/datasets/phishstorm--phishing--legitimate-url-dataset(f49465b2-c68a-4182-9171-075f0ed797d5).html
- 96018 URLs: 48009 legitimate and 48009 phishing URLs
Features in Phishing detection
Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.
Good feature = suitable representations of data
Types of features in Phishing detection:
- URL-Based Features
- Lexical feature
- Domain/Host-Based Features
- Page-Based Features
- Content-Based Features
URL-Based Features
A very popular set of features used for phishing prediction
It has 2 types:
- Lexical features (URL Features): Describes textual properties of the URL (not the page content)
- Length of the entire URL, no of dots in the URL
- Host-based features (Domain Features) : describe characteristics of the web site
- where it is located, who manages it, when was the site installed
Some background info: URL anatomy
Nowadays with Https doesn’t mean it is more secure.
- domain name is something we need to register
- subdomain (Third level domain) name can be decided by the operator of the network (easy to manipulate)
Some background info: Domain name: identifies the server that hosts the web page
-
second-level domain and top level domain are Registered domain name (hard to manipulate)
-
Phisher can control over the sub-domain name
-
Phisher can also control the path
URL Features (Lexical features)
Is IP address used as domain
Sometimes people do not register the domain name, so they just use the IP address only.
- Legitimate website seldom use IP as domain, because it will degrade user experience
- URLs that use their IP address instead of DNS (hide actual URL and domain of the website) looks to be suspicious
- example:
http://63.17.167.23/pc/verification.htm?=https://www.paypal.com/
- example:
- But there are still legitimate websites that use IP addresses for internal private devices such as routers, networked printers etc
Using IP address as domain is more likely a phishing link.
Number of separators
- 3 common separators:
-
,_
,@
- username/pwd proceeds
@
, destination URL follows@
- Legitimate website seldom uses Hyphen or underscore (
-
or_
)
In part 1:
phishing records, hyphen: 8.33 occurrences per URL (higher than non-phishing)
In part 2:
separators are used more than in the first part
Hyphen: 15.37% of characters used in a malicious URL (7.6% in non-malicious)
URL size
- Most URLs have between 25 and 50 characters
- All URLs over 500 characters was found to be malicious
Number of sub-domains in URL
- Phishers use sub-domains to lead users to believe, through careless observation, that the URL displayed in the browser is a legitimate domain
- Example:
facebook.edit.youraccount.com
- Legitimate websites usually have no subdomain
- There are some Subdomain keywords often used by phishers
- we can create another feature to check the list of keywords in the sub-domains name.
Example:
www.ABCBank.wxyz.com
is a phishing link and does not belong to the ABCBank.
www.ABCBank.wxyz.com
points towxyz.com
, rather thanABCBank.com
. In interpreting a domain, the top level is.com
. The domain name iswxyz.com
. ABCBank is a specific server in the sitewxyz.com
. If it is ABCBank, the domain name should bexxx.ABCBank.com
URL with variables
- the path or query string values can be manipulated
- Longer path/query string is more likely a phishing link
- Legitimate websites should have a small number of paths or query strings
Homographic attack
- Phishers makes use of substitutions, which can be words having spelling errors and homographs that may pass by an inattentive end user
- Example:
faceb00k
,dr0pbox
,goggle
Redirection
- Both phishing and legitimate sites use redirection
- Legitimate domain: increase security level by directly
http://a.com
tohttps://a.com
- Usually only have 1 or 2 number of redirection (They dont want to degrade user experience)
- Phishing website uses redirection to avoid being detected
- Usually have many redirections (Because they want to avoid being detected)
Domain Features (Host-based features)
Observations from domain name registration
- Many phishing sites are hosted on recently registered domains
- only lives for a short period of time
- Legitimate domains are regularly paid for several years to provide stability
Since many sites are put up and taken down in a matter of hours, it is pretty useless to build a list using traditional approach.
Compare WHOIS and Registrar of DNS record
- WHOIS is a Internet service that provides info about a domain name/IP address
- Registration details of a domain such as create date, updated date, expiry data, registrar/DNS servers of the domain
- By comparing with DNS record, we can check if the claimed identify is same as the WHOIS database.
- Empty registrar or name servers means Suspicious
Chronological domain features
- Domain age: time difference between the enquired timestamp and domain creation time
- Legitimate website >= 6 months
- Domain registration length: time difference between the enquired timestamp and expiry date
- Only 1 year means suspicious
Page-Based Features
Page-Based Features (Popularity-based)
-
Popularity of web sites has prove the Reliability of web sites
- Since Phishing websites only live for a short period of time
Website Ranking
- Find out website ranking (e.g., from Alexa database: Top 1 Million Site)
- Examines whether a website is in Google’s index or not (i.e., listed by Google search engine)
- Number of links pointing to the page (Study: 98% phishing page has no links pointing to it)
PageRank
- Tells how important a page is
- Relative importance of a page within a set of web pages
- Value between 0 to 1, 1 means very important
- Phishing pages only have a low pageRank because their short live
- 95% phishing page has 0 pageRank
Website Traffic
- Web traffic is the amount of data sent and received by visitors to a website.
- Useful measures but difficult to get free services (need to pay to obtain those data)
- Number of Visits for the domain (daily, weekly or monthly)
- Average no of page views per visit
- Average visit duration
Content-Based Features
It is quite computationally expensive to get those informations
Content-based feature
Scan the target domain
- Mis-spellings in website, CSS formatting, HTML or JavaScript code
- Such as :
- Favicon (graphic icon representing a website)
- Is it loaded from a domain other than the one in URL?
- No of external request URLs (including images, videos, mp3, …)
- are they from a domain other than the one in URL?
- Status bar customization
- Show a fake URL in the status bar to users, try to trick the user to click on it to redirect
- Use onMouseOver to change the status?
- Favicon (graphic icon representing a website)
Occurrence of suspicious keywords
- Examine the content in the page
Other Features
Visual similarity-based approach
Very computationally expensive to get those informations
- Perform prediction based on a screen capture of suspicious pages
- Check similarity between pages
- Some malicious pages do not always faithfully reproduce the look of the genuine page and produce false negatives
Feature Engineering in Phishing detection
Some background knowledge of Feature Engineering
The features can be numerical or categorical data.
- We need to convert all to numbers. Since ML models only accept numerical data.
Numerical values features include:
- Age of domain, number of dots in URL, …
- Length of the redirection chain
Note Numerical values features usually have a diverse range, so we may need to perform feature scalinng (depending on the model).
For True False feature, we can use Binarization to present them.
- Present or absent
- Is IP address used as domain {1, 0}
- Presence of homographic attack {1, 0}
Categorical features are harder to handle, such as:
- Domain names
We usually perform encoding to Categorical features.
Dealing with Numerical values: Feature Scaling
We can have different type of scaling.
Min-Max Normalization
- Perform scaling to be between 1 and 0
- We assume a linear relationship
Z Score Standardization
- Scaling through standard deviation (reduces the effect of outliers)
- This method has more staticial effect
- Most popular
Histogram Summarization
- Using the histogram, divide numerical values into different levels
- Put into interval for example, 0-10, 10-20, 20-30…
Dealing with Categorical values: Encoding
We can have different type of encoding.
Integer encoding
- Each unique label is mapped to an integer
- Naturally introduced some distance assumption
Example: Assign domain names into different labels: 1,2,3,4,5…
In python, we can use import the
category_encoders
'sOrdinalEncoder
function to perform Integer encoding.
One-hot encoding
- Each label is mapped to a binary vector
- So maximum distance will be always 1 and minimum distance will be always 0
- Result a increase in dimension
- Most popular
Example: Assign color R,G,B into 3 vectors
- Color: {R, G, B}: 3 vectors
- R vector: {1, 0, 0}
- G vector: {0, 1, 0}
- B vector: {0, 0, 1}
In python, we can use import the
category_encoders
'sOnneHotEncoder
function to perform One-hot encoding.
Dealing with Missing Data
May be due to human errors, privacy concerns, Some data might be missing.
There are generally 2 ways to solve:
Drop data with missing values
- might decrease performance (reduction in training data size)
Imputation: estimate from context
- Replace by mean, median, mode etc depends on context
Feature Engineering in Phishing detection
Features have to be converted to numerical values in machine learning-based phishing detection.
Checking how many “hyphens” in the URL
Can you give two ways to obtain the numerical feature of these characteristics?
- Integer feature: counting number of hyphens in the URL
- Binary feature: If number of hyphens is larger than T, =1, otherwise =0
Check if there is a homographic attack
Can you give one example of the homographic attack and explain how you can obtain the numerical feature of these characteristics?
facebook
->faceb00k
- Binary feature: present (1) or absence (0)
- Integer feature: number of occurence (2 in this case)
Check the domain age
Can you give two ways to obtain the numerical feature of these characteristics?
- Integer feature: calculate the difference between the enquired data and the registration date
- Histogram: set a threshold, using histogram to put them into interval
Check the popularity of the website
Give the ways to define the popularity of website. For each approach, explain how you can obtain the numerical feature of these characteristics?
- Pagerank: a number between 0 to 1
- Google index: binary feature (1 = found the record from google search)
- Alexa one millon pages: binary feature (1 = on the rank)
Check the content in the phishing email
Explain how you can obtain the numerical feature of these characteristics.
- check the content in the phishing email. Most of the phishing email sounds very urgent, may be related to the login, account etc. You can identify some keywords (terms) and then obtain the term frequency to characterize the content.
More Details about Feature Engineering in Phishing detection
https://archive.ics.uci.edu/ml/datasets/Phishing+Websites#
ML vs DL methods in Phishing detection
ML approach:
- Creating features through expert knowledge
- ML model then uses these creates to recognize patterns embedded in the data for phishing detection
Deep learning approach:
- Do not extract features
- Learn representation from the URL’s character sequence directly to perform phishing detection
Deep Learning in Phishing detection
one possible structure:
- Data cleaning
- Including: Remove
http://
,https://
,www
- Including: Remove
- Character embedding
- Note we need to convert our url into numbers.
Character embedding
- Change “characters” to numbers
- Example: one-hot encoding
- one hot encoding uses 0 and 1 to denote the presence/absence of the text, will not consider the relation between words
Word embedding
Besides Character embedding, we can use a Word embedding.
- Words with similar meaning would be represented using similar “numbers”
- Word embedding are often used in NLP (Natural language processing)
One-hot encoding vs word embedding
- Similarity:
- both methods convert text to numerical vectors
- Difference:
- one hot encoding uses 0 and 1 to denote the presence/absence of the text, will not consider the relation between words.
- word-embedding uses a numerical vector to represent a text in such a way that words with similar meaning are closer to each other in the distance measure as compared to words with dissimilar meaning.
1D Convolution layer (CNN)
- Compute the output of neurons that are connected to local regions in the input
Animated:
Fully Connected Layer (FC)
- Compute the output of neurons based on all inputs