Feature Engineering in ML-based Malware detection
Malware
Malware: malicious software
- is a code that performs malicious actions
- take the form of an executable, script, code or any other software
Malware grow exponentially
- 230K+ computer users hit by malware in Q2 2019
- Nearly 1 million new malware samples created every day
- Do they have any similarity so we can detect them?
Ways of spread
- Deceive users and make them click email attachments
- Delivered through USB drives/web pages
Forms of malware attack
- Modification of the file system
- create new files, edit, encrypt, delete
- Modification of the file directory
- create a new record, change an existing record
- Infiltration into running processes (add a piece of malicious code into a running process)
Examining the Malware
Most Malware share some common features.
Example, 5 different types of Malwares:
- They have similar logical order, but they morph the codes so they look different
Code cloning/morphing
- Morph code using combinations of transposition, substitution, insertion, deletion
- Transposition: insert jump instruction, so that code executes in the original order
Example: NGVCK malware:
- 3 samples: do the same thing, but with different opcode sequences
- A lot of junk codes
Traditional Malware Detection Methods
Use signatures, heuristics and hand crafted rules
- Construct malware detection rules manually
- Determine a malware contacts a particular domain/IP address -> use domain/IP address to create a signature and monitor the network traffic to identify all the hosts contacting that address
Signature-based method
- DB: stores all signatures
- Identify whether the signature can be found for a particular file
- High accuracy
- But:
- Unable to detect new malware
- Require continue update of the signature db
- Rely on human experts in creating the signature
- But:
Anomaly-based method
- Find profiles of normal program execution (System call, API, memory usage, etc)
- Find deviations
Obfuscation techniques used by Hackers
- Obfuscation: attempt to hide the original intentions
- Used against signature-based detection
- Because after Obfuscation, signature will be different.
Can you give three examples the malware writers will do so that the malware samples cannot be detected using the traditional signature-based detection system?
Dead code insertion / garbage code insertion
Perform code transposition
Change the looping structure… etc
Example of Obfuscation techniques:
- Dead code insertion
- instructions that “do nothing”, e.g., using NOP (no operation)
- e.g.
add eax, 0
orsub edx, 0
1
2
3
4
5
6
7
8 a=x*x
b=3 <- dead code
c=x <- dead code
d=a <- dead code
e=6 <- dead code
f=a+a
g=6*f
return g
- Garbage code insertion
- instructions that “do something”, but unrelated to the program main function (garbage)
- Code transposition
- Changing the order of instructions (Both are exact same code)
- JUMP is often used to transpose the code
1
2
3 MOV AL, BL
JMP LOC
LOC: ADD AL, 05HSame as:
1
2 MOV AL, BL
ADD AL, 05H
- Change the looping structure
- Instruction subsititution
1 add eax, 1Same as:
1 sub eax, -1
- Packer
- Uses compression to obfuscate the executable’s content
- Obfuscated content is stored within the new exe file (gives a new packed program)
- When unpacked is done, the malware is loaded into memory and triggers the execution
- Cryptor
- Similar to packer, except uses encryption rather than compression to obfuscate the executable’s content
Machine Learning Malware Detection Methods
Why Machine Learning?
- Use malware samples to automatically infer rules/signatures
- There is a large number of malware samples
- Human expert to infer rules : time consuming, Too many data for human to process
- Obfuscation might have Hidden patterns, learnable by Machine Learning
How should we formulate the problem?
Supervised Learning
- learnt features to do classification
- Labeled data -> develop model to Make accurate predictions on unseen data
Unsupervised Learning
- Learn inherent latent patterns, relationships and similarities among the input data points
- Unlabeled data -> find common characteristics -> groups/clusters
Detection vs Classification Problem in Malware Detection
Detection problem in Malware Detection :
- Detect the presence of Malware
- Output: 2 cases
- Class 1: Malware
- Class 2: Not Malware
Classification problem in Malware Detection :
- Identify the type of Malware
- Output: the number of N different malware families cases
- Output 1: family 1
- Output 2: family 2
- …
- Output N: family N
Some questions and answers regarding ML methods:
Feature Selection of Malware
Two main approaches in feature selection of Malware:
- Static analysis: examine the file without execution
- Dynamic analysis: running samples (in a controlled and isolated environment) to examine their behavior
- Emulators/Sandboxes: replicate the behavior of a system with higher accuracy, but require more resources
- Generate execution traces
- Emulators/Sandboxes: replicate the behavior of a system with higher accuracy, but require more resources
Static analysis vs Dynamic analysis
-
Advantages of Static analysis:
- Allows malicious files to be detected prior to execution (do not need to execute)
- Fast identification
-
Disadvantages of Static analysis:
- Suffer from code obfuscation (fail to detect the polymorphic malwares)
- Encryption, compression
- Suffer from code obfuscation (fail to detect the polymorphic malwares)
-
Advantages of Dynamic analysis:
- Possible to detect the “new” malwares
- Possible to detect un-conceived types of malware attacks
-
Disadvantages of Dynamic analysis:
- Complicated to extract dynamic features
- Dynamic analysis are OS dependent
- Time consuming (has to execute to see the behaviour)
- Malware might able to detect the virtual environment, thus hiding its intention
- Malware might have different behaviour under different condition (the if-else cases)
Thus Usually, we will combine the Static Features and Dynamic Features together for the detection.
Static Features
Static analysis: examine the file without execution
Malware is an executable or a binary.
-
Static analysis: analysis of source code of portable executable (PE) files
- E.g.,
.exe
,.dll
,.com
,.drv
,.sys
- File signature: PE files starts with 4D 5A
- E.g.,
-
Identify features from malware binaries
-
Disassemble / debug malware (reverse engineering)
PE File
- PE Imports
- A PE can import code from other PEs by specifying the PE file name and the functions to import
- Example: if an exe wants to create a file on disk
- Uses an API CreateFile() which is in kernel32.dll
- To call the API, first load kernel32.dll into memory and then call CreateFile() function
- By inspecting what DLL and functions are used -> know about the functionality of the exe file
- See network-related API functions (e.g., connect, socket, listen, send) from wsock32.dll
- indicate: malware connects to the Internet an performs network activity
How are we gonna to convert them into a number?
Static features: byte
- Byte sequences
- Examine bytes from binary files using tools such as hexdump
To convert them into a numerical values:
- N-grams: collection of bytes by a sliding window of n bytes (N is usually from 1 to 10)
- Represents how many times a specific combination of n bytes occurs in the binary file
Example: 2 symbols A,B
- AABABB…
2-grams: AA, AB, AB, BB, …
3-grams: AAB, ABA, BAB, ABB, …
- Freqencies of the words
Change to numerical representation
Static features: Opcode
- Opcode sequences (Assembly code)
- Reverse engineering: retrieve low-level machine instructions as features
- low-level machine instructions in PE obtained through disassembly procedure
To convert them into a numerical values:
- Opcode N-gram: same idea as byte N-gram, but count the opcode.
This idea might end up lots of different choices.
Another way is to:
- Divide opcode into different categories, and then represent the frequencies or calulcate the N-gram for these different categories:
- Control flow, Mathematical instructions (arithmetic), memory access instruction
- Obtain occurrence frequencies for different categories
- Reduce feature dimension
Static features: API
- API and system calls
- Reverse engineering: to get a list of all calls that can potentially be executed
- if an exe wants to create a file on disk
- Uses an API CreateFile() which is in kernel32.dll
- To call the API, first load kernel32.dll into memory and then call CreateFile() function
- Network communication
- See network-related API functions (e.g., connect, socket, listen, send) from wsock32.dll
- indicate: malware connects to the Internet an performs network activity
- API and system calls
- A list of all calls that can potentially be executed
- Provides a view on the interaction of the binaries with the operating system
To convert them to numerical values, we can use:
- Frequencies
- N-grams
- API control graph
If the data are too sparse, we can divide them into categories:
- Malicious file operation: create, copy, remove, delete and write files
- Malicious system operation: run, halt, delay, terminate, exception handling, debug
- Malicious process and thread operation: create, execute, terminate process/thread
- Malicious registry operation: create, modify, inquiry, delete registry items
- Malicious storage operation: storage allocation, protection, and access
- Malicious network operation: create network connections, access, DNS services, terminate
Example:
Dynamic Features
Dynamic analysis: running samples (in a controlled and isolated environment) to examine their behavior
- Executing and running malicious code to check its behaviour and processes
- Examining the dissimilarity between specified states:
- Initial state: before infection of the malware
- State after infection
- can find out the behaviors of the malware
- Examining the dissimilarity between specified states:
- Purpose: determine functionality of the malware
People run the malware on a isolated environment (e.g. Cuckoo Sandbox, VMWare)
- Setup the Sandbox
- Running the monitoring/dynamic analysis tools
- Before executing the malware specimen
- Executing the malware specimen
- Stopping the monitoring tools after the malware binary is executed for a specified time
- Analyzing the results
- Collecting the data from the monitoring tools
Monitoring activities
- File system monitoring:
- Obtain a list of all system files before the actual infection of the system
- Find out which files have been changed (or deleted)
- Processes and system services
- Detect if new services have been started or if something changed to current running processes (e.g., bypass any anti-virus program)
- Memory analysis: the malware may find ways to access individual processes through RAM
- Systems changes:
- Examine register and log files -> find purpose of the malicious file
- Monitoring the registry keys/accessed/modified and registry data that is being read/write
- Network monitoring
- Traffic to and from system
Cuckoo Sandbox
-
Use some tools to observe dynamic behavior of PEs, e.g., Cuckoo Sandbox
-
Open-source written in Python
-
Cuckoo agent handles the communication with the Host to perform analysis
- Trace of calls performed by all processes
- Files being created, deleted and downloaded during execution
- Analyze network traffic
- Perform memory analysis
- Produce a report, e.g., JSON/HTML format
-
- Malware might able to detect the virtual environment, thus hiding its intention
- Malware might have different behaviour under different condition (the if-else cases)
- Need to run all the possible routes
We can use the extractions from Cuckoo report as features.
Dynamic Features: Network
- Network features
- How PE interacts with the network: contacted addresses, generated traffic
- Duration, UDP_requests, http_requests, smtp_requests, tcp_requests, host_contacted, DNS_requested
Dynamic Features: Files and CPU usages
- Statistics of file system activities / Processes
- How many files are read/modified/deleted
- Dropped, Processes_generated
- CPU and memory usage features
- CPU_usage, mem_usage
Dynamic Features: API
- Different from API Static Features, there are possibility that API declared in Static state did not run. (hide the intent)
- Dynamic Analysis of API has better accuracy
- Statistics of APIs
- Files_accesses, files_written, files_deleted, files_erad, executed_cmds, started_services, created_services
- API categories
- Crypto_API, file_API, network_API, process_API, register_API, resource_API, services_API, system_API
- To get numerical data, we just need to know the number of times API was executed
Note: Dynamic analysis are OS dependent
- For these Dynamic analysis, it only corrspond to 1 version of OS.
- If we run it on Windows and Linux, we might get different result
- even If we run it on Windows10 and Windows7, we might get different result
- Malware might not run successfully
- Same malware might have different behaviour in different environment
But Static analysis are OS independent
Malware ML Problems
Malware detection problem: binary Classification
- Supervised approach:
- 2 set of training set: malware sample and legitimate sample (labeled)
- Testing: predict if a file is malware or legitimate
- Unsupervised approach:
- ML techniques are used to group the data into 2 groups
- 2 set of training data: malware sample and legitimate sample (unlabeled)
- Testing: predict if a file is malware or legitimate
Malware classification problem (N families)
- Supervised approach:
- N set of training data: N different malware families (labeled)
- Testing: predict which family a malware belongs to
- Unsupervised approach:
- ML techniques are used to group the data into N different families
- N set of training data: N different malware families (unlabeled)
- Testing: predict which family a malware belongs to
Deep learning approach
What will be the input data (raw data)?
- Malware visualization
Example 1: Read malware binary using a hex editor
- Read into 8-bit vector file
- Organize into a 2D array file
- Visualize as a gray-scale image
Then go through CNN and FC layers.
Example 2: Dynamic API call sequences
- Extract API call sequences
- Use a color mapping rules, the API category and number of time that each category occurs in per unit time is displayed as an image
API calls visualization
- Dynamic analysis (Cuckoo sandbox)
- Extraction of API call sequence
- Group APIs into 14 different categories:
- Networking, register, service, file, hardware and system, message, …password dumping, anti-debugging, …
Then feed to CNN + FC layers