Malware

Malware: malicious software

is a code that performs malicious actions
take the form of an executable, script, code or any other software

Malware grow exponentially

230K+ computer users hit by malware in Q2 2019
Nearly 1 million new malware samples created every day
- Do they have any similarity so we can detect them?

Ways of spread

Deceive users and make them click email attachments
Delivered through USB drives/web pages

Forms of malware attack

Modification of the file system
- create new files, edit, encrypt, delete
Modification of the file directory
- create a new record, change an existing record
Infiltration into running processes (add a piece of malicious code into a running process)

Examining the Malware

Most Malware share some common features.

Example, 5 different types of Malwares:

They have similar logical order, but they morph the codes so they look different

Code cloning/morphing

Morph code using combinations of transposition, substitution, insertion, deletion
Transposition: insert jump instruction, so that code executes in the original order

Example: NGVCK malware:

3 samples: do the same thing, but with different opcode sequences

A lot of junk codes

Traditional Malware Detection Methods

Use signatures, heuristics and hand crafted rules

Construct malware detection rules manually
Determine a malware contacts a particular domain/IP address -> use domain/IP address to create a signature and monitor the network traffic to identify all the hosts contacting that address

Signature-based method

DB: stores all signatures
- Identify whether the signature can be found for a particular file
High accuracy
- But:
  - Unable to detect new malware
  - Require continue update of the signature db
  - Rely on human experts in creating the signature

Anomaly-based method

Find profiles of normal program execution (System call, API, memory usage, etc)
Find deviations

Obfuscation techniques used by Hackers

Obfuscation: attempt to hide the original intentions
- Used against signature-based detection
- Because after Obfuscation, signature will be different.

Can you give three examples the malware writers will do so that the malware samples cannot be detected using the traditional signature-based detection system?

Dead code insertion / garbage code insertion
Perform code transposition
Change the looping structure… etc

Example of Obfuscation techniques:

Dead code insertion
- instructions that “do nothing”, e.g., using NOP (no operation)
- e.g. add eax, 0 or sub edx, 0

a=x*x
b=3 <- dead code
c=x <- dead code
d=a <- dead code
e=6 <- dead code
f=a+a
g=6*f
return g

Garbage code insertion
- instructions that “do something”, but unrelated to the program main function (garbage)
Code transposition
- Changing the order of instructions (Both are exact same code)
- JUMP is often used to transpose the code

1
2
3

MOV AL, BL 
JMP LOC
LOC: ADD AL, 05H

Same as:

1 2	MOV AL, BL ADD AL, 05H

Change the looping structure
Instruction subsititution

1	add eax, 1

Same as:

1	sub eax, -1

Packer
Uses compression to obfuscate the executable’s content
- Obfuscated content is stored within the new exe file (gives a new packed program)
When unpacked is done, the malware is loaded into memory and triggers the execution
Cryptor
- Similar to packer, except uses encryption rather than compression to obfuscate the executable’s content

Machine Learning Malware Detection Methods

Why Machine Learning?

Use malware samples to automatically infer rules/signatures
There is a large number of malware samples
- Human expert to infer rules : time consuming, Too many data for human to process
- Obfuscation might have Hidden patterns, learnable by Machine Learning

How should we formulate the problem?

Supervised Learning

learnt features to do classification
- Labeled data -> develop model to Make accurate predictions on unseen data

Unsupervised Learning

Learn inherent latent patterns, relationships and similarities among the input data points
- Unlabeled data -> find common characteristics -> groups/clusters

Detection vs Classification Problem in Malware Detection

Detection problem in Malware Detection :

Detect the presence of Malware
Output: 2 cases
- Class 1: Malware
- Class 2: Not Malware

Classification problem in Malware Detection :

Identify the type of Malware
Output: the number of N different malware families cases
- Output 1: family 1
- Output 2: family 2
- …
- Output N: family N

Some questions and answers regarding ML methods:

Feature Selection of Malware

Two main approaches in feature selection of Malware:

Static analysis: examine the file without execution
Dynamic analysis: running samples (in a controlled and isolated environment) to examine their behavior
- Emulators/Sandboxes: replicate the behavior of a system with higher accuracy, but require more resources
  - Generate execution traces

Static analysis vs Dynamic analysis

Advantages of Static analysis:
- Allows malicious files to be detected prior to execution (do not need to execute)
- Fast identification
Disadvantages of Static analysis:
- Suffer from code obfuscation (fail to detect the polymorphic malwares)
  - Encryption, compression
Advantages of Dynamic analysis:
- Possible to detect the “new” malwares
- Possible to detect un-conceived types of malware attacks
Disadvantages of Dynamic analysis:
- Complicated to extract dynamic features
- Dynamic analysis are OS dependent
- Time consuming (has to execute to see the behaviour)
- Malware might able to detect the virtual environment, thus hiding its intention
- Malware might have different behaviour under different condition (the if-else cases)

Thus Usually, we will combine the Static Features and Dynamic Features together for the detection.

Static Features

Static analysis: examine the file without execution

Malware is an executable or a binary.

Static analysis: analysis of source code of portable executable (PE) files
- E.g., .exe, .dll, .com, .drv, .sys
- File signature: PE files starts with 4D 5A
Identify features from malware binaries
Disassemble / debug malware (reverse engineering)

PE File

PE Imports
- A PE can import code from other PEs by specifying the PE file name and the functions to import
- Example: if an exe wants to create a file on disk
  - Uses an API CreateFile() which is in kernel32.dll
  - To call the API, first load kernel32.dll into memory and then call CreateFile() function
By inspecting what DLL and functions are used -> know about the functionality of the exe file
- See network-related API functions (e.g., connect, socket, listen, send) from wsock32.dll
- indicate: malware connects to the Internet an performs network activity

How are we gonna to convert them into a number?

Static features: byte

Byte sequences
- Examine bytes from binary files using tools such as hexdump

To convert them into a numerical values:

N-grams: collection of bytes by a sliding window of n bytes (N is usually from 1 to 10)
- Represents how many times a specific combination of n bytes occurs in the binary file

Example: 2 symbols A,B

AABABB…

2-grams: AA, AB, AB, BB, …

3-grams: AAB, ABA, BAB, ABB, …

Freqencies of the words

Change to numerical representation

Static features: Opcode

Opcode sequences (Assembly code)
- Reverse engineering: retrieve low-level machine instructions as features
- low-level machine instructions in PE obtained through disassembly procedure

To convert them into a numerical values:

Opcode N-gram: same idea as byte N-gram, but count the opcode.

This idea might end up lots of different choices.

Another way is to:

Divide opcode into different categories, and then represent the frequencies or calulcate the N-gram for these different categories:
- Control flow, Mathematical instructions (arithmetic), memory access instruction
- Obtain occurrence frequencies for different categories
- Reduce feature dimension

Static features: API

API and system calls
- Reverse engineering: to get a list of all calls that can potentially be executed

if an exe wants to create a file on disk

Uses an API CreateFile() which is in kernel32.dll

To call the API, first load kernel32.dll into memory and then call CreateFile() function

Network communication

See network-related API functions (e.g., connect, socket, listen, send) from wsock32.dll

indicate: malware connects to the Internet an performs network activity

API and system calls
- A list of all calls that can potentially be executed
- Provides a view on the interaction of the binaries with the operating system

To convert them to numerical values, we can use:

Frequencies
N-grams
API control graph

If the data are too sparse, we can divide them into categories:

Malicious file operation: create, copy, remove, delete and write files
Malicious system operation: run, halt, delay, terminate, exception handling, debug
Malicious process and thread operation: create, execute, terminate process/thread
Malicious registry operation: create, modify, inquiry, delete registry items
Malicious storage operation: storage allocation, protection, and access
Malicious network operation: create network connections, access, DNS services, terminate

Example:

Dynamic Features

Dynamic analysis: running samples (in a controlled and isolated environment) to examine their behavior

Executing and running malicious code to check its behaviour and processes
- Examining the dissimilarity between specified states:
  - Initial state: before infection of the malware
  - State after infection
    - can find out the behaviors of the malware
Purpose: determine functionality of the malware

People run the malware on a isolated environment (e.g. Cuckoo Sandbox, VMWare)

Setup the Sandbox
Running the monitoring/dynamic analysis tools
- Before executing the malware specimen
Executing the malware specimen
Stopping the monitoring tools after the malware binary is executed for a specified time
Analyzing the results
- Collecting the data from the monitoring tools

Monitoring activities

File system monitoring:
- Obtain a list of all system files before the actual infection of the system
- Find out which files have been changed (or deleted)
Processes and system services
- Detect if new services have been started or if something changed to current running processes (e.g., bypass any anti-virus program)
Memory analysis: the malware may find ways to access individual processes through RAM
Systems changes:
- Examine register and log files -> find purpose of the malicious file
- Monitoring the registry keys/accessed/modified and registry data that is being read/write
Network monitoring
- Traffic to and from system

Cuckoo Sandbox

Use some tools to observe dynamic behavior of PEs, e.g., Cuckoo Sandbox
- Open-source written in Python
- Cuckoo agent handles the communication with the Host to perform analysis
  - Trace of calls performed by all processes
  - Files being created, deleted and downloaded during execution
  - Analyze network traffic
  - Perform memory analysis
  - Produce a report, e.g., JSON/HTML format

Malware might able to detect the virtual environment, thus hiding its intention
Malware might have different behaviour under different condition (the if-else cases)
- Need to run all the possible routes

We can use the extractions from Cuckoo report as features.

Dynamic Features: Network

Network features
- How PE interacts with the network: contacted addresses, generated traffic
- Duration, UDP_requests, http_requests, smtp_requests, tcp_requests, host_contacted, DNS_requested

Dynamic Features: Files and CPU usages

Statistics of file system activities / Processes
- How many files are read/modified/deleted
- Dropped, Processes_generated
CPU and memory usage features
- CPU_usage, mem_usage

Dynamic Features: API

Different from API Static Features, there are possibility that API declared in Static state did not run. (hide the intent)
- Dynamic Analysis of API has better accuracy
Statistics of APIs
- Files_accesses, files_written, files_deleted, files_erad, executed_cmds, started_services, created_services
API categories
- Crypto_API, file_API, network_API, process_API, register_API, resource_API, services_API, system_API
To get numerical data, we just need to know the number of times API was executed

Note: Dynamic analysis are OS dependent

For these Dynamic analysis, it only corrspond to 1 version of OS.

If we run it on Windows and Linux, we might get different result

even If we run it on Windows10 and Windows7, we might get different result

Malware might not run successfully

Same malware might have different behaviour in different environment

But Static analysis are OS independent

Malware ML Problems

Malware detection problem: binary Classification

Supervised approach:
- 2 set of training set: malware sample and legitimate sample (labeled)
- Testing: predict if a file is malware or legitimate
Unsupervised approach:
- ML techniques are used to group the data into 2 groups
- 2 set of training data: malware sample and legitimate sample (unlabeled)
- Testing: predict if a file is malware or legitimate

Malware classification problem (N families)

Supervised approach:
- N set of training data: N different malware families (labeled)
- Testing: predict which family a malware belongs to
Unsupervised approach:
- ML techniques are used to group the data into N different families
- N set of training data: N different malware families (unlabeled)
- Testing: predict which family a malware belongs to

Deep learning approach

What will be the input data (raw data)?

Malware visualization

Example 1: Read malware binary using a hex editor

Read into 8-bit vector file
Organize into a 2D array file
Visualize as a gray-scale image

Then go through CNN and FC layers.

Example 2: Dynamic API call sequences

Extract API call sequences
Use a color mapping rules, the API category and number of time that each category occurs in per unit time is displayed as an image

API calls visualization

Dynamic analysis (Cuckoo sandbox)
- Extraction of API call sequence
- Group APIs into 14 different categories:
  - Networking, register, service, file, hardware and system, message, …password dumping, anti-debugging, …