
Malware: malicious software

  • is a code that performs malicious actions
  • take the form of an executable, script, code or any other software

Malware grow exponentially

  • 230K+ computer users hit by malware in Q2 2019
  • Nearly 1 million new malware samples created every day
    • Do they have any similarity so we can detect them?

Ways of spread

  • Deceive users and make them click email attachments
  • Delivered through USB drives/web pages

Forms of malware attack

  • Modification of the file system
    • create new files, edit, encrypt, delete
  • Modification of the file directory
    • create a new record, change an existing record
  • Infiltration into running processes (add a piece of malicious code into a running process)

Examining the Malware

Most Malware share some common features.

Example, 5 different types of Malwares:


  • They have similar logical order, but they morph the codes so they look different

Code cloning/morphing

  • Morph code using combinations of transposition, substitution, insertion, deletion
  • Transposition: insert jump instruction, so that code executes in the original order

Example: NGVCK malware:

  • 3 samples: do the same thing, but with different opcode sequences
    • A lot of junk codes

Traditional Malware Detection Methods

Use signatures, heuristics and hand crafted rules

  • Construct malware detection rules manually
  • Determine a malware contacts a particular domain/IP address -> use domain/IP address to create a signature and monitor the network traffic to identify all the hosts contacting that address

Signature-based method

  • DB: stores all signatures
    • Identify whether the signature can be found for a particular file
  • High accuracy
    • But:
      • Unable to detect new malware
      • Require continue update of the signature db
      • Rely on human experts in creating the signature

Anomaly-based method

  • Find profiles of normal program execution (System call, API, memory usage, etc)
  • Find deviations

Obfuscation techniques used by Hackers

  • Obfuscation: attempt to hide the original intentions
    • Used against signature-based detection
    • Because after Obfuscation, signature will be different.

Can you give three examples the malware writers will do so that the malware samples cannot be detected using the traditional signature-based detection system?

Dead code insertion / garbage code insertion
Perform code transposition
Change the looping structure… etc

Example of Obfuscation techniques:

  • Dead code insertion
    • instructions that “do nothing”, e.g., using NOP (no operation)
    • e.g. add eax, 0 or sub edx, 0
b=3 <- dead code
c=x <- dead code
d=a <- dead code
e=6 <- dead code
return g
  • Garbage code insertion
    • instructions that “do something”, but unrelated to the program main function (garbage)
  • Code transposition
    • Changing the order of instructions (Both are exact same code)
    • JUMP is often used to transpose the code



Same as:

  • Change the looping structure
  • Instruction subsititution
add eax, 1

Same as:

sub eax, -1
  • Packer
  • Uses compression to obfuscate the executable’s content
    • Obfuscated content is stored within the new exe file (gives a new packed program)
  • When unpacked is done, the malware is loaded into memory and triggers the execution
  • Cryptor
    • Similar to packer, except uses encryption rather than compression to obfuscate the executable’s content

Machine Learning Malware Detection Methods

Why Machine Learning?

  • Use malware samples to automatically infer rules/signatures
  • There is a large number of malware samples
    • Human expert to infer rules : time consuming, Too many data for human to process
    • Obfuscation might have Hidden patterns, learnable by Machine Learning

How should we formulate the problem?

Supervised Learning

  • learnt features to do classification
    • Labeled data -> develop model to Make accurate predictions on unseen data

Unsupervised Learning

  • Learn inherent latent patterns, relationships and similarities among the input data points
    • Unlabeled data -> find common characteristics -> groups/clusters

Detection vs Classification Problem in Malware Detection

Detection problem in Malware Detection :

  • Detect the presence of Malware
  • Output: 2 cases
    • Class 1: Malware
    • Class 2: Not Malware

Classification problem in Malware Detection :

  • Identify the type of Malware
  • Output: the number of N different malware families cases
    • Output 1: family 1
    • Output 2: family 2
    • Output N: family N

Some questions and answers regarding ML methods:



Feature Selection of Malware

Two main approaches in feature selection of Malware:

  • Static analysis: examine the file without execution
  • Dynamic analysis: running samples (in a controlled and isolated environment) to examine their behavior
    • Emulators/Sandboxes: replicate the behavior of a system with higher accuracy, but require more resources
      • Generate execution traces

Static analysis vs Dynamic analysis

  • Advantages of Static analysis:

    • Allows malicious files to be detected prior to execution (do not need to execute)
    • Fast identification
  • Disadvantages of Static analysis:

    • Suffer from code obfuscation (fail to detect the polymorphic malwares)
      • Encryption, compression
  • Advantages of Dynamic analysis:

    • Possible to detect the “new” malwares
    • Possible to detect un-conceived types of malware attacks
  • Disadvantages of Dynamic analysis:

    • Complicated to extract dynamic features
    • Dynamic analysis are OS dependent
    • Time consuming (has to execute to see the behaviour)
    • Malware might able to detect the virtual environment, thus hiding its intention
    • Malware might have different behaviour under different condition (the if-else cases)

Thus Usually, we will combine the Static Features and Dynamic Features together for the detection.

Static Features

Static analysis: examine the file without execution

Malware is an executable or a binary.

  • Static analysis: analysis of source code of portable executable (PE) files

    • E.g., .exe, .dll, .com, .drv, .sys
    • File signature: PE files starts with 4D 5A
  • Identify features from malware binaries

  • Disassemble / debug malware (reverse engineering)

PE File

  • PE Imports
    • A PE can import code from other PEs by specifying the PE file name and the functions to import
    • Example: if an exe wants to create a file on disk
      • Uses an API CreateFile() which is in kernel32.dll
      • To call the API, first load kernel32.dll into memory and then call CreateFile() function
  • By inspecting what DLL and functions are used -> know about the functionality of the exe file
    • See network-related API functions (e.g., connect, socket, listen, send) from wsock32.dll
    • indicate: malware connects to the Internet an performs network activity

How are we gonna to convert them into a number?

Static features: byte

  • Byte sequences
    • Examine bytes from binary files using tools such as hexdump

To convert them into a numerical values:

  • N-grams: collection of bytes by a sliding window of n bytes (N is usually from 1 to 10)
    • Represents how many times a specific combination of n bytes occurs in the binary file


Example: 2 symbols A,B


2-grams: AA, AB, AB, BB, …

3-grams: AAB, ABA, BAB, ABB, …

  • Freqencies of the words

Change to numerical representation


Static features: Opcode

  • Opcode sequences (Assembly code)
    • Reverse engineering: retrieve low-level machine instructions as features
    • low-level machine instructions in PE obtained through disassembly procedure

To convert them into a numerical values:

  • Opcode N-gram: same idea as byte N-gram, but count the opcode.


This idea might end up lots of different choices.

Another way is to:

  • Divide opcode into different categories, and then represent the frequencies or calulcate the N-gram for these different categories:
    • Control flow, Mathematical instructions (arithmetic), memory access instruction
    • Obtain occurrence frequencies for different categories
    • Reduce feature dimension


Static features: API

  • API and system calls
    • Reverse engineering: to get a list of all calls that can potentially be executed
  • if an exe wants to create a file on disk
    • Uses an API CreateFile() which is in kernel32.dll
    • To call the API, first load kernel32.dll into memory and then call CreateFile() function
  • Network communication
    • See network-related API functions (e.g., connect, socket, listen, send) from wsock32.dll
    • indicate: malware connects to the Internet an performs network activity
  • API and system calls
    • A list of all calls that can potentially be executed
    • Provides a view on the interaction of the binaries with the operating system

To convert them to numerical values, we can use:

  • Frequencies
  • N-grams
  • API control graph

If the data are too sparse, we can divide them into categories:

  • Malicious file operation: create, copy, remove, delete and write files
  • Malicious system operation: run, halt, delay, terminate, exception handling, debug
  • Malicious process and thread operation: create, execute, terminate process/thread
  • Malicious registry operation: create, modify, inquiry, delete registry items
  • Malicious storage operation: storage allocation, protection, and access
  • Malicious network operation: create network connections, access, DNS services, terminate



Dynamic Features

Dynamic analysis: running samples (in a controlled and isolated environment) to examine their behavior

  • Executing and running malicious code to check its behaviour and processes
    • Examining the dissimilarity between specified states:
      • Initial state: before infection of the malware
      • State after infection
        • can find out the behaviors of the malware
  • Purpose: determine functionality of the malware

People run the malware on a isolated environment (e.g. Cuckoo Sandbox, VMWare)

  • Setup the Sandbox
  • Running the monitoring/dynamic analysis tools
    • Before executing the malware specimen
  • Executing the malware specimen
  • Stopping the monitoring tools after the malware binary is executed for a specified time
  • Analyzing the results
    • Collecting the data from the monitoring tools

Monitoring activities

  • File system monitoring:
    • Obtain a list of all system files before the actual infection of the system
    • Find out which files have been changed (or deleted)
  • Processes and system services
    • Detect if new services have been started or if something changed to current running processes (e.g., bypass any anti-virus program)
  • Memory analysis: the malware may find ways to access individual processes through RAM
  • Systems changes:
    • Examine register and log files -> find purpose of the malicious file
    • Monitoring the registry keys/accessed/modified and registry data that is being read/write
  • Network monitoring
    • Traffic to and from system

Cuckoo Sandbox

  • Use some tools to observe dynamic behavior of PEs, e.g., Cuckoo Sandbox

    • Open-source written in Python

    • Cuckoo agent handles the communication with the Host to perform analysis

      • Trace of calls performed by all processes
      • Files being created, deleted and downloaded during execution
      • Analyze network traffic
      • Perform memory analysis
      • Produce a report, e.g., JSON/HTML format
  • Malware might able to detect the virtual environment, thus hiding its intention
  • Malware might have different behaviour under different condition (the if-else cases)
    • Need to run all the possible routes

We can use the extractions from Cuckoo report as features.

Dynamic Features: Network

  • Network features
    • How PE interacts with the network: contacted addresses, generated traffic
    • Duration, UDP_requests, http_requests, smtp_requests, tcp_requests, host_contacted, DNS_requested

Dynamic Features: Files and CPU usages

  • Statistics of file system activities / Processes
    • How many files are read/modified/deleted
    • Dropped, Processes_generated
  • CPU and memory usage features
    • CPU_usage, mem_usage

Dynamic Features: API

  • Different from API Static Features, there are possibility that API declared in Static state did not run. (hide the intent)
    • Dynamic Analysis of API has better accuracy
  • Statistics of APIs
    • Files_accesses, files_written, files_deleted, files_erad, executed_cmds, started_services, created_services
  • API categories
    • Crypto_API, file_API, network_API, process_API, register_API, resource_API, services_API, system_API
  • To get numerical data, we just need to know the number of times API was executed

Note: Dynamic analysis are OS dependent

  • For these Dynamic analysis, it only corrspond to 1 version of OS.
    • If we run it on Windows and Linux, we might get different result
    • even If we run it on Windows10 and Windows7, we might get different result
    • Malware might not run successfully
    • Same malware might have different behaviour in different environment

But Static analysis are OS independent

Malware ML Problems

Malware detection problem: binary Classification

  • Supervised approach:
    • 2 set of training set: malware sample and legitimate sample (labeled)
    • Testing: predict if a file is malware or legitimate
  • Unsupervised approach:
    • ML techniques are used to group the data into 2 groups
    • 2 set of training data: malware sample and legitimate sample (unlabeled)
    • Testing: predict if a file is malware or legitimate

Malware classification problem (N families)

  • Supervised approach:
    • N set of training data: N different malware families (labeled)
    • Testing: predict which family a malware belongs to
  • Unsupervised approach:
    • ML techniques are used to group the data into N different families
    • N set of training data: N different malware families (unlabeled)
    • Testing: predict which family a malware belongs to

Deep learning approach

What will be the input data (raw data)?

  • Malware visualization

Example 1: Read malware binary using a hex editor

  • Read into 8-bit vector file
  • Organize into a 2D array file
  • Visualize as a gray-scale image


Then go through CNN and FC layers.



Example 2: Dynamic API call sequences

  • Extract API call sequences
  • Use a color mapping rules, the API category and number of time that each category occurs in per unit time is displayed as an image

API calls visualization

  • Dynamic analysis (Cuckoo sandbox)
    • Extraction of API call sequence
    • Group APIs into 14 different categories:
      • Networking, register, service, file, hardware and system, message, …password dumping, anti-debugging, …



Then feed to CNN + FC layers


Example 3: Opcode
