Improving Antivirus Signature For Detection Ransomware Attacks With Machine Learning

Abstrak Kejahatan siber sulit dipisahkan dari perkembangan malware. Berdasarkan laporan dari Internet Security Threat, kejahatan dengan mengeksploitasi malware merupakan kejahatan yang cukup tinggi. Salah satu penyebaran malware yang cukup tinggi adalah ransomware. Infeksi akibat ransomware meningkat dari tahun ke tahun sejak 2013 dan terdapat 1.271 deteksi per hari selama tahun 2017. Sementara itu, pada tahun 2018 terjadi pergeseran serangan dimana 81 persen serangan menargetkan perusahaan sehingga infeksi ransomware menjadi 12 persen. Untuk mengatasi masalah ini, penelitian ini mengusulkan antivirus signature berdasarkan DLL Files dan API Calls untuk file ransomware. Mendeteksi file berdasarkan antivirus signature memiliki nilai teoritis dan praktik yang signifikan. Dari hasil percobaan menunjukkan pendeteksian file ransomware berdasarkan DLL Files dan functional API Calls dengan Machine Learning memiliki hasil yang baik dibandingkan mendeteksi file berdasarkan MD5 dan hexdump. Untuk pengujian dan deteksi ransomware files, penelitian ini menggunakan algoritma machine learning seperti KNN, SVM, Decision Tress, dan Random Forest. Hasil pengujian menunjukkan keberhasilan mendeteksi file ransomware, meningkatkan pendeteksian obyek, dan metode penelitian untuk antivirus signature.


Introduction
Internet has grown quickly. Data from International Telecommunication Union (ITU), there are 4.1 billion people are using internet in 2019. The global penetration rate increased from 16.8 % in 2005 to 53.6 % in 2019. Internet user grew on average by 10 % every year. In developed countries, 87 % people using the internet.
(International Telecommunication Union, 2019) Rapid development and rapid internet growth and computer technology is followed by cybercrime activities. The Symantec Global Intelligence Network said there are 700.000 global adversaries and 98 million attack sensors worldwide. Symantec  The outcome of detection Ransomware will be improved with apply machine learning algorithms for get the top model deployment.

Methods
This section outlined the architecture of our system along with system overview and explain a workflow for detection Ransomware files. In workflow diagram to build antivirus signature and improve it with apply machine learning techniques for classification and detection Ransomware Attacks. Antivirus signature can detection ransomware files from its hash signature (MD5 and hex dump) and machine learning techniques are completing the system with classification and detection Ransomware files based on its DLL Files and API Calls. (Cabaj and Mazurczyk, 2016) To analyze and facilitate in collecting the characteristics of files. This experiment uses pefile module on python. In the extraction section Portable Executables header (PE header) Ransomware aims to find Dos Header, File Header, and Optional Header so that the knowledge obtained from the characteristics of ransomware files. Some ransomware has antireverse engineering, so this research also used static analysis to extract its DLL Files and functional API Calls. The PE format is a portable file format that can encapsulate the information required to manage executable code. Inside the file includes a reference library dynamic to connect import and export APIs. In windows, the PE format can be used for .EXE, .DLL, .SYS files, and so on. (Sebastián et al., 2016) With using Imported Address Table (IAT) as  lookup table when application using application calls as function on different modules. Compiled programs do not know the memory location of the libraries, an indirect jump is required wherever API Call is made. Dynamic linker will load and merge simultaneously, it will write the actual address in the IAT slot so that the memory location corresponds to library functions. Some viruses have the ability to hide from reverse engineering so call functions in Python cannot extract it. Therefore, we need Pestudio and FileAlyzer for extract its API Calls. (Al Amro and Alkhalifah, 2015) After analyze the Ransomware files, this experiment should give label for each of record of Ransomware file to distinguish what is innocuous and malware file. Binary files will be used as controllers and runtime behaviors. In this section the extract process of some binary, construct feature, and classify samples to some parts of the malware and innocuous class. Some of the analyzed ransomware have similar behaviors to call some APIs with some arguments. After merging all field data from ransomware files and innocuous files, data is transformed for presented dataset to machine learning algorithms. Petya/NotPetya, Ransomware Sigma, Ransomware Tesla, and Ransomware WannaCry. The ransomware files are merged with 180 innocuous files to distinguish the characteristic of ransomware and innocuous files. The machine learning algorithms are Support Vector Machine, Decision Trees, Random Forest, and KNN. On the first experiment we analyze ransomware files to get MD5 and hex dump of each file. The MD5 and hex dump are base for build antivirus signature with using ClamAV, the open source antivirus. On the second experiment we analyze ransomware files to get DLL Files and API Call of each file. The DLL Files and API Call data that this experiment get from analyze ransomware files is used as input data for machine learning techniques. Table 1 is shown the sample files that is using on this research. Each file is extracted and checked for each binary file for the binary file whether or not each feature of each string is present and then displayed in the vector. If the selected feature is available then it will be assigned a value of 1 otherwise it will be assigned a value of 0.
{ Table 2 is shown the example binary vector space for detection Ransomware files. Table  2 is representative of dataset that is used for input Machine Learning Algorithm.

C. Classification and Detection Ransomware
Files Data is trained using machine learning algorithms. This experiment using K-Nearest-Neighbors, Decision Trees, Random Forest, and Support-Vector-Machine. Using cross-validation, choosing between models, and selecting features. The machine learning algorithms that is using on this experiment is supervised learning. Using supervised learning aims to build models by generalizing out-of-sample data so that model evaluation procedures are possible to estimate how well the model is defined to perform on the out-of-sample data. This experiment uses performance estimate to choose between available models. By doing each model train across the dataset and then evaluating each model by testing how well performs on the same data. This results in an evaluation metric known as training accuracy.
For testing and detection ransomware files, this research is using machine learning algorithms such as KNN, SVM, Decision Trees, and Random Forest. The result is shown on Table 3.