Deteksi Kerentanan Kode PHP Menggunakan TF-IDF, AST Parsing, dan Random Forest

Mohamad Erdda Habiby

doi:10.32627/internal.v8i1.1411

Authors

Mohamad Erdda Habiby Universitas Bina Nusantara

DOI:

https://doi.org/10.32627/internal.v8i1.1411

Keywords:

Abstract Syntax Tree, Frequency-Inverse Document Frequency, Random Forest, Static Code Analysis

Abstract

Vulnerabilities in PHP programming code remain one of the primary threats to the security of web applications, especially in the absence of automated analysis mechanisms during development. This study presents a hybrid system designed to detect vulnerabilities in PHP code in real time, utilizing a combination of Term Frequency-Inverse Document Frequency (TF-IDF) and Abstract Syntax Tree (AST) parsing techniques, along with the Random Forest classification algorithm. The process begins with preprocessing, cleaning, and tokenization, followed by feature extraction using TF-IDF. The next stage involves AST-based parsing to identify potentially dangerous syntax, such as the use of eval() or include() functions with parameters derived from user input. Both feature sets are combined and used to train a Random Forest model on a labeled dataset that distinguishes between vulnerable and secure code. Reliability testing was conducted on several PHP files from existing applications. The results show that the system is capable of identifying vulnerabilities such as File Inclusion and SQL Injection with confidence scores ranging from 52% to 61%, regardless of the structure or complexity of the code analyzed. Although the confidence level is not yet optimal, the system has demonstrated effectiveness in providing early warnings of potential threats. In the future, this system can be further developed as a simulation tool for static code analysis to support improved secure coding practices in software development.

References

Hossain, M. S., et al. (2020). Security vulnerabilities in PHP applications: A systematic literature review. Computer Standards & Interfaces, 71, 103427.

Karbab, E. B., et al. (2018). Malware detection with deep learning: An empirical study with PHP web applications. Computers & Security, 77, 612–628.

Islam, S., & Falcarin, P. (2011). Using rule-based and statistical analysis for vulnerability detection. Journal of Computer Virology and Hacking Techniques, 7(4), 295–305.

Sharma, S., Singh, A., & Chauhan, D. S. (2020). TF-IDF and Machine Learning Based Detection of Malicious Code in Web Pages. International Journal of Computer Applications, 975(8887).

Alzahrani, A., & Cordy, J. R. (2019). Structural detection of security vulnerabilities in PHP code using AST and graph-based techniques. Journal of Systems and Software, 153, 190–203.

T. Ying and S. Liu, “Security Analysis of PHP Web Applications Based on Machine Learning,” in 2019 IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conference, IEEE, 2019.

L. Breiman, “Random Forests,” *Machine Learning*, vol. 45, no. 1, pp. 5–32, 2001.

Wijana, M., Juliansyah, G., & Budiman, D. A. (2022). Sistem Pendukung Keputusan Penilaian Kinerja Guru Menggunakan Metode Weighted Product. Jurnal Dimamu, 2(1), 21–28.

Permana, A. A., & Wijana, M. (2023). Rancang Bangun Sistem Informasi Penjualan Barang Berbasis Web di Toko Kelontong Haji Agus. INTERNAL (Information System Journal), 6(1), 46–54.

Ferdiansyah, R., & Ramadhani, R. (2021). Analisis Kode Berbahaya pada File Website Menggunakan TF-IDF dan Random Forest. Jurnal Teknologi dan Sistem Informasi, 2(2), 83–90.

Deteksi Kerentanan Kode PHP Menggunakan TF-IDF, AST Parsing, dan Random Forest

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License