Deteksi Kerentanan Kode PHP Menggunakan TF-IDF, AST Parsing, dan Random Forest
DOI:
https://doi.org/10.32627/internal.v8i1.1411Keywords:
Abstract Syntax Tree, Frequency-Inverse Document Frequency, Random Forest, Static Code AnalysisAbstract
Vulnerabilities in PHP programming code remain one of the primary threats to the security of web applications, especially in the absence of automated analysis mechanisms during development. This study presents a hybrid system designed to detect vulnerabilities in PHP code in real time, utilizing a combination of Term Frequency-Inverse Document Frequency (TF-IDF) and Abstract Syntax Tree (AST) parsing techniques, along with the Random Forest classification algorithm. The process begins with preprocessing, cleaning, and tokenization, followed by feature extraction using TF-IDF. The next stage involves AST-based parsing to identify potentially dangerous syntax, such as the use of eval() or include() functions with parameters derived from user input. Both feature sets are combined and used to train a Random Forest model on a labeled dataset that distinguishes between vulnerable and secure code. Reliability testing was conducted on several PHP files from existing applications. The results show that the system is capable of identifying vulnerabilities such as File Inclusion and SQL Injection with confidence scores ranging from 52% to 61%, regardless of the structure or complexity of the code analyzed. Although the confidence level is not yet optimal, the system has demonstrated effectiveness in providing early warnings of potential threats. In the future, this system can be further developed as a simulation tool for static code analysis to support improved secure coding practices in software development.
References
Hossain, M. S., et al. (2020). Security vulnerabilities in PHP applications: A systematic literature review. Computer Standards & Interfaces, 71, 103427.
Karbab, E. B., et al. (2018). Malware detection with deep learning: An empirical study with PHP web applications. Computers & Security, 77, 612–628.
Islam, S., & Falcarin, P. (2011). Using rule-based and statistical analysis for vulnerability detection. Journal of Computer Virology and Hacking Techniques, 7(4), 295–305.
Sharma, S., Singh, A., & Chauhan, D. S. (2020). TF-IDF and Machine Learning Based Detection of Malicious Code in Web Pages. International Journal of Computer Applications, 975(8887).
Alzahrani, A., & Cordy, J. R. (2019). Structural detection of security vulnerabilities in PHP code using AST and graph-based techniques. Journal of Systems and Software, 153, 190–203.
T. Ying and S. Liu, “Security Analysis of PHP Web Applications Based on Machine Learning,” in 2019 IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conference, IEEE, 2019.
L. Breiman, “Random Forests,” *Machine Learning*, vol. 45, no. 1, pp. 5–32, 2001.
Wijana, M., Juliansyah, G., & Budiman, D. A. (2022). Sistem Pendukung Keputusan Penilaian Kinerja Guru Menggunakan Metode Weighted Product. Jurnal Dimamu, 2(1), 21–28.
Permana, A. A., & Wijana, M. (2023). Rancang Bangun Sistem Informasi Penjualan Barang Berbasis Web di Toko Kelontong Haji Agus. INTERNAL (Information System Journal), 6(1), 46–54.
Ferdiansyah, R., & Ramadhani, R. (2021). Analisis Kode Berbahaya pada File Website Menggunakan TF-IDF dan Random Forest. Jurnal Teknologi dan Sistem Informasi, 2(2), 83–90.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Mohamad Erdda Habiby

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.