An Evaluation of Word Embeddings on Vulnerability Prediction with Software Metrics

Authors

  • Sousuke Amasaki Department of Systems Engineering, Okayama Prefectural University
  • Tomoyuki Yokogawa Department of Systems Engineering, Okayama Prefectural University
  • Aman Hirohisa Center for Information Technology, Ehime University

Keywords:

vulnerability prediction, word embeddings, empirical study

Abstract

CONTEXT: Software vulnerability is a crucial risk for a digital world. Developers dedicate enormous effort to removing vulnerable code from their software products. Vulnerability prediction aims to spot which modules are more vulnerable using software metrics. Recent studies conducted empirical experiments using textual information and software metrics. The result showed that the textual information did not help improve the predictive performance. However, their evaluations only considered Bag-of-Words (BoW) as textual information, and semantic relations among words have never been examined. OBJECTIVE: To examine the performance of vulnerability prediction with textual information considering semantic relations. Word2Vec was employed for capturing semantic relations. METHOD: A comparative study among BoW and two Word2Vec embeddings was conducted. For easy evaluation, we replicated a recent study that employed BoW. The Word2Vec embeddings were obtained from pre-trained models based on Google News and Stack Overflow. The former used large but non-SE-related texts, while the latter used small but SE-related texts. RESULTS: The non-SE Word2Vec improved vulnerability prediction in term of prediction stability. The SE-specific Word2Vec was less effective. CONCLUSION: Practitioners should consider textual information with non-SE Word2Vec for better vulnerability prediction.

References

F. Lomio, E. Iannone, A. De Lucia, F. Palomba, and V. Lenarduzzi, “Just-in-time software vulnerability detection: Are we there yet?” The Journal of Systems & Software, vol. 188, p. 111283, 2022.

T. Mikolov, K. Chen, G. Cornado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” in Proc. of Workshop at the International Conference on Learning Representations, 2013.

V. Efstathiou, C. Chatzilenas, and D. Spinellis, “Word embeddings for the software engineering domain,” in Proc. of Working Conference on Mining Software Repositories, ser. MSR ’18. ACM, 2018, p. 38–41.

H. Perl, S. Dechand, M. Smith, D. Arp, F. Yamaguchi, K. Rieck, S. Fahl, and Y. Acar, “Vccfinder: Finding potential vulnerabilities in open-source projects to assist code audits,” in Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. New York, NY, USA: ACM, 2015, p. 426–437.

S. Chakraborty, R. Krishna, Y. Ding, and B. Ray, “Deep learning based vulnerability detection: Are we there yet?” IEEE Transactions on Software Engineering, vol. 48, no. 09, pp. 3280–3296, 2022

Downloads

Published

2023-09-11