Software Repositories and Machine Learning Research in Cyber Security: Conclusions, Acknowledgment

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Mounika Vanamala, Department of Computer Science, University of Wisconsin-Eau Claire, United States;

(2) Keith Bryant, Department of Computer Science, University of Wisconsin-Eau Claire, United States;

(3) Alex Caravella, Department of Computer Science, University of Wisconsin-Eau Claire, United States.

Table of Links

Abstract & Introduction

Discussions

Conclusions, Acknowledgment, and References

Conclusion

Upon recognizing the significance of cyber security vulnerability controls during the software requirement phase, the CAPEC software vulnerability repository emerged as the most practical repository for this study. The arrangement of attack patterns thus facilitates precise identification and seamless referral back to CAPEC for recommended defense strategies. We define and elaborate on topic modeling, as well as unsupervised and supervised ML methods, showcasing recent research instances and the applicability of these approaches. As our research continues, our efforts will involve the implementation of supervised machine learning. The CAPEC repository provides a prelabeled dataset, a valuable asset for training data set implementation. Supervised ML offers the added benefit of proficiently utilizing metrics to fine-tune the ML process, thus enabling thorough evaluation and process enhancement. A training set for the SRS document must either be crafted or located for supervised ML execution. Given the absence of a comparable research framework employing supervised ML, our future endeavors will assess and compare results stemming from Naïve Bayes and RF ML methodologies. Naïve Bayes showcases statistical prowess across both large and small data sets, making it suitable for the modest data set of SRS documents as well as the larger data set encompassing CAPEC Vulnerabilities. RF's capacity to counteract overfitting aligns well with the intricate data from CAPEC. The algorithm returning the most accurate recommendations for CAPEC attack patterns from an SRS document will be harnessed to deploy an automated tool for result processing and visualization.

Acknowledgment

Funding Information

Author’s Contributions

Keith Bryant and Alex Caravella: Acquisition of data and analysis and interpretation of data and content written.

Keith Bryant, Alex Caravella, and Mounika Vanamala: Conception and design of the article, intellectual content generation, critically reviewed the article.

Mounika Vanamala: Contribution to intellectual content ideation and reviewed the article along with the coordination for publication.

Ethics

This article is original and contains unpublished material. The corresponding author confirms that all of the other authors have read and approved the manuscript and that no ethical issues are involved.

References

Al-Sabahi, K., Zuping, Z., & Kang, Y. (2018). Latent semantic analysis approach for document summarization based on word embeddings. arXiv preprint arXiv:1807.02748. https://doi.org/10.3837/tiis.2019.01.015

Alyami, H., Nadeem, M., Alharbi, A., Alosaimi, W., Ansari, M. T. J., Pandey, D., ... & Khan, R. A. (2021). The evaluation of software security through quantum computing techniques: A durability perspective. Applied Sciences, 11(24), 11784.

https://doi.org/10.3390/app112411784

Asim, M. N., Ghani, M. U., Ibrahim, M. A., Mahmood, W., Dengel, A., & Ahmed, S. (2021). Benchmarking performance of machine and deep learning-based methodologies for Urdu text document classification. Neural Computing and Applications, 33, 5437-5469. https://doi.org/10.1007/s00521-020-05321-8

Bedi, G. (2018). A guide to Text Classification (NLP) using SVM and Naive Bayes with Python. Medium, Nov.

Bellaouar, S., Bellaouar, M. M., & Ghada, I. E. (2021, February). Topic modeling: Comparison of LSA and LDA on scientific publications. In 2021 4th International Conference on Data Storage and Data Engineering (pp. 59-64). https://doi.org/10.1145/3456146.3456156

CISA. (2021). c? | CISA. https://www.cisa.gov/uscert/ncas/tips/ST04-001

CVE. (2022). https://cve.mitre.org

Delli, U., & Chang, S. (2018). Automated process monitoring in 3D printing using supervised machine learning. Procedia Manufacturing, 26, 865-870. https://doi.org/10.1016/j.promfg.2018.07.111

Guo, Y., & Li, J. (2021). Distributed Latent Dirichlet Allocation on Streams. ACM Transactions on Knowledge Discovery from Data (TKDD), 16(1), 1-20. https://doi.org/10.1145/3451528

Prasad, S. G., Badrinarayanan, M. K., & Sharmila, V. C. (2022). Efficacy and Security Effectiveness: Key Parameters in Evaluation of Network Security. International Journal of Performability Engineering, 18(4), 282. https://doi.org/10.23940/ijpe.22.04.p6.282288

IBM. (2019). What is machine learning? https://www.ibm.com/topics/machinelearning?lnk=fle

Mallet, J., Pryor, L., Dave, R., Seliya, N., Vanamala, M., & Sowells-Boone, E. (2022, March). Hold on and swipe: A touch-movement based continuous authentication schema based on machine learning. In 2022 Asia Conference on Algorithms, Computing and Machine Learning (CACML) (pp. 442-447). IEEE. https://doi.org/10.1109/CACML55074.2022.00081

Kanakogi, K., Washizaki, H., Fukazawa, Y., Ogata, S., Okubo, T., Kato, T., ... & Yoshioka, N. (2022). Comparative Evaluation of NLP-Based Approaches for Linking CAPEC Attack Patterns from CVE Vulnerability Information. Applied Sciences, 12(7), 3400. https://doi.org/10.3390/app12073400

Kim, D., & Im, T. (2022). A Systematic Review of Virtual Reality-Based Education Research Using Latent Dirichlet Allocation: Focus on Topic Modeling Technique. Mobile Information Systems, 2022. https://doi.org/10.1155/2022/1201852

Krzeszewska, U., Poniszewska-Marańda, A., & Ochelska-Mierzejewska, J. (2022). Systematic comparison of vectorization methods in classification context. Applied Sciences, 12(10), 5119. https://doi.org/10.3390/app12105119

León-Paredes, G. A., Barbosa-Santillán, L. I., & SánchezEscobar, J. J. (2017). A heterogeneous system based on latent semantic analysis using GPU and multiCPU. Scientific Programming, 2017. https://doi.org/10.1155/2017/8131390

Livingston, F. (2005). Implementation of Breiman’s random forest machine learning algorithm. ECE591Q Machine Learning Journal Paper, 1-13.

Macsai, D. 2012. The most important company you’ve never heard of. 1 Minute Read. Fast Company. https://www.fastcompany.com/3017927/30mitre

McAllister, P., Zheng, H., Bond, R., & Moorhead, A. (2018). Combining deep residual neural network features with supervised machine learning algorithms to classify diverse food image datasets. Computers in Biology and Medicine, 95, 217-233. https://doi.org/10.1016/j.compbiomed.2018.02.008

Mounika, V., Yuan, X., & Bandaru, K. (2019, December). Analyzing CVE database using unsupervised topic modelling. In 2019 International Conference on Computational Science and Computational Intelligence (CSCI) (pp. 72-77). IEEE. https://doi.org/10.1109/CSCI49370.2019.00019

MITRE ATT&CK®. (2022). https://attack.mitre.org

Mohamed, A. E. (2017). Comparative study of four supervised machine learning techniques for classification. International Journal of Applied, 7(2), 1-15. https://www.ijastnet.com/journal/index/859

NIST. (2022). About NIST. https://www.nist.gov/about-nist

Prakash, A., Singh, N. K., & Saha, S. K. (2022). Automatic extraction of similar poetry for study of literary texts: An experiment on Hindi poetry. ETRI Journal, 44(3), 413-425. https://doi.org/10.4218/etrij.2019-0396

Rahman, A. S., Shamrat, F. J. M., Tasnim, Z., Roy, J., & Hossain, S. A. (2019). A comparative study on liver disease prediction using supervised machine learning algorithms. International Journal of Scientific & Technology Research, 8(11), 419-422. http://www.ijstr.org/final-print/nov2019/AComparative-Study-On-Liver-Disease-PredictionUsing-Supervised-Machine-LearningAlgorithms.pdf

Rustam, F., A. Reshi, S. Mehmood, S. Ullah, B. On, W. Aslam and G. Choi. 2020. COVID-19 Future Forecasting Using Supervised Machine Learning Models. IEEE Access, pp: 101489-99. https://doi.org/10.1109/ACCESS.2020.2997311

Sanguri, K., Bhuyan, A., & Patra, S. (2020). A semantic similarity adjusted document co-citation analysis: a case of tourism supply chain. Scientometrics, 125(1), 233-269. https://doi.org/10.1007/s11192-020-03608-0

Schrider, D. R., & Kern, A. D. (2018). Supervised machine learning for population genetics: a new paradigm. Trends in Genetics, 34(4), 301-312. https://doi.org/10.1016/j.tig.2017.12.005

Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press. https://www.cs.huji.ac.il/~shais/UnderstandingMach ineLearning/

Sharma, C., Sharma, S., & Sakshi. (2022). Latent DIRICHLET allocation (LDA) based information modelling on BLOCKCHAIN technology: A review of trends and research patterns used in integration. Multimedia Tools and Applications, 81(25), 36805-36831. https://doi.org/10.1007/s11042-022-13500-z

Siddiqui, N., Dave, R., Vanamala, M., & Seliya, N. (2022). Machine and deep learning applications to mouse dynamics for continuous user authentication. Machine Learning and Knowledge Extraction, 4(2), 502-518. https://doi.org/10.3390/make4020023

Sweeney, E. M., Vogelstein, J. T., Cuzzocreo, J. L., Calabresi, P. A., Reich, D. S., Crainiceanu, C. M., & Shinohara, R. T. (2014). A comparison of supervised machine learning algorithms and feature vectors for MS lesion segmentation using multimodal structural MRI. PloS One, 9(4), e95753. https://doi.org/10.1371/journal.pone.0095753

Uddin, S., Khan, A., Hossain, M. E., & Moni, M. A. (2019). Comparing different supervised machine learning algorithms for disease prediction. BMC Medical Informatics and Decision Making, 19(1), 1-16. https://doi.org/10.1186/s12911-019-1004-8

Ullah, F., Wang, J., Farhan, M., Jabbar, S., Naseer, M. K., & Asif, M. (2020). LSA based smart assessment methodology for SDN infrastructure in IoT environment. International Journal of Parallel Programming, 48, 162-177. https://doi.org/ 10.1007/s10766-018-0570-1

Ullah, F., Jabbar, S., & Mostarda, L. (2021). An intelligent decision support system for software plagiarism detection in academia. International

Journal of Intelligent Systems, 36(6), 2730-2752 https://doi.org/10.1002/int.22399

Vanamala, M., Gilmore, J., Yuan, X., & Roy, K. (2020a, December). Recommending attack patterns for software requirements document. In 2020 International Conference on Computational Science and Computational Intelligence (CSCI) (pp. 1813-1818). IEEE. https://doi.org/10.1109/CSCI51800.2020.00334

Vanamala, M., Yuan, X., & Roy, K. (2020b, August). Topic modeling and classification of Common Vulnerabilities and Exposures database. In 2020 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems (icABCD) (pp. 1-5). IEEE. https://doi.org/10.1109/icABCD49160.2020.9183814

Zhu, L., He, Y., & Zhou, D. (2020). A neural generative model for joint learning topics and topic-specific word embeddings. Transactions of the Association for Computational Linguistics, 8, 471-485. https://doi.org/10.1162/tacl_a_00326