Safe AI for Code

Model Probing Techniques

Here we present our investigations from white-box analysis-based techniques of Code-LLMs.

Trojan Signatures in Code-LLMs (SeT LLM at ICLR '24). Fields et al. (2021) defined trojan signatures as discernible differences in parameters between trojaned and non-trojaned classes. While their study found such signatures in image models, our research on large language models (LLMs) for source code classification shows that trojan signatures do not generalize. We found that trojaned code models are stubborn, even when the models were poisoned under more explicit settings (finetuned with pre-trained weights frozen). We analyzed nine trojaned models for two binary classification tasks: clone and defect detection. To the best of our knowledge, this is the first work to examine weight-based trojan signature revelation techniques for large-language models of code and furthermore to demonstrate that detecting trojans only from the weights in such models is a hard problem.

Contributors: Aftab Hussain, Md Rafiqul Islam Rabin, Mohammad Amin Alipour

Measuring Poisoning Impact on CodeBERT. Large language models (LLMs) have transformed software development, but concerns about safety, especially hidden backdoors or trojans, have emerged. This paper focuses on analyzing model parameters to detect potential backdoor signals in code models, specifically examining attention weights, biases, activation values, and context embeddings in clean and poisoned CodeBERT models. Results indicate noticeable patterns in activation values and context embeddings of poisoned samples, contributing to efforts in white-box detection of backdoor signals in LLMs of code.

Contributors: Aftab Hussain, Md Rafiqul Islam Rabin, Navid Ayoobi, Mohammad Amin Alipour

Black Box Techniques

Here we present our research findings based on black-box approaches for trojaned Code-LLMs.

OSeql: Trigger detection in Code-LLMs. Large language models (LLMs) are now a key part of software development. They learn from huge code datasets, but verifying each data point is tough. There's a risk of injecting malicious data (trojans) during training, making models vulnerable. This sneaky move can compromise model integrity and mess with downstream tasks that are deployed by the models. Meet OSeql, our new defense techique for spotting trojan-triggering inputs in code. Our results? Almost 100% recall in detecting trojaned inputs and over 90% accuracy in identifying triggers. We put OSeql to the test on trojaned versions of CodeBERT, PLBART, CodeT5, BART, and RoBERTa.

Contributors: Aftab Hussain, Md Rafiqul Islam Rabin, Toufique Ahmed, Mohammad Amin Alipour, Bowen Xu

TrojanedCM: A Repository of Trojaned Code-LLMs and a Poisoning Framework. We introduce TrojanedCM, a repository offering diverse trojaned source code models, including popular architectures like CodeBERT and PLBART. Covering defect detection, clone detection, and text-to-code generation tasks, the repository allows researchers to test trojan detection and unlearning techniques. It includes poisoned models based on benchmark datasets and provides full access to model architecture and parameters. Additionally, a poisoning framework is provided for deploying various poisoning strategies on source code models.

Contributors: Aftab Hussain, Md Rafiqul Islam Rabin, Mohammad Amin Alipour

About

Here we present research works of the Safe AI for Code Project at the Software Engineering Research Group (SERG), led by Prof. Amin Alipour at the Department of Computer Science, University of Houston. The Safe AI for Code Project is supported by SRI and IARPA.