Machine-learning models that power next-gen code-completion tools like GitHub Copilot can help software developers write more functional code, without making it less secure.
That's the tentative result of an albeit small 58-person survey conducted by a group of New York University computer scientists.
In a paper distributed via ArXiv, Gustavo Sandoval, Hammond Pearce, Teo Nys, Ramesh Karri, Brendan Dolan-Gavitt, and Siddharth Garg recount how they put the security of source code created with the help of large language models (LLMs) to the test.
LLMs like the OpenAI GPT family have been trained on massive amounts of public text data, or public source code in the case of OpenAI's Codex, a GPT descendant and the foundation of GitHub's Copilot. As such, they may reproduce errors made in the past by human programmers, illustrating the maxim "garbage in, garbage out." There was a fear that these tools would regurgitate and suggest bad code to developers, who would insert the stuff into their projects.
What's more, code security can be contextual: code that's secure in isolation may be insecure when executed in a particular sequence with other software. So, these auto-complete tools may offer code suggestions that on their own are fine, but connected with other code, are now vulnerable to attack or just plain broken. That said, it turns out these tools may not actually make humans any worse at programming.
In some sense, the researchers were putting out their own fire. About a year ago, two of the same computer scientists contributed to a paper titled "Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions." That work found about 40 percent of the output from Copilot included potentially exploitable weaknesses (CWEs).
"The difference between the two papers is that 'Asleep at the Keyboard' was looking at fully automated code generation (no human in the loop), and we did not have human users to compare against, so we couldn't say anything about how the security of Copilot's compared to the security of human-written code," said Brendan Dolan-Gavitt, co-author on both papers and assistant professor in the computer science and engineering department at NYU Tandon, in an email to The Register.
"The user study paper tries to directly tackle those missing pieces, by having half of the users get assistance from Codex (the model that powers Copilot) and having the other half write the code themselves. However, it is also narrower than 'Asleep at the Keyboard': we only looked at one task and one language (writing a linked list in C)."
In the latest report, "Security Implications of Large Language Model Code Assistants: A User Study," a slightly varied set of NYU researchers acknowledge that previous work fails to model the usage of LLM-based tools like Copilot realistically.
"First, these studies assume that the entire code is automatically generated by the LLM (we will call this the autopilot mode)," the boffins explain in their paper.
"In practice, code completion LLMs assist developers with suggestions that they will choose to accept, edit or reject. This means that while programmers prone to automation bias might naively accept buggy completions, other developers might produce less buggy code by using the time saved to fix bugs."
Second, they observe that while LLMs have been shown to produce buggy code, humans do so too. The bugs in LLM training data came from people.
So rather than assess the bugginess of LLM-generated code on its own, they set out to compare how the code produced by human developers assisted by machine-learning models differs from code produced by programming working on their own.
The NYU computer scientists recruited 58 survey participants - undergraduate and graduate students in software development courses - and divided them up into a Control group, who would work without suggestions, and an Assisted group, who had access to a custom suggestion system built using the OpenAI Codex API. They also used the Codex model to create 30 solutions to the given programming problems as a point of comparison. This Autopilot group functioned mainly as a second control group.
Both the Assisted and Control groups were allowed to consult web resources, such as Google and Stack Overflow, but not to ask others for help. Work was done in Visual Studio Code within a web-based container built with open source Anubis.
The participants were asked to complete a shopping list program using the C programming language because "it is easy for developers to inadvertently express vulnerable design patterns in C" and because the C compiler toolchain used doesn't check for errors to the same degree toolchains for modern languages, such as Go and Rust, do.
When the researchers manually analyzed the code produced by the Control and Assistant groups, they found that, contrary to prior work, AI code suggestions didn't make things worse overall.
"[W]e found no evidence to suggest that Codex assistance increases security bug incidence," the paper stated, while noting that the study's small sample size means further study is warranted. "On the contrary, there is some evidence that suggests that CWEs/LoC [lines of code] decrease with Codex assistance."
"It's hard to conclude this with much statistical confidence," said Siddharth Garg, a cybersecurity researcher and associate professor in the engineering department at NYU Tandon, in a phone interview with The Register.
Nonetheless, he said, "The data suggests Copilot users were not a lot worse off."
Dolan-Gavitt is similarly cautious about the findings.
"Current analysis of our user study results has not found any statistically significant differences - we are still analyzing this, including qualitatively, so I wouldn't draw strong conclusions from this, particularly since it was a small study (58 users total) and the users were all students rather than professional developers," he said.
"Still, we can say that with these users, on this task, the security impact of having AI assistance was probably not large: if it had a very large impact, we would have observed a larger difference between the two groups. We're doing a bit more statistical analysis to make that precise right now."
Beyond that, some other insights emerged. One is that Assistant group participants were more productive, generating more lines of code and completing a greater fraction of the functions in the assignment.
"Users in the Assisted group passed more functional tests and produced more functional code," said Garg, adding that results of this sort may help companies looking at assistive coding tools decide whether to deploy them.
Another is that the researchers were able to distinguish the output produced by the Control, Assisted, and Autopilot groups, which may allay concerns about AI-power cheating in educational settings.
The boffins also found that AI tools need to be considered in the context of user error. "Users provide prompts which may include bugs, accept buggy prompts which end up in the 'completed' programs as well as accept bugs which are later removed," the paper says. "In some cases, users also end up with more bugs than were suggested by the model!"
Expect further work along these lines. ®
In brief Plus: DeepMind beats humans at Stratego
Banishing memory safety bugs cuts critical vulnerabilities
Yep. The very same JPEG-XL that's just been axed from Chromium
As for Twitter, politicians need to grow thick skins and stop mistaking it for advertisement
Record keeping rules might need a tweak to ensure content is preserved
Intel's 'Ice Lake' and AMD's 'Milan' chips bump up speeds and feeds
RE:INVENT Amazon web arm investing in Microsoft's platform to help customers escape Windows