CWEval: Rigorous Evaluation for Secure Code Generation

CWEval is an evaluation framework for secure code generation. Instead of using separate benchmarks for functionality and security evaluation in previous works, CWEval is the first simultaneous outcome-driven evaluation for both of them on the same set of programming tasks, truely reflecting the tension between functionality and security of LLM code generation.

Based on risky coding scenarios covered by CWE-related documentations (e.g. CodeQL, MITRE CWE List) (1), we design CWEval coding tasks which consist of three components:

High-quality coding task specifications (2): with both well-defined function signature and natural language instruction, our specifications are definite and clear for LLMs to follow, and they maintain the security semantics of the corresponding CWE types.
Reference implementations (3): insecure reference implementations confirm the existence of the vulnerabilities, and secure reference implementations ensure the existence of implementations that are both functional and secure.
Test Oracles (4): functionality test oracles examine whether a given implementation fulfills the basic functionality of the coding task, and security test oracles check if it is secure with adversarial inputs. Instead of static analysis, we capture various dynamic properties to verify security. For example, given the adversarial/malicious inputs, a secure implementation should:
- return expected outputs (as shown in the example above);
- return within certain time limits (e.g. ReDoS (CWE-1333));
- have no unexpected side-effect on the environment (e.g. SQL Injection (CWE-089) where an adversarial input can maliciously modify the database);
- have no illegal memory access (e.g. Buffer Overflow (CWE-121) in C/C++, which can be reported by the address sanitizer);

As the initial proof-of-concept version, CWEval-bench has 119 high-quality risky coding tasks covering 31 CWE types and spanning 5 popular programming languages:

Core set: 25 Python ones, 23 JavaScript ones, 21 C++ ones, 20 C ones, 19 Golang ones;
Language-specific set: 11 C-memory-vulnerability ones;

We are actively working on the continuous development of CWEval! Welcome to contribute more coding tasks, suggest improvements or report issues!

:
Simultaneous Evaluation on both Functionality and Security
of LLM-generated Code

🏆 Leaderboard 🏆

About

BibTeX

:Simultaneous Evaluation on both Functionality and Securityof LLM-generated Code

🏆 Leaderboard 🏆

About

BibTeX

:
Simultaneous Evaluation on both Functionality and Security
of LLM-generated Code