REPOCOD

Leaderboard

Full
Rank	Model	Pass@1
1	GPT-4o + RAG (Sparse-Retrieval)	27.35
2	DeepSeek-V2.5 + RAG (Current-File)	27.04
3	Codestral-22B + RAG (Current-File)	20.00
4	Claude 3.5 Sonnet + RAG (Current-File)	19.80
5	GPT-4o-Mini + RAG (Current-File)	18.67
6	OpenCodeInterpreter-33B + RAG (Current-File)	18.27
7	DeepSeekCoder-33B + RAG (Dense-Retrieval)	17.14
7	Qwen2.5-Coder-7B + RAG (Dense-Retrieval)	17.14
9	DeepSeekCoder-6.7B + RAG (Sparse-Retrieval)	14.08
10	OpenCodeInterpreter-6.7B + RAG (Current-File)	13.16
11	CodeLlama-13B + RAG (Dense-Retrieval)	12.76
12	CodeLlama-7B + RAG (Sparse-Retrieval)	10.71

Lite
1	DeepSeeK-V3 + RAG (Dense-Retrieval)	10.00
2	GPT-4o + RAG (Sparse-Retrieval)	9.00
3	GPT-4o-Mini + RAG (Sparse-Retrieval)	8.00
4	DeepSeek-V2.5 + RAG (Dense-Retrieval)	7.50
5	OpenCodeInterpreter-33B + RAG (Dense-Retrieval)	6.00
7	Claude 3.5 Sonnet + OpenHands	5.50
6	OpenAI o3-mini + RAG (Sparse-Retrieval)	5.50
8	OpenCodeInterpreter-6.7B + RAG (Dense-Retrieval)	5.00
9	Codestral-22B + RAG (Dense-Retrieval)	4.50
9	DeepSeek-R1 + RAG (Sparse-Retrieval)	4.50
11	OpenAI o1 + RAG (Sparse-Retrieval)	4.00
11	DeepSeekCoder-33B + RAG (Sparse-Retrieval)	4.00
11	GPT-4.5 Preview + RAG (Sparse-Retrieval)	4.00
14	Claude 3.7 Sonnet + RAG (Sparse-Retrieval)	3.50
15	Claude 3.5 Sonnet + RAG (Dense-Retrieval)	3.00
16	CodeLlama-34B + RAG (Sparse-Retrieval)	2.50
16	DeepSeekCoder-6.7B + RAG (Sparse-Retrieval)	2.50
18	CodeLlama-7B + RAG (Dense-Retrieval)	1.50
18	QWEN-2.5-Coder-7B + RAG (Dense-Retrieval)	1.50

Note: All results for REPOCOD-FULL are evaluated under three retrieval settings: Sparse Retrieval, Dense Retrieval, and Current File. See our paper for details.

Why REPOCOD?

REPOCOD is designed to go beyond conventional benchmarks with repository-level tasks, complex function generation, and rigorous correctness evaluation.

Compared to SWE-Bench

Task Type: REPOCOD focuses on general code generation; SWE-Bench resolves GitHub issues via pull requests.
Test Coverage: REPOCOD includes 2.6× more tests per task — 313.5 vs. SWE-Bench's 120.8.

Compared to HumanEval, MBPP, CoderEval, ClassEval

Scale: 980 tasks from 11 real-world Python projects.
Task Type: Full-function generation, not toy examples or snippets.
Context: Many tasks require repository-level understanding (e.g., dependencies, imports).
Validation: Correctness verified via hundreds of developer-written test cases.
Complexity: Longest average canonical solution (331.6 tokens) and highest cyclomatic complexity (9.00) across existing repository level code generation benchmarks.

For benchmarking and evaluation, please visit our GitHub repository for more details.

Abstract

Large language models (LLMs) have achieved high accuracy, i.e., more than 90 pass@1, in solving Python coding problems in HumanEval and MBPP. Thus, a natural question is, whether LLMs achieve comparable code completion performance compared to human developers? Unfortunately, one cannot answer this question using existing manual crafted or simple (e.g., single-line) code generation benchmarks, since such tasks fail to represent real-world software development tasks. In addition, existing benchmarks often use poor code correctness metrics, providing misleading conclusions.

To address these challenges, we create REPOCOD, a code generation benchmark with 980 problems collected from 11 popular real-world projects, with more than 58% of them requiring file-level or repository-level context information. In addition, REPOCOD has the longest average canonical solution length (331.6 tokens) and the highest average cyclomatic complexity (9.00) compared to existing benchmarks. Each task in REPOCOD includes 313.5 developerwritten test cases on average for better correctness evaluation. In our evaluations of ten LLMs, none of the models achieve more than 30 pass@1 on REPOCOD, indicating the necessity of building stronger LLMs that can help developers in real-world software development.

Data Collection Pipeline

We employ a three-stage data collection pipeline to efficiently gather target functions from popular repositories: Repository Selection, Target Function Selection, and Relevant Test Case Collection. For more details, feel free to read our paper!

Dataset Statistics

REPOCOD Statistics — REPOCOD (Full) consists of 980 instances from 11 repositories across diverse domains, including data science, scientific computing, web, and software development. This table details statistics for each context complexity type—repository-level, file-level, and self-contained—including #NL (tokens in target descriptions), #GT (tokens in canonical solutions), Cyclo. (average cyclomatic complexity), and #Funcs. (number of target functions).

BibTeX

@inproceedings{liang2024languagemodelsreplaceprogrammers, title={Can Language Models Replace Programmers for Coding? REPOCOD Says 'Not Yet'}, author={Shanchao Liang and Yiran Hu and Nan Jiang and Lin Tan}, year={2024}, eprint={2410.21647}, booktitle = "Proceedings of the 63nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = july, year = "2025", note = {To appear}, publisher = "Association for Computational Linguistics", }

Can Language Models Replace Programmers for Coding? 🐟 RepoCod Says 'Not Yet'