SWE-Bench Pro

Code and data for the following works:

- Advertisement -

SWE-bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

HuggingFace: https://huggingface.co/datasets/ScaleAI/SWE-bench_Pro

- Advertisement -

Public Leaderboard: https://scale.com/leaderboard/swe_bench_pro_public

Commercial (Private) Leaderboard: https://scale.com/leaderboard/swe_bench_pro_commercial

- Advertisement -

Overview

SWE-Bench Pro is a challenging benchmark evaluating LLMs/Agents on long-horizon software engineering tasks.
Given a codebase and an issue, a language model is tasked with generating a patch that resolves the described problem.

The dataset is inspired from SWE-Bench: https://github.com/SWE-bench/SWE-bench

- Advertisement -

To access SWE-bench Pro, copy and run the following code:

from datasets import load_dataset
swebench = load_dataset(‘ScaleAI/SWE-bench_Pro’, split=’test’)
Setup

SWE-bench Pro uses Docker for reproducible evaluations.
In addition, the evaluation script requires Modal to scale the evaluation set.

- Advertisement -

Follow the instructions in the Docker setup guide to install Docker on your machine.
If you’re setting up on Linux, we recommend seeing the post-installation steps as well.

Run the following commands to store modal credentials:

pip install modal
modal setup # and follow the prompts to generate your token and secret

After running these steps, you should be able to see a token ID and secret in ~/.modal.toml:
EG:

token_id =
token_secret =
active = true

We store prebuilt Docker images for each instance. They can be found in this directory:

https://hub.docker.com/r/jefzda/sweap-images

The format of the images is as follows.

jefzda/sweap-images:{repo_base}.{repo_name}-{repo_base}__{repo_name}-{hash}

For example:

jefzda/sweap-images:gravitational.teleport-gravitational__teleport-82185f232ae8974258397e121b3bc2ed0c3729ed-v626ec2a48416b10a88641359a169d99e935ff03

(9/23) You can also use the image_name in the HuggingFace.

Note that bash runs by default in our images. e.g. when running these images, you should not manually envoke bash. See #6

Usage

First generate patch predictions using your harness of choice.
Evaluate patch predictions on SWE-bench Pro with the following command:

python swe_bench_pro_eval_modal.py
–raw_sample_path=external_hf_v2.csv
–patch_path={OUTPUT}/gold_patches.json
–output_dir={OUTPUT}/
–scripts_dir=run_scripts
–num_workers=100
–dockerhub_username=jefzda
Replace gold_patches with your patch json, and point raw_sample_path to the SWE-Bench Pro CSV.
Gold Patches can be compiled from the HuggingFace dataset.