SWE-Bench Pro

TribeNews
2 Min Read

SWE-Bench Pro

Code and data for the following works:

- Advertisement -

SWE-bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

HuggingFace: https://huggingface.co/datasets/ScaleAI/SWE-bench_Pro

- Advertisement -

Public Leaderboard: https://scale.com/leaderboard/swe_bench_pro_public

Commercial (Private) Leaderboard: https://scale.com/leaderboard/swe_bench_pro_commercial

- Advertisement -

Overview

SWE-Bench Pro is a challenging benchmark evaluating LLMs/Agents on long-horizon software engineering tasks.
Given a codebase and an issue, a language model is tasked with generating a patch that resolves the described problem.

The dataset is inspired from SWE-Bench: https://github.com/SWE-bench/SWE-bench

- Advertisement -

To access SWE-bench Pro, copy and run the following code:

from datasets import load_dataset
swebench = load_dataset(‘ScaleAI/SWE-bench_Pro’, split=’test’)
Setup

SWE-bench Pro uses Docker for reproducible evaluations.
In addition, the evaluation script requires Modal to scale the evaluation set.

- Advertisement -

Follow the instructions in the Docker setup guide to install Docker on your machine.
If you’re setting up on Linux, we recommend seeing the post-installation steps as well.

Run the following commands to store modal credentials:

pip install modal
modal setup # and follow the prompts to generate your token and secret

After running these steps, you should be able to see a token ID and secret in ~/.modal.toml:
EG:

token_id =
token_secret =
active = true

We store prebuilt Docker images for each instance. They can be found in this directory:

https://hub.docker.com/r/jefzda/sweap-images

The format of the images is as follows.

jefzda/sweap-images:{repo_base}.{repo_name}-{repo_base}__{repo_name}-{hash}

For example:

jefzda/sweap-images:gravitational.teleport-gravitational__teleport-82185f232ae8974258397e121b3bc2ed0c3729ed-v626ec2a48416b10a88641359a169d99e935ff03

(9/23) You can also use the image_name in the HuggingFace.

Note that bash runs by default in our images. e.g. when running these images, you should not manually envoke bash. See #6

Usage

First generate patch predictions using your harness of choice.
Evaluate patch predictions on SWE-bench Pro with the following command:

python swe_bench_pro_eval_modal.py
–raw_sample_path=external_hf_v2.csv
–patch_path={OUTPUT}/gold_patches.json
–output_dir={OUTPUT}/
–scripts_dir=run_scripts
–num_workers=100
–dockerhub_username=jefzda
Replace gold_patches with your patch json, and point raw_sample_path to the SWE-Bench Pro CSV.
Gold Patches can be compiled from the HuggingFace dataset.

Leave a Comment
Ads Blocker Image Powered by Code Help Pro

Ads Blocker Detected & This Is Prohibited!!!

We have detected that you are using extensions to block ads and you are also not using our official app. Your Account Have been Flagged and reported, pending de-activation & All your earning will be wiped out. Please turn off the software to continue

You cannot copy content of this app