by greynewell
Run controlled experiments that compare an MCP server‑enabled LLM against a baseline LLM on real software‑engineering tasks such as bug‑fixing, exploit generation, and tool‑selection, producing quantitative performance metrics.
Mcpbr provides a single‑command harness to benchmark MCP servers using established software‑engineering datasets (SWE‑bench, CyberGym, MCPToolBench++). It runs parallel evaluations—one with the MCP server and one baseline—so you can measure the exact impact of tool access on task success.
ANTHROPIC_API_KEY).mcpbr init or edit an existing YAML.mcpbr run -c <config.yaml> [options].mcpbr models, mcpbr benchmarks, mcpbr cleanup, and verbose logging options.-M to run only the MCP evaluation or -B for baseline only.mcpbr models.timeout_seconds in the YAML or use startup_timeout_ms / tool_timeout_ms for MCP‑specific limits.--output-junit <file>.xml and publish with actions like EnricoMi/publish-unit-test-result-action.ANTHROPIC_API_KEY).# Install via pip
pip install mcpbr && mcpbr init && mcpbr run -c mcpbr.yaml -n 1 -v
# Or via npm
npm install -g mcpbr-cli && mcpbr init && mcpbr run -c mcpbr.yaml -n 1 -v
Benchmark your MCP server against real GitHub issues. One command, hard numbers.
Model Context Protocol Benchmark Runner
Stop guessing if your MCP server actually helps. Get hard numbers comparing tool-assisted vs. baseline agent performance on real GitHub issues.
Real metrics showing whether your MCP server improves agent performance on SWE-bench tasks. No vibes, just data.
MCP servers promise to make LLMs better at coding tasks. But how do you prove it?
mcpbr runs controlled experiments: same model, same tasks, same environment - the only variable is your MCP server. You get:
mcpbr supports multiple software engineering benchmarks through a flexible abstraction layer:
Real GitHub issues requiring bug fixes and patches. The agent generates unified diffs evaluated by running pytest test suites.
Security vulnerabilities requiring Proof-of-Concept (PoC) exploits. The agent generates exploits that trigger crashes in vulnerable code.
Large-scale MCP tool use evaluation across 45+ categories. Tests agent capabilities in tool discovery, selection, invocation, and result interpretation.
# Run SWE-bench (default)
mcpbr run -c config.yaml
# Run CyberGym at level 2
mcpbr run -c config.yaml --benchmark cybergym --level 2
# Run MCPToolBench++
mcpbr run -c config.yaml --benchmark mcptoolbench
# List available benchmarks
mcpbr benchmarks
See the benchmarks guide for details on each benchmark and how to configure them.
This harness runs two parallel evaluations for each task:
By comparing these, you can measure the effectiveness of your MCP server for different software engineering tasks. See the MCP integration guide for tips on testing your server.
mcpbr includes built-in regression detection to catch performance degradations between MCP server versions:
A regression is detected when a task that passed in the baseline now fails in the current run. This helps you catch issues before deploying new versions of your MCP server.
# First, run a baseline evaluation and save results
mcpbr run -c config.yaml -o baseline.json
# Later, compare a new version against the baseline
mcpbr run -c config.yaml --baseline-results baseline.json --regression-threshold 0.1
# With notifications
mcpbr run -c config.yaml --baseline-results baseline.json \
--regression-threshold 0.1 \
--slack-webhook https://hooks.slack.com/services/YOUR/WEBHOOK/URL
======================================================================
REGRESSION DETECTION REPORT
======================================================================
Total tasks compared: 25
Regressions detected: 2
Improvements detected: 5
Regression rate: 8.0%
REGRESSIONS (previously passed, now failed):
----------------------------------------------------------------------
- django__django-11099
Error: Timeout
- sympy__sympy-18087
Error: Test suite failed
IMPROVEMENTS (previously failed, now passed):
----------------------------------------------------------------------
- astropy__astropy-12907
- pytest-dev__pytest-7373
- scikit-learn__scikit-learn-25570
- matplotlib__matplotlib-23913
- requests__requests-3362
======================================================================
For CI/CD integration, use --regression-threshold to fail the build when regressions exceed an acceptable rate:
# .github/workflows/test-mcp.yml
- name: Run mcpbr with regression detection
run: |
mcpbr run -c config.yaml \
--baseline-results baseline.json \
--regression-threshold 0.1 \
-o current.json
This will exit with code 1 if the regression rate exceeds 10%, failing the CI job.
Full installation guide with detailed setup instructions.
ANTHROPIC_API_KEY environment variableclaude) installedSupported Models (aliases or full names):
opus or claude-opus-4-5-20251101sonnet or claude-sonnet-4-5-20250929haiku or claude-haiku-4-5-20251001Run mcpbr models to see the full list.
# Run with npx (no installation)
npx mcpbr-cli run -c config.yaml
# Or install globally
npm install -g mcpbr-cli
mcpbr run -c config.yaml
Note: The npm package requires Python 3.11+ and the mcpbr Python package (
pip install mcpbr)
# Install from PyPI
pip install mcpbr
# Or install from source
git clone https://github.com/greynewell/mcpbr.git
cd mcpbr
pip install -e .
# Or with uv
uv pip install -e .
Note for Apple Silicon users: The harness automatically uses x86_64 Docker images via emulation. This may be slower than native ARM64 images but ensures compatibility with all SWE-bench tasks.
Get started in seconds with our example configurations:
# Set your API key
export ANTHROPIC_API_KEY="your-api-key"
# Run your first evaluation using an example config
mcpbr run -c examples/quick-start/getting-started.yaml -v
This runs 5 SWE-bench tasks with the filesystem server. Expected runtime: 15-30 minutes, cost: $2-5.
Explore 25+ example configurations in the examples/ directory:
See the Examples README for the complete guide.
export ANTHROPIC_API_KEY="your-api-key"
mcpbr init
mcp_server:
command: "npx"
args:
- "-y"
- "@modelcontextprotocol/server-filesystem"
- "{workdir}"
env: {}
provider: "anthropic"
agent_harness: "claude-code"
model: "sonnet" # or full name: "claude-sonnet-4-5-20250929"
dataset: "SWE-bench/SWE-bench_Lite"
sample_size: 10
timeout_seconds: 300
max_concurrent: 4
mcpbr run --config config.yaml
mcpbr includes a built-in Claude Code plugin that makes Claude an expert at running benchmarks correctly. When you clone this repository, Claude Code automatically detects the plugin and gains specialized knowledge about mcpbr.
When using Claude Code in this repository, you can simply say:
Claude will automatically:
{workdir} placeholdersThe plugin includes three specialized skills:
run-benchmark: Expert at running evaluations with proper validation
mcpbr run commandsgenerate-config: Generates valid mcpbr configuration files
{workdir} placeholder is includedswe-bench-lite: Quick-start command for SWE-bench Lite
Just clone the repository and start asking Claude to run benchmarks:
git clone https://github.com/greynewell/mcpbr.git
cd mcpbr
# In Claude Code, simply say:
# "Run the SWE-bench Lite eval with 5 tasks"
The bundled plugin ensures Claude makes no silly mistakes and follows best practices automatically.
Full configuration reference with all options and examples.
The mcp_server section defines how to start your MCP server:
| Field | Description |
|---|---|
command |
Executable to run (e.g., npx, uvx, python) |
args |
Command arguments. Use {workdir} as placeholder for the task repository path |
env |
Additional environment variables |
Anthropic Filesystem Server:
mcp_server:
command: "npx"
args: ["-y", "@modelcontextprotocol/server-filesystem", "{workdir}"]
Custom Python MCP Server:
mcp_server:
command: "python"
args: ["-m", "my_mcp_server", "--workspace", "{workdir}"]
env:
LOG_LEVEL: "debug"
Supermodel Codebase Analysis Server:
mcp_server:
command: "npx"
args: ["-y", "@supermodeltools/mcp-server"]
env:
SUPERMODEL_API_KEY: "${SUPERMODEL_API_KEY}"
mcpbr supports configurable timeouts for MCP server operations to handle different server types and workloads.
| Field | Description | Default |
|---|---|---|
startup_timeout_ms |
Timeout in milliseconds for MCP server startup | 60000 (60s) |
tool_timeout_ms |
Timeout in milliseconds for MCP tool execution | 900000 (15 min) |
These fields map to the MCP_TIMEOUT and MCP_TOOL_TIMEOUT environment variables used by Claude Code. See the Claude Code settings documentation for more details.
mcp_server:
command: "npx"
args: ["-y", "@modelcontextprotocol/server-filesystem", "{workdir}"]
startup_timeout_ms: 60000 # 60 seconds for server to start
tool_timeout_ms: 900000 # 15 minutes for long-running tools
Different server types require different timeout settings based on their operational characteristics:
| Server Type | startup_timeout_ms | tool_timeout_ms | Notes |
|---|---|---|---|
| Fast (filesystem, git) | 10000 (10s) | 30000 (30s) | Local operations with minimal overhead |
| Medium (web search, APIs) | 30000 (30s) | 120000 (2m) | Network I/O with moderate latency |
| Slow (code analysis, databases) | 60000 (60s) | 900000 (15m) | Complex processing or large datasets |
When to adjust timeouts:
startup_timeout_ms if your server takes longer to initialize (e.g., loading large models, establishing database connections)tool_timeout_ms if your tools perform long-running operations (e.g., codebase analysis, file processing, AI inference)You can customize the prompt sent to the agent using the agent_prompt field:
agent_prompt: |
Fix the following bug in this repository:
{problem_statement}
Make the minimal changes necessary to fix the issue.
Focus on the root cause, not symptoms.
Use {problem_statement} as a placeholder for the SWE-bench issue text. You can also override the prompt via CLI with --prompt.
| Parameter | Default | Description |
|---|---|---|
provider |
anthropic |
LLM provider |
agent_harness |
claude-code |
Agent backend |
benchmark |
swe-bench |
Benchmark to run (swe-bench, cybergym, or mcptoolbench) |
agent_prompt |
null |
Custom prompt template (use {problem_statement} placeholder) |
model |
sonnet |
Model alias or full ID |
dataset |
null |
HuggingFace dataset (optional, benchmark provides default) |
cybergym_level |
1 |
CyberGym difficulty level (0-3, only for CyberGym benchmark) |
sample_size |
null |
Number of tasks (null = full dataset) |
timeout_seconds |
300 |
Timeout per task |
max_concurrent |
4 |
Parallel task limit |
max_iterations |
10 |
Max agent iterations per task |
Full CLI documentation with all commands and options.
Get help for any command with --help or -h:
mcpbr --help
mcpbr run --help
mcpbr init --help
| Command | Description |
|---|---|
mcpbr run |
Run benchmark evaluation with configured MCP server |
mcpbr init |
Generate an example configuration file |
mcpbr models |
List supported models for evaluation |
mcpbr providers |
List available model providers |
mcpbr harnesses |
List available agent harnesses |
mcpbr benchmarks |
List available benchmarks (SWE-bench, CyberGym, MCPToolBench++) |
mcpbr cleanup |
Remove orphaned mcpbr Docker containers |
mcpbr runRun SWE-bench evaluation with the configured MCP server.
| Option | Short | Description |
|---|---|---|
--config PATH |
-c |
Path to YAML configuration file (required) |
--model TEXT |
-m |
Override model from config |
--benchmark TEXT |
-b |
Override benchmark from config (swe-bench, cybergym, or mcptoolbench) |
--level INTEGER |
Override CyberGym difficulty level (0-3) | |
--sample INTEGER |
-n |
Override sample size from config |
--mcp-only |
-M |
Run only MCP evaluation (skip baseline) |
--baseline-only |
-B |
Run only baseline evaluation (skip MCP) |
--no-prebuilt |
Disable pre-built SWE-bench images (build from scratch) | |
--output PATH |
-o |
Path to save JSON results |
--report PATH |
-r |
Path to save Markdown report |
--output-junit PATH |
Path to save JUnit XML report (for CI/CD integration) | |
--verbose |
-v |
Verbose output (-v summary, -vv detailed) |
--log-file PATH |
-l |
Path to write raw JSON log output (single file) |
--log-dir PATH |
Directory to write per-instance JSON log files | |
--task TEXT |
-t |
Run specific task(s) by instance_id (repeatable) |
--prompt TEXT |
Override agent prompt (use {problem_statement} placeholder) |
|
--baseline-results PATH |
Path to baseline results JSON for regression detection | |
--regression-threshold FLOAT |
Maximum acceptable regression rate (0-1). Exit with code 1 if exceeded. | |
--slack-webhook URL |
Slack webhook URL for regression notifications | |
--discord-webhook URL |
Discord webhook URL for regression notifications | |
--email-to EMAIL |
Email address for regression notifications | |
--email-from EMAIL |
Sender email address for notifications | |
--smtp-host HOST |
SMTP server hostname for email notifications | |
--smtp-port PORT |
SMTP server port (default: 587) | |
--smtp-user USER |
SMTP username for authentication | |
--smtp-password PASS |
SMTP password for authentication | |
--help |
-h |
Show help message |
# Full evaluation (MCP + baseline)
mcpbr run -c config.yaml
# Run only MCP evaluation
mcpbr run -c config.yaml -M
# Run only baseline evaluation
mcpbr run -c config.yaml -B
# Override model
mcpbr run -c config.yaml -m claude-3-5-sonnet-20241022
# Override sample size
mcpbr run -c config.yaml -n 50
# Save results and report
mcpbr run -c config.yaml -o results.json -r report.md
# Save JUnit XML for CI/CD
mcpbr run -c config.yaml --output-junit junit.xml
# Run specific tasks
mcpbr run -c config.yaml -t astropy__astropy-12907 -t django__django-11099
# Verbose output with per-instance logs
mcpbr run -c config.yaml -v --log-dir logs/
# Very verbose output
mcpbr run -c config.yaml -vv
# Run CyberGym benchmark
mcpbr run -c config.yaml --benchmark cybergym --level 2
# Run CyberGym with specific tasks
mcpbr run -c config.yaml --benchmark cybergym --level 3 -n 5
# Regression detection - compare against baseline
mcpbr run -c config.yaml --baseline-results baseline.json
# Regression detection with threshold (exit 1 if exceeded)
mcpbr run -c config.yaml --baseline-results baseline.json --regression-threshold 0.1
# Regression detection with Slack notifications
mcpbr run -c config.yaml --baseline-results baseline.json --slack-webhook https://hooks.slack.com/...
# Regression detection with Discord notifications
mcpbr run -c config.yaml --baseline-results baseline.json --discord-webhook https://discord.com/api/webhooks/...
# Regression detection with email notifications
mcpbr run -c config.yaml --baseline-results baseline.json \
--email-to team@example.com --email-from mcpbr@example.com \
--smtp-host smtp.gmail.com --smtp-port 587 \
--smtp-user user@gmail.com --smtp-password "app-password"
mcpbr initGenerate an example configuration file.
| Option | Short | Description |
|---|---|---|
--output PATH |
-o |
Path to write example config (default: mcpbr.yaml) |
--help |
-h |
Show help message |
mcpbr init
mcpbr init -o my-config.yaml
mcpbr modelsList supported Anthropic models for evaluation.
mcpbr cleanupRemove orphaned mcpbr Docker containers that were not properly cleaned up.
| Option | Short | Description |
|---|---|---|
--dry-run |
Show containers that would be removed without removing them | |
--force |
-f |
Skip confirmation prompt |
--help |
-h |
Show help message |
# Preview containers to remove
mcpbr cleanup --dry-run
# Remove containers with confirmation
mcpbr cleanup
# Remove containers without confirmation
mcpbr cleanup -f
Here's what a typical evaluation looks like:
$ mcpbr run -c config.yaml -v -o results.json --log-dir my-logs
mcpbr Evaluation
Config: config.yaml
Provider: anthropic
Model: sonnet
Agent Harness: claude-code
Dataset: SWE-bench/SWE-bench_Lite
Sample size: 10
Run MCP: True, Run Baseline: True
Pre-built images: True
Log dir: my-logs
Loading dataset: SWE-bench/SWE-bench_Lite
Evaluating 10 tasks
Provider: anthropic, Harness: claude-code
14:23:15 [MCP] Starting mcp run for astropy-12907:mcp
14:23:22 astropy-12907:mcp > TodoWrite
14:23:22 astropy-12907:mcp < Todos have been modified successfully...
14:23:26 astropy-12907:mcp > Glob
14:23:26 astropy-12907:mcp > Grep
14:23:27 astropy-12907:mcp < $WORKDIR/astropy/modeling/separable.py
14:23:27 astropy-12907:mcp < Found 5 files: astropy/modeling/tests/test_separable.py...
...
14:27:43 astropy-12907:mcp * done turns=31 tokens=115/6,542
14:28:30 [BASELINE] Starting baseline run for astropy-12907:baseline
...
Understanding evaluation results - detailed guide to interpreting output.
The harness displays real-time progress with verbose mode (-v) and a final summary table:
Evaluation Results
Summary
+-----------------+-----------+----------+
| Metric | MCP Agent | Baseline |
+-----------------+-----------+----------+
| Resolved | 8/25 | 5/25 |
| Resolution Rate | 32.0% | 20.0% |
+-----------------+-----------+----------+
Improvement: +60.0%
Per-Task Results
+------------------------+------+----------+-------+
| Instance ID | MCP | Baseline | Error |
+------------------------+------+----------+-------+
| astropy__astropy-12907 | PASS | PASS | |
| django__django-11099 | PASS | FAIL | |
| sympy__sympy-18087 | FAIL | FAIL | |
+------------------------+------+----------+-------+
Results saved to results.json
--output){
"metadata": {
"timestamp": "2026-01-17T07:23:39.871437+00:00",
"config": {
"model": "sonnet",
"provider": "anthropic",
"agent_harness": "claude-code",
"dataset": "SWE-bench/SWE-bench_Lite",
"sample_size": 25,
"timeout_seconds": 600,
"max_iterations": 30
},
"mcp_server": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-filesystem", "{workdir}"]
}
},
"summary": {
"mcp": {"resolved": 8, "total": 25, "rate": 0.32},
"baseline": {"resolved": 5, "total": 25, "rate": 0.20},
"improvement": "+60.0%"
},
"tasks": [
{
"instance_id": "astropy__astropy-12907",
"mcp": {
"patch_generated": true,
"tokens": {"input": 115, "output": 6542},
"iterations": 30,
"tool_calls": 72,
"tool_usage": {
"TodoWrite": 4, "Task": 1, "Glob": 4,
"Grep": 11, "Bash": 27, "Read": 22,
"Write": 2, "Edit": 1
},
"resolved": true,
"patch_applied": true,
"fail_to_pass": {"passed": 2, "total": 2},
"pass_to_pass": {"passed": 10, "total": 10}
},
"baseline": {
"patch_generated": true,
"tokens": {"input": 63, "output": 7615},
"iterations": 30,
"tool_calls": 57,
"tool_usage": {
"TodoWrite": 4, "Glob": 3, "Grep": 4,
"Read": 14, "Bash": 26, "Write": 4, "Edit": 1
},
"resolved": true,
"patch_applied": true
}
}
]
}
--report)Generates a human-readable report with:
--log-dir)Creates a directory with detailed JSON log files for each task run. Filenames include timestamps to prevent overwrites:
my-logs/
astropy__astropy-12907_mcp_20260117_143052.json
astropy__astropy-12907_baseline_20260117_143156.json
django__django-11099_mcp_20260117_144023.json
django__django-11099_baseline_20260117_144512.json
Each log file contains the full stream of events from the agent CLI:
{
"instance_id": "astropy__astropy-12907",
"run_type": "mcp",
"events": [
{
"type": "system",
"subtype": "init",
"cwd": "/workspace",
"tools": ["Task", "Bash", "Glob", "Grep", "Read", "Edit", "Write", "TodoWrite"],
"model": "claude-sonnet-4-5-20250929",
"claude_code_version": "2.1.12"
},
{
"type": "assistant",
"message": {
"content": [{"type": "text", "text": "I'll help you fix this bug..."}]
}
},
{
"type": "assistant",
"message": {
"content": [{"type": "tool_use", "name": "Grep", "input": {"pattern": "separability"}}]
}
},
{
"type": "result",
"num_turns": 31,
"usage": {"input_tokens": 115, "output_tokens": 6542}
}
]
}
This is useful for debugging failed runs or analyzing agent behavior in detail.
--output-junit)The harness can generate JUnit XML reports for integration with CI/CD systems like GitHub Actions, GitLab CI, and Jenkins. Each task is represented as a test case, with resolved/unresolved tasks mapped to pass/fail states.
mcpbr run -c config.yaml --output-junit junit.xml
The JUnit XML report includes:
GitHub Actions:
name: MCP Benchmark
on: [push, pull_request]
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install mcpbr
run: pip install mcpbr
- name: Run benchmark
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
mcpbr run -c config.yaml --output-junit junit.xml
- name: Publish Test Results
uses: EnricoMi/publish-unit-test-result-action@v2
if: always()
with:
files: junit.xml
GitLab CI:
benchmark:
image: python:3.11
services:
- docker:dind
script:
- pip install mcpbr
- mcpbr run -c config.yaml --output-junit junit.xml
artifacts:
reports:
junit: junit.xml
Jenkins:
pipeline {
agent any
stages {
stage('Benchmark') {
steps {
sh 'pip install mcpbr'
sh 'mcpbr run -c config.yaml --output-junit junit.xml'
}
}
}
post {
always {
junit 'junit.xml'
}
}
}
The JUnit XML format enables native test result visualization in your CI/CD dashboard, making it easy to track benchmark performance over time and identify regressions.
Architecture deep dive - learn how mcpbr works internally.
The harness uses pre-built SWE-bench Docker images from Epoch AI's registry when available. These images come with:
The agent (Claude Code CLI) runs inside the container, which means:
from astropy import ...)If a pre-built image is not available for a task, the harness falls back to cloning the repository and attempting to install dependencies (less reliable).
mcpbr/
├── src/mcpbr/
│ ├── cli.py # Command-line interface
│ ├── config.py # Configuration models
│ ├── models.py # Supported model registry
│ ├── providers.py # LLM provider abstractions (extensible)
│ ├── harnesses.py # Agent harness implementations (extensible)
│ ├── benchmarks/ # Benchmark abstraction layer
│ │ ├── __init__.py # Registry and factory
│ │ ├── base.py # Benchmark protocol
│ │ ├── swebench.py # SWE-bench implementation
│ │ ├── cybergym.py # CyberGym implementation
│ │ └── mcptoolbench.py # MCPToolBench++ implementation
│ ├── harness.py # Main orchestrator
│ ├── agent.py # Baseline agent implementation
│ ├── docker_env.py # Docker environment management + in-container execution
│ ├── evaluation.py # Patch application and testing
│ ├── log_formatter.py # Log formatting and per-instance logging
│ └── reporting.py # Output formatting
├── tests/
│ ├── test_*.py # Unit tests
│ ├── test_benchmarks.py # Benchmark tests
│ └── test_integration.py # Integration tests
├── Dockerfile # Fallback image for task environments
└── config/
└── example.yaml # Example configuration
The architecture uses Protocol-based abstractions for providers, harnesses, and benchmarks, making it easy to add support for additional LLM providers, agent backends, or software engineering benchmarks in the future. See the API reference and benchmarks guide for more details.
┌─────────────────────────────────────────────────────────────────┐
│ Host Machine │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ mcpbr Harness (Python) │ │
│ │ - Loads SWE-bench tasks from HuggingFace │ │
│ │ - Pulls pre-built Docker images │ │
│ │ - Orchestrates agent runs │ │
│ │ - Collects results and generates reports │ │
│ └─────────────────────────┬─────────────────────────────────┘ │
│ │ docker exec │
│ ┌─────────────────────────▼─────────────────────────────────┐ │
│ │ Docker Container (per task) │ │
│ │ ┌─────────────────────────────────────────────────────┐ │ │
│ │ │ Pre-built SWE-bench Image │ │ │
│ │ │ - Repository at correct commit │ │ │
│ │ │ - All dependencies installed (astropy, django...) │ │ │
│ │ │ - Node.js + Claude CLI (installed at startup) │ │ │
│ │ └─────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ Agent (Claude Code CLI) runs HERE: │ │
│ │ - Makes API calls to Anthropic │ │
│ │ - Executes Bash commands (with working imports!) │ │
│ │ - Reads/writes files │ │
│ │ - Generates patches │ │
│ │ │ │
│ │ Evaluation runs HERE: │ │
│ │ - Applies patch via git │ │
│ │ - Runs pytest with task's test suite │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
FAQ - Quick answers to common questions
Full troubleshooting guide - Detailed solutions to common issues
Ensure Docker is running:
docker info
If the harness can't pull a pre-built image for a task, it will fall back to building from scratch. You can also manually pull images:
docker pull ghcr.io/epoch-research/swe-bench.eval.x86_64.astropy__astropy-12907
On ARM64 Macs, x86_64 Docker images run via emulation which is slower. This is normal. If you're experiencing issues, ensure you have Rosetta 2 installed:
softwareupdate --install-rosetta
Test your MCP server independently:
npx -y @modelcontextprotocol/server-filesystem /tmp/test
Ensure your Anthropic API key is set:
export ANTHROPIC_API_KEY="sk-ant-..."
Increase the timeout in your config:
timeout_seconds: 600
Ensure the Claude Code CLI is installed and in your PATH:
which claude # Should return the path to the CLI
# Install dev dependencies
pip install -e ".[dev]"
# Run unit tests
pytest -m "not integration"
# Run integration tests (requires API keys and Docker)
pytest -m integration
# Run all tests
pytest
# Lint
ruff check src/
We're building the defacto standard for MCP server benchmarking! Our v1.0 Roadmap includes 200+ features across 11 strategic categories:
🎯 Good First Issues | 🙋 Help Wanted | 📋 View Roadmap
Phase 1: Foundation (v0.3.0)
Phase 2: Benchmarks (v0.4.0)
Phase 3: Developer Experience (v0.5.0)
Phase 4: Platform Expansion (v0.6.0)
Phase 5: MCP Testing Suite (v1.0.0)
We welcome contributions! Check out our 30+ good first issues perfect for newcomers:
See the contributing guide to get started!
New to mcpbr or want to optimize your workflow? Check out the Best Practices Guide for:
Please see CONTRIBUTING.md or the contributing guide for guidelines on how to contribute.
All contributors are expected to follow our Community Guidelines.
MIT - see LICENSE for details.
Built by Grey Newell
Please log in to share your review and rating for this MCP.
Explore related MCPs that share similar capabilities and solve comparable challenges
by modelcontextprotocol
A Model Context Protocol server for Git repository interaction and automation.
by zed-industries
A high‑performance, multiplayer code editor designed for speed and collaboration.
by modelcontextprotocol
Model Context Protocol Servers
by modelcontextprotocol
A Model Context Protocol server that provides time and timezone conversion capabilities.
by cline
An autonomous coding assistant that can create and edit files, execute terminal commands, and interact with a browser directly from your IDE, operating step‑by‑step with explicit user permission.
by upstash
Provides up-to-date, version‑specific library documentation and code examples directly inside LLM prompts, eliminating outdated information and hallucinated APIs.
by daytonaio
Provides a secure, elastic infrastructure that creates isolated sandboxes for running AI‑generated code with sub‑90 ms startup, unlimited persistence, and OCI/Docker compatibility.
by continuedev
Enables faster shipping of code by integrating continuous AI agents across IDEs, terminals, and CI pipelines, offering chat, edit, autocomplete, and customizable agent workflows.
by github
Connects AI tools directly to GitHub, enabling natural‑language interactions for repository browsing, issue and pull‑request management, CI/CD monitoring, code‑security analysis, and team collaboration.
{
"mcpServers": {
"filesystem": {
"command": "npx",
"args": [
"-y",
"@modelcontextprotocol/server-filesystem",
"{workdir}"
],
"env": {}
}
}
}claude mcp add filesystem npx -y @modelcontextprotocol/server-filesystem {workdir}