Methodology

How scores are computed today, what's being measured, and where the current approach stops short.

Status: documented rationales, pre-benchmark weights

Today every score is derived from static signals — file existence and content-length checks on the cloned tree. No agent is actually run. Per-model rationales are derived from each agent's published documentation — see the Sources links under every model below. The weight values themselves are still pre-benchmark; they aren't yet calibrated against measured agent success. The combination is enough to produce meaningfully different rankings and to show how the UX of per-model scoring feels, but it should not be read as a benchmark.

The plan to replace pre-benchmark weights with measured ones is part of the v1.0.0 production cut on the roadmap (tasks/1.0.0/03-benchmark-harness.md). Until then, treat the numbers as a directional signal, not a verdict.

Score formula

per-model score = Σ(signal.pass × model.weight[signal]) / Σ(model.weight) × 100
overall         = mean(per-model scores)
improvement     = closing a gap unlocks  (1 - pass) × weight / Σweight × 100  points

signal.pass is a float in [0, 1] — partial credit is allowed (e.g. a thin README gets 0.3, a long one gets 1.0).

Signals (16)

  • AGENTS.md / CLAUDE.md
    agents_md
    Presence of an agent-oriented instructions file, with substantive content.
    Improve: Add an AGENTS.md covering project goals, layout, setup commands, and conventions. Aim for 800+ chars of real guidance (not boilerplate).
  • Cursor rules (.cursor/rules)
    cursor_rules
    Cursor's canonical instruction surface — `.cursor/rules/*.mdc` (modern) or `.cursorrules` (legacy).
    Improve: Add `.cursor/rules/*.mdc` files describing how Cursor should work in this repo (architecture, conventions, naming). The legacy `.cursorrules` file is still read but is deprecated.
  • GEMINI.md
    gemini_md
    Gemini CLI's canonical hierarchical instructions file — read at every prompt.
    Improve: Add a GEMINI.md at the repo root covering project goals, layout, setup commands, and conventions. Aim for 800+ chars of real guidance (not boilerplate).
  • .openhands/setup.sh
    openhands_setup
    OpenHands runs `.openhands/setup.sh` at session start to bootstrap the repo's dev environment.
    Improve: Add a `.openhands/setup.sh` that installs dependencies and prepares the project so OpenHands can run tests and lints out of the box.
  • .aider.conf.yml
    aider_conf
    Aider reads `.aider.conf.yml` (or `.yaml`) for repo-level config — model, lint command, test command.
    Improve: Add a `.aider.conf.yml` at the repo root pinning Aider's `test-cmd` and `lint-cmd` so it auto-runs them after edits.
  • README
    readme
    Non-trivial README so the agent can learn the project quickly.
    Improve: Expand your README to cover what the project does, how to install, the common commands, and the high-level layout.
  • Test suite
    tests
    Detectable tests — agents rely on feedback loops.
    Improve: Add a tests/ (or test/, __tests__/, spec/) directory with runnable tests. Document how to run them in AGENTS.md.
  • CI configuration
    ci
    Defined pipeline the agent can reason about / emulate locally.
    Improve: Add a CI workflow (e.g. .github/workflows/ci.yml or .gitlab-ci.yml) that runs tests + linter on every PR.
  • Linter / formatter config
    linter
    Agents get immediate feedback on style rather than ambiguous drift.
    Improve: Configure a linter/formatter (ESLint+Prettier, Biome, Ruff, rustfmt+clippy, golangci-lint) and commit the config.
  • Dependency manifest
    deps_manifest
    Machine-readable dependency list so the agent can reproduce the env.
    Improve: Commit a proper manifest (package.json, pyproject.toml, Cargo.toml, go.mod, etc.) plus a lockfile.
  • Reproducible dev env
    dev_env
    One-command setup the agent can run (Makefile / devcontainer / Nix / Docker).
    Improve: Add a Makefile or devcontainer or Dockerfile so the agent can set up the project in one command.
  • Type configuration
    type_config
    Static types help agents reason about call sites without running code.
    Improve: Add a type config (tsconfig.json for JS/TS, mypy.ini or pyrightconfig.json for Python). Rust/Go/JVM/Scala/Swift/C#/OCaml/Haskell/Zig are typed by default.
  • License file
    license
    Clarity on what an agent is allowed to do with the code.
    Improve: Add a LICENSE (or COPYING) file — MIT, Apache-2.0, BSD, GPL, etc. — at the repo root.
  • CONTRIBUTING guide
    contributing
    Explicit contribution workflow an agent can follow.
    Improve: Add CONTRIBUTING.md describing branch naming, commit style, test commands, and the PR process.
  • Pre-commit / git hooks
    pre_commit
    Catches problems locally before the agent wastes a CI cycle.
    Improve: Set up pre-commit (.pre-commit-config.yaml), husky, or lefthook to run format+lint on every commit.
  • Manageable size
    size
    Very large repos strain an agent's context window.
    Improve: If possible, split into smaller modules or carve out a focused entry path. Document where to start in AGENTS.md.

Models & weight profiles (8)

  • Claude Code
    Loads CLAUDE.md at the start of every conversation per Anthropic's memory docs, so AGENTS.md / CLAUDE.md and a fast test loop carry the most weight.
    Weights
    ci               0.50
    size             0.50
    tests            1.00
    readme           0.70
    linter           0.60
    dev_env          0.90
    license          0.30
    gemini_md        0.00
    aider_conf       0.00
    agents_md        1.00
    cursor_rules     0.00
    pre_commit       0.40
    type_config      0.60
    contributing     0.40
    deps_manifest    0.70
    openhands_setup  0.00
  • Cursor
    Per Cursor's Rules docs, reads `.cursor/rules/*.mdc` and AGENTS.md as the canonical repo-side input. Type config and a clean README still aid the codebase index but aren't the docs-cited signal.
    Weights
    ci               0.40
    size             0.40
    tests            0.70
    linter           0.80
    readme           1.00
    dev_env          0.50
    gemini_md        0.00
    license          0.30
    aider_conf       0.00
    agents_md        0.80
    pre_commit       0.30
    type_config      1.00
    contributing     0.30
    cursor_rules     1.00
    deps_manifest    0.80
    openhands_setup  0.00
  • Devin
    Operates from a sandboxed Ubuntu VM and runs an 8-step machine setup (deps, secrets, language versions, lint/test commands) per Cognition's repo-setup docs. CI config files alone aren't what the docs ask for — a runnable dev environment is.
    Weights
    ci               0.70
    size             0.60
    tests            0.90
    linter           0.50
    readme           0.70
    dev_env          1.00
    license          0.30
    gemini_md        0.00
    aider_conf       0.00
    agents_md        0.60
    cursor_rules     0.00
    pre_commit       0.50
    type_config      0.50
    contributing     0.50
    deps_manifest    0.90
    openhands_setup  0.00
  • GPT-5 Codex
    Reads AGENTS.md before doing any work per OpenAI's Codex docs — the strictest AGENTS.md adherent of any agent here. Hierarchical (per-directory) AGENTS.md and AGENTS.override.md are first-class.
    Weights
    ci               0.70
    size             0.50
    tests            0.80
    linter           0.60
    readme           0.80
    dev_env          0.70
    license          0.30
    gemini_md        0.00
    aider_conf       0.00
    agents_md        0.90
    cursor_rules     0.00
    pre_commit       0.40
    type_config      0.70
    contributing     0.40
    deps_manifest    0.70
    openhands_setup  0.00
  • Gemini CLI
    Reads hierarchical `GEMINI.md` (global → workspace → component-level) at every prompt per Gemini CLI's docs. The long-context advantage favors repos that split context per directory rather than docs-heavy in general.
    Weights
    ci               0.60
    size             0.50
    tests            0.90
    linter           0.70
    readme           0.90
    dev_env          0.70
    license          0.30
    aider_conf       0.00
    agents_md        0.70
    gemini_md        1.00
    cursor_rules     0.00
    pre_commit       0.40
    type_config      0.90
    contributing     0.40
    deps_manifest    0.80
    openhands_setup  0.00
  • Aider
    Auto-lints on every edit by default; runs the configured test command after edits when `--test-cmd` is set (per Aider's lint/test docs). A green linter and a declared test command translate directly into successful commits.
    Weights
    ci               0.30
    size             0.40
    tests            1.00
    linter           1.00
    readme           0.60
    dev_env          0.50
    license          0.20
    gemini_md        0.00
    agents_md        0.80
    aider_conf       0.80
    cursor_rules     0.00
    pre_commit       0.30
    type_config      0.50
    contributing     0.30
    deps_manifest    0.70
    openhands_setup  0.00
  • OpenHands
    Runs in a sandboxed container and executes `.openhands/setup.sh` at session start per OpenHands' repo-customization docs. Root AGENTS.md is now the preferred always-on instruction surface (microagents are deprecated in favor of it).
    Weights
    ci               1.00
    size             0.70
    tests            0.90
    linter           0.60
    readme           0.70
    dev_env          1.00
    license          0.40
    gemini_md        0.00
    aider_conf       0.00
    agents_md        0.50
    cursor_rules     0.00
    pre_commit       0.60
    type_config      0.50
    contributing     0.70
    deps_manifest    1.00
    openhands_setup  1.00
  • Pi
    Minimal terminal coding harness. Loads `AGENTS.md` (or `CLAUDE.md`) at startup — global, parent dirs, then cwd — per the Pi coding-agent README. Sandboxing is deferred to user-installed extensions.
    Weights
    ci               0.40
    size             0.50
    tests            0.90
    linter           0.80
    readme           0.70
    dev_env          0.60
    license          0.20
    gemini_md        0.00
    aider_conf       0.00
    agents_md        1.00
    cursor_rules     0.00
    pre_commit       0.40
    type_config      0.60
    contributing     0.30
    deps_manifest    0.70
    openhands_setup  0.00

What isn't measured yet

  • Whether tests actually pass (we only detect their presence).
  • Whether the linter actually runs cleanly.
  • Whether the dev-env artifact (Makefile, Dockerfile) works end-to-end.
  • Commit-history signals — churn, commit frequency, contributor count. We use--depth 1 --single-branchwhich fetches the whole working tree at HEAD of the default branch, but no history. These describe repo health more than agent behavior, so they sit outside the score for now.
  • How agents actually perform on the repo — that's the v1.0.0 benchmark harness.