Methodology
How scores are computed today, what's being measured, and where the current approach stops short.
Status: documented rationales, pre-benchmark weights
Today every score is derived from static signals — file existence and content-length checks on the cloned tree. No agent is actually run. Per-model rationales are derived from each agent's published documentation — see the Sources links under every model below. The weight values themselves are still pre-benchmark; they aren't yet calibrated against measured agent success. The combination is enough to produce meaningfully different rankings and to show how the UX of per-model scoring feels, but it should not be read as a benchmark.
The plan to replace pre-benchmark weights with measured ones is part of the v1.0.0 production cut on the roadmap (tasks/1.0.0/03-benchmark-harness.md). Until then, treat the numbers as a directional signal, not a verdict.
Score formula
per-model score = Σ(signal.pass × model.weight[signal]) / Σ(model.weight) × 100 overall = mean(per-model scores) improvement = closing a gap unlocks (1 - pass) × weight / Σweight × 100 points
signal.pass is a float in [0, 1] — partial credit is allowed (e.g. a thin README gets 0.3, a long one gets 1.0).
Signals (16)
- AGENTS.md / CLAUDE.mdagents_mdPresence of an agent-oriented instructions file, with substantive content.Improve: Add an AGENTS.md covering project goals, layout, setup commands, and conventions. Aim for 800+ chars of real guidance (not boilerplate).
- Cursor rules (.cursor/rules)cursor_rulesCursor's canonical instruction surface — `.cursor/rules/*.mdc` (modern) or `.cursorrules` (legacy).Improve: Add `.cursor/rules/*.mdc` files describing how Cursor should work in this repo (architecture, conventions, naming). The legacy `.cursorrules` file is still read but is deprecated.
- GEMINI.mdgemini_mdGemini CLI's canonical hierarchical instructions file — read at every prompt.Improve: Add a GEMINI.md at the repo root covering project goals, layout, setup commands, and conventions. Aim for 800+ chars of real guidance (not boilerplate).
- .openhands/setup.shopenhands_setupOpenHands runs `.openhands/setup.sh` at session start to bootstrap the repo's dev environment.Improve: Add a `.openhands/setup.sh` that installs dependencies and prepares the project so OpenHands can run tests and lints out of the box.
- .aider.conf.ymlaider_confAider reads `.aider.conf.yml` (or `.yaml`) for repo-level config — model, lint command, test command.Improve: Add a `.aider.conf.yml` at the repo root pinning Aider's `test-cmd` and `lint-cmd` so it auto-runs them after edits.
- READMEreadmeNon-trivial README so the agent can learn the project quickly.Improve: Expand your README to cover what the project does, how to install, the common commands, and the high-level layout.
- Test suitetestsDetectable tests — agents rely on feedback loops.Improve: Add a tests/ (or test/, __tests__/, spec/) directory with runnable tests. Document how to run them in AGENTS.md.
- CI configurationciDefined pipeline the agent can reason about / emulate locally.Improve: Add a CI workflow (e.g. .github/workflows/ci.yml or .gitlab-ci.yml) that runs tests + linter on every PR.
- Linter / formatter configlinterAgents get immediate feedback on style rather than ambiguous drift.Improve: Configure a linter/formatter (ESLint+Prettier, Biome, Ruff, rustfmt+clippy, golangci-lint) and commit the config.
- Dependency manifestdeps_manifestMachine-readable dependency list so the agent can reproduce the env.Improve: Commit a proper manifest (package.json, pyproject.toml, Cargo.toml, go.mod, etc.) plus a lockfile.
- Reproducible dev envdev_envOne-command setup the agent can run (Makefile / devcontainer / Nix / Docker).Improve: Add a Makefile or devcontainer or Dockerfile so the agent can set up the project in one command.
- Type configurationtype_configStatic types help agents reason about call sites without running code.Improve: Add a type config (tsconfig.json for JS/TS, mypy.ini or pyrightconfig.json for Python). Rust/Go/JVM/Scala/Swift/C#/OCaml/Haskell/Zig are typed by default.
- License filelicenseClarity on what an agent is allowed to do with the code.Improve: Add a LICENSE (or COPYING) file — MIT, Apache-2.0, BSD, GPL, etc. — at the repo root.
- CONTRIBUTING guidecontributingExplicit contribution workflow an agent can follow.Improve: Add CONTRIBUTING.md describing branch naming, commit style, test commands, and the PR process.
- Pre-commit / git hookspre_commitCatches problems locally before the agent wastes a CI cycle.Improve: Set up pre-commit (.pre-commit-config.yaml), husky, or lefthook to run format+lint on every commit.
- Manageable sizesizeVery large repos strain an agent's context window.Improve: If possible, split into smaller modules or carve out a focused entry path. Document where to start in AGENTS.md.
Models & weight profiles (8)
- Claude CodeLoads CLAUDE.md at the start of every conversation per Anthropic's memory docs, so AGENTS.md / CLAUDE.md and a fast test loop carry the most weight.Sources:code.claude.com/memory
Weights
ci 0.50 size 0.50 tests 1.00 readme 0.70 linter 0.60 dev_env 0.90 license 0.30 gemini_md 0.00 aider_conf 0.00 agents_md 1.00 cursor_rules 0.00 pre_commit 0.40 type_config 0.60 contributing 0.40 deps_manifest 0.70 openhands_setup 0.00
- CursorPer Cursor's Rules docs, reads `.cursor/rules/*.mdc` and AGENTS.md as the canonical repo-side input. Type config and a clean README still aid the codebase index but aren't the docs-cited signal.Sources:cursor.com/rules
Weights
ci 0.40 size 0.40 tests 0.70 linter 0.80 readme 1.00 dev_env 0.50 gemini_md 0.00 license 0.30 aider_conf 0.00 agents_md 0.80 pre_commit 0.30 type_config 1.00 contributing 0.30 cursor_rules 1.00 deps_manifest 0.80 openhands_setup 0.00
- DevinOperates from a sandboxed Ubuntu VM and runs an 8-step machine setup (deps, secrets, language versions, lint/test commands) per Cognition's repo-setup docs. CI config files alone aren't what the docs ask for — a runnable dev environment is.Sources:docs.devin.ai/repo-setup
Weights
ci 0.70 size 0.60 tests 0.90 linter 0.50 readme 0.70 dev_env 1.00 license 0.30 gemini_md 0.00 aider_conf 0.00 agents_md 0.60 cursor_rules 0.00 pre_commit 0.50 type_config 0.50 contributing 0.50 deps_manifest 0.90 openhands_setup 0.00
- GPT-5 CodexReads AGENTS.md before doing any work per OpenAI's Codex docs — the strictest AGENTS.md adherent of any agent here. Hierarchical (per-directory) AGENTS.md and AGENTS.override.md are first-class.Sources:developers.openai.com/agents-md
Weights
ci 0.70 size 0.50 tests 0.80 linter 0.60 readme 0.80 dev_env 0.70 license 0.30 gemini_md 0.00 aider_conf 0.00 agents_md 0.90 cursor_rules 0.00 pre_commit 0.40 type_config 0.70 contributing 0.40 deps_manifest 0.70 openhands_setup 0.00
- Gemini CLIReads hierarchical `GEMINI.md` (global → workspace → component-level) at every prompt per Gemini CLI's docs. The long-context advantage favors repos that split context per directory rather than docs-heavy in general.Sources:geminicli.com/gemini-md
Weights
ci 0.60 size 0.50 tests 0.90 linter 0.70 readme 0.90 dev_env 0.70 license 0.30 aider_conf 0.00 agents_md 0.70 gemini_md 1.00 cursor_rules 0.00 pre_commit 0.40 type_config 0.90 contributing 0.40 deps_manifest 0.80 openhands_setup 0.00
- AiderAuto-lints on every edit by default; runs the configured test command after edits when `--test-cmd` is set (per Aider's lint/test docs). A green linter and a declared test command translate directly into successful commits.Sources:aider.chat/lint-test.html
Weights
ci 0.30 size 0.40 tests 1.00 linter 1.00 readme 0.60 dev_env 0.50 license 0.20 gemini_md 0.00 agents_md 0.80 aider_conf 0.80 cursor_rules 0.00 pre_commit 0.30 type_config 0.50 contributing 0.30 deps_manifest 0.70 openhands_setup 0.00
- OpenHandsRuns in a sandboxed container and executes `.openhands/setup.sh` at session start per OpenHands' repo-customization docs. Root AGENTS.md is now the preferred always-on instruction surface (microagents are deprecated in favor of it).
Weights
ci 1.00 size 0.70 tests 0.90 linter 0.60 readme 0.70 dev_env 1.00 license 0.40 gemini_md 0.00 aider_conf 0.00 agents_md 0.50 cursor_rules 0.00 pre_commit 0.60 type_config 0.50 contributing 0.70 deps_manifest 1.00 openhands_setup 1.00
- PiMinimal terminal coding harness. Loads `AGENTS.md` (or `CLAUDE.md`) at startup — global, parent dirs, then cwd — per the Pi coding-agent README. Sandboxing is deferred to user-installed extensions.Sources:github.com/README.md
Weights
ci 0.40 size 0.50 tests 0.90 linter 0.80 readme 0.70 dev_env 0.60 license 0.20 gemini_md 0.00 aider_conf 0.00 agents_md 1.00 cursor_rules 0.00 pre_commit 0.40 type_config 0.60 contributing 0.30 deps_manifest 0.70 openhands_setup 0.00
What isn't measured yet
- Whether tests actually pass (we only detect their presence).
- Whether the linter actually runs cleanly.
- Whether the dev-env artifact (Makefile, Dockerfile) works end-to-end.
- Commit-history signals — churn, commit frequency, contributor count. We use
--depth 1 --single-branchwhich fetches the whole working tree at HEAD of the default branch, but no history. These describe repo health more than agent behavior, so they sit outside the score for now. - How agents actually perform on the repo — that's the v1.0.0 benchmark harness.