WP-Bench: The WordPress-Specific AI Benchmark Developers Have Been Missing
AI coding assistants are getting decent at generic tasks—write a function, refactor a module, explain an algorithm. But WordPress development has its own texture: hooks, capability checks, escaping rules, REST endpoints, wp-env, WP-CLI, and a long tail of APIs that don’t look like typical framework code.
That gap is what WP-Bench targets. It’s an official WordPress AI benchmark designed to evaluate how well language models actually understand WordPress development—not just PHP syntax, but the patterns and constraints that matter in real plugins and themes.
The project lives on GitHub at WordPress/wp-bench, and it’s positioned as both a practical tool for teams choosing models and a nudge to AI providers to treat WordPress as a first-class ecosystem during evaluation.
What WP-Bench is trying to measure (and why it’s different)?
Most popular benchmarks grade models on broad programming competence. That’s useful, but it doesn’t answer WordPress-flavored questions like: Do they consistently escape output? Do they understand how WP_Query interacts with the main loop? Do they reach for the right hook? Do they avoid security foot-guns around admin-ajax.php and capability checks?
WP-Bench frames WordPress competence as two complementary dimensions:
- Knowledge: multiple-choice questions on WordPress concepts, core APIs, hooks, security patterns, and coding standards—explicitly including modern additions like the Abilities API and Interactivity API (introduced to cover areas models often struggle with).
- Execution: code-generation tasks that get graded by a real WordPress runtime. The output isn’t judged by a human rubric alone—WordPress runs it, checks it, and reports back.
How the grading pipeline works
The key design choice in WP-Bench is that it uses WordPress as the evaluator. Instead of scoring generated code with heuristics only, the benchmark runs model output through a WordPress runtime and combines static checks with runtime assertions.
In practice, the flow looks like this:
- The harness sends a prompt to a model asking for WordPress code.
- The generated code is passed into a WordPress runtime via WP-CLI (the WordPress command-line interface used for administration and automation).
- The runtime performs static analysis (syntax, coding standards, and security-related checks).
- The code executes inside a sandboxed environment where test assertions verify behavior.
- The harness collects results and writes them out as JSON, including scores and feedback.
This matters because WordPress work is often less about producing “some code” and more about producing code that behaves correctly under WordPress conventions: hooks fire at the right time, data flows through the right APIs, and outputs follow escaping and capability patterns.
Quick start: running WP-Bench locally
WP-Bench is split into a Python-based harness and a WordPress runtime used for grading. The project README walks through a minimal setup that looks like this.
1) Install the harness
python3 -m venv .venv && source .venv/bin/activate
pip install -e ./python2) Add your model provider keys
Create a .env file and add API keys for the providers you want to test. (The benchmark supports multiple providers; the exact set depends on your setup.)
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=...3) Start the WordPress grading runtime
cd runtime
npm install
npm start4) Run a benchmark suite
cd ..
wp-bench run --config wp-bench.example.yamlBy default, WP-Bench writes summary results to output/results.json, and per-test logs to output/results.jsonl.
Multi-model runs (model comparisons in one pass)
A common workflow is to benchmark the models you’re already using (or considering) against the same WordPress suite. WP-Bench supports this by letting you list multiple models in a single config.
models:
- name: gpt-4o
- name: gpt-4o-mini
- name: claude-sonnet-4-20250514
- name: claude-opus-4-5-20251101
- name: gemini/gemini-2.5-pro
- name: gemini/gemini-2.5-flashModel naming follows LiteLLM conventions, which is handy if you already route requests through LiteLLM or want consistent identifiers across providers.
Configuration basics (what you’ll actually tweak)
WP-Bench ships with an example YAML config you can copy and customize. The key knobs are dataset source, suite selection, the grader backend, and run controls like limits and concurrency.
dataset:
source: local # 'local' or 'huggingface'
name: wp-core-v1 # suite name
models:
- name: gpt-4o
grader:
kind: docker
wp_env_dir: ./runtime # path to wp-env project
run:
suite: wp-core-v1
limit: 10 # limit tests (null = all)
concurrency: 4
output:
path: output/results.json
jsonl_path: output/results.jsonlHandy CLI options
wp-bench run --config wp-bench.yaml # run with config file
wp-bench run --model-name gpt-4o --limit 5 # quick single-model test
wp-bench dry-run --config wp-bench.yaml # validate config without calling modelsInside the repo: what’s where?
WP-Bench is structured as a small system rather than a single script. That’s a good sign if you care about reproducibility and more rigorous evaluation.
.
├── python/ # Benchmark harness (pip installable)
├── runtime/ # WordPress grader plugin + wp-env config
├── datasets/ # Test suites (local JSON + Hugging Face builder)
├── notebooks/ # Results visualization and reporting
└── output/ # Benchmark results (gitignored)Test suites: knowledge vs execution
Suites live under datasets/suites/<suite-name>/ and are split into two directories:
execution/: code-generation tasks with assertions (stored as JSON; typically one file per category).knowledge/: multiple-choice questions about WordPress concepts (also JSON; typically one file per category).
The default suite is wp-core-v1, covering WordPress core APIs, hooks, database operations, and security patterns.
Using a dataset from Hugging Face
If you prefer pulling datasets from Hugging Face rather than using local suite JSON files, the config supports that as well.
dataset:
source: huggingface
name: WordPress/wp-bench-v1Current limitations (and why they’re important to understand)
WP-Bench is explicitly described as an early release. That’s not a warning sign—it’s just reality for benchmarks that aim to be both WordPress-specific and hard to game.
- Dataset size is still small. A benchmark is only as good as its test coverage, and WordPress has a huge API surface. More cases are needed across core, plugin architecture, and real-world edge conditions.
- It skews toward newer WordPress features. The suite leans into areas around WordPress 6.9 (including Abilities API and Interactivity API). This is partly intentional—models often fail hardest on newer APIs—but it can bias results because those APIs may post-date model training cutoffs.
- Some older concepts saturate quickly. Early results showed models scoring very high on older WordPress topics, which makes those questions less useful for distinguishing capability. The challenge is building tasks that are genuinely discriminating, not just obscure.
Why this matters if you build WordPress plugins (or internal tooling)
If you’ve experimented with AI inside WordPress projects, you’ve probably noticed the failure modes are rarely “can’t write PHP.” They’re more like: wrong hook selection, missing nonce checks, sloppy escaping, misunderstanding multisite, or mixing old and new editor paradigms in the same answer.
A WordPress-specific benchmark gives teams a more grounded way to answer practical questions, such as:
- Which model is most reliable for generating WordPress code that passes runtime checks?
- Which models understand WordPress security best practices well enough to be used in codegen workflows?
- Are newer WordPress APIs being handled accurately, or are assistants hallucinating patterns that look plausible but don’t exist?
WP-Bench is also a signal to AI providers
One of the more strategic goals is making WordPress performance something AI labs track deliberately. If a benchmark becomes part of pre-release evaluation, it creates pressure to improve WordPress-specific reasoning and code generation—rather than treating WordPress as “just PHP.”
There’s also ongoing work toward an open source leaderboard that tracks model performance on WordPress tasks. A public leaderboard can make tradeoffs visible (speed vs correctness, knowledge vs execution), and it provides a shared reference point when the community talks to AI providers.
Resources
- WP-Bench repository: https://github.com/WordPress/wp-bench
- AI Building Blocks for WordPress: https://make.wordpress.org/ai/2025/07/17/ai-building-blocks/
- WordPress Slack channel: #core-ai — https://wordpress.slack.com/archives/C08TJ8BPULS
James O'Brien
Backend developer, Node.js and Go specialist. API design and microservices architecture are my main focus. I love diving deep into technical details.
All posts