Benchmarks and evidence
This page ties together the size, runtime, and LLM-generation benchmarks so you can defend AINL with numbers—not vibes.
Benchmarks and evidence
This page ties together the size, runtime, and LLM-generation benchmarks so you can defend AINL with numbers—not vibes.
Benchmark highlights (March 2026)
Source of truth: repository root
BENCHMARK.md(regenerate withmake benchmark). Numbers below are tiktoken cl100k_base as in the Mode Comparison and legacy-inclusive sections.
Quick size snapshot:
| Lens | What it means | Headline ratios (tk) | Artifact coverage |
|------|----------------|----------------------|---------------------|
| Strict-valid | canonical_strict_valid | ~3.2× full_multitarget_core / ~362× full_multitarget / ~0.76× minimal_emit | 19/19 (all viable) |
| Public mixed, viable subset | Representative required-target workloads; excludes curated low-emit / legacy rows | ~1.0× core / ~322× full / ~0.73× minimal | 72/85 viable |
| Compatibility only, viable | Non-strict headline companion profile | ~0.84× core / ~318× full / ~0.71× minimal | 53/66 viable |
| Legacy-inclusive | All paths in profile (honest aggregate drag from tiny shells) | See legacy-inclusive block in BENCHMARK.md | 85/85 (public_mixed) |
Note: full_multitarget_core (six compiler-backed emitters) is the line comparable to pre-hybrid headline ratios. full_multitarget adds langgraph + temporal wrappers (tooling/emit_targets.py); they embed the IR, so that column is much larger. minimal_emit stays practical for typical artifacts (hybrid targets need needs_langgraph / needs_temporal in IR, default false).
Transparency (mirrors BENCHMARK.md blockquotes):
- tiktoken cl100k_base — markdown tables foreground tokenizer counts; JSON still stores the active CLI
--metric(defaulttiktoken) for thresholds and economics. - Viable subset — for
public_mixed/compatibility_only, rules intooling/artifact_profiles.json+ emit heuristics; legacy-inclusive tables are always below the fold. - minimal_emit fallback stub — tiny python_api async stub (~20–30 tk) when no selected target emits code.
- Emitter compaction (Mar 2026) —
prismaandreact_tsbenchmark stubs shortened (~50–70% tk reduction on those lines in examples). - Hybrid wrapper emitters in full_multitarget —
langgraph/temporalsizes come fromscripts/emit_langgraph.pyandscripts/emit_temporal.py; not part ofminimal_emitunlessemit_capabilitiesgains matching flags later. --strict-mode—scripts/benchmark_size.pywith--profile-name=canonical_strict_validruns the compiler in strict reachability mode; see the strict callout inBENCHMARK.mdwhen enabled.
Comparative methodology (LangGraph, Temporal, others)
Head-to-head commands, suggested table rows, and scope boundaries (what we do and do not claim) live in competitive/VERSUS_LANGGRAPH_TEMPORAL_BENCHMARKS.md. Use it with the tables in BENCHMARK.md so published numbers stay reproducible.
Competitive Context
These benchmarks power the head-to-head comparison tables:
- Comparison tables (committed data only)
- Versus LangGraph & Temporal: benchmark methodology
- Full competitive overview
Why these benchmarks matter
AINL is compile-once, run-many: you pay LLM tokens (or human time) to author a program once, then the runtime executes the graph deterministically—no prompt loop on every invocation. The runtime benchmarks measure that second phase: wall-clock latency, RSS deltas, optional execution reliability, and (with tiktoken) source-token economics. The size benchmarks quantify how much emitted surface area you get per AINL artifact (profile- and mode-scoped), including mean compile time over three timed compiles so you can see compiler cost separately from emit size. Together, they show a different cost structure from “LLM re-generates orchestration code every time.”
Human-written baselines (--compare-baselines) anchor claims against real Python stacks (pure async vs LangGraph-style) using the same metrics where possible.
Where to read results
- Ecosystem import examples: trees under
examples/ecosystem/(Clawflows- and Agency-Agents-style Markdown → AINL) are kept fresh via weekly auto-sync from upstream Clawflows and Agency-Agents repos — see.github/workflows/sync-ecosystem.ymlanddocs/ECOSYSTEM_OPENCLAW.md. OpenClaw and ZeroClaw both consume these paths (CLI, MCP, OpenClaw skill —docs/OPENCLAW_INTEGRATION.md, or ZeroClaw skill —docs/ZEROCLAW_INTEGRATION.md). - Human-readable size report (start here): repository root
BENCHMARK.md— transparency notes at top; viable vs legacy-inclusive sections; per-artifact Notes column. On ainativelang.com, the same content is published as the Benchmarks page. - Machine-readable size JSON:
tooling/benchmark_size.json(schema3.5+). - Runtime JSON:
tooling/benchmark_runtime_results.json(generated; tracked in git as the CI baseline when committed). - LLM eval / multi-model bench:
docs/OLLAMA_EVAL.md— local Ollama plus optional Anthropic Claude viaainl-ollama-benchmark --cloud-model …(pip install -e ".[anthropic]",ANTHROPIC_API_KEY).
Key metrics (quick glossary)
| Area | What we measure |
|------|-----------------|
| Size | tiktoken (cl100k_base) on source and emitted bundles; aggregate ratios; optional cost estimates; Compile ms (mean×3) per artifact |
| Runtime | Post-compile execution latency, peak RSS Δ, adapter/trace stats; optional reliability batches; scalability probe on a golden workflow |
| Economics | Estimated USD per run/generation from shared pricing tables in tooling/bench_metrics.py (assumption-driven where adapters do not report usage) |
| Reliability | Success rate + timing σ for extra compile or execution repetitions (workloads remain deterministic; reliability catches flakes and env drift) |
| LLM bench | Pass rate, viability gate, errors, retries, wall time — comparable columns for Ollama vs cloud |
Local commands
Full local run (updates BENCHMARK.md + default JSON paths):
pip install -e ".[dev,benchmark]"
make benchmark
The [benchmark] extra includes aiohttp, langgraph, and temporalio so hybrid emit smoke tests and baseline comparisons can run without skip warnings when those stacks are used.
make benchmark invokes scripts/benchmark_size.py --mode wide, which measures full_multitarget_core (six compiler emitters), full_multitarget (+ langgraph/temporal wrappers), and minimal_emit. Use --mode both for the older two-mode slice only.
If make resolves the wrong interpreter (see Makefile: .venv-py310, then .venv-ainl, then .venv), pass an explicit PYTHON= (e.g. make benchmark-ci PYTHON=./.venv-py310/bin/python) after pip install -e ".[benchmark]" in that venv.
CI-style (JSON only, smaller runtime sampling; matches the spirit of the benchmark-regression workflow):
make benchmark-ci
Size-only or runtime-only:
python scripts/benchmark_size.py --compare-baselines --cost-model gpt-4o
python scripts/benchmark_runtime.py --compare-baselines --reliability-runs 5 --cost-model gpt-4o
Regression helper (compare two JSON reports):
python scripts/compare_benchmark_json.py \
--old-json tooling/benchmark_size.json \
--new-json tooling/benchmark_size_ci.json \
--threshold 0.10
CI
Pull requests and pushes run benchmark-regression (see .github/workflows/ci.yml): benchmarks execute on Ubuntu (Python 3.10), JSON artifacts upload, and compare_benchmark_json.py gates regressions against the baseline commit when baseline files exist in git. If tooling/benchmark_size_ci.json / tooling/benchmark_runtime_ci.json are present on the baseline SHA, the workflow compares against those (CI slice vs CI slice); otherwise it falls back to the full tooling/benchmark_size.json / tooling/benchmark_runtime_results.json when present. Details: BENCHMARK.md § CI regression baselines.
See also
scripts/benchmark_size.py,scripts/benchmark_runtime.py,tooling/bench_metrics.pydocs/TEST_PROFILES.md— pytest profile matrixdocs/architecture/COMPILE_ONCE_RUN_MANY.md— architectural framingdocs/OPENCLAW_INTEGRATION.md— OpenClaw skill,ainl install-openclaw, and links toexamples/ecosystem/(auto-sync, MCP importer tools)docs/ZEROCLAW_INTEGRATION.md— ZeroClaw skill,ainl install-zeroclaw, and links toexamples/ecosystem/(auto-sync, MCP importer tools)
