Benchmarks and evidence

This page ties together the size, runtime, and LLM-generation benchmarks so you can defend AINL with numbers—not vibes.

Benchmarks and evidence

This page ties together the size, runtime, and LLM-generation benchmarks so you can defend AINL with numbers—not vibes.

Benchmark highlights (March 2026)

Source of truth: repository root BENCHMARK.md (regenerate with make benchmark). Numbers below are tiktoken cl100k_base as in the Mode Comparison and legacy-inclusive sections.

Quick size snapshot:

| Lens | What it means | Headline ratios (tk) | Artifact coverage | |------|----------------|----------------------|---------------------| | Strict-valid | canonical_strict_valid | ~3.2× full_multitarget_core / ~362× full_multitarget / ~0.76× minimal_emit | 19/19 (all viable) | | Public mixed, viable subset | Representative required-target workloads; excludes curated low-emit / legacy rows | ~1.0× core / ~322× full / ~0.73× minimal | 72/85 viable | | Compatibility only, viable | Non-strict headline companion profile | ~0.84× core / ~318× full / ~0.71× minimal | 53/66 viable | | Legacy-inclusive | All paths in profile (honest aggregate drag from tiny shells) | See legacy-inclusive block in BENCHMARK.md | 85/85 (public_mixed) |

Note: full_multitarget_core (six compiler-backed emitters) is the line comparable to pre-hybrid headline ratios. full_multitarget adds langgraph + temporal wrappers (tooling/emit_targets.py); they embed the IR, so that column is much larger. minimal_emit stays practical for typical artifacts (hybrid targets need needs_langgraph / needs_temporal in IR, default false).

Transparency (mirrors BENCHMARK.md blockquotes):

tiktoken cl100k_base — markdown tables foreground tokenizer counts; JSON still stores the active CLI --metric (default tiktoken) for thresholds and economics.
Viable subset — for public_mixed / compatibility_only, rules in tooling/artifact_profiles.json + emit heuristics; legacy-inclusive tables are always below the fold.
minimal_emit fallback stub — tiny python_api async stub (~20–30 tk) when no selected target emits code.
Emitter compaction (Mar 2026) — prisma and react_ts benchmark stubs shortened (~50–70% tk reduction on those lines in examples).
Hybrid wrapper emitters in full_multitarget — langgraph / temporal sizes come from scripts/emit_langgraph.py and scripts/emit_temporal.py; not part of minimal_emit unless emit_capabilities gains matching flags later.
--strict-mode — scripts/benchmark_size.py with --profile-name=canonical_strict_valid runs the compiler in strict reachability mode; see the strict callout in BENCHMARK.md when enabled.

Comparative methodology (LangGraph, Temporal, others)

Head-to-head commands, suggested table rows, and scope boundaries (what we do and do not claim) live in competitive/VERSUS_LANGGRAPH_TEMPORAL_BENCHMARKS.md. Use it with the tables in BENCHMARK.md so published numbers stay reproducible.

Competitive Context

These benchmarks power the head-to-head comparison tables:

Why these benchmarks matter

AINL is compile-once, run-many: you pay LLM tokens (or human time) to author a program once, then the runtime executes the graph deterministically—no prompt loop on every invocation. The runtime benchmarks measure that second phase: wall-clock latency, RSS deltas, optional execution reliability, and (with tiktoken) source-token economics. The size benchmarks quantify how much emitted surface area you get per AINL artifact (profile- and mode-scoped), including mean compile time over three timed compiles so you can see compiler cost separately from emit size. Together, they show a different cost structure from “LLM re-generates orchestration code every time.”

Human-written baselines (--compare-baselines) anchor claims against real Python stacks (pure async vs LangGraph-style) using the same metrics where possible.

Where to read results

Ecosystem import examples: trees under examples/ecosystem/ (Clawflows- and Agency-Agents-style Markdown → AINL) are kept fresh via weekly auto-sync from upstream Clawflows and Agency-Agents repos — see .github/workflows/sync-ecosystem.yml and docs/ECOSYSTEM_OPENCLAW.md. OpenClaw and ZeroClaw both consume these paths (CLI, MCP, OpenClaw skill — docs/OPENCLAW_INTEGRATION.md, or ZeroClaw skill — docs/ZEROCLAW_INTEGRATION.md).
Human-readable size report (start here): repository root BENCHMARK.md — transparency notes at top; viable vs legacy-inclusive sections; per-artifact Notes column. On ainativelang.com, the same content is published as the Benchmarks page.
Machine-readable size JSON: tooling/benchmark_size.json (schema 3.5+).
Runtime JSON: tooling/benchmark_runtime_results.json (generated; tracked in git as the CI baseline when committed).
LLM eval / multi-model bench: docs/OLLAMA_EVAL.md — local Ollama plus optional Anthropic Claude via ainl-ollama-benchmark --cloud-model … (pip install -e ".[anthropic]", ANTHROPIC_API_KEY).

Key metrics (quick glossary)

| Area | What we measure | |------|-----------------| | Size | tiktoken (cl100k_base) on source and emitted bundles; aggregate ratios; optional cost estimates; Compile ms (mean×3) per artifact | | Runtime | Post-compile execution latency, peak RSS Δ, adapter/trace stats; optional reliability batches; scalability probe on a golden workflow | | Economics | Estimated USD per run/generation from shared pricing tables in tooling/bench_metrics.py (assumption-driven where adapters do not report usage) | | Reliability | Success rate + timing σ for extra compile or execution repetitions (workloads remain deterministic; reliability catches flakes and env drift) | | LLM bench | Pass rate, viability gate, errors, retries, wall time — comparable columns for Ollama vs cloud |

Local commands

Full local run (updates BENCHMARK.md + default JSON paths):

pip install -e ".[dev,benchmark]"
make benchmark

The [benchmark] extra includes aiohttp, langgraph, and temporalio so hybrid emit smoke tests and baseline comparisons can run without skip warnings when those stacks are used.

make benchmark invokes scripts/benchmark_size.py --mode wide, which measures full_multitarget_core (six compiler emitters), full_multitarget (+ langgraph/temporal wrappers), and minimal_emit. Use --mode both for the older two-mode slice only.

If make resolves the wrong interpreter (see Makefile: .venv-py310, then .venv-ainl, then .venv), pass an explicit PYTHON= (e.g. make benchmark-ci PYTHON=./.venv-py310/bin/python) after pip install -e ".[benchmark]" in that venv.

CI-style (JSON only, smaller runtime sampling; matches the spirit of the benchmark-regression workflow):

make benchmark-ci

Size-only or runtime-only:

python scripts/benchmark_size.py --compare-baselines --cost-model gpt-4o
python scripts/benchmark_runtime.py --compare-baselines --reliability-runs 5 --cost-model gpt-4o

Regression helper (compare two JSON reports):

python scripts/compare_benchmark_json.py \
  --old-json tooling/benchmark_size.json \
  --new-json tooling/benchmark_size_ci.json \
  --threshold 0.10

CI

Pull requests and pushes run benchmark-regression (see .github/workflows/ci.yml): benchmarks execute on Ubuntu (Python 3.10), JSON artifacts upload, and compare_benchmark_json.py gates regressions against the baseline commit when baseline files exist in git. If tooling/benchmark_size_ci.json / tooling/benchmark_runtime_ci.json are present on the baseline SHA, the workflow compares against those (CI slice vs CI slice); otherwise it falls back to the full tooling/benchmark_size.json / tooling/benchmark_runtime_results.json when present. Details: BENCHMARK.md § CI regression baselines.

Benchmarks and evidence

This page ties together the size, runtime, and LLM-generation benchmarks so you can defend AINL with numbers—not vibes.

Benchmarks and evidence

This page ties together the size, runtime, and LLM-generation benchmarks so you can defend AINL with numbers—not vibes.

Benchmark highlights (March 2026)

Source of truth: repository root BENCHMARK.md (regenerate with make benchmark). Numbers below are tiktoken cl100k_base as in the Mode Comparison and legacy-inclusive sections.

Quick size snapshot:

Transparency (mirrors BENCHMARK.md blockquotes):

tiktoken cl100k_base — markdown tables foreground tokenizer counts; JSON still stores the active CLI --metric (default tiktoken) for thresholds and economics.
Viable subset — for public_mixed / compatibility_only, rules in tooling/artifact_profiles.json + emit heuristics; legacy-inclusive tables are always below the fold.
minimal_emit fallback stub — tiny python_api async stub (~20–30 tk) when no selected target emits code.
Emitter compaction (Mar 2026) — prisma and react_ts benchmark stubs shortened (~50–70% tk reduction on those lines in examples).
Hybrid wrapper emitters in full_multitarget — langgraph / temporal sizes come from scripts/emit_langgraph.py and scripts/emit_temporal.py; not part of minimal_emit unless emit_capabilities gains matching flags later.
--strict-mode — scripts/benchmark_size.py with --profile-name=canonical_strict_valid runs the compiler in strict reachability mode; see the strict callout in BENCHMARK.md when enabled.

Comparative methodology (LangGraph, Temporal, others)

Competitive Context

These benchmarks power the head-to-head comparison tables:

Why these benchmarks matter

Human-written baselines (--compare-baselines) anchor claims against real Python stacks (pure async vs LangGraph-style) using the same metrics where possible.

Where to read results

Ecosystem import examples: trees under examples/ecosystem/ (Clawflows- and Agency-Agents-style Markdown → AINL) are kept fresh via weekly auto-sync from upstream Clawflows and Agency-Agents repos — see .github/workflows/sync-ecosystem.yml and docs/ECOSYSTEM_OPENCLAW.md. OpenClaw and ZeroClaw both consume these paths (CLI, MCP, OpenClaw skill — docs/OPENCLAW_INTEGRATION.md, or ZeroClaw skill — docs/ZEROCLAW_INTEGRATION.md).
Human-readable size report (start here): repository root BENCHMARK.md — transparency notes at top; viable vs legacy-inclusive sections; per-artifact Notes column. On ainativelang.com, the same content is published as the Benchmarks page.
Machine-readable size JSON: tooling/benchmark_size.json (schema 3.5+).
Runtime JSON: tooling/benchmark_runtime_results.json (generated; tracked in git as the CI baseline when committed).
LLM eval / multi-model bench: docs/OLLAMA_EVAL.md — local Ollama plus optional Anthropic Claude via ainl-ollama-benchmark --cloud-model … (pip install -e ".[anthropic]", ANTHROPIC_API_KEY).

Key metrics (quick glossary)

Local commands

Full local run (updates BENCHMARK.md + default JSON paths):

pip install -e ".[dev,benchmark]"
make benchmark

The [benchmark] extra includes aiohttp, langgraph, and temporalio so hybrid emit smoke tests and baseline comparisons can run without skip warnings when those stacks are used.

CI-style (JSON only, smaller runtime sampling; matches the spirit of the benchmark-regression workflow):

make benchmark-ci

Size-only or runtime-only:

python scripts/benchmark_size.py --compare-baselines --cost-model gpt-4o
python scripts/benchmark_runtime.py --compare-baselines --reliability-runs 5 --cost-model gpt-4o

Regression helper (compare two JSON reports):

python scripts/compare_benchmark_json.py \
  --old-json tooling/benchmark_size.json \
  --new-json tooling/benchmark_size_ci.json \
  --threshold 0.10

CI

Benchmarks and evidence

This page ties together the size, runtime, and LLM-generation benchmarks so you can defend AINL with numbers—not vibes.

Benchmarks and evidence

This page ties together the size, runtime, and LLM-generation benchmarks so you can defend AINL with numbers—not vibes.

Benchmark highlights (March 2026)

Source of truth: repository root BENCHMARK.md (regenerate with make benchmark). Numbers below are tiktoken cl100k_base as in the Mode Comparison and legacy-inclusive sections.

Quick size snapshot:

Transparency (mirrors BENCHMARK.md blockquotes):

tiktoken cl100k_base — markdown tables foreground tokenizer counts; JSON still stores the active CLI --metric (default tiktoken) for thresholds and economics.
Viable subset — for public_mixed / compatibility_only, rules in tooling/artifact_profiles.json + emit heuristics; legacy-inclusive tables are always below the fold.
minimal_emit fallback stub — tiny python_api async stub (~20–30 tk) when no selected target emits code.
Emitter compaction (Mar 2026) — prisma and react_ts benchmark stubs shortened (~50–70% tk reduction on those lines in examples).
Hybrid wrapper emitters in full_multitarget — langgraph / temporal sizes come from scripts/emit_langgraph.py and scripts/emit_temporal.py; not part of minimal_emit unless emit_capabilities gains matching flags later.
--strict-mode — scripts/benchmark_size.py with --profile-name=canonical_strict_valid runs the compiler in strict reachability mode; see the strict callout in BENCHMARK.md when enabled.

Comparative methodology (LangGraph, Temporal, others)

Competitive Context

These benchmarks power the head-to-head comparison tables:

Why these benchmarks matter

Human-written baselines (--compare-baselines) anchor claims against real Python stacks (pure async vs LangGraph-style) using the same metrics where possible.

Where to read results

Ecosystem import examples: trees under examples/ecosystem/ (Clawflows- and Agency-Agents-style Markdown → AINL) are kept fresh via weekly auto-sync from upstream Clawflows and Agency-Agents repos — see .github/workflows/sync-ecosystem.yml and docs/ECOSYSTEM_OPENCLAW.md. OpenClaw and ZeroClaw both consume these paths (CLI, MCP, OpenClaw skill — docs/OPENCLAW_INTEGRATION.md, or ZeroClaw skill — docs/ZEROCLAW_INTEGRATION.md).
Human-readable size report (start here): repository root BENCHMARK.md — transparency notes at top; viable vs legacy-inclusive sections; per-artifact Notes column. On ainativelang.com, the same content is published as the Benchmarks page.
Machine-readable size JSON: tooling/benchmark_size.json (schema 3.5+).
Runtime JSON: tooling/benchmark_runtime_results.json (generated; tracked in git as the CI baseline when committed).
LLM eval / multi-model bench: docs/OLLAMA_EVAL.md — local Ollama plus optional Anthropic Claude via ainl-ollama-benchmark --cloud-model … (pip install -e ".[anthropic]", ANTHROPIC_API_KEY).

Key metrics (quick glossary)

Local commands

Full local run (updates BENCHMARK.md + default JSON paths):

pip install -e ".[dev,benchmark]"
make benchmark

The [benchmark] extra includes aiohttp, langgraph, and temporalio so hybrid emit smoke tests and baseline comparisons can run without skip warnings when those stacks are used.

CI-style (JSON only, smaller runtime sampling; matches the spirit of the benchmark-regression workflow):

make benchmark-ci

Size-only or runtime-only:

python scripts/benchmark_size.py --compare-baselines --cost-model gpt-4o
python scripts/benchmark_runtime.py --compare-baselines --reliability-runs 5 --cost-model gpt-4o

Regression helper (compare two JSON reports):

python scripts/compare_benchmark_json.py \
  --old-json tooling/benchmark_size.json \
  --new-json tooling/benchmark_size_ci.json \
  --threshold 0.10

Benchmarks and evidence

Benchmarks and evidence

Benchmark highlights (March 2026)

Comparative methodology (LangGraph, Temporal, others)

Competitive Context

Why these benchmarks matter

Where to read results

Key metrics (quick glossary)

Local commands

CI

See also

Benchmarks and evidence

Benchmarks and evidence

Benchmark highlights (March 2026)

Comparative methodology (LangGraph, Temporal, others)

Competitive Context

Why these benchmarks matter

Where to read results

Key metrics (quick glossary)

Local commands

CI

See also

Benchmarks and evidence

Benchmarks and evidence

Benchmark highlights (March 2026)

Comparative methodology (LangGraph, Temporal, others)

Competitive Context

Why these benchmarks matter

Where to read results

Key metrics (quick glossary)

Local commands

CI

See also