Versus LangGraph / Temporal: benchmark methodology
AINL’s public size and runtime benchmarks are reproducible from this repository. They are not a substitute for your production SLOs — but they give a repeatable way to compare authoring compactness, emit expansion, and p
Versus LangGraph / Temporal: benchmark methodology
AINL’s public size and runtime benchmarks are reproducible from this repository. They are not a substitute for your production SLOs — but they give a repeatable way to compare authoring compactness, emit expansion, and post-compile execution cost.
This document names commands and artifacts so you can run head-to-head style comparisons (e.g. AINL source vs emitted LangGraph module vs hand-written Python baseline) without changing core code.
Size / token economics (authoring + emit footprint)
- Human-readable: repository root
BENCHMARK.md(tiktoken cl100k_base, viable vs legacy-inclusive). - Regenerate:
make benchmarkormake benchmark-ci(seedocs/benchmarks.md). - Scripts:
scripts/benchmark_size.py— profiles intooling/artifact_profiles.json; modes includefull_multitarget(includes langgraph + temporal wrapper emitters) andminimal_emit(planner-selected targets only). - Hybrid in IR: declare
S hybrid langgraph/S hybrid temporal(seedocs/HYBRID_GUIDE.md) sominimal_emitcan include wrapper targets when you want apples-to-apples “deploy bundle” sizing.
Suggested comparison rows (fill from JSON / tables):
| Row | What you compare |
|-----|------------------|
| A | AINL strict-valid source (tk) |
| B | Same program --emit langgraph output (tk) |
| C | Same program --emit temporal output (tk) — sum of emitted files |
| D | Hand-written LangGraph / Temporal tutorial equivalent (tk) — your baseline |
Runtime (latency, RSS, optional reliability)
- Results file:
tooling/benchmark_runtime_results.json(regenerated bymake benchmarkwhen runtime bench is enabled in your env). - Script:
scripts/benchmark_runtime.py— seedocs/benchmarks.mdfor flags and CI notes. - Interpretation: measures deterministic graph execution after compile — not LLM inference. Use this to separate orchestration runtime from model cost.
LLM generation benchmarks
docs/OLLAMA_EVAL.md— local Ollama and optional cloud models; use the same temperature and prompts when comparing success rate of “generate valid AINL” vs “generate valid Python graph code.”
Migration / emit speed
-
Emit is local and CPU-bound; a rough wall-clock check:
time python3 scripts/validate_ainl.py --strict examples/hybrid/temporal_durable_ainl/monitoring_durable.ainl --emit temporal -o /tmp/ainl_temporal_out time python3 scripts/validate_ainl.py --strict examples/hybrid/langgraph_outer_ainl_core/monitoring_escalation.ainl --emit langgraph -o /tmp/monitoring_langgraph.py
Record means on a quiet machine; commit methodology, not one-off magic numbers.
Honest boundaries
- LangGraph and Temporal excel at their runtime guarantees and ecosystems; AINL’s claim is portable authoring + strict compile + multi-target emit, not “faster worker RPCs than Temporal.”
- n8n / CrewAI / prompt-loop frameworks serve different personas; compare on determinism, audit, token recurring cost, and compile guarantees for operational graphs, not on visual DSL features.
See also
- COMPARISON_TABLE — comparison tables with committed numbers, JSON source paths, and TBD rows for pending runs.
- FROM_LANGGRAPH_TO_AINL
- AINL_AND_TEMPORAL
- OPENCLAW_PRODUCTION_SAVINGS
