Monitoring & Observability
> ℹ️ TWO SYNTAX STYLES: This document shows two AINL syntax styles: > 1. Compact syntax (works now) — Python-like, recommended for new code. > See examples/compact/ and AGENTS.md for the full reference. > 2. Graph blo
Monitoring & Observability
ℹ️ TWO SYNTAX STYLES: This document shows two AINL syntax styles:
- Compact syntax (works now) — Python-like, recommended for new code. See
examples/compact/andAGENTS.mdfor the full reference.- Graph block syntax (
graph { node ... }) — DESIGN PREVIEW, does NOT compile. These blocks are labeled "Design Preview" below.Use compact syntax for real projects:
ainl validatefile--strict
Track running AINL graphs: health, performance, costs, and errors.
📊 The Three Pillars
- Health – Is the graph alive and processing?
- Metrics – Latency, token usage, success rates
- Traces – Per-execution audit logs for debugging
1️⃣ Health Checking
AINL graphs can expose a health endpoint for liveness/readiness probes.
Enable Health Server
ainl run monitor.ainl --health-port 9090
Starts HTTP server:
GET /health/live– Returns 200 if process is aliveGET /health/ready– Returns 200 if graph validated and adapters connectedGET /health/metrics– Prometheus-format metrics (optional)
Kubernetes example:
livenessProbe:
httpGet:
path: /health/live
port: 9090
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /health/ready
port: 9090
initialDelaySeconds: 5
periodSeconds: 10
2️⃣ Metrics Collection
AINL reports counters and gauges to Prometheus via /metrics.
Available Metrics
| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| ainl_executions_total | counter | graph, result | Total graph executions (result=success/failure) |
| ainl_execution_duration_seconds | histogram | graph | Execution time distribution |
| ainl_tokens_used_total | counter | graph, kind (llm/orchestration) | Token consumption |
| ainl_cost_usd_total | counter | graph, adapter | LLM API costs |
| ainl_nodes_total | gauge | graph, node, status | Nodes completed/failed |
Prometheus Config
scrape_configs:
- job_name: 'ainl'
static_configs:
- targets: ['localhost:9090']
Grafana Dashboard
Import pre-built dashboard JSON from docs/monitoring/grafana-dashboard.json (coming soon).
Key panels:
- Graph success rate (SLA)
- P99 latency per node
- Token spend over time
- Cost per execution
3️⃣ Execution Traces (Audit Logs)
The most important feature for compliance.
Enable Tracing
ainl run monitor.ainl \
--trace-jsonl /var/log/ainl/traces/$(date +%s).jsonl \
--trace-retention-days 90
Each line is a JSON event:
{
"timestamp": "2025-03-30T14:22:15.123Z",
"graph": "monitor",
"execution_id": "abc-123-def",
"node": "classify",
"phase": "completed",
"duration_ms": 1245,
"tokens_used": 245,
"cost_usd": 0.000612,
"input": {"prompt": "Classify: DB timeout..."},
"output": {"result": "CRITICAL"},
"error": null,
"traceparent": "00-...-..." // OpenTelemetry compatible
}
Trace Schema
See docs/reference/trace-schema.json for full spec.
Key fields:
execution_id: UUID linking all events from same runphase:started|completed|failedparent_id: For nested calls (when node invokes sub-graph)attributes: Arbitrary key-value for custom data
📈 Alerting Strategies
High Error Rate
# Alert if >5% of executions fail in last 5 min
rate(ainl_executions_total{result="failure"}[5m]) /
rate(ainl_executions_total[5m]) > 0.05
Cost Spike
# Alert if hourly spend > $100
sum(rate(ainl_cost_usd_total[1h])) * 3600 > 100
Latency SLA breach
# Alert if P95 latency > 5s
histogram_quantile(0.95, rate(ainl_execution_duration_seconds_bucket[5m])) > 5
Node Failure Pattern
# Alert if specific node fails >10 times in 2 min
increase(ainl_nodes_total{node="send_slack",status="failed"}[2m]) > 10
🔄 Centralized Logging
Ship traces to a central system (Loki, Datadog, Splunk).
Fluentd Example
`source`
@type tail
path /var/log/ainl/traces/*.jsonl
pos_file /var/log/ainl/traces/fluentd.pos
tag ainl.traces
`parse`
@type json
</parse>
</source>
<match ainl.traces>
@type elasticsearch
host localhost
port 9200
index_name ainl-traces
`inject`
time_key time
time_type string
time_format %Y-%m-%dT%H:%M:%S.%NZ
</inject>
</match>
🧩 Health Envelope (Advanced)
Combine health metrics with traces for full observability.
AINL can emit a health envelope every N executions:
# ainl.yaml
monitoring:
health_envelope:
enabled: true
every_n_executions: 100
send_to: "https://monitor.ainativelang.com/health"
api_key: ${HEALTH_ENVELOPE_KEY}
Envelope includes:
- Last N execution summaries
- Token usage statistics
- Error rate
- Custom metrics from nodes (if configured)
Enterprise hosted runtimes provide this out of the box.
🐛 Debugging with Traces
Find Slow Node
# Extract all node completion events
jq 'select(.phase=="completed")' traces.jsonl | \
jq -r '"\(.node)\t\(.duration_ms)"' | \
sort -k2 -n -r | head -10
Replay a Failing Execution
# Find execution ID from logs
grep '"execution_id":"abc-123' traces.jsonl | head -1
# Replay in debugger
ainl replay monitor.ainl --execution-id abc-123 --debug
Correlate with Application Logs
If your app logs a request ID, pass it as an attribute:
node classify: LLM("classify") {
attributes: {
request_id: "${env.REQUEST_ID}"
}
}
Then trace and app log share request_id.
📊 Dashboard Templates
Basic Grafana Dashboard (JSON)
Import from docs/monitoring/grafana-basic.json:
- Top row: Overall health (executions/sec, error rate)
- Middle row: Cost and token usage
- Bottom row: Top 5 slowest nodes
Enterprise Dashboard (Coming Soon)
- Multi-tenant views
- SLA compliance tracking
- Cost allocation per team/department
- Automated compliance reports
✅ Production Checklist
- [ ] Health endpoints exposed (
--health-port) - [ ] Traces stored in durable storage (not just local disk)
- [ ] Prometheus scraping metrics endpoint
- [ ] Alerts configured for error rate, cost, latency
- [ ] Grafana dashboards created and shared with team
- [ ] Log retention policy meets compliance needs (usually 90 days minimum)
- [ ] Trace data scrubbed of PII before storage (use
--trace-filter)
🎯 Best Practices
- Never run without traces in production (
--trace-jsonlmandatory) - Rotate logs – use
logrotateor cloud storage lifecycle - Filter PII – configure
trace.fields_to_redactin config - Aggregate metrics – send to central Prometheus, not per-host
- Set budgets – Alert on cost thresholds, not just usage
🔗 Related
- STRICT_VALIDATION_AND_COMPLIANCE.md – Compliance patterns
- Self-Monitoring Guide – AINL monitoring itself
- Trace Schema Reference – Full JSON spec
Monitor everything, sleep soundly! →
