Retry with Exponential Backoff Pattern
> ℹ️ TWO SYNTAX STYLES: This document shows two AINL syntax styles: > 1. Compact syntax (works now) — Python-like, recommended for new code. > See examples/compact/ and AGENTS.md for the full reference. > 2. Graph blo
Retry with Exponential Backoff Pattern
ℹ️ TWO SYNTAX STYLES: This document shows two AINL syntax styles:
- Compact syntax (works now) — Python-like, recommended for new code. See
examples/compact/andAGENTS.mdfor the full reference.- Graph block syntax (
graph { node ... }) — DESIGN PREVIEW, does NOT compile. These blocks are labeled "Design Preview" below.Use compact syntax for real projects:
ainl validatefile--strict
Handle transient failures in external API calls with intelligent retry logic.
🎯 Use Case
You're calling an external API (HTTP, database, LLM) that occasionally fails due to:
- Network blips
- Rate limiting (429)
- Temporary service unavailability (5xx)
You want to:
- Retry automatically instead of failing the whole graph
- Wait longer between retries (backoff)
- Give up after reasonable attempts
- Log each retry attempt for observability
📐 Pattern Structure
graph TD
A[Call API] --> B{Success?};
B -->|Yes| C[Return result];
B -->|No| D{Retryable?};
D -->|Yes| E[Wait (backoff)];
E --> A;
D -->|No| F[Fail graph];
🏗️ Implementation
Real AINL Syntax (v1.3.3 — this compiles)
# retry_api.ainl — Retry HTTP call with backoff
# ainl validate retry_api.ainl --strict
S app core noop
L_call:
R core.GET ctx "url" ->url
R http.POST url {} ->resp
R core.GET resp "status" ->status
If (core.eq status 200) ->L_ok ->L_retry
L_retry:
# Check attempt count
R core.GET ctx "attempt" ->attempt
X max_attempts 3
If (core.gt attempt max_attempts) ->L_fail ->L_wait
L_wait:
# Exponential backoff: sleep 1s * 2^attempt
R core.mul 1000 (core.pow 2 attempt) ->delay_ms
R core.sleep delay_ms ->_
Set attempt (core.add attempt 1)
Call L_call
L_ok:
J resp
L_fail:
Err "Max retries exceeded"
Design Preview Syntax (AINL 2.0 — does NOT compile yet)
graph RetryApiCall {
input: Request = { url: string, payload: object }
node call: HTTP("external-api") {
method: POST
url: input.url
body: input.payload
timeout: 10s
# Note: retry logic is in wrapper below
}
node with_retry: retry(call) {
max_attempts: 3
backoff: "exponential" # 1s, 2s, 4s
retryable_codes: [429, 500, 502, 503] # HTTP status codes
on_retry: log_retry
}
node log_retry: WriteFile("retry-log") {
path: "./logs/retries.jsonl"
content: '{"attempt":{{retry.attempt}},"error":"{{retry.error}}","ts":"{{now}}"}'
mode: append
}
output: with_retry.result
}
Key elements:
with_retrywrapper node (built-in AINL construct)max_attempts: total tries = 1 initial + 2 retries = 3backoff: exponential means wait 1s, then 2s, then 4sretryable_codes: only retry on these HTTP status codeson_retry: optional action before each retry (here: log)
🔧 Retry Policies
Backoff Strategies
| Strategy | Formula | Example (3 retries) |
|----------|---------|---------------------|
| Fixed | wait = base_delay | 1s, 1s, 1s |
| Exponential | wait = base_delay * 2^attempt | 1s, 2s, 4s |
| Linear | wait = base_delay * attempt | 1s, 2s, 3s |
Recommendation: Exponential with jitter to avoid thundering herd.
node with_retry: retry(call) {
backoff: "exponential"
base_delay: 1s
jitter: "random" # ±25% randomness
}
🧪 Testing Retry Logic
Simulate Failures
from ainl.testing import MockAdapter, FailingMockAdapter
# Make first 2 calls fail, 3rd succeed
failing_mock = FailingMockAdapter(
failures=2,
success_response='{"ok": true}',
failure_exception=RuntimeError("Service temporarily unavailable")
)
graph.get_node("call").adapter = failing_mock
result = graph.run(input_data)
assert result == {"ok": true} # Eventually succeeds
# Verify retry count
assert failing_mock.call_count == 3
Test Non-Retryable Errors
# 400 Bad Request should not retry
error_mock = MockAdapter(
status_code=400,
response='{"error":"bad input"}'
)
graph.get_node("call").adapter = error_mock
with pytest.raises(GraphExecutionError):
graph.run(input_data)
assert error_mock.call_count == 1 # No retries
📊 Observability
Metrics to Track
- Retry count per node (how often are we retrying?)
- Failure reasons (rate limit vs 5xx vs network)
- Cumulative latency (including retry waits)
- Success rate after retries (eventual success rate)
Prometheus queries:
# Retries per minute
rate(ainl_retry_total[1m])
# Failure breakdown
sum by (error_type) (rate(ainl_node_failures_total[5m]))
⚙️ Production Tips
1. Set Reasonable Limits
- Max attempts: 3-5 (more = higher latency, more load)
- Base delay: 1s for interactive APIs, 5s for batch jobs
- Total timeout: Ensure
timeouton node > sum(backoff delays)
2. Different Strategies per Error Type
node with_retry: retry(call) {
backoff: "exponential"
retry_rules:
- codes: [429] # Rate limit
max_attempts: 5
base_delay: 2s
- codes: [500, 502, 503] # Server errors
max_attempts: 3
base_delay: 1s
- codes: [408, 524] # Timeouts
max_attempts: 2
base_delay: 3s
}
3. Circuit Breaker Pattern
After N consecutive failures, stop calling for a cool-down period:
node call_with_circuit: circuit_breaker(call) {
failure_threshold: 5
cool_down: 60s
on_open: log_circuit_open
}
🔄 Variations
Parallel Retry (Fan-out)
If you have multiple independent API calls, retry each separately:
graph ParallelFetch {
node fetch_a: retry(HTTP("a")) { max_attempts: 3 }
node fetch_b: retry(HTTP("b")) { max_attempts: 3 }
node fetch_c: retry(HTTP("c")) { max_attempts: 3 }
# These run in parallel
output: { a: fetch_a.result, b: fetch_b.result, c: fetch_c.result }
}
🐛 Common Pitfalls
| Pitfall | Why Bad | Fix |
|---------|---------|-----|
| Infinite retries | Hang forever, no failure notice | Always set max_attempts |
| No jitter on backoff | All retries happen at same time → thundering herd | Add jitter: "random" |
| Retrying non-idempotent ops | Duplicate creates, double charges | Only idempotent ops should retry |
| Too many retries | High latency, API quota exhaustion | Keep max_attempts ≤ 5 |
📈 When to Use This Pattern
✅ Do use:
- External API calls (weather, payment, email)
- Database connections ( transient network issues)
- LLM API calls (rate limits, occasional 5xx)
- Any operation with known transient failure modes
❌ Don't use:
- Permanent failures (validation errors, 400s) – retry won't help
- Non-idempotent operations (create, delete) unless idempotency key
- User-facing interactive requests (retries make latency unpredictable)
📚 See Also
- Testing Guide – Test your retry logic thoroughly
- Monitoring Guide – Track retry metrics in production
- Adapters Guide – Configure adapter-level retries as fallback
Build resilient systems that handle failure gracefully.
