Monitoring Operations for AINL
- MetricsCollector: in-memory histogram/counter metrics. - CostTracker: SQLite DB (~/.ainl/costs.db) storing run costs and budget. - BudgetPolicy: enforces monthly limits and alerts. - HealthStatus: liveness/readiness pr
Monitoring Operations for AINL
Components
- MetricsCollector: in-memory histogram/counter metrics.
- CostTracker: SQLite DB (
~/.ainl/costs.db) storing run costs and budget. - BudgetPolicy: enforces monthly limits and alerts.
- HealthStatus: liveness/readiness probes.
- Dashboard: web UI at http://localhost:8080
Setup
- Ensure
intelligence/monitorpackage is importable. - Run
python -m intelligence.monitor.dashboard.serverto start the dashboard. - Optionally configure Telegram alerts:
export TELEGRAM_BOT_TOKEN=your_bot_token export TELEGRAM_CHAT_ID=your_chat_id - Set budget in
~/.ainl/costs.db(tablebudget):INSERT INTO budget (monthly_limit_usd, alert_threshold_pct, throttle_threshold_pct) VALUES (20.0, 0.8, 0.95);
Integrating with Workflows
At the start of any LLM call, call:
from intelligence.monitor.budget_policy import BudgetPolicy
policy = BudgetPolicy()
result = policy.check_and_enforce(run_id)
if result == "throttled":
# Reduce max_tokens or skip
pass
After an LLM adapter returns usage, call:
from intelligence.monitor.cost_tracker import CostTracker
ct = CostTracker()
ct.add_cost(run_id, provider, model, usage.prompt_tokens, usage.completion_tokens, adapter.estimate_cost(...))
Metrics
Prometheus‑compatible exposition on /api/metrics. Key metrics:
ainl_workflow_runs_totalainl_workflow_successainl_node_calls_total{node_type}ainl_llm_tokens_total{provider,model}ainl_llm_cost_total{provider,model}ainl_budget_spent_usdainl_budget_usage_pct
Alerts
- Telegram: configured via env vars; sent when threshold exceeded or throttling.
- Email: stub for SMTP configuration.
- Webhook: generic POST to URL with payload.
Troubleshooting
- DB locked? Use connection pooling or ensure single writer.
- High memory? MetricsCollector holds all metrics; consider TTL.
- Adapter down? Health check
readyreflects adapter connectivity via custom logic.
Automation
Add to existing self_monitor.py:
- Daily: reconcile costs with provider statements
- Hourly: check health endpoints and restart dashboard if dead
