Failure Runbooks
These runbooks describe the implemented v0 operator surface. Metrics listed in
operator-metrics.md as “not currently emitted” are intentionally not used
here.
Health Endpoint Contract
Section titled “Health Endpoint Contract”The admin health endpoints return 503 when the underlying subsystem is in a page-worthy state — not just 200 with counters the operator has to interpret. The contract is:
/healthz→ 503 when storage is unhealthy or DuckLake required-but-unavailable./api/admin/health/storage→ mirrors /healthz (sameis_readypredicate)./api/admin/health/maintenance→ 503 when any job has consecutive failures ≥ 3 (matchescanardstack_maintenance_consecutive_failures{job} >= 3)./api/admin/health/queries→ 503 when storage is not ready (queries depend on DuckDB). Body includes admission state./api/admin/health/ingest→ always 200; check queue gauges for backpressure and theraw_spoolobject for restart replay backlog. Body includes the freshness-budget projection.
Where a step below says “check /api/admin/health/…”, the expected paging status is the HTTP code, not just the body.
1. Seals Are Stuck
Section titled “1. Seals Are Stuck”Symptoms
Section titled “Symptoms”canardstack_maintenance_consecutive_failures{job="seal"}rises.canardstack_observed_freshness_lag_secondsrises.- Query freshness gets worse even though ingest is accepting data.
- Maintenance logs show seal/checkpoint failures.
Metrics
Section titled “Metrics”canardstack_maintenance_runs_total{job="seal",status="error"}.canardstack_maintenance_failures_total{job="seal"}.canardstack_maintenance_consecutive_failures{job="seal"}.canardstack_ingest_inflight_bytes{storage_signal}.canardstack_observed_freshness_lag_seconds.canardstack_ingest_to_query_lag_seconds{table}.canardstack_projected_visibility_seconds.canardstack_seal_ewma_bytes_per_second.canardstack_storage_physical_bytes{table="all"}.
Immediate Mitigation
Section titled “Immediate Mitigation”- Check
/api/admin/health/storageand/api/admin/health/maintenance. - Trigger
POST /api/admin/maintenance/seal. - If the seal still fails, reduce upstream exporter concurrency or batch volume.
- Watch
canardstack_ingest_requests_total{status=~"429|503"}and queue gauges. - If storage health is unsafe, keep returning
503for ingest until the dependency recovers.
Safe Degradation
Section titled “Safe Degradation”- Keep existing queries available if they do not increase storage pressure.
- Prefer
429over accepting more data into full queues. - Heavy query admission should degrade before the seal admission is starved. If query
load still contends with the seal admission, lower
CANARDSTACK_QUERY_CONCURRENCYwhile preserving at least one heavy slot after seal and cheap-query reservations.
Escalation
Section titled “Escalation”- Inspect seal errors for object storage auth, network, or catalog lock contention.
- Add storage capacity if disk is the immediate risk.
- If the seal cannot be recovered, snapshot diagnostics and prepare for restore from last known good catalog backup.
2. DuckDB Query OOM Takes Down The Query Role
Section titled “2. DuckDB Query OOM Takes Down The Query Role”Symptoms
Section titled “Symptoms”- Query role restarts.
- In-flight Grafana or compatibility queries fail with
503. - System logs show DuckDB allocation failure or process OOM kill.
- Ingest may remain healthy if process roles are isolated; otherwise all service endpoints restart.
Metrics
Section titled “Metrics”canardstack_query_requests_total{status="503"}.canardstack_query_timeouts_total.canardstack_query_rejections_total./api/admin/health/queriesfor active, limit, and admission counts.canardstack_query_admission_rejections_total.canardstack_query_admission_reductions_total.
Immediate Mitigation
Section titled “Immediate Mitigation”All query admission knobs are env vars applied at boot — there is no hot-reload endpoint. Steps 2 and 3 require an operator-driven restart.
- Restart query role if supervisor has not already done so.
- Lower query memory by 50%: set
CANARDSTACK_QUERY_MEMORY_LIMIT(e.g.256MiB) and restart. - Lower global query concurrency, keeping it greater than
CANARDSTACK_SEAL_ADMISSION_CAPACITY + CANARDSTACK_CHEAP_QUERY_ADMISSION_CAPACITY; with defaults, useCANARDSTACK_QUERY_CONCURRENCY=3or higher. - Lower heavy degraded capacity only if needed:
CANARDSTACK_HEAVY_QUERY_DEGRADED_CAPACITY=1. - Reduce compatibility query traffic from Grafana or other clients.
Safe Degradation
Section titled “Safe Degradation”- Keep ingest running.
- Return
503for query APIs while the query role restarts. - Keep admin health endpoints available if they do not need DuckDB.
Escalation
Section titled “Escalation”- Move query execution to a separate process role if not already isolated.
- Add stricter time-range caps for the failing query class. (Per-query-shape rejection rules are not implemented in v0; tightening must happen at the reverse proxy / Grafana datasource until a server-side allow-list lands.)
- Add preflight estimates or query templates that avoid full scans.
3. Object Storage 5xx Storm Causes Seal Backlog
Section titled “3. Object Storage 5xx Storm Causes Seal Backlog”Symptoms
Section titled “Symptoms”- DuckDB Arrow appends, DuckLake commits, or seal/checkpoint cycles fail with object storage 5xx.
- Freshness lag grows.
- Maintenance retries increase.
503may begin for ingest if storage health is unsafe.
Metrics
Section titled “Metrics”canardstack_maintenance_failures_total.canardstack_maintenance_consecutive_failures.canardstack_ducklake_checkpoint_runs_total{status="error"}.canardstack_ducklake_checkpoint_supported.canardstack_ingest_requests_total{status="503"}.canardstack_ingest_inflight_bytes{storage_signal}.canardstack_ingest_to_query_lag_seconds{table}.canardstack_projected_visibility_seconds.
Immediate Mitigation
Section titled “Immediate Mitigation”- Confirm whether the object storage incident is regional or credential-related.
- Reduce upstream exporter concurrency.
- Trigger
POST /api/admin/maintenance/sealafter the storage incident clears. - Let admission control return
429while queue memory is full but storage is healthy. - Return
503if accepting data would endanger memory, catalog, or disk.
Safe Degradation
Section titled “Safe Degradation”- Serve queries from already committed data.
- Surface stale freshness watermarks in Grafana.
- Prioritize seal recovery. Snapshot expiration and old-file cleanup are safe retention controls; physical file compaction remains disabled until proven stable.
Escalation
Section titled “Escalation”- Fail over to a configured alternate bucket only if that path was tested.
- Increase catalog capacity only as a temporary measure.
- Preserve logs and metrics for DuckLake/storage vendor support.
4. Ingest Overload Triggers Sustained 429 Or 503
Section titled “4. Ingest Overload Triggers Sustained 429 Or 503”Symptoms
Section titled “Symptoms”- Exporters report retryable failures.
429or503rate remains high for more than 10 minutes.- Ingest queues stay above 85%.
- CPU, memory, catalog, or storage is saturated.
Metrics
Section titled “Metrics”canardstack_ingest_requests_total{status}.canardstack_raw_spool_records_total{status="full"}.canardstack_raw_spool_pending_records.canardstack_raw_spool_pending_bytes.canardstack_ingest_inflight_bytes{storage_signal}.canardstack_observed_freshness_lag_seconds.canardstack_ingest_freshness_budget_rejections_total.canardstack_projected_visibility_seconds.canardstack_ingest_records_total{request_kind}.canardstack_http_connection_errors_total.
Immediate Mitigation
Section titled “Immediate Mitigation”- Identify the limiting resource: CPU, memory, catalog, object storage, freshness debt, or maintenance backlog.
- Increase exporter batch interval or reduce exporter concurrency if controlled by the operator.
- Temporarily drop lower-priority signals upstream if configured in the exporter or Collector.
- Increase process memory or add a larger instance if CPU/memory bound.
- If
freshness_budget_exceededrises, prioritize seal recovery before increasing ingest limits. - Keep returning retryable failures until queues return below 70%.
Safe Degradation
Section titled “Safe Degradation”- Prefer
429when the system is healthy but full.raw_spool_fullmeans the local raw spool hit its configured byte budget before transform. freshness_budget_exceededmeans the request was rejected before raw-spool append to protect query visibility.- Use
503when dependencies are unhealthy or the raw spool is unavailable. - Do not accept data that would exceed memory bounds.
Escalation
Section titled “Escalation”- Split ingest/query/maintenance roles.
- Increase batch sizes if inserts are too small.
- Lower retention temporarily only if storage pressure is part of the overload.
5. Restart Replay Backlog Is Draining
Section titled “5. Restart Replay Backlog Is Draining”Symptoms
Section titled “Symptoms”- A process restarts after accepting
202responses. /api/admin/health/ingestshowsraw_spool.pending_records > 0.canardstack_raw_spool_replayed_records_total{status="ok"}increases during startup.- Queues may temporarily rise as replayed records re-enter normal admission.
Metrics
Section titled “Metrics”canardstack_raw_spool_pending_records.canardstack_raw_spool_pending_bytes.canardstack_raw_spool_replayed_records_total{status}.canardstack_raw_spool_checkpointed_records_total{reason="storage_committed"}.canardstack_ingest_inflight_bytes{storage_signal}.canardstack_storage_logical_rows{table}.
Immediate Mitigation
Section titled “Immediate Mitigation”- Check
/api/admin/health/ingest;raw_spool.pending_recordsshould fall after replay and seal. - Check
/metricsfor replay failures. Anystatus="failed"increase is page-worthy. - Trigger
POST /api/admin/maintenance/sealif the scheduler is disabled or lagging. - Watch
canardstack_storage_logical_rows{table}andcanardstack_raw_spool_checkpointed_records_total.
Safe Degradation
Section titled “Safe Degradation”- Keep accepting ingest only if spool and queue pressure remain below configured bounds.
- Use
429 raw_spool_fullwhen the raw spool is full. - Use
503 raw_spool_unavailablewhen the spool cannot be opened, written, or append-synced.
6. Retention Cleanup Fails And Storage Usage Keeps Growing
Section titled “6. Retention Cleanup Fails And Storage Usage Keeps Growing”Symptoms
Section titled “Symptoms”- Physical storage bytes keep increasing after retention horizon.
- Retention job fails or times out.
- Old files remain after expected cleanup.
Metrics
Section titled “Metrics”canardstack_storage_physical_bytes{table="all"}.canardstack_storage_logical_rows{table}.canardstack_maintenance_failures_total{job="retention"}.canardstack_maintenance_consecutive_failures{job="retention"}.
Immediate Mitigation
Section titled “Immediate Mitigation”- Check
/api/admin/health/storageand compare active DuckLake data-file counts with logical rows. - If active file counts are high but logical history should be preserved, run
POST /api/admin/maintenance/checkpoint/run. This triggers DuckLake physical maintenance without applying telemetry retention deletes. - If old telemetry should be removed, run
POST /api/admin/maintenance/retention/dry-run, check the returned table counts, then runPOST /api/admin/maintenance/retention/run. - If storage is near full, reduce upstream ingest or add storage before accepting more data.
- Shorten retention only after confirming the run succeeds and physical bytes fall.
Safe Degradation
Section titled “Safe Degradation”- Keep ingest running if storage headroom is sufficient.
- If bucket or disk quota is near exhaustion, return
503for ingest before writes fail unpredictably. - Keep queries bounded to retained dates.
Escalation
Section titled “Escalation”- Add storage capacity.
- Investigate whether partition deletes are producing delete files instead of reclaiming physical files.
- If needed, migrate to physical day tables behind stable views.