Operations
Failure Modes
Read
Reads route through the gateway, but a gateway outage does not take your queries dark. The Python and Go SDKs fall through to Turbopuffer direct when the gateway is unreachable, so Turbopuffer-compatible queries keep serving rather than failing, minus the document cache, search history, and Layer’s query enhancements (see Client fall-through below). Layer-only read paths (document fetch, warm jobs, pipeline and UDF status, snapshots, and search history) fail fast, because they depend on gateway-owned cache, queue, history, and consistency state.
The document cache is stateless and can scale to zero with no disruption: document fetches fall through to origin (Turbopuffer, or S3 for snapshots) on a miss or cache outage, so a cache failure degrades latency, not availability.
Write
Writes also fall through to Turbopuffer direct when the gateway is unreachable (again, see Client fall-through); the durable upstream still accepts the row, but the write skips document-cache warming and pipeline staging until the gateway returns.
Pipeline stop-writes
The primary failure mode for writes through a healthy gateway is Aerospike stop-writes during a multi-stage pipeline job: staged documents stay warm in the cache but carry no vector data yet, and once that data exceeds the Aerospike drive allocation the cache rejects further writes.
The pipeline does not stall. Each stage persists its chunk bodies to S3 before it touches the cache, and pipeline state lives in PostgreSQL, so the Aerospike write is best-effort: on stop-writes the gateway logs the skipped write and the stage still completes. Downstream chunk reads degrade to the S3 backing for as long as the cache is rejecting writes.
Recovery is automatic. The Helm document cache restarts on stop-writes by
default (documentCache.autoRestartOnStopWrites: true) and clears its
Aerospike backing file on pod start (documentCache.storage.resetOnStart: true); the gateway reconnects in the background and refills the cache from S3
on demand. No pipeline work is lost — S3 and PostgreSQL are the durable
recovery boundary and must stay healthy.
Operator signals:
layer_aerospike_op_duration_seconds{status="aerospike_stop_writes"}— the stop-writes condition itself, the same series the dashboard charts.hevlayer_cache_cold_responses_total— reads being served from S3 backing instead of the cache while it recovers.hevlayer_document_cache_cold_starts_totalandhevlayer_document_cache_cold_start_seconds— the demand-triggered reconnect-and-refill cycle after the cache restarts.- Gateway warn logs
Aerospike chunk write failed (best-effort)andAerospike chunk read failed; falling back to S3 backing.
Client fall-through
When the gateway is unreachable, the SDKs retry the call against Turbopuffer
directly for operations that need no Layer state — simple vector queries,
writes, and raw Turbopuffer-compatible methods (schema, metadata, namespace
listing). These calls succeed without the document cache, search history, or
Layer’s query enhancements, and set the perf fallback field to
turbopuffer_direct. Fall-through requires Turbopuffer credentials
(TURBOPUFFER_API_KEY, or WithTurbopufferAPIKey / turbopuffer_api_key);
without them the original gateway error propagates unchanged.
Fall-through is on by default. Disable it with fallback_to_turbopuffer=False
on AsyncHevlayer or WithFallbackToTurbopuffer(false) on the Go client. For
the exact list of which operations fall through and which fail fast, see
Client fall-through in the API
introduction.