See it in action on the hev-shop demo store.

API

Query & Fetch

Stable reads

Layer uses the same query syntax as upstream but defaults to stable reads. Every response carries an x-layer-stable-as-of watermark: the point the upstream index is known to be caught up to. A query issued right after an upsert never returns partially-indexed rows and never 429s under write pressure, so derived views like facets and counts stay in sync with your index.

HTTP/1.1 200 OK
x-layer-stable-as-of: 1715600400000

{"rows":[{"id":"asin-B08N5WRWNW","$dist":0.42,"title":"..."}]}

This is achieved by:

  1. Queries run at consistency=eventual upstream, so they never block on indexing.
  2. A control loop polls each registered namespace’s index.status and records the latest status plus, when stable, a watermark equal to poll_start - safety_margin. Cold or updating namespaces use the fast polling interval; stable namespaces back off to the stable interval until the next write re-arms the fast tier.
  3. Per-query decision:
    • Updating → inject a hidden _hevlayer_upserted_at <= watermark predicate so the read never sees partially-indexed rows.
    • Stable or Unknown → run without the predicate. The upstream index is caught up (or no contrary evidence exists).
  4. On a 429 to an unfiltered query, Layer retries once with the watermark filter forced on.

Responses report x-layer-stable-as-of (epoch ms) when the watcher has a watermark for the namespace. It is omitted on a cold-start gateway that has not yet observed a stable poll. Paginated single-query responses return the next page token in x-layer-next-cursor; pass that value back as cursor in the next request body.

Stable-read behavior is set per namespace with the consistency field on the Index CRD. Two gateway tunables control the watcher:

VariableDefaultPurpose
CONSISTENCY_POLL_INTERVAL_MS1000Fast cadence for cold and updating namespaces.
CONSISTENCY_STABLE_POLL_INTERVAL_MS60000Slow cadence for namespaces last observed stable. Set equal to the fast interval to restore one uniform cadence.
CONSISTENCY_SAFETY_MARGIN_MS500Cushion between poll time and watermark to cover in-flight upserts.

Query by id

Pass nearest_to_id in place of vector to rank by stored document vectors instead of a raw query vector — exactly one of the two is required. nearest_to_id takes an array of document ids: the gateway resolves each id’s vector (document cache first, Turbopuffer on miss with a cache backfill) and averages them component-wise into a single centroid, then ranks nearest neighbors to that centroid. Pass one id to rank by a single document; pass several to get “more like these” over a set of seeds.

response = await client.query_namespace("products", {
    "nearest_to_id": ["asin-B08N5WRWNW", "asin-B07PXGQC1Q"],
    "top_k": 10,
    "include_attributes": ["title", "category"],
})
response, err := client.QueryNamespace(ctx, "products", &hevlayer.QueryRequest{
    NearestToID:       []string{"asin-B08N5WRWNW", "asin-B07PXGQC1Q"},
    TopK:              10,
    IncludeAttributes: []string{"title", "category"},
})
const response = await client.queryNamespace("products", {
  nearest_to_id: ["asin-B08N5WRWNW", "asin-B07PXGQC1Q"],
  top_k: 10,
  include_attributes: ["title", "category"],
});
curl -X POST "$LAYER_GATEWAY_URL/v2/namespaces/products/query" \
  -H "Authorization: Bearer $LAYER_GATEWAY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "nearest_to_id": ["asin-B08N5WRWNW", "asin-B07PXGQC1Q"],
    "top_k": 10,
    "include_attributes": ["title", "category"]
  }'
OutcomeStatus
Every id resolved (cache or origin)200, ranked results
Any id has no stored vector anywhere404 (names the missing ids)
nearest_to_id empty, or both/neither of vector / nearest_to_id422

The centroid is an unweighted mean, so seed ids contribute equally regardless of how many you pass. All resolved vectors share the namespace’s dimensionality, so no reconciliation is needed across seeds. This fuses the seeds into one ranking; to run several independent rankings in a single request, see multi-query.

Multi-query

nearest_to_id fuses several seeds into a single ranking. To run several independent queries in one round trip, each with its own ranking, post a queries array. The response is a parallel results array: one ranked result set per query, in request order.

batch = await client.multi_query_turbopuffer_namespace("products", {
    "queries": [
        {"rank_by": ["vector", "ANN", [0.1, 0.2, 0.3]], "top_k": 10},
        {"rank_by": ["title", "BM25", "wireless earbuds"], "top_k": 10},
    ],
})
# batch.results[0].rows ranked by vector; batch.results[1].rows by text
batch, err := client.MultiQueryTurbopufferNamespace(ctx, "products",
    &hevlayer.TurbopufferMultiQueryRequest{
        Queries: []hevlayer.TurbopufferQueryRequest{
            {"rank_by": []any{"vector", "ANN", []float64{0.1, 0.2, 0.3}}, "top_k": 10},
            {"rank_by": []any{"title", "BM25", "wireless earbuds"}, "top_k": 10},
        },
    })
const batch = await client.multiQueryTurbopufferNamespace("products", {
  queries: [
    { rank_by: ["vector", "ANN", [0.1, 0.2, 0.3]], top_k: 10 },
    { rank_by: ["title", "BM25", "wireless earbuds"], top_k: 10 },
  ],
});
// batch.results[0].rows ranked by vector; batch.results[1].rows by text
curl -X POST "$LAYER_GATEWAY_URL/v2/namespaces/products/query?stainless_overload=multiQuery" \
  -H "Authorization: Bearer $LAYER_GATEWAY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "queries": [
      {"rank_by": ["vector", "ANN", [0.1, 0.2, 0.3]], "top_k": 10},
      {"rank_by": ["title", "BM25", "wireless earbuds"], "top_k": 10}
    ]
  }'

All legs in a non-fused batch share one x-layer-stable-as-of value. A leg may use native rank_by, or the Layer vector / nearest_to_id single-query shape; nearest_to_id is resolved before the leg is sent upstream. Batches must contain 2 to 16 legs. cursor is rejected at the top level and per leg because pagination is single-query only.

When rerank_by is present, Layer treats the request as an upstream fused query and passes the body through unchanged.

Reach for multi-query when you genuinely need N rankings — distinct user queries batched into one round trip, or hybrid retrieval fused upstream with RRF. Reach for nearest_to_id when many seeds should collapse into one “more like these” ranking. To get typo-tolerant text search without building the fused query yourself, see hybrid text fusion.

Hybrid text fusion

BM25 misses typos and morphological variants; fuzzy matching alone loses the relevance signal BM25 provides. HybridText runs both in one request: the gateway tokenizes your input string, expands it into one BM25 leg plus one fuzzy leg per token, and Turbopuffer fuses the legs with reciprocal rank fusion (RRF). One expression in, typo-tolerant ranked results out.

HybridText is a Layer-only rank_by spelling on the existing query route — no new endpoint, no client changes beyond the expression. The gateway tokenizes with alyze, Turbopuffer’s own open-source tokenizer and the same code that segmented your text at index time, so query tokens match index terms by construction.

response = await client.query_namespace("support-tickets", {
    "rank_by": ["content", "HybridText", "conection timout kubernets"],
    "top_k": 10,
    "filters": ["tenant", "Eq", "t-42"],
    "include_attributes": ["content", "title"],
})
response, err := client.QueryNamespace(ctx, "support-tickets", &hevlayer.QueryRequest{
    RankBy:            []any{"content", "HybridText", "conection timout kubernets"},
    TopK:              10,
    Filters:           []any{"tenant", "Eq", "t-42"},
    IncludeAttributes: []string{"content", "title"},
})
const response = await client.queryNamespace("support-tickets", {
  rank_by: ["content", "HybridText", "conection timout kubernets"],
  top_k: 10,
  filters: ["tenant", "Eq", "t-42"],
  include_attributes: ["content", "title"],
});
curl -X POST "$LAYER_GATEWAY_URL/v2/namespaces/support-tickets/query" \
  -H "Authorization: Bearer $LAYER_GATEWAY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "rank_by": ["content", "HybridText", "conection timout kubernets"],
    "top_k": 10,
    "filters": ["tenant", "Eq", "t-42"],
    "include_attributes": ["content", "title"]
  }'

An optional fourth tuple element tunes the expansion. Defaults:

["content", "HybridText", "conection timout kubernets", {
  "fuzziness": "auto",
  "rank_constant": 60,
  "per_leg_limit": null
}]
OptionDefaultMeaning
fuzziness"auto"Max edit distance for fuzzy legs. "auto" maps per token: 1 for tokens of 5 characters or fewer, 2 for longer tokens. Fixed 0, 1, or 2 applies to all tokens.
rank_constant60Turbopuffer’s RRF constant, passed through verbatim. Integer > 0.
per_leg_limitclamp(5 × top_k, 50, 200)How deep each leg retrieves before fusion. Integer > 0.
threadsIndex.spec.scan.threads, else 8Maximum concurrent upstream requests when the gateway scatter/gathers the expansion across a sharded namespace — the same fan-out control as scans. Clamped to active shards. No effect on unsharded namespaces, where the expansion is a single fused upstream call.

Tokenization

The input string becomes tokens under a fixed, documented policy:

  1. Split on Unicode (UAX #29) word boundaries and lowercase, using alyze — the code behind Turbopuffer’s production word_v4 tokenizer. Punctuation-only tokens never survive the split.
  2. Drop tokens shorter than 2 characters.
  3. Dedupe.
  4. Cap at 15 tokens (15 fuzzy legs + 1 BM25 leg = 16, the upstream subquery limit). Tokens cut by the cap are counted in tokens_dropped.

Stemming, stopword removal, and language detection are not applied. The input must yield at least one token; one token is fine (that is still two legs, the RRF minimum).

Response

Results are the upstream RRF-fused list. A hybrid block echoes the effective expansion so defaults are never invisible:

{
  "rows": [
    {
      "id": "ticket-4117",
      "$score": 0.0639,
      "content": "...",
      "title": "Connection timeout on Kubernetes ingress"
    }
  ],
  "hybrid": {
    "tokens": ["conection", "timout", "kubernets"],
    "tokens_dropped": 0,
    "fuzziness": "auto",
    "rank_constant": 60,
    "legs": 4,
    "per_leg_limit": 50
  }
}
FieldMeaning
$scoreUpstream RRF score. Comparable within a response, not across requests — do not threshold on it.
tokensTokens that produced fuzzy legs, post-policy.
tokens_droppedTokens removed by the 15-token cap (not by the length or punctuation rules).
legsTotal subqueries sent upstream (fuzzy legs + 1 BM25 leg).

The hybrid block appears only on HybridText responses. On sharded namespaces it also reports the effective threads fan-out width. Requests without a HybridText expression, including native multi-query + rerank_by bodies, keep their upstream-shaped responses byte-for-byte.

Semantics

  • One round trip. The expansion is a single upstream multi-query fused by rerank_by: ["RRF", ...]. Layer implements no fusion math and does not reorder results.
  • One consistency cut. Request-level filters are replicated to every leg, and the stable-read watermark predicate is injected into every leg from a single read — all legs see the same cut. Responses carry x-layer-stable-as-of as usual.
  • All-or-nothing. Upstream multi-query has no partial results; any leg failing fails the request.
  • Replay as a unit. The query logs to search history as one entry carrying the HybridText expression, so replaying it reproduces the whole expansion.

Validation

All return 422:

ConditionWhy
Input yields zero tokens under the policyNothing to expand.
cursor presentFused scores do not form the monotone bands pagination relies on.
HybridText inside a queries arrayThe expansion is one multi-query deep by construction.
fuzziness not in "auto" | 0 | 1 | 2; rank_constant ≤ 0; per_leg_limit ≤ 0; threads < 1Out of range.

To let the gateway pick between hybrid text and semantic retrieval per query, see query routing.

Query routing

Real search boxes receive both "timout" and "why do pods lose their connection during deploys". The first wants hybrid text fusion; the second wants semantic retrieval — lexical legs add noise on long conversational input, and ANN underperforms on short identifier-shaped tokens. Auto is a Layer-only rank_by spelling that makes that call per query, so the branch doesn’t live ad hoc in your application code.

The gateway never embeds. The route is chosen from the shape of the input alone, and when the chosen route needs a query vector the request didn’t include, the response is the routing decision instead of results — your application embeds and re-issues with the route forced. Short keyword traffic executes immediately and never pays for an embedding.

response = await client.query_namespace("support-tickets", {
    "rank_by": ["content", "Auto", user_input],
    "top_k": 10,
    "filters": ["tenant", "Eq", "t-42"],
})

if not response.routing.executed:
    vector = await embed(user_input)  # your model or API
    response = await client.query_namespace("support-tickets", {
        "rank_by": ["content", "Auto", user_input, {
            "route": response.routing.route,
            "vector": vector,
        }],
        "top_k": 10,
        "filters": ["tenant", "Eq", "t-42"],
    })
response, err := client.QueryNamespace(ctx, "support-tickets", &hevlayer.QueryRequest{
    RankBy:  []any{"content", "Auto", userInput},
    TopK:    10,
    Filters: []any{"tenant", "Eq", "t-42"},
})

if err == nil && !response.Routing.Executed {
    vector := embed(userInput) // your model or API
    response, err = client.QueryNamespace(ctx, "support-tickets", &hevlayer.QueryRequest{
        RankBy: []any{"content", "Auto", userInput, map[string]any{
            "route":  response.Routing.Route,
            "vector": vector,
        }},
        TopK:    10,
        Filters: []any{"tenant", "Eq", "t-42"},
    })
}
let response = await client.queryNamespace("support-tickets", {
  rank_by: ["content", "Auto", userInput],
  top_k: 10,
  filters: ["tenant", "Eq", "t-42"],
});

if (!response.routing.executed) {
  const vector = await embed(userInput); // your model or API
  response = await client.queryNamespace("support-tickets", {
    rank_by: ["content", "Auto", userInput, {
      route: response.routing.route,
      vector,
    }],
    top_k: 10,
    filters: ["tenant", "Eq", "t-42"],
  });
}
# First request: no vector. Executes lexically, or returns the decision.
curl -X POST "$LAYER_GATEWAY_URL/v2/namespaces/support-tickets/query" \
  -H "Authorization: Bearer $LAYER_GATEWAY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "rank_by": ["content", "Auto", "why do pods lose their connection during deploys"],
    "top_k": 10,
    "filters": ["tenant", "Eq", "t-42"]
  }'

# Routed semantic without a vector, so the body is the decision, not rows:
# {"rows": [], "routing": {"route": "semantic", "policy": "v1", "tokens": 8, "executed": false}}

# Embed, then re-issue with the route forced:
curl -X POST "$LAYER_GATEWAY_URL/v2/namespaces/support-tickets/query" \
  -H "Authorization: Bearer $LAYER_GATEWAY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "rank_by": ["content", "Auto", "why do pods lose their connection during deploys", {
      "route": "semantic",
      "vector": [0.0012, -0.043]
    }],
    "top_k": 10,
    "filters": ["tenant", "Eq", "t-42"]
  }'

Routing policy

The v1 policy reads the token count of the input under the same tokenizer policy as hybrid text fusion:

TokensRouteRuns
≤ 2hybrid_textThe hybrid text fusion expansion.
≥ 8semanticANN over the supplied query vector.
3 – 7fusedBoth, merged upstream by RRF.

Vector availability never changes which route is chosen — only whether it executes in this request. hybrid_text always executes; semantic and fused execute when the request supplies a vector and defer otherwise. The policy is versioned ("policy": "v1") so threshold changes are visible in search history.

Options

The optional fourth tuple element:

OptionDefaultMeaning
route"auto"Force "hybrid_text", "semantic", or "fused" instead of applying the policy. Used on re-issue after a deferral, and for A/B comparison of strategies on the same input.
vectorThe query vector for the semantic leg. Dimensionality must match the namespace.

When the chosen route expands hybrid-text legs, the hybrid defaults apply and the hybrid echo block appears alongside routing.

Response

Every Auto response carries a routing block:

{
  "rows": [{"id": "ticket-4117", "$score": 0.0639, "title": "..."}],
  "routing": {
    "route": "hybrid_text",
    "policy": "v1",
    "tokens": 1,
    "executed": true
  },
  "hybrid": {"tokens": ["timout"], "tokens_dropped": 0, "fuzziness": "auto", "rank_constant": 60, "legs": 2, "per_leg_limit": 50}
}
FieldMeaning
routeThe strategy chosen (or forced).
policyRouting policy version that made the decision. "forced" when route was supplied.
tokensToken count the policy read, post tokenizer policy.
executedfalse on a deferral: the route needs a vector the request didn’t supply. rows is empty; embed and re-issue with the route forced.

Routed queries follow the same semantics as their underlying strategy: one consistency cut across all legs, all-or-nothing leg failure, and a single search history entry carrying the Auto expression and the decision.

Validation

All return 422:

ConditionWhy
Forced "semantic" or "fused" without vectorForcing asserts you have the vector; only auto-routing defers.
Input yields zero tokens under the policyNothing to route.
vector dimensionality mismatchSame check as a plain vector query.
cursor present, or Auto inside a queries arrayInherited from hybrid text fusion.

Counting matches

To count how many rows match a full-text or vector query, use scan count mode with the fts or ann selector. Ranked counts share the single /scans endpoint with filter counts — fts is exact, ann is a radius scan flagged approximate, and both honor the exhaustive flag and the count deadline.

Fetch

Fetch is a Layer-only endpoint with no upstream equivalent. The NVMe cache is checked first; on miss or error the gateway falls through to Turbopuffer and backfills the cache best-effort.

Single fetch

doc = await client.fetch_document(
    "products",
    "asin-B08N5WRWNW",
    include_attributes=["title", "category"],
)
doc, err := client.FetchDocument(ctx, "products", "asin-B08N5WRWNW",
    &hevlayer.FetchDocumentParams{
        IncludeAttributes: []string{"title", "category"},
    })
const doc = await client.fetchDocument("products", "asin-B08N5WRWNW", {
  includeAttributes: ["title", "category"],
});
curl "$LAYER_GATEWAY_URL/v2/namespaces/products/documents/asin-B08N5WRWNW?include_attributes=title,category" \
  -H "Authorization: Bearer $LAYER_GATEWAY_API_KEY"
OutcomeStatusHeader
Cached hit200x-layer-cache: hit
Cache miss, upstream hit, cache backfilled200x-layer-cache: miss
Cache unavailable, upstream hit200x-layer-cache: miss-on-error
Missing from both layers404

Batch fetch

batch = await client.fetch_documents("products", {
    "ids": ["asin-1", "asin-2", "asin-3"],
    "include_attributes": ["title"],
})
batch, err := client.FetchDocuments(ctx, "products", &hevlayer.FetchDocumentsRequest{
    Ids:               []string{"asin-1", "asin-2", "asin-3"},
    IncludeAttributes: []string{"title"},
})
const batch = await client.fetchDocuments("products", {
  ids: ["asin-1", "asin-2", "asin-3"],
  include_attributes: ["title"],
});
curl -X POST "$LAYER_GATEWAY_URL/v2/namespaces/products/documents" \
  -H "Authorization: Bearer $LAYER_GATEWAY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "ids": ["asin-1", "asin-2", "asin-3"],
    "include_attributes": ["title"]
  }'
{
  "documents": [
    {"id": "asin-1", "attributes": {"title": "..."}},
    {"id": "asin-3", "attributes": {"title": "..."}}
  ],
  "missing": ["asin-2"]
}

Batch fetch returns found documents and missing ids inline instead of a partial 404. documents preserves request order; ids the gateway could not find anywhere land in missing. Because order is preserved, batch fetch is a convenient way to reassemble a pipeline’s chunks back into their original document — request the chunk ids in sequence and concatenate the results.

Behavior matrix

Cache stateSingle fetchBatch fetch
Hitcachecache
Miss, upstream presentupstream + backfillupstream + backfill
Miss, upstream absent404inline missing
Cache unavailableupstream, miss-on-errorupstream, miss-on-error
esc