API
Query & Fetch
Stable reads
Layer uses the same query syntax as upstream but defaults to stable
reads. Every response carries an
x-layer-stable-as-of watermark: the point the upstream index is known
to be caught up to. A query issued right after an upsert never returns
partially-indexed rows and never 429s under write pressure, so derived
views like facets and counts
stay in sync with your index.
HTTP/1.1 200 OK
x-layer-stable-as-of: 1715600400000
{"rows":[{"id":"asin-B08N5WRWNW","$dist":0.42,"title":"..."}]}
This is achieved by:
- Queries run at
consistency=eventualupstream, so they never block on indexing. - A control loop polls each registered
namespace’s
index.statusand records the latest status plus, when stable, a watermark equal topoll_start - safety_margin. Cold or updating namespaces use the fast polling interval; stable namespaces back off to the stable interval until the next write re-arms the fast tier. - Per-query decision:
Updating→ inject a hidden_hevlayer_upserted_at <= watermarkpredicate so the read never sees partially-indexed rows.StableorUnknown→ run without the predicate. The upstream index is caught up (or no contrary evidence exists).
- On a 429 to an unfiltered query, Layer retries once with the watermark filter forced on.
Responses report x-layer-stable-as-of (epoch ms) when the watcher has a
watermark for the namespace. It is omitted on a cold-start gateway that
has not yet observed a stable poll. Paginated single-query responses
return the next page token in x-layer-next-cursor; pass that value back
as cursor in the next request body.
Stable-read behavior is set per namespace with the consistency field on
the Index CRD. Two gateway tunables control
the watcher:
| Variable | Default | Purpose |
|---|---|---|
CONSISTENCY_POLL_INTERVAL_MS | 1000 | Fast cadence for cold and updating namespaces. |
CONSISTENCY_STABLE_POLL_INTERVAL_MS | 60000 | Slow cadence for namespaces last observed stable. Set equal to the fast interval to restore one uniform cadence. |
CONSISTENCY_SAFETY_MARGIN_MS | 500 | Cushion between poll time and watermark to cover in-flight upserts. |
Query by id
Pass nearest_to_id in place of vector to rank by stored document
vectors instead of a raw query vector — exactly one of the two is
required. nearest_to_id takes an array of document ids: the gateway
resolves each id’s vector (document cache first, Turbopuffer on miss with a
cache backfill) and averages them component-wise into a single centroid,
then ranks nearest neighbors to that centroid. Pass one id to rank by a
single document; pass several to get “more like these” over a set of seeds.
response = await client.query_namespace("products", {
"nearest_to_id": ["asin-B08N5WRWNW", "asin-B07PXGQC1Q"],
"top_k": 10,
"include_attributes": ["title", "category"],
})response, err := client.QueryNamespace(ctx, "products", &hevlayer.QueryRequest{
NearestToID: []string{"asin-B08N5WRWNW", "asin-B07PXGQC1Q"},
TopK: 10,
IncludeAttributes: []string{"title", "category"},
})const response = await client.queryNamespace("products", {
nearest_to_id: ["asin-B08N5WRWNW", "asin-B07PXGQC1Q"],
top_k: 10,
include_attributes: ["title", "category"],
});curl -X POST "$LAYER_GATEWAY_URL/v2/namespaces/products/query" \
-H "Authorization: Bearer $LAYER_GATEWAY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"nearest_to_id": ["asin-B08N5WRWNW", "asin-B07PXGQC1Q"],
"top_k": 10,
"include_attributes": ["title", "category"]
}' | Outcome | Status |
|---|---|
| Every id resolved (cache or origin) | 200, ranked results |
| Any id has no stored vector anywhere | 404 (names the missing ids) |
nearest_to_id empty, or both/neither of vector / nearest_to_id | 422 |
The centroid is an unweighted mean, so seed ids contribute equally regardless of how many you pass. All resolved vectors share the namespace’s dimensionality, so no reconciliation is needed across seeds. This fuses the seeds into one ranking; to run several independent rankings in a single request, see multi-query.
Multi-query
nearest_to_id fuses several seeds into a single ranking. To run
several independent queries in one round trip, each with its own
ranking, post a queries array. The response is a parallel results
array: one ranked result set per query, in request order.
batch = await client.multi_query_turbopuffer_namespace("products", {
"queries": [
{"rank_by": ["vector", "ANN", [0.1, 0.2, 0.3]], "top_k": 10},
{"rank_by": ["title", "BM25", "wireless earbuds"], "top_k": 10},
],
})
# batch.results[0].rows ranked by vector; batch.results[1].rows by textbatch, err := client.MultiQueryTurbopufferNamespace(ctx, "products",
&hevlayer.TurbopufferMultiQueryRequest{
Queries: []hevlayer.TurbopufferQueryRequest{
{"rank_by": []any{"vector", "ANN", []float64{0.1, 0.2, 0.3}}, "top_k": 10},
{"rank_by": []any{"title", "BM25", "wireless earbuds"}, "top_k": 10},
},
})const batch = await client.multiQueryTurbopufferNamespace("products", {
queries: [
{ rank_by: ["vector", "ANN", [0.1, 0.2, 0.3]], top_k: 10 },
{ rank_by: ["title", "BM25", "wireless earbuds"], top_k: 10 },
],
});
// batch.results[0].rows ranked by vector; batch.results[1].rows by textcurl -X POST "$LAYER_GATEWAY_URL/v2/namespaces/products/query?stainless_overload=multiQuery" \
-H "Authorization: Bearer $LAYER_GATEWAY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"queries": [
{"rank_by": ["vector", "ANN", [0.1, 0.2, 0.3]], "top_k": 10},
{"rank_by": ["title", "BM25", "wireless earbuds"], "top_k": 10}
]
}' All legs in a non-fused batch share one x-layer-stable-as-of value. A
leg may use native rank_by, or the Layer vector / nearest_to_id
single-query shape; nearest_to_id is resolved before the leg is sent
upstream. Batches must contain 2 to 16 legs. cursor is rejected at the
top level and per leg because pagination is single-query only.
When rerank_by is present, Layer treats the request as an upstream fused
query and passes the body through unchanged.
Reach for multi-query when you genuinely need N rankings — distinct user
queries batched into one round trip, or hybrid retrieval fused upstream
with RRF. Reach for nearest_to_id when many seeds should collapse into
one “more like these” ranking. To get typo-tolerant text search without
building the fused query yourself, see
hybrid text fusion.
Hybrid text fusion
BM25 misses typos and morphological variants; fuzzy matching alone loses
the relevance signal BM25 provides. HybridText runs both in one
request: the gateway tokenizes your input string, expands it into one
BM25 leg plus one fuzzy leg per token, and Turbopuffer fuses the legs
with reciprocal rank fusion (RRF). One expression in, typo-tolerant
ranked results out.
HybridText is a Layer-only rank_by spelling on the existing query
route — no new endpoint, no client changes beyond the expression. The
gateway tokenizes with alyze,
Turbopuffer’s own open-source tokenizer and the same code that segmented
your text at index time, so query tokens match index terms by
construction.
response = await client.query_namespace("support-tickets", {
"rank_by": ["content", "HybridText", "conection timout kubernets"],
"top_k": 10,
"filters": ["tenant", "Eq", "t-42"],
"include_attributes": ["content", "title"],
})response, err := client.QueryNamespace(ctx, "support-tickets", &hevlayer.QueryRequest{
RankBy: []any{"content", "HybridText", "conection timout kubernets"},
TopK: 10,
Filters: []any{"tenant", "Eq", "t-42"},
IncludeAttributes: []string{"content", "title"},
})const response = await client.queryNamespace("support-tickets", {
rank_by: ["content", "HybridText", "conection timout kubernets"],
top_k: 10,
filters: ["tenant", "Eq", "t-42"],
include_attributes: ["content", "title"],
});curl -X POST "$LAYER_GATEWAY_URL/v2/namespaces/support-tickets/query" \
-H "Authorization: Bearer $LAYER_GATEWAY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"rank_by": ["content", "HybridText", "conection timout kubernets"],
"top_k": 10,
"filters": ["tenant", "Eq", "t-42"],
"include_attributes": ["content", "title"]
}' An optional fourth tuple element tunes the expansion. Defaults:
["content", "HybridText", "conection timout kubernets", {
"fuzziness": "auto",
"rank_constant": 60,
"per_leg_limit": null
}]
| Option | Default | Meaning |
|---|---|---|
fuzziness | "auto" | Max edit distance for fuzzy legs. "auto" maps per token: 1 for tokens of 5 characters or fewer, 2 for longer tokens. Fixed 0, 1, or 2 applies to all tokens. |
rank_constant | 60 | Turbopuffer’s RRF constant, passed through verbatim. Integer > 0. |
per_leg_limit | clamp(5 × top_k, 50, 200) | How deep each leg retrieves before fusion. Integer > 0. |
threads | Index.spec.scan.threads, else 8 | Maximum concurrent upstream requests when the gateway scatter/gathers the expansion across a sharded namespace — the same fan-out control as scans. Clamped to active shards. No effect on unsharded namespaces, where the expansion is a single fused upstream call. |
Tokenization
The input string becomes tokens under a fixed, documented policy:
- Split on Unicode (UAX #29) word boundaries and lowercase, using
alyze— the code behind Turbopuffer’s productionword_v4tokenizer. Punctuation-only tokens never survive the split. - Drop tokens shorter than 2 characters.
- Dedupe.
- Cap at 15 tokens (15 fuzzy legs + 1 BM25 leg = 16, the upstream
subquery limit). Tokens cut by the cap are counted in
tokens_dropped.
Stemming, stopword removal, and language detection are not applied. The input must yield at least one token; one token is fine (that is still two legs, the RRF minimum).
Response
Results are the upstream RRF-fused list. A hybrid block echoes the
effective expansion so defaults are never invisible:
{
"rows": [
{
"id": "ticket-4117",
"$score": 0.0639,
"content": "...",
"title": "Connection timeout on Kubernetes ingress"
}
],
"hybrid": {
"tokens": ["conection", "timout", "kubernets"],
"tokens_dropped": 0,
"fuzziness": "auto",
"rank_constant": 60,
"legs": 4,
"per_leg_limit": 50
}
}
| Field | Meaning |
|---|---|
$score | Upstream RRF score. Comparable within a response, not across requests — do not threshold on it. |
tokens | Tokens that produced fuzzy legs, post-policy. |
tokens_dropped | Tokens removed by the 15-token cap (not by the length or punctuation rules). |
legs | Total subqueries sent upstream (fuzzy legs + 1 BM25 leg). |
The hybrid block appears only on HybridText responses. On sharded
namespaces it also reports the effective threads fan-out width.
Requests without a HybridText expression, including native
multi-query + rerank_by bodies, keep their upstream-shaped responses
byte-for-byte.
Semantics
- One round trip. The expansion is a single upstream multi-query
fused by
rerank_by: ["RRF", ...]. Layer implements no fusion math and does not reorder results. - One consistency cut. Request-level
filtersare replicated to every leg, and the stable-read watermark predicate is injected into every leg from a single read — all legs see the same cut. Responses carryx-layer-stable-as-ofas usual. - All-or-nothing. Upstream multi-query has no partial results; any leg failing fails the request.
- Replay as a unit. The query logs to
search history as one entry carrying the
HybridTextexpression, so replaying it reproduces the whole expansion.
Validation
All return 422:
| Condition | Why |
|---|---|
| Input yields zero tokens under the policy | Nothing to expand. |
cursor present | Fused scores do not form the monotone bands pagination relies on. |
HybridText inside a queries array | The expansion is one multi-query deep by construction. |
fuzziness not in "auto" | 0 | 1 | 2; rank_constant ≤ 0; per_leg_limit ≤ 0; threads < 1 | Out of range. |
To let the gateway pick between hybrid text and semantic retrieval per query, see query routing.
Query routing
Real search boxes receive both "timout" and "why do pods lose their connection during deploys". The first wants
hybrid text fusion; the second wants semantic
retrieval — lexical legs add noise on long conversational input, and
ANN underperforms on short identifier-shaped tokens. Auto is a
Layer-only rank_by spelling that makes that call per query, so the
branch doesn’t live ad hoc in your application code.
The gateway never embeds. The route is chosen from the shape of the input alone, and when the chosen route needs a query vector the request didn’t include, the response is the routing decision instead of results — your application embeds and re-issues with the route forced. Short keyword traffic executes immediately and never pays for an embedding.
response = await client.query_namespace("support-tickets", {
"rank_by": ["content", "Auto", user_input],
"top_k": 10,
"filters": ["tenant", "Eq", "t-42"],
})
if not response.routing.executed:
vector = await embed(user_input) # your model or API
response = await client.query_namespace("support-tickets", {
"rank_by": ["content", "Auto", user_input, {
"route": response.routing.route,
"vector": vector,
}],
"top_k": 10,
"filters": ["tenant", "Eq", "t-42"],
})response, err := client.QueryNamespace(ctx, "support-tickets", &hevlayer.QueryRequest{
RankBy: []any{"content", "Auto", userInput},
TopK: 10,
Filters: []any{"tenant", "Eq", "t-42"},
})
if err == nil && !response.Routing.Executed {
vector := embed(userInput) // your model or API
response, err = client.QueryNamespace(ctx, "support-tickets", &hevlayer.QueryRequest{
RankBy: []any{"content", "Auto", userInput, map[string]any{
"route": response.Routing.Route,
"vector": vector,
}},
TopK: 10,
Filters: []any{"tenant", "Eq", "t-42"},
})
}let response = await client.queryNamespace("support-tickets", {
rank_by: ["content", "Auto", userInput],
top_k: 10,
filters: ["tenant", "Eq", "t-42"],
});
if (!response.routing.executed) {
const vector = await embed(userInput); // your model or API
response = await client.queryNamespace("support-tickets", {
rank_by: ["content", "Auto", userInput, {
route: response.routing.route,
vector,
}],
top_k: 10,
filters: ["tenant", "Eq", "t-42"],
});
}# First request: no vector. Executes lexically, or returns the decision.
curl -X POST "$LAYER_GATEWAY_URL/v2/namespaces/support-tickets/query" \
-H "Authorization: Bearer $LAYER_GATEWAY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"rank_by": ["content", "Auto", "why do pods lose their connection during deploys"],
"top_k": 10,
"filters": ["tenant", "Eq", "t-42"]
}'
# Routed semantic without a vector, so the body is the decision, not rows:
# {"rows": [], "routing": {"route": "semantic", "policy": "v1", "tokens": 8, "executed": false}}
# Embed, then re-issue with the route forced:
curl -X POST "$LAYER_GATEWAY_URL/v2/namespaces/support-tickets/query" \
-H "Authorization: Bearer $LAYER_GATEWAY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"rank_by": ["content", "Auto", "why do pods lose their connection during deploys", {
"route": "semantic",
"vector": [0.0012, -0.043]
}],
"top_k": 10,
"filters": ["tenant", "Eq", "t-42"]
}' Routing policy
The v1 policy reads the token count of the input under the same tokenizer policy as hybrid text fusion:
| Tokens | Route | Runs |
|---|---|---|
| ≤ 2 | hybrid_text | The hybrid text fusion expansion. |
| ≥ 8 | semantic | ANN over the supplied query vector. |
| 3 – 7 | fused | Both, merged upstream by RRF. |
Vector availability never changes which route is chosen — only whether
it executes in this request. hybrid_text always executes; semantic
and fused execute when the request supplies a vector and defer
otherwise. The policy is versioned ("policy": "v1") so threshold
changes are visible in search history.
Options
The optional fourth tuple element:
| Option | Default | Meaning |
|---|---|---|
route | "auto" | Force "hybrid_text", "semantic", or "fused" instead of applying the policy. Used on re-issue after a deferral, and for A/B comparison of strategies on the same input. |
vector | — | The query vector for the semantic leg. Dimensionality must match the namespace. |
When the chosen route expands hybrid-text legs, the hybrid defaults
apply and the hybrid echo block appears alongside
routing.
Response
Every Auto response carries a routing block:
{
"rows": [{"id": "ticket-4117", "$score": 0.0639, "title": "..."}],
"routing": {
"route": "hybrid_text",
"policy": "v1",
"tokens": 1,
"executed": true
},
"hybrid": {"tokens": ["timout"], "tokens_dropped": 0, "fuzziness": "auto", "rank_constant": 60, "legs": 2, "per_leg_limit": 50}
}
| Field | Meaning |
|---|---|
route | The strategy chosen (or forced). |
policy | Routing policy version that made the decision. "forced" when route was supplied. |
tokens | Token count the policy read, post tokenizer policy. |
executed | false on a deferral: the route needs a vector the request didn’t supply. rows is empty; embed and re-issue with the route forced. |
Routed queries follow the same semantics as their underlying strategy:
one consistency cut across all legs, all-or-nothing leg failure, and a
single search history entry carrying the
Auto expression and the decision.
Validation
All return 422:
| Condition | Why |
|---|---|
Forced "semantic" or "fused" without vector | Forcing asserts you have the vector; only auto-routing defers. |
| Input yields zero tokens under the policy | Nothing to route. |
vector dimensionality mismatch | Same check as a plain vector query. |
cursor present, or Auto inside a queries array | Inherited from hybrid text fusion. |
Counting matches
To count how many rows match a full-text or vector query, use
scan count mode with the fts or ann selector.
Ranked counts share the single /scans endpoint with filter counts —
fts is exact, ann is a radius scan flagged approximate, and both
honor the exhaustive flag and the count deadline.
Fetch
Fetch is a Layer-only endpoint with no upstream equivalent. The NVMe cache is checked first; on miss or error the gateway falls through to Turbopuffer and backfills the cache best-effort.
Single fetch
doc = await client.fetch_document(
"products",
"asin-B08N5WRWNW",
include_attributes=["title", "category"],
)doc, err := client.FetchDocument(ctx, "products", "asin-B08N5WRWNW",
&hevlayer.FetchDocumentParams{
IncludeAttributes: []string{"title", "category"},
})const doc = await client.fetchDocument("products", "asin-B08N5WRWNW", {
includeAttributes: ["title", "category"],
});curl "$LAYER_GATEWAY_URL/v2/namespaces/products/documents/asin-B08N5WRWNW?include_attributes=title,category" \
-H "Authorization: Bearer $LAYER_GATEWAY_API_KEY" | Outcome | Status | Header |
|---|---|---|
| Cached hit | 200 | x-layer-cache: hit |
| Cache miss, upstream hit, cache backfilled | 200 | x-layer-cache: miss |
| Cache unavailable, upstream hit | 200 | x-layer-cache: miss-on-error |
| Missing from both layers | 404 | — |
Batch fetch
batch = await client.fetch_documents("products", {
"ids": ["asin-1", "asin-2", "asin-3"],
"include_attributes": ["title"],
})batch, err := client.FetchDocuments(ctx, "products", &hevlayer.FetchDocumentsRequest{
Ids: []string{"asin-1", "asin-2", "asin-3"},
IncludeAttributes: []string{"title"},
})const batch = await client.fetchDocuments("products", {
ids: ["asin-1", "asin-2", "asin-3"],
include_attributes: ["title"],
});curl -X POST "$LAYER_GATEWAY_URL/v2/namespaces/products/documents" \
-H "Authorization: Bearer $LAYER_GATEWAY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"ids": ["asin-1", "asin-2", "asin-3"],
"include_attributes": ["title"]
}' {
"documents": [
{"id": "asin-1", "attributes": {"title": "..."}},
{"id": "asin-3", "attributes": {"title": "..."}}
],
"missing": ["asin-2"]
}
Batch fetch returns found documents and missing ids inline instead of a
partial 404. documents preserves request order; ids the gateway could
not find anywhere land in missing. Because order is preserved, batch
fetch is a convenient way to reassemble a pipeline’s
chunks back into their original document — request the chunk ids in
sequence and concatenate the results.
Behavior matrix
| Cache state | Single fetch | Batch fetch |
|---|---|---|
| Hit | cache | cache |
| Miss, upstream present | upstream + backfill | upstream + backfill |
| Miss, upstream absent | 404 | inline missing |
| Cache unavailable | upstream, miss-on-error | upstream, miss-on-error |