# Indexes SecantusDB has a real query planner with index acceleration across most of the shapes `pymongo` produces. This page documents what's wired up and how to verify it. ## Creating indexes Standard `pymongo` API: ```python coll.create_index("email", unique=True) coll.create_index([("a", 1), ("b", -1)], name="ab_compound") coll.create_index("status", partialFilterExpression={"active": True}) coll.create_index("createdAt", expireAfterSeconds=3600) ``` The `_id` index always exists — it's the document table itself, walked by WT-key order. ## What `find()` accelerates `find()` routes through the index entries table for: ### Single-field filters Bare equality (`{field: v}`), `$eq`, `$in`, and any combination of `$gt` / `$gte` / `$lt` / `$lte` against a single-field index. ```python coll.create_index("score") coll.find({"score": {"$gte": 80, "$lt": 100}}) # IXSCAN on score ``` When no single-field index covers `field`, a compound index whose **leading field** is `field` is used instead. Equality lookups become prefix scans (`enc(v) + COMPOUND_SEP`); range bounds are evaluated with a leading-field-only scan that uses `startswith(esc_X + esc_compound_sep)` to identify boundary rows. ### Multi-field bare-equality filters When the filter's fields are a leading prefix (set-wise) of an ASC compound index, an exact match (filter covers the whole index) or a prefix scan (strict leading prefix) runs. Filter field order doesn't matter — `{b: 20, a: 1}` finds the same `{a: 1, b: 1}` index as `{a: 1, b: 20}`. ### Compound prefix + trailing operator `{a: 5, b: 10, c: {$gt: 20}}` walks a compound `{a, b, c}` index by pinning the prefix from the equalities and applying the operator's bounds to the next column. Supports `$eq` / `$in` / `$gt` / `$gte` / `$lt` / `$lte` on the trailing field. ### Mixed-direction compound indexes Compound indexes accept any per-field direction (`{a: 1, b: -1}`, `{a: -1, b: -1}`, etc.). Each field is byte-encoded with `encode_value_directed(value, dir)` so the entries table sorts in the index's natural order. When the trailing field is DESC, the operator semantics flip (`$gt` becomes upper-exclusive in byte order). ### Partial indexes `partialFilterExpression` is honoured at write time (only matching docs get entries) and at query time: ```python coll.create_index("n", partialFilterExpression={"status": "active"}) # Uses the index — query implies the partial filter. coll.find({"status": "active", "n": 5}) # Falls back to scan — query doesn't imply the partial filter. coll.find({"n": 5}) ``` The picker requires every key/value in the partial filter to appear with the same bare value in the user filter. Partial-filter keys are stripped before matching against the index key spec, so a query like `{status: "active", n: 5}` against a partial `{n: 1}` index with filter `{status: "active"}` correctly uses the index. Conservative: operator-form clauses or document-level operators (`$or`, `$expr`, ...) in the partial filter aren't recognised as implied. ### Multikey fallback SecantusDB doesn't yet support per-element multikey indexing. Instead, indexes are flagged `multikey: True` (sticky — never cleared) at insert / update / `create_index` time when any indexed field on a doc is a list, and the picker skips multikey-flagged indexes — `find()` falls back to a full scan + `matches()` so array-element queries (`{tags: "python"}` against `{tags: ["python", "go"]}`) return the correct rows. ## Sort acceleration If the sort field matches an index's leading field, the post-sort step is skipped — `find()` walks the index in order: ```python coll.create_index([("createdAt", -1)]) list(coll.find().sort("createdAt", -1)) # forward walk of DESC index list(coll.find().sort("createdAt", 1)) # backward walk of DESC index ``` Single-field sort can use any matching index — single-field or compound, ASC or DESC. Multi-field sort still falls back to in-memory `sort_docs` (today only single-field sort is index-accelerated). ## Hints `hint` is honoured on both `find` and `aggregate`: | Hint value | Behaviour | | --- | --- | | Index name string | Walk that index | | Key-spec dict matching an index | Walk that index | | `"$natural"` | Force collection scan | | `"_id_"` / `{_id: 1}` | Walk doc table order | An unknown hint surfaces as a `BadValue` (code 2) error to the client. The hint can also align with the sort spec to skip the post-sort step. `aggregate` lifts a leading `$match` stage into the initial fetch's filter so a pipeline starting with `[{$match: {...}}]` benefits from the same index acceleration as `find`. ## TTL indexes `expireAfterSeconds` is honoured by `Storage.prune_ttl(db, coll, *, now=None)` which walks the collection, deletes docs whose indexed `datetime` field is older than `now - expireAfterSeconds`, and removes their index entries. The clock is injectable so tests can drive expiry deterministically. There is **no background sweeper** — real MongoDB prunes every 60s; SecantusDB requires the caller to invoke `prune_ttl` explicitly. This is the right ergonomics for a test harness: the test that wants TTL behaviour fires the prune itself. ```python from secantus import SecantusDBServer import datetime as dt with SecantusDBServer(port=0, storage_path=":memory:") as server: client = MongoClient(server.uri) coll = client["db"]["events"] coll.create_index("createdAt", expireAfterSeconds=60) coll.insert_one({"createdAt": dt.datetime(2026, 1, 1, tzinfo=dt.UTC)}) # Force expiry from the test: pruned = server.storage.prune_ttl( "db", "events", now=dt.datetime(2026, 1, 1, 0, 5, tzinfo=dt.UTC), ) assert pruned == 1 ``` Docs without the TTL field, with non-date values, or with values inside the window are left untouched. ## explain `explain` reports `IXSCAN` when an index would be used and `COLLSCAN` otherwise: ```python plan = coll.find({"n": 5}).explain()["queryPlanner"]["winningPlan"] # IXSCAN: {"stage": "FETCH", "filter": ..., # "inputStage": {"stage": "IXSCAN", "indexName": "n_1", # "keyPattern": {"n": 1}, "direction": "forward"}} # COLLSCAN: {"stage": "COLLSCAN", "filter": ...} ``` `Storage.explain_plan(...)` mirrors `find_matching`'s routing decisions without executing them and is exposed on the public storage API. ## Geospatial — `2d` and `2dsphere` Both index types ship and accelerate `$geoWithin` / `$geoIntersects` / `$near` / `$nearSphere` plus the `$geoNear` aggregation stage. See the dedicated [Geospatial](geospatial.md) page for the operator-by-operator reference, doc-side shapes accepted, distance-unit conventions across the GeoJSON / legacy-planar / legacy-spherical spec forms, and the worked deployment example. Quick shapes: ```python coll.create_index([("loc", "2dsphere")]) # GeoJSON, spherical coll.create_index([("loc", "2d")]) # legacy [x, y] pairs, planar coll.create_index([("loc", "2dsphere"), ("cat", 1)]) # compound geo + scalar ``` ## Natural iteration order Walking the doc table in WT-key order yields docs in MongoDB's natural sort order: numeric for int/float/Decimal128 (with cross-type collision preserved by the lexical-decimal encoding), chronological for `ObjectId`, lexical for strings, etc. `update_matching(multi=False)` and `find()` without `sort` walk in this order, matching `mongod`. ## Index acceleration paths summary | Filter shape | Path | | --- | --- | | `{f: v}` (bare-equality, single field) | Single-field index, prefix scan on compound, or partial-index match | | `{f: {$eq: v}}` / `{f: {$in: [...]}}` | Same as bare-equality | | `{f: {$gt/$gte/$lt/$lte: v}}` | Range scan | | `{f1: v1, f2: v2, ...}` (bare-eq) | Compound prefix exact / scan | | `{f1: v1, f2: {$gt: v}}` | Compound prefix + trailing operator | | `{f: array}` against a `multikey` index | Falls back to scan | | `{partial_filter_keys, ...index_keys}` | Partial-index path | | Sort `{f: ±1}` aligned with an index leading field | B-tree walk in (reversed) order, no post-sort | | `hint` | Forces a specific index / `$natural` | ## Acceleration summary across index types | Index type | Filter / sort shape | Path | |---|---|---| | Single-field B-tree | equality / `$in` / `$gt`/`$gte`/`$lt`/`$lte` / sort | IXSCAN | | Compound B-tree | bare-eq prefix; eq prefix + trailing operator on the next column; ASC/DESC mix; multi-field sort that matches or exactly inverts the key spec | IXSCAN, no post-sort | | Partial | when the user filter implies the partial expression | IXSCAN | | Multikey | equality / `$in` / range on the array column; whole-array equality goes through the canonical key entry | IXSCAN (sort-acceleration skipped — multikey doesn't preserve a single natural order) | | TTL | timestamp range + prune sweeper drives expiry | IXSCAN | | `2dsphere` | `$geoWithin` / `$geoIntersects` / `$near` / `$nearSphere` / `$geoNear` via S2 cell-covering scan | IXSCAN — see [Geospatial](geospatial.md) | | `2d` | same operators via quadtree-decomposed Z-order range scan over a bit-interleaved geohash | IXSCAN — see [Geospatial](geospatial.md) | | Compound geo + scalar | geo column drives the cell scan; trailing scalar(s) filtered at the verifier step | IXSCAN | ## What's still missing - **Per-index collation** — `createIndexes` stores the option on the index spec, but entries are written in BSON codepoint order (no collation-aware sort key). Queries that carry `collation` fall through to COLLSCAN by design. The per-query collation infrastructure does honour collation for `find` / `count` / `distinct` / `findAndModify` — it's just the index-side enforcement that's missing. - **TTL background sweeper** — `prune_ttl` is opt-in; no 60-second cadence sweeper. Real mongod runs one; for an in-process test surrogate the explicit-call ergonomics suit the audience better. - **Text / hashed indexes** — out of scope (no full-text engine; no practical workload pulling hashed shard-key behaviour into an in-process surrogate).