# Indexes

SecantusDB has a real query planner with index acceleration across most of
the shapes `pymongo` produces. This page documents what's wired up and how
to verify it.

## Creating indexes

Standard `pymongo` API:

```python
coll.create_index("email", unique=True)
coll.create_index([("a", 1), ("b", -1)], name="ab_compound")
coll.create_index("status", partialFilterExpression={"active": True})
coll.create_index("createdAt", expireAfterSeconds=3600)
```

The `_id` index always exists — it's the document table itself, walked by
WT-key order.

## What `find()` accelerates

`find()` routes through the index entries table for:

### Single-field filters

Bare equality (`{field: v}`), `$eq`, `$in`, and any combination of `$gt` /
`$gte` / `$lt` / `$lte` against a single-field index.

```python
coll.create_index("score")
coll.find({"score": {"$gte": 80, "$lt": 100}})  # IXSCAN on score
```

When no single-field index covers `field`, a compound index whose **leading
field** is `field` is used instead. Equality lookups become prefix scans
(`enc(v) + COMPOUND_SEP`); range bounds are evaluated with a
leading-field-only scan that uses `startswith(esc_X + esc_compound_sep)` to
identify boundary rows.

### Multi-field bare-equality filters

When the filter's fields are a leading prefix (set-wise) of an ASC compound
index, an exact match (filter covers the whole index) or a prefix scan
(strict leading prefix) runs.

Filter field order doesn't matter — `{b: 20, a: 1}` finds the same
`{a: 1, b: 1}` index as `{a: 1, b: 20}`.

### Compound prefix + trailing operator

`{a: 5, b: 10, c: {$gt: 20}}` walks a compound `{a, b, c}` index by pinning
the prefix from the equalities and applying the operator's bounds to the
next column. Supports `$eq` / `$in` / `$gt` / `$gte` / `$lt` / `$lte` on the
trailing field.

### Mixed-direction compound indexes

Compound indexes accept any per-field direction (`{a: 1, b: -1}`,
`{a: -1, b: -1}`, etc.). Each field is byte-encoded with
`encode_value_directed(value, dir)` so the entries table sorts in the
index's natural order. When the trailing field is DESC, the operator
semantics flip (`$gt` becomes upper-exclusive in byte order).

### Partial indexes

`partialFilterExpression` is honoured at write time (only matching docs get
entries) and at query time:

```python
coll.create_index("n", partialFilterExpression={"status": "active"})

# Uses the index — query implies the partial filter.
coll.find({"status": "active", "n": 5})

# Falls back to scan — query doesn't imply the partial filter.
coll.find({"n": 5})
```

The picker requires every key/value in the partial filter to appear with the
same bare value in the user filter. Partial-filter keys are stripped before
matching against the index key spec, so a query like `{status: "active",
n: 5}` against a partial `{n: 1}` index with filter `{status: "active"}`
correctly uses the index.

Conservative: operator-form clauses or document-level operators (`$or`,
`$expr`, ...) in the partial filter aren't recognised as implied.

### Multikey fallback

SecantusDB doesn't yet support per-element multikey indexing. Instead,
indexes are flagged `multikey: True` (sticky — never cleared) at insert /
update / `create_index` time when any indexed field on a doc is a list, and
the picker skips multikey-flagged indexes — `find()` falls back to a full
scan + `matches()` so array-element queries (`{tags: "python"}` against
`{tags: ["python", "go"]}`) return the correct rows.

## Sort acceleration

If the sort field matches an index's leading field, the post-sort step is
skipped — `find()` walks the index in order:

```python
coll.create_index([("createdAt", -1)])
list(coll.find().sort("createdAt", -1))   # forward walk of DESC index
list(coll.find().sort("createdAt", 1))    # backward walk of DESC index
```

Single-field sort can use any matching index — single-field or compound,
ASC or DESC. Multi-field sort still falls back to in-memory `sort_docs`
(today only single-field sort is index-accelerated).

## Hints

`hint` is honoured on both `find` and `aggregate`:

| Hint value | Behaviour |
| --- | --- |
| Index name string | Walk that index |
| Key-spec dict matching an index | Walk that index |
| `"$natural"` | Force collection scan |
| `"_id_"` / `{_id: 1}` | Walk doc table order |

An unknown hint surfaces as a `BadValue` (code 2) error to the client.
The hint can also align with the sort spec to skip the post-sort step.

`aggregate` lifts a leading `$match` stage into the initial fetch's filter
so a pipeline starting with `[{$match: {...}}]` benefits from the same
index acceleration as `find`.

## TTL indexes

`expireAfterSeconds` is honoured by `Storage.prune_ttl(db, coll, *, now=None)`
which walks the collection, deletes docs whose indexed `datetime` field is
older than `now - expireAfterSeconds`, and removes their index entries.

The clock is injectable so tests can drive expiry deterministically. There
is **no background sweeper** — real MongoDB prunes every 60s; SecantusDB
requires the caller to invoke `prune_ttl` explicitly. This is the right
ergonomics for a test harness: the test that wants TTL behaviour fires the
prune itself.

```python
from secantus import SecantusDBServer
import datetime as dt

with SecantusDBServer(port=0, storage_path=":memory:") as server:
    client = MongoClient(server.uri)
    coll = client["db"]["events"]
    coll.create_index("createdAt", expireAfterSeconds=60)
    coll.insert_one({"createdAt": dt.datetime(2026, 1, 1, tzinfo=dt.UTC)})

    # Force expiry from the test:
    pruned = server.storage.prune_ttl(
        "db", "events",
        now=dt.datetime(2026, 1, 1, 0, 5, tzinfo=dt.UTC),
    )
    assert pruned == 1
```

Docs without the TTL field, with non-date values, or with values inside the
window are left untouched.

## explain

`explain` reports `IXSCAN` when an index would be used and `COLLSCAN`
otherwise:

```python
plan = coll.find({"n": 5}).explain()["queryPlanner"]["winningPlan"]
# IXSCAN: {"stage": "FETCH", "filter": ...,
#          "inputStage": {"stage": "IXSCAN", "indexName": "n_1",
#                         "keyPattern": {"n": 1}, "direction": "forward"}}
# COLLSCAN: {"stage": "COLLSCAN", "filter": ...}
```

`Storage.explain_plan(...)` mirrors `find_matching`'s routing decisions
without executing them and is exposed on the public storage API.

## Geospatial — `2d` and `2dsphere`

Both index types ship and accelerate `$geoWithin` / `$geoIntersects`
/ `$near` / `$nearSphere` plus the `$geoNear` aggregation stage.
See the dedicated [Geospatial](geospatial.md) page for the
operator-by-operator reference, doc-side shapes accepted,
distance-unit conventions across the GeoJSON / legacy-planar /
legacy-spherical spec forms, and the worked deployment example.

Quick shapes:

```python
coll.create_index([("loc", "2dsphere")])               # GeoJSON, spherical
coll.create_index([("loc", "2d")])                     # legacy [x, y] pairs, planar
coll.create_index([("loc", "2dsphere"), ("cat", 1)])   # compound geo + scalar
```

## Natural iteration order

Walking the doc table in WT-key order yields docs in MongoDB's natural
sort order: numeric for int/float/Decimal128 (with cross-type collision
preserved by the lexical-decimal encoding), chronological for `ObjectId`,
lexical for strings, etc. `update_matching(multi=False)` and `find()`
without `sort` walk in this order, matching `mongod`.

## Index acceleration paths summary

| Filter shape | Path |
| --- | --- |
| `{f: v}` (bare-equality, single field) | Single-field index, prefix scan on compound, or partial-index match |
| `{f: {$eq: v}}` / `{f: {$in: [...]}}` | Same as bare-equality |
| `{f: {$gt/$gte/$lt/$lte: v}}` | Range scan |
| `{f1: v1, f2: v2, ...}` (bare-eq) | Compound prefix exact / scan |
| `{f1: v1, f2: {$gt: v}}` | Compound prefix + trailing operator |
| `{f: array}` against a `multikey` index | Falls back to scan |
| `{partial_filter_keys, ...index_keys}` | Partial-index path |
| Sort `{f: ±1}` aligned with an index leading field | B-tree walk in (reversed) order, no post-sort |
| `hint` | Forces a specific index / `$natural` |

## Acceleration summary across index types

| Index type | Filter / sort shape | Path |
|---|---|---|
| Single-field B-tree | equality / `$in` / `$gt`/`$gte`/`$lt`/`$lte` / sort | IXSCAN |
| Compound B-tree | bare-eq prefix; eq prefix + trailing operator on the next column; ASC/DESC mix; multi-field sort that matches or exactly inverts the key spec | IXSCAN, no post-sort |
| Partial | when the user filter implies the partial expression | IXSCAN |
| Multikey | equality / `$in` / range on the array column; whole-array equality goes through the canonical key entry | IXSCAN (sort-acceleration skipped — multikey doesn't preserve a single natural order) |
| TTL | timestamp range + prune sweeper drives expiry | IXSCAN |
| `2dsphere` | `$geoWithin` / `$geoIntersects` / `$near` / `$nearSphere` / `$geoNear` via S2 cell-covering scan | IXSCAN — see [Geospatial](geospatial.md) |
| `2d` | same operators via quadtree-decomposed Z-order range scan over a bit-interleaved geohash | IXSCAN — see [Geospatial](geospatial.md) |
| Compound geo + scalar | geo column drives the cell scan; trailing scalar(s) filtered at the verifier step | IXSCAN |

## What's still missing

- **Per-index collation** — `createIndexes` stores the option on
  the index spec, but entries are written in BSON codepoint order
  (no collation-aware sort key). Queries that carry `collation`
  fall through to COLLSCAN by design. The per-query collation
  infrastructure does honour collation for `find` / `count` /
  `distinct` / `findAndModify` — it's just the index-side enforcement
  that's missing.
- **TTL background sweeper** — `prune_ttl` is opt-in; no 60-second
  cadence sweeper. Real mongod runs one; for an in-process test
  surrogate the explicit-call ergonomics suit the audience better.
- **Text / hashed indexes** — out of scope (no full-text engine; no
  practical workload pulling hashed shard-key behaviour into an
  in-process surrogate).