Skip to content

Snapshots and persistence

A built catalog is an in-memory object. To ship it, reuse it across processes, or rebuild it lazily, you persist it as a snapshot — a portable directory of the entries plus their serialized indexes. Parsimony writes snapshots atomically, stamps each one with an integrity digest, and can load them back into a fully searchable Catalog. Snapshots live behind a small URL scheme: a local file:// path or a Hugging Face hf:// dataset.

Saving a catalog

Catalog.save is a coroutine. The catalog must be built first — saving a dirty catalog raises a plain ValueError, the same gate that guards search.

import asyncio
from parsimony.catalog import BM25Index, Catalog, Entity


async def main():
    catalog = Catalog("solo", indexes={"title": BM25Index()}, default_field="title")
    catalog.set_entities(
        [
            Entity(namespace="solo", code="A", title="alpha title"),
            Entity(namespace="solo", code="B", title="beta title"),
        ]
    )
    await catalog.build()
    await catalog.save("file:///tmp/parsimony/snapshot", builder="my-job")


asyncio.run(main())

The optional builder keyword is a free-form identifier of the script or job that produced the snapshot. It is recorded in the manifest's build.builder field and is otherwise inert.

Build before you save

save() calls the same dirty-state check as search(). After any mutation (set_entities, set_index, set_indexes, update_indexes, delete_many) the catalog is marked dirty and save() raises ValueError("Catalog entries or indexes changed — call await catalog.build() before it can be saved") until you re-run await catalog.build().

BM25 / vector backends are an optional extra

The BM25Index example above builds and searches only with the standard extra installed (it pulls in rank-bm25). The snapshot machinery itself — atomic writes, the manifest, the integrity digest, URL dispatch — is pure parsimony-core. See Installation for the extras matrix.

Loading a catalog

Catalog.load is an async classmethod. It returns a non-dirty, immediately searchable catalog — there is no build() call needed after a load.

import asyncio
from parsimony.catalog import Catalog


async def main():
    catalog = await Catalog.load("file:///tmp/parsimony/snapshot")
    hits, diag = await catalog.search("alpha", limit=5)
    print(diag.mode, [m.code for m in hits])


asyncio.run(main())

A loaded catalog is reconstructed exactly from what was serialized: its default_field and index set come from the manifest, and the framework's default-index policy is forced off. That means calling build() on a loaded catalog will not re-derive metadata-key indexes — the serialized indexes are authoritative. (See Indexes for the default policy.)

The snapshot layout

A snapshot is a directory with three parts:

snapshot/
├── entries.parquet      # entities as rows; zstd-compressed when non-empty
├── indexes/             # one subdirectory per index field
│   ├── code/
│   └── title/
└── meta.json            # the CatalogMeta manifest
Path Contents
entries.parquet PyArrow table with columns namespace, code, title, metadata_json. Each entity's metadata is JSON-encoded into metadata_json. Non-empty tables use zstd compression; an empty catalog writes the four-column schema with no rows.
indexes/<field>/ One directory per configured index, written by that index's own save(). The field name is the logical search surface.
meta.json The CatalogMeta manifest (see below).

Only JSON-serializable metadata survives a round-trip

metadata is stored as a JSON string in the metadata_json column. Anything not JSON-encodable will fail to serialize, and non-JSON-native types are reconstructed as their JSON equivalents on load.

The manifest

meta.json deserializes to a CatalogMeta model. Import it (and the nested BuildInfo) from parsimony.catalog.storage; neither is a top-level name.

from parsimony.catalog.storage import read_meta

meta = read_meta("/tmp/parsimony/snapshot")
print(meta.name, meta.entry_count, meta.index_fields)

CatalogMeta fields:

Field Type Meaning
schema_version Literal[1] Snapshot format version. Pinned to 1; loading any other value raises ValueError.
name str The catalog's normalized name.
namespaces list[str] Sorted distinct namespaces across the entries.
entry_count int Number of entities (>= 0).
index_fields dict[str, str] Maps each index field to its kind string: "bm25", "vector", "hybrid", or "dis_max".
default_field str \| None The broad-search surface, if one was configured.
build BuildInfo Provenance for this snapshot (see below).

BuildInfo fields:

Field Type Meaning
built_at datetime UTC timestamp, defaulted at construction.
parsimony_version str \| None Optional library version stamp; not auto-populated.
builder str \| None The free-form identifier passed to save(..., builder=...).
content_sha256 str Integrity digest of every snapshot file except meta.json. Empty string when not computed.

Atomic writes and integrity

Saving never leaves a half-written snapshot in place. The catalog is written to a sibling temp directory (<name>.<pid>.<uuid>.tmp next to the target), the manifest is written last, any existing target is removed, and the temp directory is renamed onto the target. If anything fails mid-write, the temp directory is cleaned up and the original target is untouched. Because the temp directory is a sibling of the target, the target's parent directory must be writable.

The manifest's content_sha256 is computed over every file in the snapshot except meta.json itself: each file is hashed, the relpath:hexdigest lines are sorted and concatenated, and that is hashed again. On load, if content_sha256 is non-empty, the digest is recomputed and a mismatch raises ValueError:

Catalog snapshot integrity check failed for /tmp/parsimony/snapshot2:
  expected sha256: ...
  actual sha256:   ...

Integrity is opt-in by presence

The check runs only when build.content_sha256 is non-empty. save() always populates it, so snapshots Parsimony writes are always verified on load. A hand-written manifest that leaves the digest empty skips the check.

Which indexes are serializable

save() serializes only the four built-in index types — BM25Index, VectorIndex, HybridIndex, and DisMaxIndex. Any other object satisfying the CatalogIndex protocol is treated as runtime-only:

TypeError: Catalog index for field 'custom' is runtime-only and cannot be serialized

On load, indexes are reconstructed by the kind string recorded in meta.index_fields; an unrecognized kind raises ValueError. See Indexes for the index types and their kind values.

URL schemes

Both save() and load() dispatch on the URL scheme. Two schemes are supported; any other raises ValueError("Unsupported catalog URL scheme '<scheme>'. Supported: ['file', 'hf']").

Scheme Form Behavior
file:// (or a bare path) file:///abs/path/snapshot, data/cat Reads/writes a local directory. A bare path with no :// is treated as file.
hf:// hf://<org>/<repo>[/<sub>...] A Hugging Face dataset repo. Lazily imports huggingface_hub; not needed for file://.

s3:// is not supported by snapshots

Only file and hf are recognized by save() / load(). Passing s3://... raises the unsupported-scheme ValueError regardless of any installed extra.

Parsing URLs

The pure parser lives in parsimony.catalog.urls. It splits a URL into a ParsedCatalogURL (scheme, root, sub):

from parsimony.catalog.urls import parse_catalog_url

p = parse_catalog_url("hf://org/repo/nested/bundle")
print(p.scheme, p.root, p.sub)   # -> hf org/repo nested/bundle

q = parse_catalog_url("file:///tmp/repo/bundle_a")
print(q.scheme, q.root, q.sub)   # -> file /tmp/repo/bundle_a (sub is empty)

r = parse_catalog_url("data/cat")
print(r.scheme, r.root, r.sub)   # -> file data/cat (bare path -> file)

The semantics differ between schemes:

  • file:// has no sub-path semantics. The entire path goes into root and sub stays empty; the snapshot directory is addressed directly. A bare absolute path is absolutized; a bare relative path is left as the raw string. Multi-bundle local layouts are composed by the caller.
  • hf:// decomposes into root = "<org>/<repo>" and sub = the remaining (possibly nested) path. A missing org or repo segment raises ValueError.

For hf:// loads, huggingface_hub.snapshot_download pulls the dataset into the catalogs cache directory with repo_type="dataset"; a sub restricts the download to that bundle. Saving to hf:// stages the snapshot locally, then creates the repo (exist_ok=True) and uploads the folder.

Lazy load-or-build

When a catalog might be remote, missing, or cheap to rebuild, load_or_build_catalog resolves it through a three-step fallback. Import it from parsimony.catalog.search.

import asyncio
from pathlib import Path
from parsimony.catalog import BM25Index, Catalog, Entity
from parsimony.catalog.search import load_or_build_catalog


async def main(tmp: Path):
    async def build():
        c = Catalog("demo", indexes={"title": BM25Index()}, default_field="title")
        c.set_entities([Entity(namespace="demo", code="a", title="alpha widget")])
        await c.build()
        return c

    cache = tmp / "lazy-cache"
    # url is absent -> build() runs and the result is saved to the lazy cache
    cat = await load_or_build_catalog(f"file://{tmp}/missing", cache_path=cache, build=build)
    # second call hits the lazy cache; build() does not run again
    again = await load_or_build_catalog(f"file://{tmp}/missing", cache_path=cache, build=build)
    print(len(again), (cache / "meta.json").is_file())  # -> 1 True


asyncio.run(main(Path("/tmp/parsimony")))

The resolution order is:

  1. Remote URL — try Catalog.load(url). If it succeeds, return it.
  2. Local lazy cache — if the load failed because the catalog is missing, and <cache_path>/meta.json exists, load file://<cache_path>. If that cached snapshot is unreadable, log a warning and continue to step 3.
  3. Build callback — if build is supplied, call it, save the result to the lazy cache with builder="lazy", and return it. If build is None, raise CatalogNotFoundError.

Only missing errors trigger the fallback: CatalogNotFoundError, FileNotFoundError, Hugging Face's RepositoryNotFoundError, or a ConnectorError whose message says "not found", "not present", or "does not exist". Any other load failure propagates unchanged.

CatalogNotFoundError is not a top-level name

Import it from parsimony.errors. It subclasses ConnectorError, so it carries the same agent-facing message convention as the rest of the typed errors — its default message ends with a "DO NOT retry." directive.

Caching hydrated catalogs

CatalogLRU keeps already-loaded Catalog instances in memory, keyed by URL, so repeat lookups for the same snapshot reuse one object instead of re-reading from disk. It is bounded (default size 4, must be >= 1) and asyncio.Lock-guarded.

import asyncio
from parsimony.catalog.search import CatalogLRU


async def main():
    lru = CatalogLRU(size=2)
    a = await lru.get_or_load("file:///tmp/parsimony/snapshot")
    b = await lru.get_or_load("file:///tmp/parsimony/snapshot")
    assert a is b  # same in-memory instance
    lru.clear()


asyncio.run(main())

When both cache_path and build are passed, get_or_load delegates to load_or_build_catalog; otherwise it calls Catalog.load(url) directly, mapping a missing catalog to CatalogNotFoundError. Entries beyond the configured size are evicted oldest-first.

See also

  • The Catalog — the lifecycle (build → search → save) these snapshots persist.
  • Building and searching — the dirty/build gate that save() shares.
  • Indexes — which index kinds are serializable and how kind drives reload.
  • Caching — the cache root that backs hf:// downloads and lazy caches.
  • ErrorsCatalogNotFoundError and the agent-facing taxonomy.