Building and searching¶
A Catalog is only searchable after you build it. This page covers the
constructor, the entry and index mutators, the build gate, the search coroutine, and the
small query DSL that decides between structured (field: value) and broad (plain-text)
search.
All of the runnable examples here use BM25Index, which needs the optional
rank-bm25 backend at search time. Install it with the standard extra:
Imports
Catalog, Entity, CatalogMatch, and BM25Index are top-level
(from parsimony import Catalog), but they are also re-exported from
parsimony.catalog. For catalog-heavy code, importing everything from the submodule —
from parsimony.catalog import Catalog, Entity, BM25Index, ... — is the clearest
convention, and it is the only way to reach the names that are not top-level
(SearchDiagnostic, StructuredQuery, parse_query, and the query error types).
Constructing a catalog¶
from parsimony.catalog import Catalog, CatalogIndex
Catalog(
name: str,
*,
indexes: dict[str, CatalogIndex] | None = None,
default_field: str | None = None,
)
| Parameter | Default | Behavior |
|---|---|---|
name |
required | Normalized to lowercase snake_case via the namespace rule (^[a-z][a-z0-9_]*$). A name like "My Catalog" raises ValueError. |
indexes |
None |
None enables the default index policy (see below). A dict gives you full control — only those indexes exist, none are added silently. |
default_field |
None |
The search-surface name used for broad (plain-text) search. If None, broad search falls back to "title" when a "title" index exists, otherwise broad search is disabled. |
The keys of the indexes dict are logical search-surface names. They are what you type in
the DSL (FIELD: value) and what appears in error messages. By convention a key matches the
Entity field its index reads (code, title, or a metadata key), but a
composite index such as DisMaxIndex can expose one surface name while reading
several entity fields internally.
A freshly constructed catalog is dirty: you must call build() before searching or saving.
The default index policy¶
Passing indexes=None defers index selection to build time. At build(), the catalog
materializes a BM25Index for code, for title, and for every metadata
key observed across the current entries (sorted).
import asyncio
from parsimony.catalog import Catalog, Entity
async def main() -> None:
catalog = Catalog("demo") # indexes=None -> default policy
catalog.set_entities(
[Entity(namespace="demo", code="a", title="alpha", metadata={"region": "eu"})]
)
await catalog.build()
print(sorted(catalog.indexes)) # -> ['code', 'region', 'title']
asyncio.run(main())
Calling any of set_index, set_indexes, or update_indexes permanently disables the
default policy — once you take manual control, build() will not re-derive metadata-key
indexes.
default_field and an explicit indexes dict
If you set default_field together with an explicit indexes dict that does not contain
it, the catalog raises BroadSearchConfigError at construction time. With the default
index policy active (indexes=None), that check is deferred to build(), where the same
error is raised if the field is still not covered.
Loading entries and managing indexes¶
These mutators change the catalog in memory and mark it dirty. None of them rebuild indexes —
that always happens in build().
| Method | Effect |
|---|---|
set_entities(entries: list[Entity]) |
Replace all entries. Entries are upserted by (namespace, code), so duplicate keys overwrite earlier ones rather than appending. |
await delete_many(keys) |
Remove entries by (namespace, code) pairs. Returns the count removed (0 if none matched). |
set_index(field, index) |
Replace one field index. Disables the default policy. |
update_indexes(indexes) |
Merge field indexes into the current set. Disables the default policy. |
set_indexes(indexes) |
Replace the entire index set. Disables the default policy. |
await get(namespace, code) |
Look up a single Entity by key, or None. This does not require a build. |
The entities and indexes properties return copies of the current entries and index map.
len(catalog) is the entry count.
import asyncio
from parsimony.catalog import BM25Index, Catalog, Entity
async def main() -> None:
catalog = Catalog("series", indexes={"title": BM25Index()})
catalog.set_entities(
[
Entity(namespace="series", code="A", title="alpha title"),
Entity(namespace="series", code="B", title="beta title"),
]
)
await catalog.build()
print(len(catalog)) # -> 2
print((await catalog.get("series", "A")).title) # -> alpha title
asyncio.run(main())
The build gate¶
await catalog.build() validates the configuration and materializes every configured index
over the current entries. The rebuild is guarded by an asyncio.Lock, and a single shared
vector cache is threaded through all indexes in one build, so identical texts across fields
are embedded once.
Construction and every mutator (set_entities, set_index, set_indexes, update_indexes,
delete_many) mark the catalog dirty. While dirty, search() and save() raise a plain
ValueError:
Build before searching or saving
The build gate is a plain ValueError, not a custom error type. Catch it as ValueError
if you need to. get() is the only data method that skips the gate. Any change to entries
or indexes after a build requires another build().
Searching¶
async def search(
self,
query: str,
limit: int,
*,
namespaces: list[str] | None = None,
) -> tuple[list[CatalogMatch], SearchDiagnostic]
query and limit are positional and both required; namespaces is keyword-only. The
coroutine returns a tuple of the ranked matches and a diagnostic describing how the query ran.
limitcaps the number of results. (Whole tie-groups can slightly exceed it — see Ranking and fusion.)namespaces, when given, post-filters the ranking to entries whose normalized namespace is in the allowed set.
search() first calls the build gate, then parses the query to choose between structured and
broad mode.
Broad search¶
If the query does not start with a field: prefix, it is a broad query against the
catalog's resolved default field. The query is scored against that one index and the
diagnostic reports mode="broad".
import asyncio
from parsimony.catalog import BM25Index, Catalog, Entity
async def main() -> None:
catalog = Catalog("artifact", indexes={"title": BM25Index()})
catalog.set_entities(
[
Entity(namespace="series", code="A", title="alpha title"),
Entity(namespace="series", code="B", title="beta title"),
]
)
await catalog.build()
hits, diag = await catalog.search("alpha", limit=1)
print(diag.mode, hits[0].code) # -> broad A
asyncio.run(main())
If no default field can be resolved — default_field is unset and there is no "title"
index — a plain-text query raises BroadSearchUnavailableError:
The structured query DSL¶
A query is structured if and only if it matches the regex ^\s*\w+\s*: — that is, it begins
with a word followed by a colon. The grammar:
&&separates clauses, which are ANDed together (a result must satisfy every clause).- Within one clause,
FIELD: v1, v2lists values separated by,, which are ORed (any value matching contributes).
Each clause field must have a configured index, otherwise the parse raises
UnknownIndexedFieldError. Scoring within a clause keeps the maximum positive score per row
across the OR values; across clauses, the surviving rows (the intersection) have their clause
scores summed.
import asyncio
from parsimony.catalog import BM25Index, Catalog, Entity
async def main() -> None:
catalog = Catalog("test_cat")
catalog.set_indexes(
{"title": BM25Index(), "ref_area": BM25Index(), "icp_item": BM25Index()}
)
catalog.set_entities(
[
Entity(namespace="ns", code="A", title="series a",
metadata={"ref_area": "Germany", "icp_item": "energy"}),
Entity(namespace="ns", code="B", title="series b",
metadata={"ref_area": "Italy", "icp_item": "energy"}),
Entity(namespace="ns", code="C", title="series c",
metadata={"ref_area": "Germany", "icp_item": "food"}),
]
)
await catalog.build()
res, diag = await catalog.search("ref_area: Germany && icp_item: energy", limit=5)
print(diag.mode) # -> structured
print({m.code for m in res}) # -> {'A'}
asyncio.run(main())
A bare field token is still broad
The structured trigger requires a colon. A query like ref_area (no colon) does not match
the regex and is treated as a broad query against the default field, not a structured one.
The DSL parser is also available directly when you want to inspect or validate a query without running it:
from parsimony.catalog import parse_query, StructuredQuery
parsed = parse_query("ref_area: Germany, Italy && freq: M", known_fields={"ref_area", "freq"})
print(parsed.clauses)
# -> [('ref_area', ['Germany', 'Italy']), ('freq', ['M'])]
# A plain-text query parses to None (broad).
print(parse_query("inflation", known_fields={"ref_area"})) # -> None
parse_query(q, known_fields) returns None for a broad query, a StructuredQuery
(a frozen dataclass with a clauses: list[tuple[str, list[str]]] field) for a structured one,
raises ValueError for a malformed clause (empty field, no values), and
UnknownIndexedFieldError when a clause names a field not in known_fields.
Reading results¶
Each match is a CatalogMatch — a Pydantic model carrying the resolved entity fields plus the
final relevance score:
| Field | Type | Notes |
|---|---|---|
namespace |
str |
Re-normalized to lowercase snake_case. |
code |
str |
Trimmed, non-empty. |
title |
str |
Trimmed, non-empty. |
score |
float |
The fused/ranked score; higher is better. |
metadata |
dict[str, Any] |
Shallow copy of the entity's metadata. |
The second tuple element is a SearchDiagnostic, whose mode is the literal "broad" or
"structured" and whose notes is a list of strings (empty by default).
import asyncio
from parsimony.catalog import BM25Index, Catalog, Entity
async def main() -> None:
catalog = Catalog("demo", indexes={"title": BM25Index()})
catalog.set_entities([Entity(namespace="demo", code="A", title="alpha title")])
await catalog.build()
matches, diag = await catalog.search("alpha", limit=5)
top = matches[0]
print(top.namespace, top.code, top.title) # -> demo A alpha title
print(diag.mode) # -> broad
asyncio.run(main())
Query errors¶
All three query errors subclass ValueError, so a broad except ValueError catches them, or
you can match by type. Import them from parsimony.catalog.
| Error | When raised |
|---|---|
UnknownIndexedFieldError |
A structured clause references a field with no configured index. Raised during query parsing. |
BroadSearchUnavailableError |
A plain-text query is issued but no broad-search (default) field resolves. Raised at search time. |
BroadSearchConfigError |
default_field is set but no index covers it. Raised at construction (with an explicit indexes dict) or at build() (under the default policy). |
from parsimony.catalog import (
BroadSearchConfigError,
BroadSearchUnavailableError,
UnknownIndexedFieldError,
)
for exc in (UnknownIndexedFieldError, BroadSearchUnavailableError, BroadSearchConfigError):
assert issubclass(exc, ValueError)
Config-time vs query-time
BroadSearchConfigError is a configuration error (your default_field points at a
field with no index). BroadSearchUnavailableError is a query-time error (a plain-text
query against a catalog that only supports structured search). They are easy to confuse in
a broad except ValueError.
Sparse fields and empty results¶
Indexes read entity fields through the catalog's field-extraction rules: entries that are
missing a metadata field, or that have empty-string values for it, contribute no postings to
that index. A broad search over a description index therefore returns only entries that
actually have a non-empty description, and an entirely empty index builds fine and returns
[]. See Entities for the exact extraction rules.
See also¶
- The Catalog — the catalog lifecycle at a glance
- Indexes —
BM25Index,VectorIndex,HybridIndex,DisMaxIndex, and the selection policies - Entities — the
Entitymodel and how fields become searchable text - Snapshots and persistence —
save/loadand the build gate on save