Entities¶
An Entity is the unit a catalog indexes and returns: a normalized,
discoverable identity made of a namespace, a code, a title, and an open
metadata dictionary. This page covers the Entity model and its sibling
CatalogMatch, the normalization helpers that enforce identity rules, the
field-extraction helpers that decide what indexes actually read, and the two ways
you turn a DataFrame into entities.
The Entity model¶
Entity is a Pydantic v2 model with exactly four fields. It is a top-level export,
but for catalog-heavy code the clearest convention is to import it from
parsimony.catalog.
from parsimony.catalog import Entity
e = Entity(
namespace="fred",
code="UNRATE",
title="Unemployment Rate",
metadata={"frequency": "M", "tags": ["labor", "rates"]},
)
| Field | Type | Required | Notes |
|---|---|---|---|
namespace |
str |
yes | Lowercase snake_case identity scope. Normalized on construction. |
code |
str |
yes | The entity's identifier within the namespace. Trimmed; otherwise preserved verbatim. |
title |
str |
yes | Human-readable label. Trimmed; must be non-empty. |
metadata |
dict[str, Any] |
no | Open key/value space; defaults to {}. |
The model is configured with extra="forbid". Any keyword that is not one of the
four declared fields raises a pydantic.ValidationError at construction. There is
no tags, description, or frequency field on the model: metadata is the only
place for anything beyond identity.
from pydantic import ValidationError
from parsimony.catalog import Entity
# tags / description are NOT model fields
try:
Entity(namespace="fred", code="X", title="T", description="old")
except ValidationError:
pass # extra='forbid'
# put them in metadata instead
e = Entity(namespace="fred", code="X", title="T", metadata={"description": "ok", "tags": ["a"]})
assert "tags" not in Entity.model_fields
extra='forbid' is strict
Migrating older records that carried top-level tags or description keys will
raise ValidationError. Move those keys into metadata before constructing the
entity. Parsimony does not silently drop unknown fields.
Field validators¶
Three field_validators run when an Entity is constructed. They delegate to the
standalone normalization helpers described below, so the same rules apply whether you
build an Entity directly or normalize a value by hand.
namespaceis passed throughnormalize_namespace: trimmed, then required to match^[a-z][a-z0-9_]*$— lowercase letters, digits, and underscores, never starting with a digit, never empty.codeis passed throughnormalize_entity_code: trimmed and required to be non-empty. It is deliberately permissive otherwise, so connector-native identifiers survive unchanged.titleis trimmed and required to be non-empty (ValueError: title must be non-empty).
CatalogMatch¶
CatalogMatch is the resolved search result returned by Catalog.search. It mirrors
the three string fields of Entity — with the same three validators — keeps the open
metadata dict, and adds a required score: float.
| Field | Type | Notes |
|---|---|---|
namespace |
str |
Re-normalized via normalize_namespace. |
code |
str |
Re-normalized via normalize_entity_code. |
title |
str |
Trimmed, non-empty. |
score |
float |
Final relevance score. Required. |
metadata |
dict[str, Any] |
Defaults to {}. Like Entity, extra="forbid". |
You rarely build a CatalogMatch yourself — the catalog does it during ranking — but
the adapter is public. catalog_match_from_entity lives in parsimony.catalog.models
(not re-exported at the catalog top level), and its score argument is keyword-only.
from parsimony.catalog import Entity, CatalogMatch
from parsimony.catalog.models import catalog_match_from_entity
e = Entity(namespace="fred", code="UNRATE", title="Unemployment", metadata={"freq": "M"})
m = catalog_match_from_entity(e, score=0.87) # score is keyword-only
assert isinstance(m, CatalogMatch)
assert (m.namespace, m.code, m.title, m.score) == ("fred", "UNRATE", "Unemployment", 0.87)
assert m.metadata is not e.metadata # shallow copy via dict(entity.metadata)
Shallow copy
The adapter copies metadata with dict(entity.metadata). Mutating the match's
top-level metadata dict does not touch the entity's, but nested mutable values
(a list or dict inside metadata) are shared between the two.
Identity normalization helpers¶
These functions are the building blocks behind the validators. Import them from
parsimony.catalog or parsimony.entity — both expose them. The full set of
helpers is in parsimony.entity.
normalize_namespace(value) -> str¶
Trims, then enforces ^[a-z][a-z0-9_]*$. Raises ValueError("Value must be
non-empty") on a blank string and ValueError("Value must be lowercase snake_case
(letters, numbers, underscores)") on a pattern mismatch.
from parsimony.entity import normalize_namespace
assert normalize_namespace("fred") == "fred"
# normalize_namespace("Bad Code") -> ValueError (not snake_case)
# normalize_namespace("1bad") -> ValueError (starts with a digit)
normalize_entity_code(value) -> str¶
Trims and requires non-empty (ValueError("code must be non-empty") when blank).
Intentionally loose otherwise so provider-native identifiers — uppercase, dots,
mixed punctuation — pass through unchanged.
from parsimony.entity import normalize_entity_code
assert normalize_entity_code("GDPC1") == "GDPC1"
assert normalize_entity_code(" B.U.Y.10Y ") == "B.U.Y.10Y"
namespace and code are not interchangeable
code preserves uppercase and dots; namespace rejects them. The two helpers
also emit different empty-string messages (Value must be non-empty versus
code must be non-empty) — do not assume a uniform error string.
code_token(value) -> str¶
Turns an arbitrary string into a safe, snake_case code token: lowercases, maps
-/space/. (and any other non-[a-z0-9_] character) to _, collapses repeated
underscores, strips edge underscores. Returns "unknown" if nothing survives, and
prefixes v_ when the result would start with a digit. Use it in a provider when you
must synthesize a code from a free-form label.
from parsimony.catalog import code_token
assert code_token("Real GDP (2017 $)") == "real_gdp_2017"
assert code_token("10Y") == "v_10y"
assert code_token("---") == "unknown"
entity_key(namespace, code) -> tuple[str, str]¶
The canonical in-memory key for a (namespace, code) pair, used internally by the
catalog's lookup table. It returns (normalize_namespace(namespace),
normalize_entity_code(code)), so it applies both rules at once.
from parsimony.catalog import entity_key
assert entity_key("fred", " UNRATE ") == ("fred", "UNRATE")
What an index reads: field extraction¶
Indexes do not read raw Python objects off an entity — they read normalized lists of
strings. Three helpers in parsimony.entity define that contract. field_values and
field_text are also re-exported from parsimony.catalog; field_value is only on
parsimony.entity.
| Helper | Returns | Use |
|---|---|---|
field_value(entity, field) |
the raw value (Any) or None |
low-level single-field accessor |
field_values(entity, field) |
list[str] of trimmed, non-empty values |
the multi-value text an index builds on |
field_text(entity, field) |
the field_values list joined by single spaces |
a single searchable string |
All three resolve field the same way: namespace, code, and title are
first-class; any other name is a metadata.get(field) lookup (so a missing key
yields None).
field_values then coerces by type:
| Value type | Result |
|---|---|
None (missing key) |
[] |
str |
[trimmed], or [] if blank after trimming |
list / tuple / set |
stringified, trimmed items; None/blank dropped |
dict |
["key: value", ...] (literal ": " separator); items with None/blank values dropped |
| any other scalar | [str(value).strip()], or [] if blank |
from parsimony.catalog import Entity, field_values, field_text
e = Entity(
namespace="fred",
code="UNRATE",
title="Unemployment Rate",
metadata={"frequency": "M", "tags": ["labor", "rates"], "attrs": {"unit": "pct", "scale": 1}},
)
assert field_values(e, "title") == ["Unemployment Rate"]
assert field_values(e, "frequency") == ["M"]
assert field_values(e, "tags") == ["labor", "rates"]
assert field_values(e, "missing") == [] # absent metadata key
assert field_values(e, "attrs") == ["unit: pct", "scale: 1"]
assert field_text(e, "tags") == "labor rates"
Set order is non-deterministic
A set metadata value is iterated in arbitrary order, so the strings come back
unsorted. If you need a stable order, store a list.
Turning a DataFrame into entities¶
Connectors return raw DataFrames, not entities. There are two ways to build Entity
rows from a frame: the low-level entities_from_dataframe, and the higher-level
OutputConfig.build_entities, which an enumerator
uses to feed a catalog.
entities_from_dataframe¶
entities_from_dataframe (in parsimony.entity) takes explicit column roles and
returns list[Entity]. Rows are grouped by the key column, so repeated keys collapse
into one entity.
import pandas as pd
from parsimony.entity import entities_from_dataframe
df = pd.DataFrame(
{
"code": ["A", "A", "B"],
"title": ["Alpha", "Alpha", "Beta"],
"sector": ["Tech", "Tech", "Energy"],
}
)
entities = entities_from_dataframe(
df,
namespace="demo",
key_column="code",
title_column="title", # optional; falls back to the code if None or missing
metadata_columns=["sector"],
)
assert {e.code: e.metadata["sector"] for e in entities} == {"A": "Tech", "B": "Energy"}
It raises a ValueError if a named key, title, or metadata column is absent from the
frame. Each metadata column must hold a single value per key group; if values vary
within one key, it raises a ValueError whose message says the column "is not entity
metadata" and points you at ColumnRole.DATA or a more specific key.
Per-row namespaces (__row__)¶
When entities in one frame belong to different namespaces, pass
namespace="__row__" together with a namespace_column. Each row's namespace is then
read (and normalized) from that column instead of a single static value.
import pandas as pd
from parsimony.entity import entities_from_dataframe
df = pd.DataFrame({"code": ["X"], "title": ["T"], "entity_namespace": ["fred"]})
entities = entities_from_dataframe(
df,
namespace="__row__",
key_column="code",
title_column="title",
metadata_columns=[],
namespace_column="entity_namespace",
)
assert entities[0].namespace == "fred"
OutputConfig.build_entities¶
The declarative path is OutputConfig.build_entities(df) (see
results and output schemas). It reads column roles from the
schema and delegates to entities_from_dataframe, so the same grouping and
metadata-consistency rules apply.
Requirements:
- Exactly one
KEYcolumn, and that column must declare anamespace=— otherwiseValueError: KEY column must declare namespace=.... - At most one
TITLEcolumn (optional). When absent, the code is used as the title. METADATAcolumns are optional. A metadata column named"*"is a wildcard that claims every DataFrame column not already taken by theKEY,TITLE, or another explicit metadata column.- A
KEYnamespace of"__row__"switches on per-row namespaces, read from a column namedentity_namespace.
import pandas as pd
from parsimony import OutputConfig, Column, ColumnRole
df = pd.DataFrame(
{
"code": ["UNRATE"],
"title": ["Unemployment Rate"],
"frequency": ["M"],
"description": ["Civilian unemployment rate"],
}
)
schema = OutputConfig(
columns=[
Column(name="code", role=ColumnRole.KEY, namespace="fred"),
Column(name="title", role=ColumnRole.TITLE),
Column(name="frequency", role=ColumnRole.METADATA),
Column(name="description", role=ColumnRole.METADATA),
]
)
entities = schema.build_entities(df)
assert entities[0].namespace == "fred"
assert entities[0].metadata == {"frequency": "M", "description": "Civilian unemployment rate"}
The wildcard form is convenient when you want every remaining column as metadata:
import pandas as pd
from parsimony import OutputConfig, Column, ColumnRole
df = pd.DataFrame({"code": ["A"], "name": ["Alpha"], "sector": ["Tech"], "region": ["US"]})
schema = OutputConfig(
columns=[
Column(name="code", role=ColumnRole.KEY, namespace="demo"),
Column(name="name", role=ColumnRole.TITLE),
Column(name="*", role=ColumnRole.METADATA),
]
)
entities = schema.build_entities(df)
assert entities[0].metadata == {"sector": "Tech", "region": "US"}
Metadata must be constant within an entity key
Both builders group rows by the key and require each metadata column to hold a
single value per group. A column whose value differs across rows that share a key
(for example an isin that changes between two rows of the same code) raises a
ValueError — that column is observation DATA, not identity metadata, or your
key is too coarse.
Once you have a list[Entity], hand it to a catalog with set_entities, then build
and search. See building and searching.
See also¶
- The Catalog — the lifecycle that consumes entities and returns matches.
- Building and searching —
set_entities,build, and the query DSL that returnsCatalogMatchresults. - Results and output schemas —
OutputConfig,Column, andColumnRole, the schema that drivesbuild_entities. - Loaders and enumerators — enumerators emit the DataFrames that become catalog entities.