ADR 002 — pgvector Hybrid Search (Vector + FTS in PostgreSQL)¶

Accepted Date: 2025 Deciders: CEP AI Team

Context¶

We needed a retrieval backend that could:

Store 5,236 ANZSIC records with 768-dimensional embeddings
Perform fast approximate nearest-neighbour (ANN) search on the embeddings
Perform keyword / full-text search on the descriptions
Run on developer laptops and a corporate Mac without external services
Be replaced in the future if requirements change

Candidates evaluated:

Option	Vector search	FTS	Self-hosted	Notes
PostgreSQL + pgvector	✅ HNSW	✅ tsvector	✅	Already installed; SQL familiarity
Weaviate	✅ HNSW	✅ BM25	✅	Docker dependency; extra ops overhead
Pinecone	✅	❌ (no FTS)	❌ SaaS	Corporate data policy concerns; cost
Qdrant	✅	Partial	✅	Newer; less community docs
Elasticsearch	Partial	✅	✅	Heavy; no native pgvector interop

Decision¶

We chose PostgreSQL 15 + pgvector 0.8.0.

Reasons:

Single dependency — PostgreSQL was already running on developer machines. Adding pgvector required one CREATE EXTENSION vector command.
Both search types in one query engine — The FTS (tsvector + GIN index) and vector search (HNSW index via <=>) live in the same database. There is no data synchronisation, no dual writes, no consistency lag.
HNSW performance — At 5,236 rows, HNSW is fast enough that latency is dominated by the Vertex AI API calls, not the database. The HNSW index (m=16, ef_construction=64) returns results in < 10 ms.
Corporate data policy — ANZSIC data and query logs stay on-premises. No external SaaS dependency.
SQL familiarity — The entire team can inspect, debug, and extend the retrieval queries with standard SQL knowledge.

Database schema (key columns)¶

CREATE TABLE anzsic_codes (
    anzsic_code      TEXT PRIMARY KEY,
    anzsic_desc      TEXT,
    class_code       TEXT,
    class_desc       TEXT,
    group_code       TEXT,
    group_desc       TEXT,
    subdivision_desc TEXT,
    division_desc    TEXT,
    class_exclusions TEXT,
    enriched_text    TEXT,
    embedding        vector(768),   -- pgvector HNSW
    fts_vector       tsvector       -- GIN full-text
);

CREATE INDEX ON anzsic_codes USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

CREATE INDEX ON anzsic_codes USING gin (fts_vector);

Consequences¶

Positive:

Zero additional infrastructure beyond what already existed
SQL debugging: EXPLAIN ANALYZE works on vector queries
Both search systems share the same transaction semantics

Negative / trade-offs:

PostgreSQL is not purpose-built for vector search; Weaviate or Qdrant would be faster at 10M+ records
HNSW is an approximate index — exact nearest-neighbour is not guaranteed (acceptable for classification; we fuse with FTS anyway)
pgvector compilation required PG_SYSROOT override on macOS Sequoia (documented in ingest.py header)

When to revisit:

Migrate to a dedicated vector database (Weaviate/Qdrant) when the code catalogue grows beyond ~50,000 records or when sub-5ms p99 vector search latency becomes a requirement.

The DatabasePort abstraction makes this migration a one-file change.