ADR 005 — Gemini for LLM Re-ranking (Stage 2)¶
Accepted Date: 2025 Deciders: CEP AI Team
Context¶
Stage 2 takes the top-20 Stage 1 candidates and asks an LLM to:
- Read each candidate's description, class, group, division, and exclusions
- Select the top-5 best matches ranked by relevance
- Provide a plain-English reason for each selection
This task requires a model that can:
- Follow a strict JSON output schema
- Reason about subtle distinctions (e.g. "own account" vs "employed" mechanics)
- Process ~20 candidate records (~3,000 tokens) in a single call
- Return results within a 5–10 second budget
Models evaluated:
| Model | Context window | JSON mode | Region | Notes |
|---|---|---|---|---|
| Gemini 2.5 Flash | 1M tokens | ✅ native | australia-southeast1 |
GCP auth reused |
| GPT-4o | 128K tokens | ✅ native | US/EU only | Separate API key; data residency concern |
| GPT-4o-mini | 128K tokens | ✅ | US/EU only | Cheaper but less accurate |
| Claude 3.5 Sonnet | 200K tokens | ✅ (tool use) | US/EU only | No AUS region |
| Gemini 2.0 Flash 001 | 1M tokens | ✅ | australia-southeast1 |
Model unavailable in region (tested, failed) |
Decision¶
We chose Vertex AI Gemini 2.5 Flash (gemini-2.5-flash).
Reasons:
-
Existing GCP auth reused — The same
GCPAuthManagerand bearer token used for embeddings serves Gemini. No additional secrets management. -
australia-southeast1region — Data stays in Australian infrastructure. (Note:gemini-2.0-flash-001was tested first but was unavailable in this region — see below.) -
Native JSON mode —
responseMimeType: application/jsonguarantees syntactically valid JSON output, eliminating the need to strip markdown fences or handle malformed responses. -
1M token context window — The CSV fallback strategy (injecting all 5,236 codes, ~63K tokens) fits comfortably within the context budget.
-
Speed — Flash-tier models balance quality and latency better than Pro-tier for this structured re-ranking task.
Model availability issue (resolved)¶
During development, gemini-2.0-flash-001 returned HTTP 404 for all requests
to australia-southeast1. Investigation confirmed the model was not deployed
in that region at the time. We switched to gemini-2.5-flash which is
available. The model is configured via the GCP_GEMINI_MODEL environment
variable — changing model requires zero code changes.
CSV fallback strategy¶
The standard re-ranking prompt contains only the 20 Stage 1 candidates (~2K tokens). On rare queries where Gemini returns an empty result array (i.e. no candidate matched), we retry with all 5,236 ANZSIC codes injected as a reference lookup (~63K tokens).
Why retry-on-empty rather than always injecting?
- The 63K-token prompt was initially always sent, but this caused token limit errors (148K > 128K context for an older model)
- Even with the new 1M context model, a 65K-token prompt degrades Gemini's attention on the 20 relevant candidates
- Testing showed > 95% of queries succeed on the first (compact) call
- The retry adds latency only for the rare case where it is genuinely needed
This strategy is implemented in LLMReranker._call_llm() and tested in
prod/tests/unit/test_reranker.py::TestRerankerCsvFallback.
Consequences¶
Positive:
- GCP auth shared with embeddings — one token, two adapters
- Temperature 0.1 produces consistent, reproducible re-rankings across runs
- Provenance:
llm_modelfield inClassifyResponserecords which model produced the result — important for reproducibility audits
Negative / trade-offs:
- Stage 2 adds 2–5 seconds of latency — Fast mode (Stage 1 only) is available when speed matters more than explanation quality
- Gemini API availability affects production reliability — Fast mode is the graceful degradation path
When to revisit:
The LLMPort abstraction makes swapping Gemini for another model a one-line
change in container.py. Good candidates for future evaluation:
GPT-4o (if data residency requirements relax) or a fine-tuned smaller model
(if Gemini API costs become a concern at scale).