Architecture Decisions (ADRs) ============================= **What.** One record per major architectural decision. ADRs 001–009 cover the original PHI-handling architecture. ADRs 010–015 cover the sandbox, KeyStore, PDF orchestrator, reviewed snapshot baseline, parallel extraction, and l-diversity decisions. Each record states what was decided, why, how it was implemented, what alternatives were considered, and what consequences to expect if the decision ages poorly. **Why.** Commit messages and Claude-memory files capture decisions at the moment of authorship but are invisible to a new contributor who starts with ``git clone``. An ADR page is the stable artifact: the next person who asks "why did they choose rule+allowlist over Presidio?" can find the answer here without spelunking ``git log``. **How.** One section per decision. Section layout: **What** (the decision stated plainly) · **Why** (the driving constraint or evidence) · **How** (the implementation mechanism) · **Alternatives** (what was considered and rejected, with the reason) · **Consequences** (what to watch for if the decision needs revisiting). .. contents:: On this page :local: :depth: 1 ADR-001 — Single-study, local-first runtime ------------------------------------------- **What.** The pipeline processes one fixed study under ``data/raw/{STUDY_NAME}/`` per install. Multi-study, federated, HPC, and cloud-deploy workflows are explicitly out of scope. **Why.** Clinical-study privacy posture is per-study. Cross-study workflows introduce cross-study re-identification risk that needs its own threat model; that's a different project. Keeping this one narrow lets every defence be tuned to one study's data dictionary and one IRB's expectations. **How.** ``config.STUDY_NAME`` pins the single study; every path under ``data/raw/``, ``tmp/``, and ``output/`` is parameterised by it. **Alternatives.** An upload-driven multi-tenant variant was considered at project start; rejected because multi-tenant isolation would have dominated the architecture while the real demand was single-study velocity. **Consequences.** Supporting a second study today means a second repo clone with a different ``STUDY_NAME``. If that becomes a burden, revisit by first generalising ``config.py`` study resolution, not by hacking new paths into the pipeline. ADR-002 — HMAC-SHA256 pseudonymization with sidecar key (no vault) ------------------------------------------------------------------ **What.** Subject IDs are replaced with ``SUBJ_``. The 32-byte key lives at ``~/.config/report_ai_portal/phi_key`` (mode 0600) outside the repo tree. **Why.** HMAC-SHA256 is non-reversible without the key (so pseudonyms leak no information about the identifier) and deterministic with the key (so the same subject maps to the same tag across every file in the bundle — longitudinal linkage preserved). A sidecar key outside the repo tree means key compromise requires host-level access; a repo-committed key would ride into every clone. **How.** :func:`scripts.security.phi_scrub.pseudo_id` + :func:`scripts.security.phi_scrub.bootstrap_key`; key permissions enforced by :func:`scripts.security.phi_scrub.load_key` (hard-fail on missing or wrong mode). **Alternatives.** * **AES-256-GCM keyvault with a database of original→pseudonym mappings** (the vault.py approach, evaluated and rejected early in design). Rejected because a mapping database is a decryption surface — anyone with the key can invert every pseudonym. HMAC-only is strictly safer: deletion of the key forfeits the ability to re-derive pseudonyms, so key compartmentalisation is simpler to reason about. * **Plain SHA-256 hashing**. Rejected — hashing without a secret key is vulnerable to rainbow-table attacks when the input space is small (subject IDs are typically 4-5 digits). * **Random tokens stored in a database**. Rejected — breaks longitudinal linkage across studies or re-runs; same subject would get different tokens per run. **Consequences.** Key rotation invalidates every prior pseudonym; full re-ingestion from raw is required. This is the intentional behaviour but can surprise new operators. Document explicitly in the key- management runbook. ADR-003 — SANT per-subject date jitter over HIPAA year-only ----------------------------------------------------------- **What.** Event dates are shifted by a per-subject deterministic offset ∈ [-30, +30] days. The offset is the same for every date belonging to one subject, so intra-subject intervals are preserved exactly. **Why.** HIPAA Safe Harbor reduces dates to "year only" — that loses up to 364 days of resolution and breaks survival / incidence / person-time analyses. SANT jitter preserves per-subject interval structure (enough for every epi question the agent is expected to answer) while still obscuring the absolute calendar date. **How.** :func:`scripts.security.phi_scrub.date_offset_days` derives the offset from ``HMAC-SHA256(key, subject_id)[:4]`` mod (2N+1) − N. :func:`scripts.security.phi_scrub.shift_date` applies it while preserving the source format (ISO / M-D-Y / D-M-Y). **Alternatives.** * **Year-only masking** (strict Safe Harbor). Rejected because it breaks survival analysis and time-to-event fits — core to the study research questions. * **Global constant offset across all subjects**. Rejected because the offset would be a single secret with no per-subject variation — one leaked event date reveals every subject's calendar events. * **Per-row random offset**. Rejected because it destroys per-subject interval structure that the epi analyses depend on. **Consequences.** SANT is a Limited Dataset technique in the HIPAA taxonomy (not Safe Harbor). The ``compliance_posture`` flag in ``phi_scrub.yaml`` makes this explicit — ``safe_harbor`` drops DOB entirely; ``limited_dataset`` requires an IRB-approved Data Use Agreement file to exist before the scrub will shift DOB along with other dates. ADR-004 — Rule+allowlist over Microsoft Presidio ------------------------------------------------ **What.** The PHI gate and log redactor use a curated regex catalog (:mod:`scripts.security.phi_patterns`) plus a clinical-phrase allowlist (:mod:`scripts.security.phi_allowlist`). Presidio is NOT installed as a runtime dependency. **Why.** 2025 benchmarks (cited in :doc:`references`) found Presidio runs 22.7% precision in mixed enterprise data and ~84% F1 on clinical notes; curated rules on calibrated fields consistently outperform it. The false-positive tax alone (legitimate clinical phrases flagged as PHI) would make every agent response unusable. **How.** ``phi_gate.py`` compiles ``phi_patterns.BLOCKING_PATTERNS`` + ``WARN_PATTERNS``; clinical-phrase allowlist suppresses warn-tier hits like "Treatment Completed" that match the generic name-like heuristic. **Alternatives.** * **Presidio-analyzer default**. Rejected per the benchmarks above. * **John Snow Labs Clinical NER (commercial, 98.6% F1)**. Rejected because it's commercial and requires a JSL license key; the architecture prioritises local-first zero-egress. * **Stanford Stanza with i2b2-tuned model**. Considered; ~93% F1 open- source. Deferred as an optional supplement for narrative residuals if the rule catalog proves insufficient. * **Local Ollama prompt-engineered NER**. Rejected for the current runtime because the calibrated rule catalog plus whole-field narrative drops provide a smaller, auditable surface. **Consequences.** Rule maintenance cost scales with new PHI classes discovered during deployment. Offset by the catalog being data-driven (YAML + rule-class helpers) rather than inline regex — adding a new PHI class is a YAML addition plus a test case, not a code change. ADR-005 — tmpfs staging as an operator opt-in (not default) ----------------------------------------------------------- **What.** When ``REPORTALIN_TMPFS_STAGING=1`` is set AND Linux ``/dev/shm`` is writable, the extraction leg redirects staging to ``/dev/shm/report_ai_portal/{STUDY}/`` so raw extracted rows never hit physical disk. Default is off. **Why.** tmpfs-backed staging is the gold-standard defence against filesystem forensics on the extraction host, but forcing it on by default would break macOS and Windows operators without warning. Opt-in keeps the strong defence available to Linux operators who ask for it while preserving portability for everyone else. **How.** :func:`scripts.utils.secure_staging.resolve_staging_root` gates tmpfs redirection on the env flag AND the writability check. Graceful fallback to the default ``tmp/{STUDY}/`` path when the env flag is set but the tmpfs isn't available. **Alternatives.** * **Encrypted-at-rest staging**. Considered; rejected as over- engineering — staging is short-lived, the on-disk hardening (mode 0700 + umask 0077 + zero-fill teardown) already resists the realistic threat model (accidental disk imaging, forensic recovery). Adding AES at rest would add a key-management burden without closing a different attack surface. * **Default-on tmpfs with a Windows/macOS fallback**. Considered; rejected because silent platform-dependent behaviour is a foot-gun — operators may assume tmpfs is active when it isn't. Explicit opt-in makes the posture auditable. **Consequences.** Linux operators who forget the env flag run without tmpfs hardening. Mitigated by documenting the flag prominently in ``.env.example``, :doc:`../user_guide/configuration`, and the operational runbook. ADR-006 — External-API PDF extraction refused by default -------------------------------------------------------- **What.** ``scripts/extraction/extract_pdf_data._resolve_pdf_provider`` refuses to initialise an Anthropic / Google Gemini client unless the operator explicitly sets ``REPORTALIN_PDF_PHI_FREE=1``. **Why.** PDFs are treated as PHI-bearing by default — annotated CRFs can carry filled-in patient data, handwritten signatures, or example subject IDs in annotations. Sending them to an external LLM API is a network egress of PHI to a third party. The architecture forbids that without explicit operator assertion. **How.** Feature-flag check at provider resolution; refusal raises ``ValueError`` with a remediation message listing three alternatives (flip the flag if PHI-free, use ``--pdf-source`` with pre-extracted JSON, skip the PDF leg entirely). **Alternatives.** * **Local-only PDF extraction** (pdfplumber primary + local-Ollama multimodal fallback). **Superseded by the PDF orchestrator:** ``scripts/extraction/pdf_pipeline.py`` ships pdfplumber as the always-on code path, paired with a redacted-text LLM call (capable cloud or local model) via ``_merge``, and a per-PDF snapshot baseline fallback at ``data/snapshots/{STUDY}/pdfs/``. No raw PDF bytes leave the host on the orchestrator path. The original ADR-006 external-API gate remains as the legacy fallback for operators who cannot run the orchestrator. * **Keep external-API default with an "I acknowledge PHI" banner**. Rejected — banners are not operator assertions. An env flag creates a durable audit trail. * **Drop PDF extraction entirely**. Considered as the minimum safe posture; rejected because the annotated PDFs carry variable-definition annotations that the data dictionary alone does not cover. **Consequences.** A new operator who needs PDF extraction must read the error message, verify their PDFs are PHI-free, and flip the flag. This is the intentional friction. ADR-007 — Four-tier architecture (RED / AMBER / GREEN / GREEN-PROTECT) ---------------------------------------------------------------------- **What.** Every path in the runtime belongs to exactly one of four named zones: RED (raw), AMBER (staging), GREEN (trio_bundle), or GREEN-PROTECT (agent boundary). The zones are enforced by assertions, not by convention. **Why.** Convention-based zone separation rots; assertion-based zone separation fails fast. New code that accidentally reads from raw/ or writes to output/ before the scrub runs raises ``ZoneViolationError`` in CI before landing. **How.** :mod:`scripts.security.secure_env` exposes the pipeline-side guards (``assert_not_raw``, ``assert_write_zone``, ``assert_output_zone``, ``assert_trio_bundle_zone``) — one per tier. The agent world has its own chokepoint :mod:`scripts.ai_assistant.file_access` with ``validate_agent_read`` / ``validate_agent_write`` / ``validate_sandbox_write`` (layered on the same ``ZoneViolationError``), so no agent tool can add a new file read without passing through the validator. Every file-touching module starts with the applicable assertion. **Alternatives.** None seriously considered — the zone model is explicitly stated in HIPAA + ICMR + the RePORT India Common Protocol. The only question was whether to enforce it as comments or as code; code won. **Consequences.** Refactors that move file I/O from one module to another must also move (or add) the zone assertion. Tests catch missing assertions via the "zone" test category in ``tests/test_secure_env.py``. ADR-008 — Agent boundary PHI + k-anon gate as defence-in-depth -------------------------------------------------------------- **What.** Every LLM tool return string runs through :func:`scripts.security.phi_gate.phi_gate_check`. Row-level returns additionally run through :func:`scripts.security.kanon_gate.kanon_check` with k=5. **Why.** The offline scrub (Step 1.6) is the primary PHI defence; the agent-boundary gate is the backstop. If a new narrative-field shape slips past the catalog or a quasi-identifier equivalence class is smaller than k=5 for a specific query, the gate catches the leak before the model ever sees it. **How.** :func:`scripts.ai_assistant.phi_safe.phi_safe_return` decorator; :func:`scripts.ai_assistant.phi_safe.guard_rows_with_kanon` for row-level surfacing. **Alternatives.** "Trust the offline scrub, skip the runtime gate". Rejected — single-layer defences against PHI leakage have a bad track record across the industry. The k-anon gate is particularly important for the Indo-VAP cohort size (tribal + district + outcome combinations can trivially produce k<5 equivalence classes). **Consequences.** Every tool author must decorate with ``@phi_safe_return``. Missing decorations are caught by the ``test_agent_tools_zone_guard`` coverage tests (see :doc:`testing`). ADR-009 — Counts-only audit reports (never raw values) ------------------------------------------------------ **What.** Every report under ``output/{STUDY}/audit/`` records counts of actions per field per file but never the underlying values. The lineage manifest records hashes, not contents. **Why.** An auditor must be able to verify the scrub ran correctly without access to the raw data. Putting raw values in the audit would turn every audit report into a PHI leak surface of its own. **How.** :func:`scripts.security.phi_scrub._emit_audit` emits a payload with ``scrubbed: [{scope, field, file, count}, ...]``. The lineage manifest records ``{path, sha256, size_bytes, mtime_utc}`` entries. **Alternatives.** A verbose audit with before/after value pairs was considered for debugging; rejected — such an audit is a full PHI copy. If debugging requires value inspection, the operator reads the (short-lived, mode-0700) AMBER staging files directly, never the audit. **Consequences.** Audit reports are thin on evidence by design. The counts suffice for IRB acceptance; deeper debugging requires live staging inspection. ADR-010 — Subprocess + rlimits sandbox for ``run_python_analysis`` ------------------------------------------------------------------ **What.** LLM-generated Python from the ``run_python_analysis`` tool runs in a fresh OS subprocess with ``RLIMIT_AS`` / ``RLIMIT_NPROC`` / ``RLIMIT_CPU`` clamps + a sanitised env + read-only access to the trio bundle. The original AST guards remain as defense-in-depth inside the child. **Why.** AST guards alone don't stop CPython gadget escapes (hand-crafted ``__class__.__bases__[0]`` traversals, ``co_consts`` poisoning, etc.). An OS-level isolation boundary defangs every in-process escape: even a perfectly-jailbroken interpreter inside the subprocess cannot exceed the rlimits, cannot read the parent's KeyStore, cannot write outside ``trio_bundle/``, cannot fork beyond ``RLIMIT_NPROC``. **How.** Implemented under :mod:`scripts.ai_assistant.sandbox.replicate` (public API), :mod:`scripts.ai_assistant.sandbox.runner` (child entry point), and :mod:`scripts.ai_assistant.sandbox.limits` (rlimits helpers). The generated ``.py`` file is persisted to ``output/{STUDY}/agent/analysis/{ts}.py`` so the operator can copy + reproduce externally — no hidden code path. **Alternatives.** * **AST guards only**. Rejected — known CPython escape gadgets defeat AST-level filtering. * **``nsjail`` / ``firejail``** profile. Considered as a future high-assurance option for cloud deployments; not shipped because the deployment target is the operator's laptop and ``nsjail`` needs root + Linux-specific kernel features. * **WebAssembly** (Pyodide). Considered; rejected because pandas / numpy / scipy / statsmodels don't run there cleanly, and the capability surface ``run_python_analysis`` needs is exactly those libraries. **Consequences.** Subprocess startup is ~75 % of per-call latency on macOS — acceptable for the analytical-question-per-second workload but would need parquet + ``mmap`` if call rate increased materially. Tracked as future work in :doc:`sandbox`. ADR-011 — KeyStore (in-memory API-key registry) ----------------------------------------------- **What.** API keys never persist in the parent process's ``os.environ`` for the lifetime of the app. The Streamlit wizard routes the key into an in-memory ``KeyStore`` registry; the corresponding ``*_API_KEY`` env variable is scrubbed from ``os.environ``. Keys are re-injected only into the short-lived pipeline subprocess via ``KeyStore.env_for_subprocess``. **Why.** ``os.environ`` is a process-wide global. A single ``logger.info(f"env={dict(os.environ)}")`` debug-print, an exception traceback rendered with ``locals()``, or a third-party library helpfully echoing config — any of these can leak ``ANTHROPIC_API_KEY`` into the log file and from there into the audit envelope or the operator's terminal scrollback. Removing the keys from ``os.environ`` once they're loaded into the in-memory registry shrinks the leak surface materially. **How.** Implemented as :mod:`scripts.ai_assistant.keystore`. Every LLM client constructor (``ChatAnthropic``, ``ChatOpenAI``, ``ChatGoogleGenerativeAI``, ``ChatNVIDIA``, ``ChatOllama``) takes an explicit ``api_key=`` kwarg sourced from the KeyStore — no environment lookup at construction time. Tested by ``tests/test_keystore.py``, ``tests/test_log_hygiene_keys.py``, ``tests/test_no_keys_in_parent_environ.py``. **Alternatives.** * **OS keyring** (``keyring`` package). Considered for persistence across sessions; rejected because the operator already has the ``.env`` file as their persistence layer, and adding keyring would introduce platform-specific behaviour (macOS Keychain vs Linux libsecret vs Windows Credential Vault). * **Encrypted on-disk vault**. Rejected — adds a master-key bootstrap problem on top of the existing PHI-key bootstrap problem. **Consequences.** Operators using the CLI (``python main.py --pipeline`` directly without the wizard) still rely on the env-var path. The CLI ``main.py`` reads ``LLM_PROVIDER`` / ``ANTHROPIC_API_KEY`` from env; this is intentional for back-compat with existing shell-script invocations. The KeyStore posture only applies to the in-app (Streamlit/CLI-REPL) lifetimes. ADR-012 — Two-way PDF orchestrator (pdfplumber + redacted-text LLM merge) ------------------------------------------------------------------------- **What.** PDF extraction has two co-existing paths. The default path (``scripts/extraction/pdf_pipeline.py``, the wizard's "Load Study" selection) extracts text locally with ``pdfplumber``, redacts the text via ``phi_patterns.BLOCKING_PATTERNS`` *before* any LLM call, sends only the redacted text to a capable LLM, re-scrubs the response, and merges with the code candidate. Per-PDF fallback to the reviewed snapshot baseline at ``data/snapshots/{STUDY}/pdfs/`` when the LLM tier is unavailable. **No raw PDF bytes leave the host on this path.** **Why.** ADR-006 refused external-API PDF extraction by default because raw PDF bytes carry PHI risk. The two-part attestation gate in the legacy path is a workable but operator-friction-heavy posture. The orchestrator path replaces "ship raw bytes after attestation" with "redact text first, then ship". The redaction catalog is the same one the agent-output PHI gate uses, so the audit story is internally consistent. **How.** Implemented under :mod:`scripts.extraction.pdf_pipeline`. Capability gate via :func:`scripts.utils.llm_capabilities.is_capable_model` (Claude Opus 4.6+, Sonnet 4.6+, GPT-5+, Gemini 2.5 Pro, Llama 3.3 405B; Ollama excluded by default). Provider gate via :data:`scripts.extraction.pdf_pipeline.ORCHESTRATOR_SUPPORTED_PROVIDERS` (anthropic + google only — where ``_extract_via_llm`` has client wiring). Idempotent cache keyed on ``SHA-256(pdf_bytes || provider || model || phi_scrub.yaml hash)``. **Alternatives.** * **Local-only PDF extraction** (pdfplumber + local-Ollama multimodal). Considered; rejected as primary because Ollama models don't reliably produce the JSON-schema response on a 30-page CRF. The snapshot fallback covers this case. * **Keep ADR-006's attestation gate as the only path.** Rejected because operator friction effectively meant the PDF leg was skipped for most runs; the orchestrator unblocks the leg without weakening the egress posture. **Consequences.** ADR-006 is now a *fallback* path, not the primary path. The attestation gate remains in the legacy ``extract_pdf_data._resolve_pdf_provider``. The orchestrator uses the ``data/snapshots/{STUDY}/`` baseline tier described in ADR-013. ADR-013 — Single reviewed snapshot baseline ------------------------------------------- **What.** One human-reviewed copy of the trio bundle lives at ``data/snapshots/{STUDY}/``. It is a single per-study cleaned trio bundle and mirrors ``output/{STUDY}/trio_bundle/``. **Why.** The PDF orchestrator (ADR-012) needs a deterministic, reviewable baseline to fall back on when the LLM tier fails, and the wizard needs a clean way to recover from incomplete PDF extraction. Multi-named scratch copies created ambiguity: operators could confuse local rollback data with the reviewed baseline. The architecture now keeps only the reviewed baseline. **How.** ``config.STUDY_SNAPSHOTS_DIR`` points at ``data/snapshots/{STUDY}/``. :mod:`scripts.utils.snapshots` saves and restores that baseline. The wizard's **Use Existing Study** button and the full pipeline's PDF-failure path both restore the baseline over ``output/{STUDY}/trio_bundle/``. **Crucially:** the LLM agent's read zone is strictly ``trio_bundle/`` + ``agent/``. The snapshot path is intentionally *outside* the LLM read zone. A stale baseline can never be served directly as live data because the agent cannot read it. **Alternatives.** * **Restore points under ``output/{STUDY}/agent/``.** Rejected — they are scratch state, sit inside the agent zone, and do not match the human-reviewed fallback contract. * **Require operators to seed via env var or a separate repo.** Rejected — ``data/snapshots/{STUDY}/`` keeps the fallback colocated with study data and easy to review. **Consequences.** Maintainers who follow the snapshot-baseline maintenance protocol (see :doc:`operations`) must run a manual ``make snapshot`` after a verified production run. The whole value of a snapshot is the human verification step; auto-generation from ``--force`` runs without review is a foot-gun the protocol explicitly forbids. ADR-014 — Parallel extraction phase (3-worker ThreadPoolExecutor) ----------------------------------------------------------------- **What.** ``main.py``'s extraction phase runs Dictionary / Datasets / PDFs in parallel on a 3-worker ``concurrent.futures.ThreadPoolExecutor``. The cleanup chain (PHI scrub / dataset cleanup / cleanup propagation), publish, and ``variables.json`` build run sequentially after the join. **Why.** The three legs are fully decoupled — different RED inputs, different AMBER staging subdirs, no shared mutable state. The PDF leg's HTTP latency (orchestrator LLM calls) is amortised against the dataset leg's Excel-parsing CPU. Cleanup chain stays sequential because it has hard data dependencies (PHI scrub needs all dataset records; propagation needs the cleanup audit; publish needs propagation; ``variables.json`` needs publish to have landed). **How.** Each leg is wrapped in a helper (``_run_dict_leg``, ``_run_dataset_leg``, ``_run_pdf_leg``) that returns a result dict; futures are gathered via ``as_completed`` so a fast leg can short-circuit while a slow one runs. **Alternatives.** * **``multiprocessing.Pool``** for true parallelism. Rejected — the PDF leg's HTTP calls release the GIL, so threads are sufficient; ``multiprocessing`` would multiply the staging-tree write contention. * **``asyncio``**. Rejected — the existing extraction code is sync pandas + openpyxl; converting to async would be a much bigger change. **Consequences.** The staging root is protected by a per-study process lock before AMBER is purged, so two operator-triggered runs cannot race over the same ``tmp/{STUDY}/`` workspace. ``VerboseLogger`` uses thread-local indentation, keeping ``--verbose`` tree output readable while the three extraction legs overlap. ADR-015 — l-diversity (l=2) on row-returning tools --------------------------------------------------- **What.** Every row-returning agent tool runs the result through :func:`scripts.security.kanon_gate.guard_rows_with_kanon_and_ldiv`. The gate enforces both k-anonymity (k=5; ADR-008 / Pillar 2.4) AND l-diversity (l=2 distinct sensitive-attribute values per equivalence class). **Why.** k-anonymity alone is insufficient when every row in a small equivalence class shares the same sensitive attribute (the "homogeneity attack" of Machanavajjhala et al. 2007). If five rows match on all quasi-identifiers AND all five share the diagnosis ``HIV+``, the k=5 check passes but the cohort is just as re-identifying as a k=1 disclosure. l-diversity guards against this by also requiring at least ``l`` distinct values of the sensitive attribute. **How.** Default sensitive attributes are configured in ``_DEFAULT_SENSITIVE_ATTRIBUTES`` of :mod:`scripts.security.kanon_gate`. Tested by ``tests/security/test_kanon_l_diversity.py``. **Alternatives.** * **t-closeness.** Considered; deferred. t-closeness checks the *distribution* of the sensitive attribute against the population baseline, which needs population-level priors that aren't always available. l-diversity gives the bulk of the protection without that requirement; t-closeness can be added later as a third gate. * **Differential privacy**. Out of scope for the row-returning-tool use case; would change the contract from "exact rows when allowed" to "noisy aggregate". **Consequences.** Some legitimate cohort queries return aggregate substitutes (``"≥5 subjects, all sharing diagnosis X"``) instead of row-level data. The agent's prompt explains this to the user when the gate fires.