Architecture Decisions (ADRs)

What. One record per major architectural decision. ADRs 001–009 cover the original PHI-handling architecture. ADRs 010–015 cover the sandbox, KeyStore, PDF orchestrator, reviewed snapshot baseline, parallel extraction, and l-diversity decisions. Each record states what was decided, why, how it was implemented, what alternatives were considered, and what consequences to expect if the decision ages poorly.

Why. Commit messages and Claude-memory files capture decisions at the moment of authorship but are invisible to a new contributor who starts with git clone. An ADR page is the stable artifact: the next person who asks “why did they choose rule+allowlist over Presidio?” can find the answer here without spelunking git log.

How. One section per decision. Section layout: What (the decision stated plainly) · Why (the driving constraint or evidence) · How (the implementation mechanism) · Alternatives (what was considered and rejected, with the reason) · Consequences (what to watch for if the decision needs revisiting).

ADR-001 — Single-study, local-first runtime 

What. The pipeline processes one fixed study under data/raw/{STUDY_NAME}/ per install. Multi-study, federated, HPC, and cloud-deploy workflows are explicitly out of scope.

Why. Clinical-study privacy posture is per-study. Cross-study workflows introduce cross-study re-identification risk that needs its own threat model; that’s a different project. Keeping this one narrow lets every defence be tuned to one study’s data dictionary and one IRB’s expectations.

How. config.STUDY_NAME pins the single study; every path under data/raw/, tmp/, and output/ is parameterised by it.

Alternatives. An upload-driven multi-tenant variant was considered at project start; rejected because multi-tenant isolation would have dominated the architecture while the real demand was single-study velocity.

Consequences. Supporting a second study today means a second repo clone with a different STUDY_NAME. If that becomes a burden, revisit by first generalising config.py study resolution, not by hacking new paths into the pipeline.

ADR-002 — HMAC-SHA256 pseudonymization with sidecar key (no vault)

What. Subject IDs are replaced with SUBJ_<hmac-sha256(key, id)[:12]>. The 32-byte key lives at ~/.config/report_ai_portal/phi_key (mode 0600) outside the repo tree.

Why. HMAC-SHA256 is non-reversible without the key (so pseudonyms leak no information about the identifier) and deterministic with the key (so the same subject maps to the same tag across every file in the bundle — longitudinal linkage preserved). A sidecar key outside the repo tree means key compromise requires host-level access; a repo-committed key would ride into every clone.

How. scripts.security.phi_scrub.pseudo_id() + scripts.security.phi_scrub.bootstrap_key(); key permissions enforced by scripts.security.phi_scrub.load_key() (hard-fail on missing or wrong mode).

Alternatives.

AES-256-GCM keyvault with a database of original→pseudonym mappings (the vault.py approach, evaluated and rejected early in design). Rejected because a mapping database is a decryption surface — anyone with the key can invert every pseudonym. HMAC-only is strictly safer: deletion of the key forfeits the ability to re-derive pseudonyms, so key compartmentalisation is simpler to reason about.
Plain SHA-256 hashing. Rejected — hashing without a secret key is vulnerable to rainbow-table attacks when the input space is small (subject IDs are typically 4-5 digits).
Random tokens stored in a database. Rejected — breaks longitudinal linkage across studies or re-runs; same subject would get different tokens per run.

Consequences. Key rotation invalidates every prior pseudonym; full re-ingestion from raw is required. This is the intentional behaviour but can surprise new operators. Document explicitly in the key- management runbook.

ADR-003 — SANT per-subject date jitter over HIPAA year-only 

What. Event dates are shifted by a per-subject deterministic offset ∈ [-30, +30] days. The offset is the same for every date belonging to one subject, so intra-subject intervals are preserved exactly.

Why. HIPAA Safe Harbor reduces dates to “year only” — that loses up to 364 days of resolution and breaks survival / incidence / person-time analyses. SANT jitter preserves per-subject interval structure (enough for every epi question the agent is expected to answer) while still obscuring the absolute calendar date.

How. scripts.security.phi_scrub.date_offset_days() derives the offset from HMAC-SHA256(key, subject_id)[:4] mod (2N+1) − N. scripts.security.phi_scrub.shift_date() applies it while preserving the source format (ISO / M-D-Y / D-M-Y).

Alternatives.

Year-only masking (strict Safe Harbor). Rejected because it breaks survival analysis and time-to-event fits — core to the study research questions.
Global constant offset across all subjects. Rejected because the offset would be a single secret with no per-subject variation — one leaked event date reveals every subject’s calendar events.
Per-row random offset. Rejected because it destroys per-subject interval structure that the epi analyses depend on.

Consequences. SANT is a Limited Dataset technique in the HIPAA taxonomy (not Safe Harbor). The compliance_posture flag in phi_scrub.yaml makes this explicit — safe_harbor drops DOB entirely; limited_dataset requires an IRB-approved Data Use Agreement file to exist before the scrub will shift DOB along with other dates.

ADR-004 — Rule+allowlist over Microsoft Presidio 

What. The PHI gate and log redactor use a curated regex catalog (scripts.security.phi_patterns) plus a clinical-phrase allowlist (scripts.security.phi_allowlist). Presidio is NOT installed as a runtime dependency.

Why. 2025 benchmarks (cited in References) found Presidio runs 22.7% precision in mixed enterprise data and ~84% F1 on clinical notes; curated rules on calibrated fields consistently outperform it. The false-positive tax alone (legitimate clinical phrases flagged as PHI) would make every agent response unusable.

How. phi_gate.py compiles phi_patterns.BLOCKING_PATTERNS + WARN_PATTERNS; clinical-phrase allowlist suppresses warn-tier hits like “Treatment Completed” that match the generic name-like heuristic.

Alternatives.

Presidio-analyzer default. Rejected per the benchmarks above.
John Snow Labs Clinical NER (commercial, 98.6% F1). Rejected because it’s commercial and requires a JSL license key; the architecture prioritises local-first zero-egress.
Stanford Stanza with i2b2-tuned model. Considered; ~93% F1 open- source. Deferred as an optional supplement for narrative residuals if the rule catalog proves insufficient.
Local Ollama prompt-engineered NER. Rejected for the current runtime because the calibrated rule catalog plus whole-field narrative drops provide a smaller, auditable surface.

Consequences. Rule maintenance cost scales with new PHI classes discovered during deployment. Offset by the catalog being data-driven (YAML + rule-class helpers) rather than inline regex — adding a new PHI class is a YAML addition plus a test case, not a code change.

ADR-005 — tmpfs staging as an operator opt-in (not default)

What. When REPORTALIN_TMPFS_STAGING=1 is set AND Linux /dev/shm is writable, the extraction leg redirects staging to /dev/shm/report_ai_portal/{STUDY}/ so raw extracted rows never hit physical disk. Default is off.

Why. tmpfs-backed staging is the gold-standard defence against filesystem forensics on the extraction host, but forcing it on by default would break macOS and Windows operators without warning. Opt-in keeps the strong defence available to Linux operators who ask for it while preserving portability for everyone else.

How. scripts.utils.secure_staging.resolve_staging_root() gates tmpfs redirection on the env flag AND the writability check. Graceful fallback to the default tmp/{STUDY}/ path when the env flag is set but the tmpfs isn’t available.

Alternatives.

Encrypted-at-rest staging. Considered; rejected as over- engineering — staging is short-lived, the on-disk hardening (mode 0700 + umask 0077 + zero-fill teardown) already resists the realistic threat model (accidental disk imaging, forensic recovery). Adding AES at rest would add a key-management burden without closing a different attack surface.
Default-on tmpfs with a Windows/macOS fallback. Considered; rejected because silent platform-dependent behaviour is a foot-gun — operators may assume tmpfs is active when it isn’t. Explicit opt-in makes the posture auditable.

Consequences. Linux operators who forget the env flag run without tmpfs hardening. Mitigated by documenting the flag prominently in .env.example, Configuration, and the operational runbook.

ADR-006 — External-API PDF extraction refused by default 

What. scripts/extraction/extract_pdf_data._resolve_pdf_provider refuses to initialise an Anthropic / Google Gemini client unless the operator explicitly sets REPORTALIN_PDF_PHI_FREE=1.

Why. PDFs are treated as PHI-bearing by default — annotated CRFs can carry filled-in patient data, handwritten signatures, or example subject IDs in annotations. Sending them to an external LLM API is a network egress of PHI to a third party. The architecture forbids that without explicit operator assertion.

How. Feature-flag check at provider resolution; refusal raises ValueError with a remediation message listing three alternatives (flip the flag if PHI-free, use --pdf-source with pre-extracted JSON, skip the PDF leg entirely).

Alternatives.

Local-only PDF extraction (pdfplumber primary + local-Ollama multimodal fallback). Superseded by the PDF orchestrator: scripts/extraction/pdf_pipeline.py ships pdfplumber as the always-on code path, paired with a redacted-text LLM call (capable cloud or local model) via _merge, and a per-PDF snapshot baseline fallback at data/snapshots/{STUDY}/pdfs/. No raw PDF bytes leave the host on the orchestrator path. The original ADR-006 external-API gate remains as the legacy fallback for operators who cannot run the orchestrator.
Keep external-API default with an “I acknowledge PHI” banner. Rejected — banners are not operator assertions. An env flag creates a durable audit trail.
Drop PDF extraction entirely. Considered as the minimum safe posture; rejected because the annotated PDFs carry variable-definition annotations that the data dictionary alone does not cover.

Consequences. A new operator who needs PDF extraction must read the error message, verify their PDFs are PHI-free, and flip the flag. This is the intentional friction.

ADR-007 — Four-tier architecture (RED / AMBER / GREEN / GREEN-PROTECT)

What. Every path in the runtime belongs to exactly one of four named zones: RED (raw), AMBER (staging), GREEN (trio_bundle), or GREEN-PROTECT (agent boundary). The zones are enforced by assertions, not by convention.

Why. Convention-based zone separation rots; assertion-based zone separation fails fast. New code that accidentally reads from raw/ or writes to output/ before the scrub runs raises ZoneViolationError in CI before landing.

How. scripts.security.secure_env exposes the pipeline-side guards (assert_not_raw, assert_write_zone, assert_output_zone, assert_trio_bundle_zone) — one per tier. The agent world has its own chokepoint scripts.ai_assistant.file_access with validate_agent_read / validate_agent_write / validate_sandbox_write (layered on the same ZoneViolationError), so no agent tool can add a new file read without passing through the validator. Every file-touching module starts with the applicable assertion.

Alternatives. None seriously considered — the zone model is explicitly stated in HIPAA + ICMR + the RePORT India Common Protocol. The only question was whether to enforce it as comments or as code; code won.

Consequences. Refactors that move file I/O from one module to another must also move (or add) the zone assertion. Tests catch missing assertions via the “zone” test category in tests/test_secure_env.py.

ADR-008 — Agent boundary PHI + k-anon gate as defence-in-depth 

What. Every LLM tool return string runs through scripts.security.phi_gate.phi_gate_check(). Row-level returns additionally run through scripts.security.kanon_gate.kanon_check() with k=5.

Why. The offline scrub (Step 1.6) is the primary PHI defence; the agent-boundary gate is the backstop. If a new narrative-field shape slips past the catalog or a quasi-identifier equivalence class is smaller than k=5 for a specific query, the gate catches the leak before the model ever sees it.

How. scripts.ai_assistant.phi_safe.phi_safe_return() decorator; scripts.ai_assistant.phi_safe.guard_rows_with_kanon() for row-level surfacing.

Alternatives. “Trust the offline scrub, skip the runtime gate”. Rejected — single-layer defences against PHI leakage have a bad track record across the industry. The k-anon gate is particularly important for the Indo-VAP cohort size (tribal + district + outcome combinations can trivially produce k<5 equivalence classes).

Consequences. Every tool author must decorate with @phi_safe_return. Missing decorations are caught by the test_agent_tools_zone_guard coverage tests (see Testing).

ADR-009 — Counts-only audit reports (never raw values)

What. Every report under output/{STUDY}/audit/ records counts of actions per field per file but never the underlying values. The lineage manifest records hashes, not contents.

Why. An auditor must be able to verify the scrub ran correctly without access to the raw data. Putting raw values in the audit would turn every audit report into a PHI leak surface of its own.

How. scripts.security.phi_scrub._emit_audit() emits a payload with scrubbed: [{scope, field, file, count}, ...]. The lineage manifest records {path, sha256, size_bytes, mtime_utc} entries.

Alternatives. A verbose audit with before/after value pairs was considered for debugging; rejected — such an audit is a full PHI copy. If debugging requires value inspection, the operator reads the (short-lived, mode-0700) AMBER staging files directly, never the audit.

Consequences. Audit reports are thin on evidence by design. The counts suffice for IRB acceptance; deeper debugging requires live staging inspection.

ADR-010 — Subprocess + rlimits sandbox for `run_python_analysis`

What. LLM-generated Python from the run_python_analysis tool runs in a fresh OS subprocess with RLIMIT_AS / RLIMIT_NPROC / RLIMIT_CPU clamps + a sanitised env + read-only access to the trio bundle. The original AST guards remain as defense-in-depth inside the child.

Why. AST guards alone don’t stop CPython gadget escapes (hand-crafted __class__.__bases__[0] traversals, co_consts poisoning, etc.). An OS-level isolation boundary defangs every in-process escape: even a perfectly-jailbroken interpreter inside the subprocess cannot exceed the rlimits, cannot read the parent’s KeyStore, cannot write outside trio_bundle/, cannot fork beyond RLIMIT_NPROC.

How. Implemented under scripts.ai_assistant.sandbox.replicate (public API), scripts.ai_assistant.sandbox.runner (child entry point), and scripts.ai_assistant.sandbox.limits (rlimits helpers). The generated .py file is persisted to output/{STUDY}/agent/analysis/{ts}.py so the operator can copy + reproduce externally — no hidden code path.

Alternatives.

AST guards only. Rejected — known CPython escape gadgets defeat AST-level filtering.
``nsjail`` / ``firejail`` profile. Considered as a future high-assurance option for cloud deployments; not shipped because the deployment target is the operator’s laptop and nsjail needs root + Linux-specific kernel features.
WebAssembly (Pyodide). Considered; rejected because pandas / numpy / scipy / statsmodels don’t run there cleanly, and the capability surface run_python_analysis needs is exactly those libraries.

Consequences. Subprocess startup is ~75 % of per-call latency on macOS — acceptable for the analytical-question-per-second workload but would need parquet + mmap if call rate increased materially. Tracked as future work in Sandbox: Subprocess-Isolated Code Execution.

ADR-011 — KeyStore (in-memory API-key registry)

What. API keys never persist in the parent process’s os.environ for the lifetime of the app. The Streamlit wizard routes the key into an in-memory KeyStore registry; the corresponding *_API_KEY env variable is scrubbed from os.environ. Keys are re-injected only into the short-lived pipeline subprocess via KeyStore.env_for_subprocess.

Why. os.environ is a process-wide global. A single logger.info(f"env={dict(os.environ)}") debug-print, an exception traceback rendered with locals(), or a third-party library helpfully echoing config — any of these can leak ANTHROPIC_API_KEY into the log file and from there into the audit envelope or the operator’s terminal scrollback. Removing the keys from os.environ once they’re loaded into the in-memory registry shrinks the leak surface materially.

How. Implemented as scripts.ai_assistant.keystore. Every LLM client constructor (ChatAnthropic, ChatOpenAI, ChatGoogleGenerativeAI, ChatNVIDIA, ChatOllama) takes an explicit api_key= kwarg sourced from the KeyStore — no environment lookup at construction time. Tested by tests/test_keystore.py, tests/test_log_hygiene_keys.py, tests/test_no_keys_in_parent_environ.py.

Alternatives.

OS keyring (keyring package). Considered for persistence across sessions; rejected because the operator already has the .env file as their persistence layer, and adding keyring would introduce platform-specific behaviour (macOS Keychain vs Linux libsecret vs Windows Credential Vault).
Encrypted on-disk vault. Rejected — adds a master-key bootstrap problem on top of the existing PHI-key bootstrap problem.

Consequences. Operators using the CLI (python main.py --pipeline directly without the wizard) still rely on the env-var path. The CLI main.py reads LLM_PROVIDER / ANTHROPIC_API_KEY from env; this is intentional for back-compat with existing shell-script invocations. The KeyStore posture only applies to the in-app (Streamlit/CLI-REPL) lifetimes.

ADR-012 — Two-way PDF orchestrator (pdfplumber + redacted-text LLM merge)

What. PDF extraction has two co-existing paths. The default path (scripts/extraction/pdf_pipeline.py, the wizard’s “Load Study” selection) extracts text locally with pdfplumber, redacts the text via phi_patterns.BLOCKING_PATTERNS before any LLM call, sends only the redacted text to a capable LLM, re-scrubs the response, and merges with the code candidate. Per-PDF fallback to the reviewed snapshot baseline at data/snapshots/{STUDY}/pdfs/ when the LLM tier is unavailable. No raw PDF bytes leave the host on this path.

Why. ADR-006 refused external-API PDF extraction by default because raw PDF bytes carry PHI risk. The two-part attestation gate in the legacy path is a workable but operator-friction-heavy posture. The orchestrator path replaces “ship raw bytes after attestation” with “redact text first, then ship”. The redaction catalog is the same one the agent-output PHI gate uses, so the audit story is internally consistent.

How. Implemented under scripts.extraction.pdf_pipeline. Capability gate via scripts.utils.llm_capabilities.is_capable_model() (Claude Opus 4.6+, Sonnet 4.6+, GPT-5+, Gemini 2.5 Pro, Llama 3.3 405B; Ollama excluded by default). Provider gate via scripts.extraction.pdf_pipeline.ORCHESTRATOR_SUPPORTED_PROVIDERS (anthropic + google only — where _extract_via_llm has client wiring). Idempotent cache keyed on SHA-256(pdf_bytes || provider || model || phi_scrub.yaml hash).

Alternatives.

Local-only PDF extraction (pdfplumber + local-Ollama multimodal). Considered; rejected as primary because Ollama models don’t reliably produce the JSON-schema response on a 30-page CRF. The snapshot fallback covers this case.
Keep ADR-006’s attestation gate as the only path. Rejected because operator friction effectively meant the PDF leg was skipped for most runs; the orchestrator unblocks the leg without weakening the egress posture.

Consequences. ADR-006 is now a fallback path, not the primary path. The attestation gate remains in the legacy extract_pdf_data._resolve_pdf_provider. The orchestrator uses the data/snapshots/{STUDY}/ baseline tier described in ADR-013.

ADR-013 — Single reviewed snapshot baseline 

What. One human-reviewed copy of the trio bundle lives at data/snapshots/{STUDY}/. It is a single per-study cleaned trio bundle and mirrors output/{STUDY}/trio_bundle/.

Why. The PDF orchestrator (ADR-012) needs a deterministic, reviewable baseline to fall back on when the LLM tier fails, and the wizard needs a clean way to recover from incomplete PDF extraction. Multi-named scratch copies created ambiguity: operators could confuse local rollback data with the reviewed baseline. The architecture now keeps only the reviewed baseline.

How. config.STUDY_SNAPSHOTS_DIR points at data/snapshots/{STUDY}/. scripts.utils.snapshots saves and restores that baseline. The wizard’s Use Existing Study button and the full pipeline’s PDF-failure path both restore the baseline over output/{STUDY}/trio_bundle/.

Crucially: the LLM agent’s read zone is strictly trio_bundle/ + agent/. The snapshot path is intentionally outside the LLM read zone. A stale baseline can never be served directly as live data because the agent cannot read it.

Alternatives.

Restore points under ``output/{STUDY}/agent/``. Rejected — they are scratch state, sit inside the agent zone, and do not match the human-reviewed fallback contract.
Require operators to seed via env var or a separate repo. Rejected — data/snapshots/{STUDY}/ keeps the fallback colocated with study data and easy to review.

Consequences. Maintainers who follow the snapshot-baseline maintenance protocol (see Operations) must run a manual make snapshot after a verified production run. The whole value of a snapshot is the human verification step; auto-generation from --force runs without review is a foot-gun the protocol explicitly forbids.

ADR-014 — Parallel extraction phase (3-worker ThreadPoolExecutor)

What. main.py’s extraction phase runs Dictionary / Datasets / PDFs in parallel on a 3-worker concurrent.futures.ThreadPoolExecutor. The cleanup chain (PHI scrub / dataset cleanup / cleanup propagation), publish, and variables.json build run sequentially after the join.

Why. The three legs are fully decoupled — different RED inputs, different AMBER staging subdirs, no shared mutable state. The PDF leg’s HTTP latency (orchestrator LLM calls) is amortised against the dataset leg’s Excel-parsing CPU. Cleanup chain stays sequential because it has hard data dependencies (PHI scrub needs all dataset records; propagation needs the cleanup audit; publish needs propagation; variables.json needs publish to have landed).

How. Each leg is wrapped in a helper (_run_dict_leg, _run_dataset_leg, _run_pdf_leg) that returns a result dict; futures are gathered via as_completed so a fast leg can short-circuit while a slow one runs.

Alternatives.

``multiprocessing.Pool`` for true parallelism. Rejected — the PDF leg’s HTTP calls release the GIL, so threads are sufficient; multiprocessing would multiply the staging-tree write contention.
``asyncio``. Rejected — the existing extraction code is sync pandas + openpyxl; converting to async would be a much bigger change.

Consequences. The staging root is protected by a per-study process lock before AMBER is purged, so two operator-triggered runs cannot race over the same tmp/{STUDY}/ workspace. VerboseLogger uses thread-local indentation, keeping --verbose tree output readable while the three extraction legs overlap.

ADR-015 — l-diversity (l=2) on row-returning tools 

What. Every row-returning agent tool runs the result through scripts.security.kanon_gate.guard_rows_with_kanon_and_ldiv(). The gate enforces both k-anonymity (k=5; ADR-008 / Pillar 2.4) AND l-diversity (l=2 distinct sensitive-attribute values per equivalence class).

Why. k-anonymity alone is insufficient when every row in a small equivalence class shares the same sensitive attribute (the “homogeneity attack” of Machanavajjhala et al. 2007). If five rows match on all quasi-identifiers AND all five share the diagnosis HIV+, the k=5 check passes but the cohort is just as re-identifying as a k=1 disclosure. l-diversity guards against this by also requiring at least l distinct values of the sensitive attribute.

How. Default sensitive attributes are configured in _DEFAULT_SENSITIVE_ATTRIBUTES of scripts.security.kanon_gate. Tested by tests/security/test_kanon_l_diversity.py.

Alternatives.

t-closeness. Considered; deferred. t-closeness checks the distribution of the sensitive attribute against the population baseline, which needs population-level priors that aren’t always available. l-diversity gives the bulk of the protection without that requirement; t-closeness can be added later as a third gate.
Differential privacy. Out of scope for the row-returning-tool use case; would change the contract from “exact rows when allowed” to “noisy aggregate”.

Consequences. Some legitimate cohort queries return aggregate substitutes ("≥5 subjects, all sharing diagnosis X") instead of row-level data. The agent’s prompt explains this to the user when the gate fires.