PHI Architecture ================ The canonical developer-facing description of the full PHI-handling story — the four zones, the eight-action scrub catalog, the integrity chain, the log redactor, the PDF orchestrator's redact-then-call posture, and the agent-boundary three-gate stack. For the reviewer-only IRB/Auditor profile, see :doc:`../irb_auditor/phi_handling`; for the architectural decisions behind these mechanisms see :doc:`decisions`. The Four Tiers (plus audit and one out-of-zone tier) ---------------------------------------------------- The honest-broker model has three filesystem zones plus one agent boundary tier. The audit envelope is a separate counts-only filesystem surface that the agent cannot read. The fifth path (``data/snapshots/{STUDY}/``) is *not* a zone in the honest-broker sense — it's a human-reviewed baseline, intentionally outside every LLM-readable surface. .. list-table:: :header-rows: 1 :widths: 14 30 56 * - Zone - Path - PHI posture * - **RED** - ``data/raw/{STUDY}/`` - Raw clinical inputs. Presumed PHI-bearing. Read-only by the extraction subprocess; the agent and the LLM never touch this zone. * - **AMBER** - ``tmp/{STUDY}/`` - Per-run scratch. Mode ``0700`` under umask ``0077``. PHI is present here for the duration of one pipeline run; on success the entire tree is overwritten with random bytes + ``fsync``-ed + unlinked. On failure preserved for forensic inspection. * - **GREEN** - ``output/{STUDY}/trio_bundle/`` + ``output/{STUDY}/agent/`` - PHI-free published artifacts + agent's own state. :func:`scripts.ai_assistant.file_access.validate_agent_read` admits paths in this zone only. * - **GREEN-PROTECT** - Agent tool boundary (not a directory) - Every tool return is checked by the PHI regex gate and, for row-level results, k-anonymity + l-diversity before the LLM can answer. The audit envelope: * **``output/{STUDY}/audit/``** — counts-only IRB evidence: lineage manifest, scrub report, cleanup report, telemetry. Same ``output/`` root as GREEN but hard-rejected by the agent's read-zone validator. The fifth path: * **``data/snapshots/{STUDY}/``** — human-reviewed cleaned trio bundle baseline used by the PDF orchestrator's fallback and restored over ``trio_bundle/`` when fresh PDF extraction fails or **Use Existing Study** is selected. **The LLM cannot read it.** The path is outside the GREEN tree and outside the audit envelope, so a stale baseline can never be served directly as live data. Maintainer-curated by hand; see :doc:`operations`. Zone enforcement ~~~~~~~~~~~~~~~~ Two complementary chokepoints: * :mod:`scripts.security.secure_env` — pipeline-side directory-level early-reject. Functions: ``assert_not_raw``, ``assert_output_zone``, ``assert_write_zone``, ``assert_trio_bundle_zone``. Used at pipeline boundaries (e.g. before the publish-step rename, before ``--pdf-source`` copy). * :mod:`scripts.ai_assistant.file_access` — agent-runtime path validator. Functions: ``validate_agent_read``, ``validate_agent_write``, ``validate_sandbox_write``, ``is_agent_readable``. Resolves every path with ``os.path.realpath`` and verifies containment with ``os.path.commonpath``. Reads accept ``trio_bundle/`` ∪ ``agent/`` (plus ``config/study_knowledge.yaml`` via an explicit allowlist for the StudyKnowledge helper). Agent-tool writes accept ``agent/`` only; ``exec_python`` sandbox writes narrow further to ``agent/analysis/``. Audit, telemetry, staging, raw, and the snapshot baseline are hard-rejected with ``ZoneViolationError``. The Eight-Action Scrub Catalog (Step 1.6) ----------------------------------------- :func:`scripts.security.phi_scrub.run_scrub` is invoked between the parallel extraction phase and the dataset cleanup. It operates on ``tmp/{STUDY}/datasets/*.jsonl`` in place. Eight action classes, evaluated in strict priority order against ~200 Indo-VAP-calibrated rules in ``scripts/security/phi_scrub.yaml``: 1. **keep** — pass through (only for confirmed non-PHI columns) 2. **birthdate** — replace with ``birthyear`` only (HIPAA Safe Harbor §164.514(b)(2)(i)) 3. **drop** — null out 4. **cap** — clamp at a quantile (the "age > 89" rule) 5. **generalize** — bucket into ranges (e.g. age → 5-year bands) 6. **suppress_small_cell** — null when the cohort cell has fewer than the configured threshold 7. **date_jitter (SANT)** — per-subject deterministic shift via ``HMAC-SHA256(key, subject_id)[:4] mod (2*max_days+1) - max_days``. Within-subject visit intervals are preserved exactly; absolute dates are obscured. 8. **hmac_pseudonymize** — replace IDs with ``SUBJ_``. Non-reversible without the key, deterministic with it. The HMAC key lives at ``~/.config/report_ai_portal/phi_key`` (mode ``0600``, outside the repo, never committed). Path resolution: ``$XDG_CONFIG_HOME/report_ai_portal/phi_key`` if set, else ``~/.config/report_ai_portal/phi_key``. Posture flags ~~~~~~~~~~~~~ The scrub config supports two "compliance posture" modes: * **Default (Safe Harbor / NIST SP 800-188)** — ``birthdate`` ⇒ ``birthyear``, drop precise dates, jitter within-subject, etc. * **Limited Dataset (HIPAA §164.514(e))** — ``birthdate`` and precise dates retained because a Data Use Agreement is in place. Activated by ``compliance_posture: limited_dataset`` in ``phi_scrub.yaml`` AND a non-empty ``authorities/phi_limited_dataset.md`` attestation note. Both pillars must hold. A YAML edit alone or an attestation note alone is insufficient. The Agent-Boundary Three-Gate Stack ----------------------------------- Every tool return string passes through three gates before reaching the LLM: Gate 1 — PHI regex catalog (``phi_gate_check`` / ``guard_text``) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Module: :mod:`scripts.security.phi_gate` and :mod:`scripts.ai_assistant.phi_safe`. Pattern catalog: :mod:`scripts.security.phi_patterns`. Allowlist: :mod:`scripts.security.phi_allowlist`. Blocking patterns: Aadhaar (12-digit + Verhoeff check), PAN (``[A-Z]{5}[0-9]{4}[A-Z]``), email, phone (Indian mobile patterns + international), precise dates (``\d{1,2}[/-]\d{1,2}[/-]\d{2,4}``, ISO ``\d{4}-\d{2}-\d{2}``), MRN-shaped tokens. When a blocking pattern fires, the response is replaced with a redaction message; the LLM sees the redaction notice, not the raw text. The clinical-phrase allowlist exempts strings like "INH 5 mg/kg" or "VL 300 copies/mL" from numeric-id false positives. Gate 2 — k-anonymity (k=5) (``guard_rows_with_kanon``) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Module: :mod:`scripts.security.kanon_gate`. Function: :func:`scripts.security.kanon_gate.kanon_check` (used as a primitive by ``guard_rows_with_kanon_and_ldiv`` below). When a tool would surface row-level data, the gate computes the equivalence class of each row over the configured quasi-identifiers (``_DEFAULT_QUASI_IDENTIFIERS``: typically ``age_band``, ``sex``, ``district``). If any equivalence class has fewer than 5 members, the gate suppresses the response and returns an aggregate or an explicit "too-few-records" message. Gate 3 — l-diversity (l=2) (``guard_rows_with_kanon_and_ldiv``) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Function: :func:`scripts.security.kanon_gate.l_diversity_check` (used as a primitive by ``guard_rows_with_kanon_and_ldiv``). When a k-anon-passing equivalence class shares the same sensitive attribute (e.g. all 5 rows have ``hiv_status = positive``), the gate also fires. l=2 means the class must contain at least 2 distinct values of the sensitive attribute. See ADR-015 in :doc:`decisions` for the rationale. The PDF Orchestrator's Redact-Then-Call Posture ----------------------------------------------- ADR-012. The wizard's "Load Study" button selects this path. Per PDF: 1. **Code path always runs.** ``pdfplumber`` extracts text + a heuristic candidate via :func:`scripts.extraction.pdf_pipeline._candidate_from_text`. 2. **Capability + provider gate.** :func:`scripts.utils.llm_capabilities.is_capable_model` enforces the model allowlist; :data:`scripts.extraction.pdf_pipeline.ORCHESTRATOR_SUPPORTED_PROVIDERS` restricts to anthropic + google. 3. **Redact-then-call.** Extracted text is scrubbed via :func:`scripts.extraction.pdf_pipeline._redact_text_for_llm` (which uses ``phi_patterns.BLOCKING_PATTERNS``). A defensive ``_assert_no_raw_phi_in_payload`` re-checks and raises if any blocking pattern survives. Only the redacted text reaches the LLM. 4. **Re-scrub the response.** The LLM response is parsed and every string field is re-scrubbed through :func:`scripts.ai_assistant.phi_safe.guard_text` before merge. 5. **Merge.** :func:`scripts.extraction.pdf_pipeline._merge` reconciles LLM + code candidates. 6. **Per-PDF snapshot fallback.** When the LLM tier is unavailable, the orchestrator publishes ``data/snapshots/{STUDY}/pdfs/{stem}_variables.json`` instead. **Code-only output is never published** — heuristic-only metadata without LLM oversight is too unreliable for IRB-grade work. **No raw PDF bytes leave the host on the orchestrator path.** The legacy raw-PDF API path (:func:`scripts.extraction.extract_pdf_data._resolve_pdf_provider`) remains as the back-compat fallback gated by the two-part attestation (``REPORTALIN_PDF_PHI_FREE=1`` + non-empty ``authorities/phi_free_pdfs.md``). The Integrity Chain ------------------- Three artifacts cryptographically link the raw inputs to the published outputs: 1. **Per-row provenance dict** — every JSONL record in ``trio_bundle/datasets/`` carries a ``_provenance`` field with ``raw_sha256``, ``pipeline_version``, ``extraction_engine``, ``source_file``, ``sheet_name``, ``row_index``, ``study_name``, ``extraction_utc``. Traceability per row. 2. **Step-cache manifests** at ``output/{STUDY}/audit/manifests/{step}.json``. Each step (e.g. ``dataset_processing``) records the SHA-256 of every input file it consumed; the next run hashes inputs again and skips the step if all hashes match. Implementation: :mod:`scripts.utils.step_cache`. 3. **Lineage manifest** at ``output/{STUDY}/audit/lineage_manifest.json`` — Step 4 of the pipeline. Records: * Per-input hash: ``{path, sha256, size_bytes, mtime_utc}`` for every file under ``data/raw/{STUDY}/``. * Per-output hash: same shape for every file under ``trio_bundle/``. * Per-leg audit pointer: paths to ``phi_scrub_report.json``, ``dataset_cleanup_report.json``, etc. * **PHI-key fingerprint**: SHA-256 of the HMAC key bytes (so IRB reviewers can verify the same key was used as expected without ever seeing the key itself). * **Compliance posture**: ``default`` / ``limited_dataset`` / ``disabled`` / ``unknown``. * Pipeline version + emit timestamp. Implementation: :mod:`scripts.utils.lineage`. Every audit report is **counts-only** (per ADR-009). No row contents, no before/after pairs, no subject identifiers. The auditor reads counts; if values are needed for debugging, the operator inspects the live AMBER staging files (which only exist for the duration of the run). Log Hygiene ----------- :func:`scripts.utils.log_hygiene.install_phi_redactor` attaches a ``logging.Filter`` to the root logger. Every log line goes through the filter at format time. Patterns covered: * API keys (``sk-ant-…``, ``sk-…``) * Aadhaar, PAN, MRN-shaped tokens * Phone, email * Precise dates The redactor is installed once in ``main.py`` (``_install_log_redactor_best_effort``) and once in the AI Assistant entry points so both worlds emit scrubbed logs. KeyStore -------- ADR-011. API keys never persist in the parent process's ``os.environ``. The Streamlit wizard's step 1 routes the pasted key into :mod:`scripts.ai_assistant.keystore` (an in-memory ``KeyStore`` registry); the corresponding ``*_API_KEY`` env variable is scrubbed. Every LLM client takes ``api_key=`` as an explicit kwarg sourced from the KeyStore. Keys are re-injected only into the short-lived pipeline subprocess via ``KeyStore.env_for_subprocess``. Subprocess Sandbox ------------------ ADR-010. ``run_python_analysis`` runs in a fresh ``subprocess.run`` child with ``RLIMIT_AS`` / ``RLIMIT_NPROC`` / ``RLIMIT_CPU`` clamps, a sanitised env (no ``*_API_KEY`` from the parent KeyStore), and read-only access to ``trio_bundle/`` only. AST + import + dunder + builtin guards remain inside the child as defence-in-depth. See :doc:`sandbox` for the full layered story. Module Map ---------- .. list-table:: :header-rows: 1 :widths: 38 62 * - Module - Role * - :mod:`scripts.security.phi_scrub` - Eight-action scrub catalog driver. Reads ``scripts/security/phi_scrub.yaml``. * - :mod:`scripts.security.phi_patterns` - Shared regex catalog (``BLOCKING_PATTERNS``, ``WARN_PATTERNS``). Used by the agent-output gate, the PDF orchestrator's redaction step, and the log redactor. * - :mod:`scripts.security.phi_allowlist` - Clinical-phrase exemption (e.g. "INH 5 mg/kg" not flagged as a numeric ID). * - :mod:`scripts.security.phi_gate` - Agent-output PHI gate. ``phi_gate_check`` returns blocked / allowed. * - :mod:`scripts.security.kanon_gate` - k-anonymity (k=5) + l-diversity (l=2). ``kanon_check``, ``l_diversity_check``, ``guard_rows_with_kanon_and_ldiv``. * - :mod:`scripts.security.secure_env` - Pipeline-side directory-level zone guards. * - :mod:`scripts.ai_assistant.file_access` - Agent-runtime path validator (``validate_agent_read`` etc.). * - :mod:`scripts.ai_assistant.phi_safe` - Agent-side PHI helpers: ``phi_safe_return``, ``guard_text``, ``guard_user_prompt``, ``sanitise_untrusted_snippet``, ``redact_phi_in_text``, ``sanitise_traceback``. * - :mod:`scripts.ai_assistant.keystore` - In-memory API-key registry. * - :mod:`scripts.utils.log_hygiene` - Logging filter for API-key + PHI redaction. * - :mod:`scripts.utils.lineage` - Lineage manifest emitter (Step 4). * - :mod:`scripts.utils.secure_staging` - AMBER staging prep + secure-zero-fill teardown. * - :mod:`scripts.utils.step_cache` - Per-step hash manifests for skip semantics. * - :mod:`scripts.extraction.pdf_pipeline` - PDF orchestrator with redact-then-call. IRB Benchmark Cross-Reference ----------------------------- The active IRB/Auditor conformance profile lives at :doc:`../irb_auditor/conformance`. Pillar mapping: * **Pillar 1 — PHI scrub catalog**: ``phi_scrub.py`` + ``phi_scrub.yaml``, the 8 action classes documented above. * **Pillar 2 — Zone isolation + agent access**: ``file_access.py`` + ``secure_env.py`` + the three agent-output gates. * **Pillar 3 — Secure channel + integrity**: ``secure_staging.py`` + ``lineage.py`` + ``step_cache.py``. * **Pillar 4 — Extraction safety**: ``dataset_pipeline.py`` + ``pdf_pipeline.py`` + ``extract_pdf_data.py``. * **Pillar 5 — Governance + retention + breach**: ``phi_scrub.bootstrap_key`` + ``_cleanup_staging`` + audit envelope. When You Touch This Code ------------------------ Every diff that touches anything under ``scripts/security/``, ``scripts/ai_assistant/{file_access,phi_safe,keystore}.py``, or ``scripts/extraction/pdf_pipeline.py`` should: 1. Run ``make test-all`` locally — the 22 PHI-critical test modules covering scrub, staging, file access, PDF redaction, PHI gates, lineage, and log hygiene must all pass. 2. Run ``make doc-freshness`` — the lint compares live source-of- truth values (tool count, scrub-action count, version) against prose in this page and the Sphinx docs. 3. If you change the scrub catalog (the YAML), the ``phi_scrub.yaml`` SHA-256 changes — which invalidates the PDF orchestrator's idempotent cache by design (the cache key includes ``phi_scrub.yaml`` hash). Confirmed by ``tests/security/test_pdf_redaction_pipeline.py::test_cache_key_invariants``. 4. If you add a new pattern to ``BLOCKING_PATTERNS``, add a positive test (the pattern fires) AND a negative test (the clinical-phrase allowlist still passes legitimate strings). See Also -------- * :doc:`architecture` — full system architecture. * :doc:`decisions` — ADRs (especially 010-015 which cover the PHI, PDF, snapshot, and agent-boundary work). * :doc:`sandbox` — subprocess sandbox. * :doc:`operations` — snapshot-baseline maintenance protocol. * :doc:`../irb_auditor/index` — reviewer-only PHI handling and conformance profile.