PHI Architecture
The canonical developer-facing description of the full PHI-handling story — the four zones, the eight-action scrub catalog, the integrity chain, the log redactor, the PDF orchestrator’s redact-then-call posture, and the agent-boundary three-gate stack. For the reviewer-only IRB/Auditor profile, see PHI Handling; for the architectural decisions behind these mechanisms see Architecture Decisions (ADRs).
The Four Tiers (plus audit and one out-of-zone tier)
The honest-broker model has three filesystem zones plus one agent
boundary tier. The audit envelope is a separate counts-only filesystem
surface that the agent cannot read. The fifth path
(data/snapshots/{STUDY}/) is not a zone in the honest-broker
sense — it’s a human-reviewed baseline, intentionally outside every
LLM-readable surface.
Zone |
Path |
PHI posture |
|---|---|---|
RED |
|
Raw clinical inputs. Presumed PHI-bearing. Read-only by the extraction subprocess; the agent and the LLM never touch this zone. |
AMBER |
|
Per-run scratch. Mode |
GREEN |
|
PHI-free published artifacts + agent’s own state.
|
GREEN-PROTECT |
Agent tool boundary (not a directory) |
Every tool return is checked by the PHI regex gate and, for row-level results, k-anonymity + l-diversity before the LLM can answer. |
The audit envelope:
``output/{STUDY}/audit/`` — counts-only IRB evidence: lineage manifest, scrub report, cleanup report, telemetry. Same
output/root as GREEN but hard-rejected by the agent’s read-zone validator.
The fifth path:
``data/snapshots/{STUDY}/`` — human-reviewed cleaned trio bundle baseline used by the PDF orchestrator’s fallback and restored over
trio_bundle/when fresh PDF extraction fails or Use Existing Study is selected. The LLM cannot read it. The path is outside the GREEN tree and outside the audit envelope, so a stale baseline can never be served directly as live data. Maintainer-curated by hand; see Operations.
Zone enforcement
Two complementary chokepoints:
scripts.security.secure_env— pipeline-side directory-level early-reject. Functions:assert_not_raw,assert_output_zone,assert_write_zone,assert_trio_bundle_zone. Used at pipeline boundaries (e.g. before the publish-step rename, before--pdf-sourcecopy).scripts.ai_assistant.file_access— agent-runtime path validator. Functions:validate_agent_read,validate_agent_write,validate_sandbox_write,is_agent_readable. Resolves every path withos.path.realpathand verifies containment withos.path.commonpath. Reads accepttrio_bundle/∪agent/(plusconfig/study_knowledge.yamlvia an explicit allowlist for the StudyKnowledge helper). Agent-tool writes acceptagent/only;exec_pythonsandbox writes narrow further toagent/analysis/. Audit, telemetry, staging, raw, and the snapshot baseline are hard-rejected withZoneViolationError.
The Eight-Action Scrub Catalog (Step 1.6)
scripts.security.phi_scrub.run_scrub() is invoked between the
parallel extraction phase and the dataset cleanup. It operates on
tmp/{STUDY}/datasets/*.jsonl in place. Eight action classes,
evaluated in strict priority order against ~200 Indo-VAP-calibrated
rules in scripts/security/phi_scrub.yaml:
keep — pass through (only for confirmed non-PHI columns)
birthdate — replace with
birthyearonly (HIPAA Safe Harbor §164.514(b)(2)(i))drop — null out
cap — clamp at a quantile (the “age > 89” rule)
generalize — bucket into ranges (e.g. age → 5-year bands)
suppress_small_cell — null when the cohort cell has fewer than the configured threshold
date_jitter (SANT) — per-subject deterministic shift via
HMAC-SHA256(key, subject_id)[:4] mod (2*max_days+1) - max_days. Within-subject visit intervals are preserved exactly; absolute dates are obscured.hmac_pseudonymize — replace IDs with
SUBJ_<HMAC-SHA256(key, value)[:12]>. Non-reversible without the key, deterministic with it.
The HMAC key lives at ~/.config/report_ai_portal/phi_key (mode
0600, outside the repo, never committed). Path resolution:
$XDG_CONFIG_HOME/report_ai_portal/phi_key if set, else
~/.config/report_ai_portal/phi_key.
Posture flags
The scrub config supports two “compliance posture” modes:
Default (Safe Harbor / NIST SP 800-188) —
birthdate⇒birthyear, drop precise dates, jitter within-subject, etc.Limited Dataset (HIPAA §164.514(e)) —
birthdateand precise dates retained because a Data Use Agreement is in place. Activated bycompliance_posture: limited_datasetinphi_scrub.yamlAND a non-emptyauthorities/phi_limited_dataset.mdattestation note.
Both pillars must hold. A YAML edit alone or an attestation note alone is insufficient.
The Agent-Boundary Three-Gate Stack
Every tool return string passes through three gates before reaching the LLM:
Gate 1 — PHI regex catalog (phi_gate_check / guard_text)
Module: scripts.security.phi_gate and
scripts.ai_assistant.phi_safe. Pattern catalog:
scripts.security.phi_patterns. Allowlist:
scripts.security.phi_allowlist.
Blocking patterns: Aadhaar (12-digit + Verhoeff check), PAN
([A-Z]{5}[0-9]{4}[A-Z]), email, phone (Indian mobile patterns
+ international), precise dates (\d{1,2}[/-]\d{1,2}[/-]\d{2,4},
ISO \d{4}-\d{2}-\d{2}), MRN-shaped tokens. When a blocking
pattern fires, the response is replaced with a redaction
message; the LLM sees the redaction notice, not the raw text.
The clinical-phrase allowlist exempts strings like “INH 5 mg/kg” or “VL 300 copies/mL” from numeric-id false positives.
Gate 2 — k-anonymity (k=5) (guard_rows_with_kanon)
Module: scripts.security.kanon_gate. Function:
scripts.security.kanon_gate.kanon_check() (used as a
primitive by guard_rows_with_kanon_and_ldiv below).
When a tool would surface row-level data, the gate computes the
equivalence class of each row over the configured quasi-identifiers
(_DEFAULT_QUASI_IDENTIFIERS: typically age_band, sex,
district). If any equivalence class has fewer than 5 members, the
gate suppresses the response and returns an aggregate or an explicit
“too-few-records” message.
Gate 3 — l-diversity (l=2) (guard_rows_with_kanon_and_ldiv)
Function: scripts.security.kanon_gate.l_diversity_check() (used as a
primitive by guard_rows_with_kanon_and_ldiv).
When a k-anon-passing equivalence class shares the same sensitive
attribute (e.g. all 5 rows have hiv_status = positive), the gate
also fires. l=2 means the class must contain at least 2 distinct
values of the sensitive attribute. See ADR-015 in Architecture Decisions (ADRs)
for the rationale.
The PDF Orchestrator’s Redact-Then-Call Posture
ADR-012. The wizard’s “Load Study” button selects this path. Per PDF:
Code path always runs.
pdfplumberextracts text + a heuristic candidate viascripts.extraction.pdf_pipeline._candidate_from_text().Capability + provider gate.
scripts.utils.llm_capabilities.is_capable_model()enforces the model allowlist;scripts.extraction.pdf_pipeline.ORCHESTRATOR_SUPPORTED_PROVIDERSrestricts to anthropic + google.Redact-then-call. Extracted text is scrubbed via
scripts.extraction.pdf_pipeline._redact_text_for_llm()(which usesphi_patterns.BLOCKING_PATTERNS). A defensive_assert_no_raw_phi_in_payloadre-checks and raises if any blocking pattern survives. Only the redacted text reaches the LLM.Re-scrub the response. The LLM response is parsed and every string field is re-scrubbed through
scripts.ai_assistant.phi_safe.guard_text()before merge.Merge.
scripts.extraction.pdf_pipeline._merge()reconciles LLM + code candidates.Per-PDF snapshot fallback. When the LLM tier is unavailable, the orchestrator publishes
data/snapshots/{STUDY}/pdfs/{stem}_variables.jsoninstead. Code-only output is never published — heuristic-only metadata without LLM oversight is too unreliable for IRB-grade work.
No raw PDF bytes leave the host on the orchestrator path. The
legacy raw-PDF API path
(scripts.extraction.extract_pdf_data._resolve_pdf_provider())
remains as the back-compat fallback gated by the two-part
attestation (REPORTALIN_PDF_PHI_FREE=1 + non-empty
authorities/phi_free_pdfs.md).
The Integrity Chain
Three artifacts cryptographically link the raw inputs to the published outputs:
Per-row provenance dict — every JSONL record in
trio_bundle/datasets/carries a_provenancefield withraw_sha256,pipeline_version,extraction_engine,source_file,sheet_name,row_index,study_name,extraction_utc. Traceability per row.Step-cache manifests at
output/{STUDY}/audit/manifests/{step}.json. Each step (e.g.dataset_processing) records the SHA-256 of every input file it consumed; the next run hashes inputs again and skips the step if all hashes match. Implementation:scripts.utils.step_cache.Lineage manifest at
output/{STUDY}/audit/lineage_manifest.json— Step 4 of the pipeline. Records:Per-input hash:
{path, sha256, size_bytes, mtime_utc}for every file underdata/raw/{STUDY}/.Per-output hash: same shape for every file under
trio_bundle/.Per-leg audit pointer: paths to
phi_scrub_report.json,dataset_cleanup_report.json, etc.PHI-key fingerprint: SHA-256 of the HMAC key bytes (so IRB reviewers can verify the same key was used as expected without ever seeing the key itself).
Compliance posture:
default/limited_dataset/disabled/unknown.Pipeline version + emit timestamp.
Implementation:
scripts.utils.lineage.
Every audit report is counts-only (per ADR-009). No row contents, no before/after pairs, no subject identifiers. The auditor reads counts; if values are needed for debugging, the operator inspects the live AMBER staging files (which only exist for the duration of the run).
Log Hygiene
scripts.utils.log_hygiene.install_phi_redactor() attaches a
logging.Filter to the root logger. Every log line goes through
the filter at format time. Patterns covered:
API keys (
sk-ant-…,sk-…)Aadhaar, PAN, MRN-shaped tokens
Phone, email
Precise dates
The redactor is installed once in main.py (_install_log_redactor_best_effort)
and once in the AI Assistant entry points so both worlds emit
scrubbed logs.
KeyStore
ADR-011. API keys never persist in the parent process’s
os.environ. The Streamlit wizard’s step 1 routes the pasted key
into scripts.ai_assistant.keystore (an in-memory
KeyStore registry); the corresponding *_API_KEY env variable
is scrubbed. Every LLM client takes api_key= as an explicit
kwarg sourced from the KeyStore. Keys are re-injected only into
the short-lived pipeline subprocess via
KeyStore.env_for_subprocess.
Subprocess Sandbox
ADR-010. run_python_analysis runs in a fresh subprocess.run
child with RLIMIT_AS / RLIMIT_NPROC / RLIMIT_CPU clamps,
a sanitised env (no *_API_KEY from the parent KeyStore), and
read-only access to trio_bundle/ only. AST + import + dunder +
builtin guards remain inside the child as defence-in-depth. See
Sandbox: Subprocess-Isolated Code Execution for the full layered story.
Module Map
Module |
Role |
|---|---|
Eight-action scrub catalog driver. Reads
|
|
Shared regex catalog ( |
|
Clinical-phrase exemption (e.g. “INH 5 mg/kg” not flagged as a numeric ID). |
|
Agent-output PHI gate. |
|
k-anonymity (k=5) + l-diversity (l=2). |
|
Pipeline-side directory-level zone guards. |
|
Agent-runtime path validator ( |
|
Agent-side PHI helpers: |
|
In-memory API-key registry. |
|
Logging filter for API-key + PHI redaction. |
|
Lineage manifest emitter (Step 4). |
|
AMBER staging prep + secure-zero-fill teardown. |
|
Per-step hash manifests for skip semantics. |
|
PDF orchestrator with redact-then-call. |
IRB Benchmark Cross-Reference
The active IRB/Auditor conformance profile lives at Conformance Evidence. Pillar mapping:
Pillar 1 — PHI scrub catalog:
phi_scrub.py+phi_scrub.yaml, the 8 action classes documented above.Pillar 2 — Zone isolation + agent access:
file_access.py+secure_env.py+ the three agent-output gates.Pillar 3 — Secure channel + integrity:
secure_staging.py+lineage.py+step_cache.py.Pillar 4 — Extraction safety:
dataset_pipeline.py+pdf_pipeline.py+extract_pdf_data.py.Pillar 5 — Governance + retention + breach:
phi_scrub.bootstrap_key+_cleanup_staging+ audit envelope.
When You Touch This Code
Every diff that touches anything under scripts/security/,
scripts/ai_assistant/{file_access,phi_safe,keystore}.py, or
scripts/extraction/pdf_pipeline.py should:
Run
make test-alllocally — the 22 PHI-critical test modules covering scrub, staging, file access, PDF redaction, PHI gates, lineage, and log hygiene must all pass.Run
make doc-freshness— the lint compares live source-of- truth values (tool count, scrub-action count, version) against prose in this page and the Sphinx docs.If you change the scrub catalog (the YAML), the
phi_scrub.yamlSHA-256 changes — which invalidates the PDF orchestrator’s idempotent cache by design (the cache key includesphi_scrub.yamlhash). Confirmed bytests/security/test_pdf_redaction_pipeline.py::test_cache_key_invariants.If you add a new pattern to
BLOCKING_PATTERNS, add a positive test (the pattern fires) AND a negative test (the clinical-phrase allowlist still passes legitimate strings).
See Also
Architecture — full system architecture.
Architecture Decisions (ADRs) — ADRs (especially 010-015 which cover the PHI, PDF, snapshot, and agent-boundary work).
Sandbox: Subprocess-Isolated Code Execution — subprocess sandbox.
Operations — snapshot-baseline maintenance protocol.
IRB/Auditor Profile — reviewer-only PHI handling and conformance profile.