Developer Guide
This guide is for people who change, review, operate, or release the code. User onboarding stays in User Guide; IRB/auditor evidence stays in IRB/Auditor Profile.
Start by Role
Reader |
Goal |
Start here |
|---|---|---|
Pipeline developer |
Change extraction, PHI scrub, cleanup, publish, variables, or lineage behavior. |
|
PDF pipeline developer |
Change PDF extraction, redaction, merge, or snapshot fallback. |
|
Agent/tool developer |
Add or change assistant tools without breaking file-zone and PHI gates. |
Agent Instructions (for AI Coding Assistants), then API Reference |
Privacy or security reviewer |
Inspect load-bearing controls, invariants, and tests. |
PHI Architecture, Sandbox: Subprocess-Isolated Code Execution, then Testing |
Maintainer |
Run verification, restore reviewed snapshots, and prepare releases. |
|
Documentation contributor |
Keep README and Sphinx organized by audience. |
Contents
Architecture & Decisions
- Architecture
- PHI Architecture
- The Four Tiers (plus audit and one out-of-zone tier)
- The Eight-Action Scrub Catalog (Step 1.6)
- The Agent-Boundary Three-Gate Stack
- The PDF Orchestrator’s Redact-Then-Call Posture
- The Integrity Chain
- Log Hygiene
- KeyStore
- Subprocess Sandbox
- Module Map
- IRB Benchmark Cross-Reference
- When You Touch This Code
- See Also
- Architecture Decisions (ADRs)
- ADR-001 — Single-study, local-first runtime
- ADR-002 — HMAC-SHA256 pseudonymization with sidecar key (no vault)
- ADR-003 — SANT per-subject date jitter over HIPAA year-only
- ADR-004 — Rule+allowlist over Microsoft Presidio
- ADR-005 — tmpfs staging as an operator opt-in (not default)
- ADR-006 — External-API PDF extraction refused by default
- ADR-007 — Four-tier architecture (RED / AMBER / GREEN / GREEN-PROTECT)
- ADR-008 — Agent boundary PHI + k-anon gate as defence-in-depth
- ADR-009 — Counts-only audit reports (never raw values)
- ADR-010 — Subprocess + rlimits sandbox for
run_python_analysis - ADR-011 — KeyStore (in-memory API-key registry)
- ADR-012 — Two-way PDF orchestrator (pdfplumber + redacted-text LLM merge)
- ADR-013 — Single reviewed snapshot baseline
- ADR-014 — Parallel extraction phase (3-worker ThreadPoolExecutor)
- ADR-015 — l-diversity (l=2) on row-returning tools
- References
- Tech Stack
Pipeline Components
Reference & Status
Working Rules
Preserve the raw → staging → published bundle → agent-boundary PHI model described in PHI Architecture.
Keep implementation changes, tests, and documentation in the same PR when behavior changes.
Run the smallest focused tests first, then the repo gates required by Testing.
Keep README brief. Put durable detail in Sphinx and link to it.