Project Status
This page is the current implementation snapshot for maintainers. It is not a changelog; historical PR numbers and patch labels belong in GitHub, not in the operational docs.
Implemented
Pipeline
Single-study pipeline entry point in
main.py.Parallel extraction legs for dictionary, datasets, and PDFs.
Supported tabular inputs are
.xlsxand.csvonly.AMBER staging under
tmp/{STUDY}/with mode0700and secure deletion on successful completion.Step 1.6 PHI scrub over staged datasets before publish.
Dataset cleanup and cleanup propagation into dictionary and PDF metadata.
Atomic publish into
output/{STUDY}/trio_bundle/.variables.jsonbuild from the published trio bundle.Counts-only audit reports and lineage manifest under
output/{STUDY}/audit/.
PHI And Security Boundaries
RED raw data is limited to extraction code.
AMBER staging is never agent-readable.
GREEN consists of
output/{STUDY}/trio_bundle/plusoutput/{STUDY}/agent/.GREEN-PROTECT is the agent tool boundary: PHI regex gate, k-anonymity, and l-diversity before row-level answers are surfaced.
output/{STUDY}/audit/is a counts-only audit envelope and is rejected by the agent read validator.data/snapshots/{STUDY}/is a reviewed baseline restored when PDF extraction fails or Use Existing Study is selected; it is outside the agent read surface.API keys route through the in-memory KeyStore in the Streamlit flow.
run_python_analysisexecutes generated code in a constrained subprocess and persists reproducibility artifacts underoutput/{STUDY}/agent/analysis/.
AI Assistant
LangChain/LangGraph ReAct agent constructed through
scripts/ai_assistant/agent_graph.py.Twelve structured tools registered in
scripts/ai_assistant/agent_tools.ALL_TOOLS.CLI and Streamlit interfaces.
Grounded-answer prompt contract: resolve variables before analysis, use deterministic tools for statistical claims, separate computed facts from interpretation, and surface caveats plainly.
Provider support through OpenAI, Anthropic, Google Gemini, Ollama, and NVIDIA AI Endpoints.
Ollama qwen3 downgrade ladder for local memory pressure.
PDF Extraction
Default wizard path uses the two-way PDF orchestrator:
pdfplumbertext extraction, PHI redaction before any LLM call, re-scrubbed LLM response, merge with code candidate, and per-PDF fallback todata/snapshots/{STUDY}/pdfs/.Legacy raw-PDF API path remains available for CLI compatibility, but is refused unless
REPORTALIN_PDF_PHI_FREE=1and a non-emptyauthorities/phi_free_pdfs.mdattestation are both present.
Verification
Use the command output for the commit under review as the source of truth. Current gates:
make verify
make test-all
make docs-quality
make security
The CI workflow runs Ruff, mypy, and the full pytest suite on Python 3.11, 3.12, and 3.13. The docs-quality workflow runs doc-freshness, Sphinx build, linkcheck, and documentation metrics.
IRB Conformance
The active IRB/Auditor conformance profile lives in Conformance Evidence. It maps each claim to:
the applicable authority,
the disk artifact an auditor can inspect,
and the pytest assertion that fails if the claim regresses.
The reviewer-facing PHI handling narrative lives in PHI Handling.
Known Follow-Ups
These are documented gaps or operator-owned extensions; none require the agent to read raw PHI.
Study-team breach-response runbook.
Study-team data-retention and destruction runbook.
Optional district-population mapping table when a site needs population-threshold geography generalization beyond the current drop catalog.
Optional
config/consent_scope.yamlfor an IEC-approved field allowlist layered above the scrub catalog.