Project Status

This page is the current implementation snapshot for maintainers. It is not a changelog; historical PR numbers and patch labels belong in GitHub, not in the operational docs.

Implemented

Pipeline

Single-study pipeline entry point in main.py.
Parallel extraction legs for dictionary, datasets, and PDFs.
Supported tabular inputs are .xlsx and .csv only.
AMBER staging under tmp/{STUDY}/ with mode 0700 and secure deletion on successful completion.
Step 1.6 PHI scrub over staged datasets before publish.
Dataset cleanup and cleanup propagation into dictionary and PDF metadata.
Atomic publish into output/{STUDY}/trio_bundle/.
variables.json build from the published trio bundle.
Counts-only audit reports and lineage manifest under output/{STUDY}/audit/.

PHI And Security Boundaries

RED raw data is limited to extraction code.
AMBER staging is never agent-readable.
GREEN consists of output/{STUDY}/trio_bundle/ plus output/{STUDY}/agent/.
GREEN-PROTECT is the agent tool boundary: PHI regex gate, k-anonymity, and l-diversity before row-level answers are surfaced.
output/{STUDY}/audit/ is a counts-only audit envelope and is rejected by the agent read validator.
data/snapshots/{STUDY}/ is a reviewed baseline restored when PDF extraction fails or Use Existing Study is selected; it is outside the agent read surface.
API keys route through the in-memory KeyStore in the Streamlit flow.
run_python_analysis executes generated code in a constrained subprocess and persists reproducibility artifacts under output/{STUDY}/agent/analysis/.

AI Assistant

LangChain/LangGraph ReAct agent constructed through scripts/ai_assistant/agent_graph.py.
Twelve structured tools registered in scripts/ai_assistant/agent_tools.ALL_TOOLS.
CLI and Streamlit interfaces.
Grounded-answer prompt contract: resolve variables before analysis, use deterministic tools for statistical claims, separate computed facts from interpretation, and surface caveats plainly.
Provider support through OpenAI, Anthropic, Google Gemini, Ollama, and NVIDIA AI Endpoints.
Ollama qwen3 downgrade ladder for local memory pressure.

PDF Extraction

Default wizard path uses the two-way PDF orchestrator: pdfplumber text extraction, PHI redaction before any LLM call, re-scrubbed LLM response, merge with code candidate, and per-PDF fallback to data/snapshots/{STUDY}/pdfs/.
Legacy raw-PDF API path remains available for CLI compatibility, but is refused unless REPORTALIN_PDF_PHI_FREE=1 and a non-empty authorities/phi_free_pdfs.md attestation are both present.

Verification

Use the command output for the commit under review as the source of truth. Current gates:

make verify
make test-all
make docs-quality
make security

The CI workflow runs Ruff, mypy, and the full pytest suite on Python 3.11, 3.12, and 3.13. The docs-quality workflow runs doc-freshness, Sphinx build, linkcheck, and documentation metrics.

IRB Conformance

The active IRB/Auditor conformance profile lives in Conformance Evidence. It maps each claim to:

the applicable authority,
the disk artifact an auditor can inspect,
and the pytest assertion that fails if the claim regresses.

The reviewer-facing PHI handling narrative lives in PHI Handling.

Known Follow-Ups

These are documented gaps or operator-owned extensions; none require the agent to read raw PHI.

Study-team breach-response runbook.
Study-team data-retention and destruction runbook.
Optional district-population mapping table when a site needs population-threshold geography generalization beyond the current drop catalog.
Optional config/consent_scope.yaml for an IEC-approved field allowlist layered above the scrub catalog.