Project Status
==============

This page is the current implementation snapshot for maintainers. It is
not a changelog; historical PR numbers and patch labels belong in GitHub,
not in the operational docs.

Implemented
-----------

Pipeline
~~~~~~~~

* Single-study pipeline entry point in ``main.py``.
* Parallel extraction legs for dictionary, datasets, and PDFs.
* Supported tabular inputs are ``.xlsx`` and ``.csv`` only.
* AMBER staging under ``tmp/{STUDY}/`` with mode ``0700`` and secure
  deletion on successful completion.
* Step 1.6 PHI scrub over staged datasets before publish.
* Dataset cleanup and cleanup propagation into dictionary and PDF
  metadata.
* Atomic publish into ``output/{STUDY}/trio_bundle/``.
* ``variables.json`` build from the published trio bundle.
* Counts-only audit reports and lineage manifest under
  ``output/{STUDY}/audit/``.

PHI And Security Boundaries
~~~~~~~~~~~~~~~~~~~~~~~~~~~

* RED raw data is limited to extraction code.
* AMBER staging is never agent-readable.
* GREEN consists of ``output/{STUDY}/trio_bundle/`` plus
  ``output/{STUDY}/agent/``.
* GREEN-PROTECT is the agent tool boundary: PHI regex gate,
  k-anonymity, and l-diversity before row-level answers are surfaced.
* ``output/{STUDY}/audit/`` is a counts-only audit envelope and is
  rejected by the agent read validator.
* ``data/snapshots/{STUDY}/`` is a reviewed baseline restored when
  PDF extraction fails or **Use Existing Study** is selected; it is
  outside the agent read surface.
* API keys route through the in-memory KeyStore in the Streamlit flow.
* ``run_python_analysis`` executes generated code in a constrained
  subprocess and persists reproducibility artifacts under
  ``output/{STUDY}/agent/analysis/``.

AI Assistant
~~~~~~~~~~~~

* LangChain/LangGraph ReAct agent constructed through
  ``scripts/ai_assistant/agent_graph.py``.
* Twelve structured tools registered in
  ``scripts/ai_assistant/agent_tools.ALL_TOOLS``.
* CLI and Streamlit interfaces.
* Grounded-answer prompt contract: resolve variables before analysis,
  use deterministic tools for statistical claims, separate computed
  facts from interpretation, and surface caveats plainly.
* Provider support through OpenAI, Anthropic, Google Gemini, Ollama,
  and NVIDIA AI Endpoints.
* Ollama qwen3 downgrade ladder for local memory pressure.

PDF Extraction
~~~~~~~~~~~~~~

* Default wizard path uses the two-way PDF orchestrator:
  ``pdfplumber`` text extraction, PHI redaction before any LLM call,
  re-scrubbed LLM response, merge with code candidate, and per-PDF
  fallback to ``data/snapshots/{STUDY}/pdfs/``.
* Legacy raw-PDF API path remains available for CLI compatibility, but
  is refused unless ``REPORTALIN_PDF_PHI_FREE=1`` and a non-empty
  ``authorities/phi_free_pdfs.md`` attestation are both present.

Verification
------------

Use the command output for the commit under review as the source of
truth. Current gates:

.. code-block:: bash

   make verify
   make test-all
   make docs-quality
   make security

The CI workflow runs Ruff, mypy, and the full pytest suite on Python
3.11, 3.12, and 3.13. The docs-quality workflow runs doc-freshness,
Sphinx build, linkcheck, and documentation metrics.

IRB Conformance
---------------

The active IRB/Auditor conformance profile lives in
:doc:`../irb_auditor/conformance`. It maps each claim to:

* the applicable authority,
* the disk artifact an auditor can inspect,
* and the pytest assertion that fails if the claim regresses.

The reviewer-facing PHI handling narrative lives in
:doc:`../irb_auditor/phi_handling`.

Known Follow-Ups
----------------

These are documented gaps or operator-owned extensions; none require the
agent to read raw PHI.

* Study-team breach-response runbook.
* Study-team data-retention and destruction runbook.
* Optional district-population mapping table when a site needs
  population-threshold geography generalization beyond the current drop
  catalog.
* Optional ``config/consent_scope.yaml`` for an IEC-approved field
  allowlist layered above the scrub catalog.