Operations
Operational runbook for running, rebuilding, and verifying the RePORT AI Portal pipeline. Deployment controls and release gates live in Production Readiness.
Prerequisites
Requirement |
Check |
|---|---|
Python 3.11+ |
|
uv package manager |
|
Dependencies synced |
|
LLM provider configured |
|
Study data in place |
|
Pipeline Run
Full Pipeline (Recommended)
make pipeline
Runs all steps in order: dictionary → dataset extraction → AMBER scrub
(eight-action catalog, rule + allowlist) → atomic publish into the
trio_bundle/ GREEN zone → variables-reference build. PDFs are
processed in a separate leg (make pdf-extract); they are gated
behind REPORTALIN_PDF_PHI_FREE=1 because annotated CRFs are
PHI-bearing by default.
Individual Steps
Command |
What it does |
|---|---|
|
Load data dictionary |
|
PDF extraction to JSONL (gated by |
|
Dataset extraction into AMBER staging, run through the eight-action
PHI scrub, then atomically promoted into the GREEN |
|
Re-assemble the trio bundle from already-scrubbed staging artifacts |
|
Save the reviewed snapshot baseline at
|
|
Launch the Streamlit research-assistant UI (with setup wizard) |
|
Launch the CLI research-assistant (interactive REPL) |
Serve
Note
The AI Assistant chat interface is available via make chat-cli
(interactive CLI REPL) or make chat (Streamlit web UI). See
AI Assistant below.
make chat-cli # CLI Research Assistant (interactive REPL)
make chat # Streamlit Research Assistant UI (with setup wizard)
Quickstart
make quickstart # sync → pipeline
Artifact Rebuild
When schemas, the data dictionary, or the eight-action PHI scrub catalog
(scripts/security/phi_scrub.yaml) change:
# Full rebuild
make nuke && make quickstart
Cleanup
make clean # Caches, sessions, stale logs (safe)
make nuke # Full reset: venv, output, indexes (confirmation required)
Retention rules:
tests/fixtures/— never touched by cleanupdata/raw/— manual cleanup only (source of truth)
Security Verification
Dataset Promotion Protocol
After make pipeline produces clean JSONL:
Manually inspect
output/{STUDY}/trio_bundle/for any unexpected residual contentSpot-check dataset records:
head -20 output/{STUDY}/trio_bundle/datasets/*.jsonlStep 1.6 of the pipeline scrubs staged datasets in place via the 8-action honest-broker catalog (
scripts.security.phi_scrub). If residual content is found in trio_bundle, inspectoutput/{STUDY}/audit/phi_scrub_report.jsonfor the action counts, add the offending field pattern toscripts/security/phi_scrub.yamlunder the appropriate section (drop_fields,cap_fields, etc.), and re-runmake pipeline.Cross-check
output/{STUDY}/audit/lineage_manifest.json— every raw input SHA-256 should have a corresponding trio artifact entry.
Zone Enforcement
Verify zone guards are active:
uv run pytest tests/security/test_zone_guard.py -v
Quality Checks
make test # deterministic pytest subset
make test-all # full pytest suite
make lint # ruff
make typecheck # mypy
make ci # All quality checks
make doc-freshness # Catch stale prose (vector-DB / "only zone" / wrong tool count / …)
make docs # Build sphinx HTML
uv run pip-audit # Dependency security audit
Debug and Troubleshooting
make debug # Pipeline + serve with DEBUG logging
Common issues:
No LLM configured: Set
LLM_PROVIDERenv var orconfig/config.yamlMissing study data: Ensure
data/raw/{STUDY}/has the required subdirectoriesDependency issues:
uv lock --upgrade && uv sync --all-groupsStale artifacts:
make nuke && make quickstart
Known Limitations
Single-study mode only — one study directory under
data/raw/LLM provider must be explicitly configured (no default)
AI Assistant
The AI Assistant research assistant uses LangGraph with a ReAct agent pattern. It reads from the trio bundle and provides study-specific answers grounded in the data dictionary, PDF extractions, and dataset metadata.
make chat-cli # CLI interactive REPL (or `python main.py --chat` directly)
make chat # Streamlit web UI (or `python main.py --web` directly)
Trio-Bundle Snapshot Maintenance
A snapshot baseline is the single cleaned-and-verified trio bundle
saved under data/snapshots/{STUDY_NAME}/ after human review. It
mirrors the layout of the live output/{STUDY}/trio_bundle/:
data/snapshots/
└── {STUDY_NAME}/ # e.g. data/snapshots/Indo-VAP/ — must match
├── datasets/ # config.STUDY_NAME exactly
├── dictionary/
├── pdfs/
└── variables.json
Purpose. The reviewed baseline is the deterministic fallback source for the portal:
PDF orchestrator fallback. When the wizard’s “Load Study” runs and the PDF orchestrator’s LLM tier is unavailable for a particular PDF (no API key, image-only PDF, capability gate fails, LLM call errors), the orchestrator reads
data/snapshots/{STUDY}/pdfs/{stem}_variables.jsoninstead of publishing a code-only heuristic guess.Failed or skipped PDF leg. If the PDF extraction leg fails, is skipped, or creates no files during a full pipeline run, the pipeline restores
data/snapshots/{STUDY}/over the liveoutput/{STUDY}/trio_bundle/.Use Existing Study. The setup wizard’s button restores the same reviewed baseline over the live trio bundle before enabling chat. The rest of
output/{STUDY}/is left in place.
Read posture. The LLM agent must NOT read this directory. The
agent’s read zone is restricted to output/{STUDY}/trio_bundle/
and output/{STUDY}/agent/ only (see
scripts.ai_assistant.file_access.validate_agent_read()).
Putting snapshots outside both zones is intentional — a stale
snapshot must never be served as live data.
The wizard and pipeline subprocess are the only legitimate readers.
Both use config.STUDY_SNAPSHOTS_DIR as the snapshot lookup root.
Maintenance protocol.
Snapshots are PHI-scrubbed. Only files that have been through the full
phi_scrub+kanon_gatechain belong here. Adding raw subject IDs or unscrubbed dates to a snapshot would defeat the entire purpose.Update by promoting from a verified production run. A maintainer runs
make snapshotafter manual review and references thelineage_manifest.jsonhash in the commit message for audit trail when the baseline is committed.Do not generate snapshots from
--forceruns without manual review. The whole value of a snapshot is the human verification step.
The repo’s .gitignore allows data/snapshots/ to be committed.
Files under this directory are study-team reviewed artifacts and are
not agent-owned state.
Lifecycle:
make snapshot # save current trio bundle as the reviewed baseline
make snapshot FORCE=1 # overwrite the reviewed baseline after review
make list-snapshots # show whether the reviewed baseline exists
make restore-study # restore the reviewed baseline into trio_bundle/
Bumping the baseline is a maintainer action with audit-trail implications. Restore points are intentionally not part of the architecture.