Architecture ============ The active technical architecture of the RePORT AI Portal runtime. For the PHI-handling deep dive see :doc:`phi_architecture`; for the decisions behind the choices see :doc:`decisions`; for the operator's view see :doc:`../user_guide/data_pipeline`. System Overview --------------- RePORT AI Portal is a **two-world** system. The two worlds run in separate processes and never share mutable state. **World 1 — Deterministic Pipeline** (``main.py`` + ``scripts/extraction/`` + ``scripts/security/`` + ``scripts/utils/``). Reads raw clinical data from ``data/raw/{STUDY}/``, runs three extraction legs in parallel, scrubs PHI from the dataset leg, mirrors dataset drops into the dictionary + PDF legs, atomically publishes per-leg into ``output/{STUDY}/trio_bundle/``, builds a consolidated ``variables.json``, emits a lineage manifest. ``main.py --pipeline`` is the canonical entry point; ``make pipeline`` is the Makefile alias; the wizard's "Load Study" button spawns this as a subprocess. **World 2 — AI Assistant** (``scripts/ai_assistant/``). A LangGraph ReAct agent with 12 tools that reads the published trio bundle and answers researcher queries. Provider-agnostic via ``init_chat_model``; runs against Anthropic / OpenAI / Google / NVIDIA / Ollama. Never accesses raw data. Three independent gates on every tool return (PHI regex catalog, k=5 anonymity, l=2 diversity). The two worlds communicate through the ``output/{STUDY}/`` tree only: .. code-block:: text World 1 writes: World 2 reads: - trio_bundle/ (sanitised data) → - trio_bundle/ (LLM data surface) - audit/ (counts only) (LLM hard-rejected for audit/) - agent/ (state subdirs) → - agent/ (LLM session memory) The Streamlit wizard is the operator's entry point; it routes API keys through the in-memory KeyStore, spawns the pipeline subprocess on demand, and then hands off to the agent for chat. The Five-Tier Zone Model ------------------------ See :doc:`phi_architecture` for the full discussion. Briefly: .. list-table:: :header-rows: 1 :widths: 16 28 56 * - Tier - Path - Posture * - **RED** - ``data/raw/{STUDY}/`` - Raw, presumed PHI. Read by extraction subprocess only. * - **AMBER** - ``tmp/{STUDY}/`` - Per-run scratch. Mode 0700. Securely deleted on success. * - **GREEN** - ``output/{STUDY}/{trio_bundle,agent}/`` - LLM read zone. PHI-free. * - **GREEN-PROTECT** - Agent tool boundary - PHI regex + k-anonymity + l-diversity gates before answers. * - **AUDIT** - ``output/{STUDY}/audit/`` - Counts-only IRB evidence. LLM-rejected. * - *out-of-zone* - ``data/snapshots/{STUDY}/`` - Human-reviewed baseline (LLM-invisible). Restored over ``trio_bundle/`` when fresh PDF extraction fails or when **Use Existing Study** is selected. Core Components --------------- Configuration System ~~~~~~~~~~~~~~~~~~~~ :mod:`config` resolves every path and most behaviour flags from env vars + a YAML overlay (``config/config.yaml``). Read by every module that needs a path or knob. See :doc:`../user_guide/configuration` for the full env-var table; key constants: * ``STUDY_NAME`` — e.g. ``Indo-VAP``. Pins the single study. * ``BASE_DIR`` — repo root, used as the anchor for all path derivation. * ``TRIO_BUNDLE_DIR`` — the canonical published-bundle path. * ``STUDY_SNAPSHOTS_DIR`` — the reviewed baseline path (``data/snapshots/{STUDY}/``). Logging System ~~~~~~~~~~~~~~ :mod:`scripts.utils.logging_system` — root-logger setup with a verbose "tree" mode for ``--verbose`` runs. :mod:`scripts.utils.log_hygiene` — ``logging.Filter`` that scrubs API keys + PHI patterns from every log line. Both filters attach to the root logger so ``World 1`` and ``World 2`` emit scrubbed logs by default. ``VerboseLogger`` keeps indentation in thread-local state, so ``--verbose`` tree output remains readable while the dictionary, dataset, and PDF extraction legs run in parallel. Pipeline Modules ---------------- The pipeline is structured as a sequence of step functions in ``main.py``, each importing its operative module from ``scripts/extraction/``, ``scripts/security/``, or ``scripts/utils/``. Steps 0/1/1.5 run in parallel; the cleanup chain (1.6/1.7/1.8) and Steps 2/3/4/5 are sequential. Dictionary Loader ~~~~~~~~~~~~~~~~~ * **Module:** :func:`scripts.extraction.load_dictionary.load_study_dictionary` * **Step:** Step 0 * **Reads:** ``data/raw/{STUDY}/data_dictionary/*.{xlsx,csv}`` * **Writes:** ``tmp/{STUDY}/dictionary/*.json`` * **PHI posture:** Carries no PHI by design (the dictionary defines variables, not subject values). Dataset Extraction ~~~~~~~~~~~~~~~~~~ * **Module:** :func:`scripts.extraction.dataset_pipeline.process_datasets` * **Step:** Step 1 (extract) * **Reads:** ``data/raw/{STUDY}/datasets/*.{xlsx,csv}`` * **Writes:** ``tmp/{STUDY}/datasets/*.jsonl`` * **PHI posture:** Records carry full PHI here. Every record gets a ``_provenance`` dict (raw_sha256, pipeline_version, extraction_engine, source_file, sheet_name, row_index, study_name, extraction_utc). * **Skip semantics:** Hash-based step cache at ``output/{STUDY}/audit/manifests/dataset_processing.json``. PDF Extraction ~~~~~~~~~~~~~~ Two co-existing paths: * **Orchestrator path** (default): :mod:`scripts.extraction.pdf_pipeline`. ``pdfplumber`` code path + redacted-text LLM merge + per-PDF fallback to ``data/snapshots/{STUDY}/pdfs/``. If the PDF leg fails, the full reviewed snapshot baseline is restored over ``trio_bundle/``. **No raw PDF bytes leave the host.** See :doc:`data_extraction_pdfs` for the per-step pipeline. * **Legacy raw-PDF API path:** :mod:`scripts.extraction.extract_pdf_data`. Refused unless the operator opts in twice (``REPORTALIN_PDF_PHI_FREE=1`` env flag + non-empty ``authorities/phi_free_pdfs.md`` attestation note). Dispatch happens in :func:`scripts.extraction.extract_pdf_data.extract_pdfs_to_jsonl` based on ``REPORTALIN_PDF_EXTRACTION_MODE``. The wizard always sets ``llm`` (orchestrator); the CLI default is unset (legacy gate). PHI Scrub ~~~~~~~~~ * **Module:** :func:`scripts.security.phi_scrub.run_scrub` * **Step:** Step 1.6 (BEFORE Step 1.7 cleanup so no raw PHI reaches the audit envelope) * **Reads/writes:** ``tmp/{STUDY}/datasets/*.jsonl`` in place * **Audit:** ``output/{STUDY}/audit/phi_scrub_report.json`` (counts-only) * **Eight action classes:** keep / birthdate / drop / cap / generalize / suppress_small_cell / date_jitter / hmac_pseudonymize. Configured in ``scripts/security/phi_scrub.yaml`` (~200 Indo-VAP-calibrated rules). * **HMAC key:** ``~/.config/report_ai_portal/phi_key`` (mode 0600, outside the repo). Dataset Cleanup ~~~~~~~~~~~~~~~ * **Module:** :func:`scripts.extraction.dataset_cleanup.clean_trio_datasets` * **Step:** Step 1.7 * **Reads/writes:** ``tmp/{STUDY}/datasets/*.jsonl`` in place * **Audit:** ``output/{STUDY}/audit/dataset_cleanup_report.json`` * Removes junk rows, merges duplicate records, propagates Step 1's drop events into the cleanup record. Cleanup Propagation ~~~~~~~~~~~~~~~~~~~ * **Module:** :func:`scripts.extraction.cleanup_propagation.run_propagation` * **Step:** Step 1.8 * **Reads/writes:** ``tmp/{STUDY}/{dictionary,pdfs}/`` in place * **Audit:** ``output/{STUDY}/audit/{dictionary,pdfs}_cleanup_report.json`` * Computes the propagation drop-set from the dataset audit and prunes matching rows/keys from the staged dictionary + PDF trees. Keeps the published trio bundle internally consistent. Publish ~~~~~~~ * **Function:** ``_publish_staging`` in ``main.py`` * **Step:** Step 2 * **Atomic per-leg rename** ``tmp/{STUDY}/{leg}/`` → ``output/{STUDY}/trio_bundle/{leg}/``. Same-filesystem rename = single inode swap; cross-filesystem (e.g. tmpfs staging + disk output) falls back to ``shutil.copytree`` + ``shutil.rmtree``. * **Zone guard:** ``assert_output_zone(trio_dir)`` runs before the rename. * **Pre-publish:** if the destination exists, ``secure_remove_tree`` (zero-fill + fsync + unlink) so old bytes aren't recoverable. Variables Reference Builder ~~~~~~~~~~~~~~~~~~~~~~~~~~~ * **Module:** :func:`scripts.extraction.build_variables_reference.build_variables_reference` * **Step:** Step 3 (AFTER publish; reads the populated ``trio_bundle/``, not staging) * **Output:** ``output/{STUDY}/trio_bundle/variables.json`` — the consolidated variable schema the agent uses to validate variable names in queries. Lineage Manifest ~~~~~~~~~~~~~~~~ * **Module:** :func:`scripts.utils.lineage.emit_lineage_manifest` * **Step:** Step 4 * **Output:** ``output/{STUDY}/audit/lineage_manifest.json`` — pairs every raw input SHA-256 with every published trio artifact SHA-256, plus PHI-key fingerprint, compliance posture, and pipeline version. **The single artifact an IRB reviewer reads to verify the entire raw → scrub → publish chain.** Output Signpost + AMBER cleanup ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * **Step:** Step 5 * Re-emits ``output/{STUDY}/README.md`` describing the layout. * On success, ``secure_remove_tree`` over ``tmp/{STUDY}/``. On failure, AMBER is preserved for forensic inspection. Supporting Services ------------------- AI Assistant Agent Layer ~~~~~~~~~~~~~~~~~~~~~~~~ * :mod:`scripts.ai_assistant.agent_graph` — LangGraph ReAct agent; the only module that constructs an LLM client. Provider keys flow in via the explicit ``api_key=`` kwarg, sourced from the KeyStore (no ``os.environ`` lookup). * :mod:`scripts.ai_assistant.agent_tools` — 12 ``@tool``-decorated functions. ``ALL_TOOLS`` is the canonical list; the doc-freshness lint ties prose docs to this list. * :mod:`scripts.ai_assistant.agent_prompts` — system prompt with CONVERSATIONAL WORLD section that tells the LLM to answer greetings without tool calls. * :mod:`scripts.ai_assistant.phi_safe` — agent-side PHI helpers: ``phi_safe_return``, ``guard_text``, ``guard_user_prompt``, ``sanitise_untrusted_snippet``, ``redact_phi_in_text``, ``sanitise_traceback``. * :mod:`scripts.ai_assistant.file_access` — agent-runtime path validator (the canonical chokepoint for every tool's file I/O). * :mod:`scripts.ai_assistant.keystore` — in-memory API-key registry. * :mod:`scripts.ai_assistant.tool_cache` — per-tool memoisation. Analytical Engine ~~~~~~~~~~~~~~~~~ :mod:`scripts.ai_assistant.analytical_engine` — deterministic epidemiology helpers (logistic regression, survival, descriptive stats) called from the ``run_python_analysis`` tool. Pre-loaded DataFrames come from ``config.TRIO_DATASETS_DIR`` only (GREEN zone). Subprocess Sandbox ~~~~~~~~~~~~~~~~~~ * :mod:`scripts.ai_assistant.sandbox.replicate` — public API (``run_in_subprocess``). * :mod:`scripts.ai_assistant.sandbox.runner` — child-process entry point; carries the AST + import + dunder + builtin guards. * :mod:`scripts.ai_assistant.sandbox.limits` — cross-platform rlimits. Generated ``.py`` files persisted to ``output/{STUDY}/agent/analysis/{ts}.py``. See :doc:`sandbox` for the full layered story. File-Access Validator ~~~~~~~~~~~~~~~~~~~~~ :mod:`scripts.ai_assistant.file_access` — unified chokepoint that every agent tool calls before any file I/O. Resolves with ``os.path.realpath``, verifies containment with ``os.path.commonpath``. Reads accept ``trio_bundle/`` ∪ ``agent/`` (plus ``config/study_knowledge.yaml`` via an explicit allowlist). Writes accept ``agent/`` only. Sandbox writes narrow further to ``agent/analysis/``. Audit, telemetry, staging, raw, and the snapshot baseline are hard-rejected. Telemetry ~~~~~~~~~ :mod:`scripts.utils.telemetry` — agent event logger, attached as a LangChain callback. Lands in ``output/{STUDY}/audit/telemetry/events.jsonl`` (LLM-rejected via ``validate_agent_read``). Non-string event payloads are force-stringified + masked before write. Web UI ~~~~~~ * :mod:`scripts.ai_assistant.web_ui` — Streamlit entry. * :mod:`scripts.ai_assistant.ui.wizard` — three-step setup flow. Step 1 = LLM config (KeyStore routing). Step 2 = Data load (two-button: Use Existing Study + Load Study). Step 3 = Confirm + start chat. * :mod:`scripts.ai_assistant.ui.chat` — chat surface. * :mod:`scripts.ai_assistant.ui.streaming` — token stream + error expander (with traceback sanitiser). * :mod:`scripts.ai_assistant.ui.conversations` — at-rest conversation persistence with PHI redaction. * :mod:`scripts.ai_assistant.ui.providers` — provider catalog (Anthropic, OpenAI, Google, Ollama, NVIDIA). * :mod:`scripts.ai_assistant.ui.model_policy` — capability floor enforcement on UI selection. Data Flow --------- End-to-End Runtime Flow ~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: text data/raw/{STUDY_NAME}/data_dictionary/ ──┐ ┐ ├──→ load_dictionary ────┐ │ data/raw/{STUDY_NAME}/datasets/ ─────────┼──→ dataset_pipeline ────┤ ├ Phase 1 PARALLEL │ │ │ (3-worker pool; data/raw/{STUDY_NAME}/annotated_pdfs/ ───┴──→ pdf_pipeline ────────┤ ┘ join → cleanup) (orchestrator: pdfplumber│ code path + redacted- │ text LLM merge + │ snapshot fallback at │ data/snapshots/{STUDY}/ │ pdfs/) │ │ (all legs → staging) ▼ tmp/{STUDY_NAME}/{datasets,dictionary,pdfs}/ │ phi_scrub.run_scrub (Step 1.6 — date jitter + ID pseudonymization on staged datasets; emits phi_scrub_report.json) │ dataset_cleanup (emits dataset audit) │ cleanup_propagation (prunes dict+pdf in staging, emits dict+pdf audits) │ _publish_staging (atomic per-leg rename) │ ▼ output/{STUDY_NAME}/trio_bundle/{datasets, dictionary, pdfs, variables.json} │ build_variables_reference (Step 3 — reads the published trio bundle) │ emit_lineage_manifest (Step 4 — raw SHA-256 ↔ trio SHA-256 + PHI-key fingerprint) │ ▼ output/{STUDY_NAME}/audit/lineage_manifest.json │ _emit_output_signpost (Step 5) │ _cleanup_staging (success only — secure_remove_tree) │ ▼ World 2: AI Assistant reads trio_bundle/ + agent/ only Source tree ~~~~~~~~~~~ Expected source tree: .. code-block:: text data/raw/{STUDY_NAME}/ ├── datasets/ ├── annotated_pdfs/ └── data_dictionary/ data/snapshots/{STUDY_NAME}/ # reviewed baseline ├── datasets/ # cleaned + verified, ├── dictionary/ # LLM-INVISIBLE; restored over live trio ├── pdfs/ # ``pdfs/{stem}_variables.json`` as └── variables.json # per-PDF fallback Expected processed tree: .. code-block:: text output/{STUDY_NAME}/ ├── trio_bundle/ # GREEN — LLM read zone │ ├── datasets/*.jsonl # PHI-scrubbed │ ├── dictionary/*.json │ ├── pdfs/*_variables.json # tier: merged | snapshot | empty │ └── variables.json # consolidated schema ├── audit/ # AUDIT — counts only; LLM hard-rejected │ ├── lineage_manifest.json │ ├── phi_scrub_report.json │ ├── dataset_cleanup_report.json │ ├── dictionary_cleanup_report.json │ ├── pdfs_cleanup_report.json │ └── telemetry/ │ └── events.jsonl └── agent/ # analysis / conversations Transient staging root (not a durable artifact): .. code-block:: text tmp/{STUDY_NAME}/ ├── datasets/ ├── dictionary/ ├── pdfs/ └── .pdf_cache/ # idempotent LLM-response cache Security Boundaries ------------------- Zone Enforcement ~~~~~~~~~~~~~~~~ Two complementary chokepoints: * **Pipeline-side directory guards** — :mod:`scripts.security.secure_env`. Functions: ``assert_not_raw``, ``assert_output_zone``, ``assert_write_zone``, ``assert_trio_bundle_zone``. Used at pipeline boundaries. * **Agent-runtime path validator** — :mod:`scripts.ai_assistant.file_access`. Functions: ``validate_agent_read``, ``validate_agent_write``, ``validate_sandbox_write``, ``is_agent_readable``. Used by every agent tool before any file I/O. Both raise ``ZoneViolationError`` (a ``PermissionError`` subclass) on any zone violation. The agent's read zone is strictly ``trio_bundle/`` + ``agent/`` (plus the ``config/study_knowledge.yaml`` allowlist); audit, telemetry, staging, raw, and the snapshot baseline are hard-rejected. Three independent gates on every tool return ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ See :doc:`phi_architecture`. PHI regex catalog + k=5 + l=2. API keys never in os.environ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ADR-011. The wizard routes pasted keys into the in-memory ``KeyStore``; ``*_API_KEY`` env vars are scrubbed from ``os.environ``. Keys re-injected only into the short-lived pipeline subprocess via ``KeyStore.env_for_subprocess``. Subprocess sandbox for ``run_python_analysis`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ADR-010. ``RLIMIT_AS`` / ``RLIMIT_NPROC`` / ``RLIMIT_CPU`` clamps + sanitised env + read-only ``trio_bundle/`` + AST guards inside the child. See :doc:`sandbox`. Design Principles ----------------- Modularity ~~~~~~~~~~ Each pipeline step is a function in ``main.py`` that imports its operative module from ``scripts/``. The step + its module are the unit of audit; you can verify Step 1.6 by reading :func:`scripts.security.phi_scrub.run_scrub` and ``scripts/security/phi_scrub.yaml`` without needing to read any other code. Determinism where possible ~~~~~~~~~~~~~~~~~~~~~~~~~~ * HMAC date jitter is deterministic given the key (so re-runs produce identical pseudonyms + offsets). * Step cache uses SHA-256 hashes of inputs (so re-runs skip unchanged steps). * Lineage manifest + per-row provenance dict make every published byte traceable to a raw input hash + pipeline version. Security-first boundaries ~~~~~~~~~~~~~~~~~~~~~~~~~ * Two zone-guard chokepoints (pipeline + agent). * Three agent-output gates (PHI / k-anon / l-diversity). * KeyStore + subprocess sandbox + log redactor as orthogonal defenses. * Single reviewed snapshot baseline prevents an incomplete PDF run from becoming live data while keeping the baseline outside the LLM read zone. Out-of-scope (explicit non-goals) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ These are not part of the active local architecture contract described here: * upload-driven multi-study workflows * HPC / Slurm deployment surfaces * distributed processing claims * historical phase-based roadmap promises If a feature in the codebase contradicts the architecture described on this page, the **page is the source of truth** for current behaviour; the feature should either be reconciled or marked as out-of-scope above. New ADRs in :doc:`decisions` capture genuine architectural shifts. See Also -------- * :doc:`phi_architecture` — full PHI handling story. * :doc:`decisions` — ADRs for the security, PDF, snapshot, and agent-boundary decisions. * :doc:`sandbox` — subprocess sandbox. * :doc:`data_extraction_pdfs` — PDF orchestrator deep dive. * :doc:`operations` — operational playbook + snapshot-baseline maintenance. * :doc:`agents` — instructions for AI coding assistants.