Architecture

The active technical architecture of the RePORT AI Portal runtime. For the PHI-handling deep dive see PHI Architecture; for the decisions behind the choices see Architecture Decisions (ADRs); for the operator’s view see Data Pipeline.

System Overview

RePORT AI Portal is a two-world system. The two worlds run in separate processes and never share mutable state.

World 1 — Deterministic Pipeline (main.py + scripts/extraction/ + scripts/security/ + scripts/utils/).

Reads raw clinical data from data/raw/{STUDY}/, runs three extraction legs in parallel, scrubs PHI from the dataset leg, mirrors dataset drops into the dictionary + PDF legs, atomically publishes per-leg into output/{STUDY}/trio_bundle/, builds a consolidated variables.json, emits a lineage manifest. main.py --pipeline is the canonical entry point; make pipeline is the Makefile alias; the wizard’s “Load Study” button spawns this as a subprocess.

World 2 — AI Assistant (scripts/ai_assistant/).

A LangGraph ReAct agent with 12 tools that reads the published trio bundle and answers researcher queries. Provider-agnostic via init_chat_model; runs against Anthropic / OpenAI / Google / NVIDIA / Ollama. Never accesses raw data. Three independent gates on every tool return (PHI regex catalog, k=5 anonymity, l=2 diversity).

The two worlds communicate through the output/{STUDY}/ tree only:

World 1 writes:                          World 2 reads:
- trio_bundle/  (sanitised data)    →   - trio_bundle/  (LLM data surface)
- audit/        (counts only)            (LLM hard-rejected for audit/)
- agent/        (state subdirs)     →   - agent/  (LLM session memory)

The Streamlit wizard is the operator’s entry point; it routes API keys through the in-memory KeyStore, spawns the pipeline subprocess on demand, and then hands off to the agent for chat.

The Five-Tier Zone Model

See PHI Architecture for the full discussion. Briefly:

Tier	Path	Posture
RED	`data/raw/{STUDY}/`	Raw, presumed PHI. Read by extraction subprocess only.
AMBER	`tmp/{STUDY}/`	Per-run scratch. Mode 0700. Securely deleted on success.
GREEN	`output/{STUDY}/{trio_bundle,agent}/`	LLM read zone. PHI-free.
GREEN-PROTECT	Agent tool boundary	PHI regex + k-anonymity + l-diversity gates before answers.
AUDIT	`output/{STUDY}/audit/`	Counts-only IRB evidence. LLM-rejected.
out-of-zone	`data/snapshots/{STUDY}/`	Human-reviewed baseline (LLM-invisible). Restored over `trio_bundle/` when fresh PDF extraction fails or when Use Existing Study is selected.

Core Components

Configuration System

config resolves every path and most behaviour flags from env vars + a YAML overlay (config/config.yaml). Read by every module that needs a path or knob. See Configuration for the full env-var table; key constants:

STUDY_NAME — e.g. Indo-VAP. Pins the single study.
BASE_DIR — repo root, used as the anchor for all path derivation.
TRIO_BUNDLE_DIR — the canonical published-bundle path.
STUDY_SNAPSHOTS_DIR — the reviewed baseline path (data/snapshots/{STUDY}/).

Logging System

scripts.utils.logging_system — root-logger setup with a verbose “tree” mode for --verbose runs. scripts.utils.log_hygiene — logging.Filter that scrubs API keys + PHI patterns from every log line. Both filters attach to the root logger so World 1 and World 2 emit scrubbed logs by default.

VerboseLogger keeps indentation in thread-local state, so --verbose tree output remains readable while the dictionary, dataset, and PDF extraction legs run in parallel.

Pipeline Modules

The pipeline is structured as a sequence of step functions in main.py, each importing its operative module from scripts/extraction/, scripts/security/, or scripts/utils/. Steps 0/1/1.5 run in parallel; the cleanup chain (1.6/1.7/1.8) and Steps 2/3/4/5 are sequential.

Dictionary Loader

Module: scripts.extraction.load_dictionary.load_study_dictionary()
Step: Step 0
Reads: data/raw/{STUDY}/data_dictionary/*.{xlsx,csv}
Writes: tmp/{STUDY}/dictionary/*.json
PHI posture: Carries no PHI by design (the dictionary defines variables, not subject values).

Dataset Extraction

Module: scripts.extraction.dataset_pipeline.process_datasets()
Step: Step 1 (extract)
Reads: data/raw/{STUDY}/datasets/*.{xlsx,csv}
Writes: tmp/{STUDY}/datasets/*.jsonl
PHI posture: Records carry full PHI here. Every record gets a _provenance dict (raw_sha256, pipeline_version, extraction_engine, source_file, sheet_name, row_index, study_name, extraction_utc).
Skip semantics: Hash-based step cache at output/{STUDY}/audit/manifests/dataset_processing.json.

PDF Extraction

Two co-existing paths:

Orchestrator path (default): scripts.extraction.pdf_pipeline. pdfplumber code path + redacted-text LLM merge + per-PDF fallback to data/snapshots/{STUDY}/pdfs/. If the PDF leg fails, the full reviewed snapshot baseline is restored over trio_bundle/. No raw PDF bytes leave the host. See PDF Extraction for the per-step pipeline.
Legacy raw-PDF API path: scripts.extraction.extract_pdf_data. Refused unless the operator opts in twice (REPORTALIN_PDF_PHI_FREE=1 env flag + non-empty authorities/phi_free_pdfs.md attestation note).

Dispatch happens in scripts.extraction.extract_pdf_data.extract_pdfs_to_jsonl() based on REPORTALIN_PDF_EXTRACTION_MODE. The wizard always sets llm (orchestrator); the CLI default is unset (legacy gate).

PHI Scrub

Module: scripts.security.phi_scrub.run_scrub()
Step: Step 1.6 (BEFORE Step 1.7 cleanup so no raw PHI reaches the audit envelope)
Reads/writes: tmp/{STUDY}/datasets/*.jsonl in place
Audit: output/{STUDY}/audit/phi_scrub_report.json (counts-only)
Eight action classes: keep / birthdate / drop / cap / generalize / suppress_small_cell / date_jitter / hmac_pseudonymize. Configured in scripts/security/phi_scrub.yaml (~200 Indo-VAP-calibrated rules).
HMAC key: ~/.config/report_ai_portal/phi_key (mode 0600, outside the repo).

Dataset Cleanup

Module: scripts.extraction.dataset_cleanup.clean_trio_datasets()
Step: Step 1.7
Reads/writes: tmp/{STUDY}/datasets/*.jsonl in place
Audit: output/{STUDY}/audit/dataset_cleanup_report.json
Removes junk rows, merges duplicate records, propagates Step 1’s drop events into the cleanup record.

Cleanup Propagation

Module: scripts.extraction.cleanup_propagation.run_propagation()
Step: Step 1.8
Reads/writes: tmp/{STUDY}/{dictionary,pdfs}/ in place
Audit: output/{STUDY}/audit/{dictionary,pdfs}_cleanup_report.json
Computes the propagation drop-set from the dataset audit and prunes matching rows/keys from the staged dictionary + PDF trees. Keeps the published trio bundle internally consistent.

Publish

Function: _publish_staging in main.py
Step: Step 2
Atomic per-leg rename tmp/{STUDY}/{leg}/ → output/{STUDY}/trio_bundle/{leg}/. Same-filesystem rename = single inode swap; cross-filesystem (e.g. tmpfs staging + disk output) falls back to shutil.copytree + shutil.rmtree.
Zone guard: assert_output_zone(trio_dir) runs before the rename.
Pre-publish: if the destination exists, secure_remove_tree (zero-fill + fsync + unlink) so old bytes aren’t recoverable.

Variables Reference Builder

Module: scripts.extraction.build_variables_reference.build_variables_reference()
Step: Step 3 (AFTER publish; reads the populated trio_bundle/, not staging)
Output: output/{STUDY}/trio_bundle/variables.json — the consolidated variable schema the agent uses to validate variable names in queries.

Lineage Manifest

Module: scripts.utils.lineage.emit_lineage_manifest()
Step: Step 4
Output: output/{STUDY}/audit/lineage_manifest.json — pairs every raw input SHA-256 with every published trio artifact SHA-256, plus PHI-key fingerprint, compliance posture, and pipeline version. The single artifact an IRB reviewer reads to verify the entire raw → scrub → publish chain.

Output Signpost + AMBER cleanup

Step: Step 5
Re-emits output/{STUDY}/README.md describing the layout.
On success, secure_remove_tree over tmp/{STUDY}/. On failure, AMBER is preserved for forensic inspection.

Supporting Services

AI Assistant Agent Layer

scripts.ai_assistant.agent_graph — LangGraph ReAct agent; the only module that constructs an LLM client. Provider keys flow in via the explicit api_key= kwarg, sourced from the KeyStore (no os.environ lookup).
scripts.ai_assistant.agent_tools — 12 @tool-decorated functions. ALL_TOOLS is the canonical list; the doc-freshness lint ties prose docs to this list.
scripts.ai_assistant.agent_prompts — system prompt with CONVERSATIONAL WORLD section that tells the LLM to answer greetings without tool calls.
scripts.ai_assistant.phi_safe — agent-side PHI helpers: phi_safe_return, guard_text, guard_user_prompt, sanitise_untrusted_snippet, redact_phi_in_text, sanitise_traceback.
scripts.ai_assistant.file_access — agent-runtime path validator (the canonical chokepoint for every tool’s file I/O).
scripts.ai_assistant.keystore — in-memory API-key registry.
scripts.ai_assistant.tool_cache — per-tool memoisation.

Analytical Engine

scripts.ai_assistant.analytical_engine — deterministic epidemiology helpers (logistic regression, survival, descriptive stats) called from the run_python_analysis tool. Pre-loaded DataFrames come from config.TRIO_DATASETS_DIR only (GREEN zone).

Subprocess Sandbox

scripts.ai_assistant.sandbox.replicate — public API (run_in_subprocess).
scripts.ai_assistant.sandbox.runner — child-process entry point; carries the AST + import + dunder + builtin guards.
scripts.ai_assistant.sandbox.limits — cross-platform rlimits.

Generated .py files persisted to output/{STUDY}/agent/analysis/{ts}.py. See Sandbox: Subprocess-Isolated Code Execution for the full layered story.

File-Access Validator

scripts.ai_assistant.file_access — unified chokepoint that every agent tool calls before any file I/O. Resolves with os.path.realpath, verifies containment with os.path.commonpath. Reads accept trio_bundle/ ∪ agent/ (plus config/study_knowledge.yaml via an explicit allowlist). Writes accept agent/ only. Sandbox writes narrow further to agent/analysis/. Audit, telemetry, staging, raw, and the snapshot baseline are hard-rejected.

Telemetry

scripts.utils.telemetry — agent event logger, attached as a LangChain callback. Lands in output/{STUDY}/audit/telemetry/events.jsonl (LLM-rejected via validate_agent_read). Non-string event payloads are force-stringified + masked before write.

Web UI

scripts.ai_assistant.web_ui — Streamlit entry.
scripts.ai_assistant.ui.wizard — three-step setup flow. Step 1 = LLM config (KeyStore routing). Step 2 = Data load (two-button: Use Existing Study + Load Study). Step 3 = Confirm + start chat.
scripts.ai_assistant.ui.chat — chat surface.
scripts.ai_assistant.ui.streaming — token stream + error expander (with traceback sanitiser).
scripts.ai_assistant.ui.conversations — at-rest conversation persistence with PHI redaction.
scripts.ai_assistant.ui.providers — provider catalog (Anthropic, OpenAI, Google, Ollama, NVIDIA).
scripts.ai_assistant.ui.model_policy — capability floor enforcement on UI selection.

Data Flow

End-to-End Runtime Flow

data/raw/{STUDY_NAME}/data_dictionary/ ──┐  ┐
                                         ├──→ load_dictionary ────┐ │
data/raw/{STUDY_NAME}/datasets/ ─────────┼──→ dataset_pipeline ────┤ ├ Phase 1 PARALLEL
                                         │                         │ │ (3-worker pool;
data/raw/{STUDY_NAME}/annotated_pdfs/ ───┴──→ pdf_pipeline ────────┤ ┘ join → cleanup)
                                             (orchestrator: pdfplumber│
                                             code path + redacted-    │
                                             text LLM merge +         │
                                             snapshot fallback at     │
                                             data/snapshots/{STUDY}/  │
                                             pdfs/)                   │
                                                                      │
                                           (all legs → staging)       ▼
                                       tmp/{STUDY_NAME}/{datasets,dictionary,pdfs}/
                                                                      │
                                             phi_scrub.run_scrub (Step 1.6 — date jitter +
                                                ID pseudonymization on staged datasets;
                                                emits phi_scrub_report.json)
                                                                      │
                                               dataset_cleanup (emits dataset audit)
                                                                      │
                                           cleanup_propagation (prunes dict+pdf in staging,
                                               emits dict+pdf audits)
                                                                      │
                                             _publish_staging (atomic per-leg rename)
                                                                      │
                                                                      ▼
                                       output/{STUDY_NAME}/trio_bundle/{datasets,
                                                                       dictionary,
                                                                       pdfs,
                                                                       variables.json}
                                                                      │
                                           build_variables_reference (Step 3 — reads
                                               the published trio bundle)
                                                                      │
                                           emit_lineage_manifest (Step 4 — raw SHA-256
                                               ↔ trio SHA-256 + PHI-key fingerprint)
                                                                      │
                                                                      ▼
                                       output/{STUDY_NAME}/audit/lineage_manifest.json
                                                                      │
                                           _emit_output_signpost (Step 5)
                                                                      │
                                           _cleanup_staging (success only — secure_remove_tree)
                                                                      │
                                                                      ▼
                                                          World 2: AI Assistant
                                                    reads trio_bundle/ + agent/ only

Source tree

Expected source tree:

data/raw/{STUDY_NAME}/
├── datasets/
├── annotated_pdfs/
└── data_dictionary/

data/snapshots/{STUDY_NAME}/                # reviewed baseline
├── datasets/                               # cleaned + verified,
├── dictionary/                             # LLM-INVISIBLE; restored over live trio
├── pdfs/                                   # ``pdfs/{stem}_variables.json`` as
└── variables.json                          # per-PDF fallback

Expected processed tree:

output/{STUDY_NAME}/
├── trio_bundle/                  # GREEN — LLM read zone
│   ├── datasets/*.jsonl          # PHI-scrubbed
│   ├── dictionary/*.json
│   ├── pdfs/*_variables.json     # tier: merged | snapshot | empty
│   └── variables.json            # consolidated schema
├── audit/                        # AUDIT — counts only; LLM hard-rejected
│   ├── lineage_manifest.json
│   ├── phi_scrub_report.json
│   ├── dataset_cleanup_report.json
│   ├── dictionary_cleanup_report.json
│   ├── pdfs_cleanup_report.json
│   └── telemetry/
│       └── events.jsonl
└── agent/                        # analysis / conversations

Transient staging root (not a durable artifact):

tmp/{STUDY_NAME}/
├── datasets/
├── dictionary/
├── pdfs/
└── .pdf_cache/                   # idempotent LLM-response cache

Security Boundaries

Zone Enforcement

Two complementary chokepoints:

Pipeline-side directory guards — scripts.security.secure_env. Functions: assert_not_raw, assert_output_zone, assert_write_zone, assert_trio_bundle_zone. Used at pipeline boundaries.
Agent-runtime path validator — scripts.ai_assistant.file_access. Functions: validate_agent_read, validate_agent_write, validate_sandbox_write, is_agent_readable. Used by every agent tool before any file I/O.

Both raise ZoneViolationError (a PermissionError subclass) on any zone violation. The agent’s read zone is strictly trio_bundle/ + agent/ (plus the config/study_knowledge.yaml allowlist); audit, telemetry, staging, raw, and the snapshot baseline are hard-rejected.

Three independent gates on every tool return

See PHI Architecture. PHI regex catalog + k=5 + l=2.

API keys never in os.environ

ADR-011. The wizard routes pasted keys into the in-memory KeyStore; *_API_KEY env vars are scrubbed from os.environ. Keys re-injected only into the short-lived pipeline subprocess via KeyStore.env_for_subprocess.

Subprocess sandbox for `run_python_analysis`

ADR-010. RLIMIT_AS / RLIMIT_NPROC / RLIMIT_CPU clamps + sanitised env + read-only trio_bundle/ + AST guards inside the child. See Sandbox: Subprocess-Isolated Code Execution.

Design Principles

Modularity

Each pipeline step is a function in main.py that imports its operative module from scripts/. The step + its module are the unit of audit; you can verify Step 1.6 by reading scripts.security.phi_scrub.run_scrub() and scripts/security/phi_scrub.yaml without needing to read any other code.

Determinism where possible

HMAC date jitter is deterministic given the key (so re-runs produce identical pseudonyms + offsets).
Step cache uses SHA-256 hashes of inputs (so re-runs skip unchanged steps).
Lineage manifest + per-row provenance dict make every published byte traceable to a raw input hash + pipeline version.

Security-first boundaries

Two zone-guard chokepoints (pipeline + agent).
Three agent-output gates (PHI / k-anon / l-diversity).
KeyStore + subprocess sandbox + log redactor as orthogonal defenses.
Single reviewed snapshot baseline prevents an incomplete PDF run from becoming live data while keeping the baseline outside the LLM read zone.

Out-of-scope (explicit non-goals)

These are not part of the active local architecture contract described here:

upload-driven multi-study workflows
HPC / Slurm deployment surfaces
distributed processing claims
historical phase-based roadmap promises

If a feature in the codebase contradicts the architecture described on this page, the page is the source of truth for current behaviour; the feature should either be reconciled or marked as out-of-scope above. New ADRs in Architecture Decisions (ADRs) capture genuine architectural shifts.