Agent Instructions (for AI Coding Assistants)

This page is the authoritative briefing for AI coding assistants (Claude Code, Copilot CLI, Codex, Gemini CLI) working on this repository. It supersedes the historical AGENTS.md at the repo root, which has been retired (per the directive to keep project documentation in README and Sphinx).

The remainder of this page is organised the way an assistant’s context-builder reads it: orientation → conventions → rules.

Orientation

Privacy-first, local-first AI Assistant system for clinical research data. The PHI scrubber (Step 1.6) is an honest-broker catalog with eight action classes — keep / birthdate / drop / cap / generalize / suppress_small_cell / date / id — evaluated in strict priority order against ~200 Indo-VAP-calibrated rules. See scripts.security.phi_scrub and scripts/security/phi_scrub.yaml. The HMAC key lives at ~/.config/report_ai_portal/phi_key (outside this repo, never read by agent code).

Four-tier zone model:

RED — data/raw/: raw clinical inputs.
AMBER — tmp/{STUDY}/: secure staging (mode 0700, umask 0077, zero-fill teardown; optional tmpfs via REPORTALIN_TMPFS_STAGING=1).
GREEN — output/{STUDY}/trio_bundle/ (PHI-free artifacts) + output/{STUDY}/agent/ (the agent’s own state). These two zones form the LLM’s read surface, enforced by scripts.ai_assistant.file_access.validate_agent_read().
GREEN-PROTECT — the agent tool boundary: PHI regex gate plus k-anonymity and l-diversity for row-level results before the LLM can answer.
AUDIT envelope — output/{STUDY}/audit/: counts-only IRB evidence, hard-rejected for the agent.

Plus a fifth, out-of-zone tier: data/snapshots/{STUDY}/ holds the human-reviewed cleaned trio bundle baseline restored when PDF extraction fails or Use Existing Study is selected. The LLM cannot read it.

See Architecture for the full architecture. The IRB-grade benchmark lives at Conformance Evidence.

Quick reference

make sync          # Install all deps (uv sync --all-groups)
make test          # Deterministic subset excluding AI Assistant construction smokes
make test-all      # Full suite including AI Assistant construction smokes
make lint          # ruff check + format
make ci            # lint → typecheck → test
make chat          # Launch Streamlit web UI
make chat-cli      # Launch CLI REPL
make pipeline      # Full data pipeline (dict → datasets + pdfs → variables.json)

Architecture (two-world)

World 1 — Deterministic Pipeline (main.py → scripts/extraction/, scripts/security/, scripts/utils/):

The three extraction legs (dictionary, datasets, PDFs) write into a transient staging workspace at tmp/{STUDY_NAME}/. The three legs run in parallel on a 3-worker concurrent.futures.ThreadPoolExecutor; the cleanup chain (PHI scrub / dataset cleanup / cleanup propagation) and Publish + Variables are sequential after the join.

Every extracted row gets a full _provenance dict (raw_sha256, pipeline_version, extraction_engine, source_file, sheet_name, row_index, study_name, extraction_utc). scripts.security.phi_scrub.run_scrub() (Step 1.6) scrubs staged datasets in place via the eight action classes in strict priority order BEFORE any audit is written so no raw PHI lands in output/. dataset_cleanup (Step 1.7) runs against staged datasets and emits audit/dataset_cleanup_report.json. scripts.extraction.cleanup_propagation.run_propagation() (Step 1.8) reads the dataset audit, computes the pruning set, and rewrites staged dictionary + PDF artifacts. _publish_staging atomically renames staging → trio_bundle/ (per-leg, copytree fallback across filesystems). scripts.extraction.build_variables_reference.build_variables_reference() runs after publish. Step 4 emits audit/lineage_manifest.json pairing every raw input (SHA-256) with every published trio artifact (SHA-256). On success, staging is securely removed (overwrite + fsync + unlink); on failure, tmp/{STUDY_NAME}/ is preserved for operator inspection.

PDF extraction: the wizard’s “Load Study” button selects the orchestrator path (scripts.extraction.pdf_pipeline). pdfplumber extracts text locally; the text is PHI-redacted; only redacted text reaches the LLM; the response is re-scrubbed and merged with the code candidate. When the LLM tier is unavailable for any reason, the orchestrator falls back per-PDF to data/snapshots/{STUDY}/pdfs/ (the reviewed baseline). If the PDF leg fails, the pipeline restores the full reviewed baseline over trio_bundle/. The legacy raw-PDF API path (scripts.extraction.extract_pdf_data) is the CLI default and is gated by the two-part REPORTALIN_PDF_PHI_FREE operator attestation.

World 2 — AI Assistant (scripts/ai_assistant/): LangGraph ReAct agent with 12 tools for querying study data. Never accesses raw data.

Output structure: output/{STUDY_NAME}/trio_bundle/{datasets,pdfs,dictionary,variables.json}, audit/{dataset,dictionary,pdfs}_cleanup_report.json + audit/phi_scrub_report.json + audit/lineage_manifest.json + audit/telemetry/events.jsonl, agent/{analysis,conversations}/; transient staging sibling: tmp/{STUDY_NAME}/{datasets,dictionary,pdfs}/.

Snapshot baseline: data/snapshots/{STUDY_NAME}/{datasets,dictionary,pdfs,variables.json} is the human-reviewed, single per-study cleaned trio bundle. The PDF orchestrator reads it as the per-PDF fallback, and the wizard restores it over the live trio_bundle/ for Use Existing Study. LLM is forbidden from reading it. Maintainer protocol: see Operations.

Wizard step 2: two top-level buttons — Use Existing Study (restore the reviewed snapshot baseline into the live trio_bundle/) and Load Study (run the pipeline subprocess; orchestrator falls back to the reviewed snapshot baseline when the PDF leg cannot produce complete output).

PHI key: sidecar at ~/.config/report_ai_portal/phi_key (resolved via config.PHI_KEY_PATH, overridable with XDG_CONFIG_HOME). Mode must be 0600. Missing = hard-fail for developer/operator CLI pipeline runs. Normal users should create it only through the web UI’s Load Study flow. Developers can bootstrap it via python -m scripts.security.phi_scrub bootstrap-key when running the pipeline outside the web UI. Key rotation = full re-ingestion.

Tech stack

Python 3.11+, uv package manager (required)
Ruff linter (line-length=100, see pyproject.toml [tool.ruff])
MyPy type checker (ignore_missing_imports=true)
Pytest (tests/, @pytest.mark.slow for heavy tests)
LangChain/LangGraph for AI Assistant agent, Streamlit ≥1.38, <2.0 for web UI
Custom type stubs in typings/ for google, anthropic

Critical conventions

Security zones (MUST follow)

Never access data/raw/ from agent code — only output/{STUDY}/trio_bundle/.
Always call scripts.ai_assistant.file_access.validate_agent_read() or validate_agent_write before any file I/O in tools. This is the unified chokepoint — accepts only trio_bundle/ + agent/ paths and rejects audit, telemetry, staging, raw, and arbitrary filesystem paths with ZoneViolationError.
Route every free-text tool return through scripts.ai_assistant.phi_safe.guard_text() or wrap the tool with @phi_safe_return.
When surfacing row-level data, call scripts.security.kanon_gate.guard_rows_with_kanon_and_ldiv() first — k=5 + l=2. The gate suppresses responses when any quasi-identifier equivalence class has fewer than k members or when l-diversity (l=2) on the sensitive attribute is violated.
When writing pipeline code that logs subject data, install the PHI log redactor via scripts.utils.log_hygiene.install_phi_redactor() so raw SUBJID / dates / emails / Aadhaar / phone never land in .logs/*.log.

Conversational-shortcut guard on fuzzy search tools

Greetings / acknowledgements / queries shorter than 3 chars are short-circuited inside search_variables, find_variable_candidates, search_pdf_context via _query_looks_conversational in scripts/ai_assistant/agent_tools.py. The tool returns a refusal (_CONVERSATIONAL_REFUSAL_MESSAGE) instead of surfacing noisy fuzzy-substring hits.
Paired with a CONVERSATIONAL WORLD section at the top of scripts/ai_assistant/agent_prompts.py that tells the LLM to answer greetings / small-talk without any tool call.
This is UX hygiene, not a security control. phi_safe.guard_user_prompt still runs on every prompt at UI + CLI entry; this guard operates inside the tool so a retry-happy agent that tries to call it anyway gets a clean refusal rather than a name-variable paraphrase.
When adding a new fuzzy search tool, call _query_looks_conversational(query) and return _CONVERSATIONAL_REFUSAL_MESSAGE on True. Covered by tests/test_agent_conversational_guard.py.

Prompt-injection + at-rest defences

Input-side gate. Every researcher prompt must pass scripts.ai_assistant.phi_safe.guard_user_prompt() before the LLM is invoked. Already wired at ui/chat.py + cli.py.
Untrusted text must be wrapped. Any text surfaced from outside the agent’s control (PDF extracts, dictionary free-text, external vocab) must pass through scripts.ai_assistant.phi_safe.sanitise_untrusted_snippet() before it reaches the LLM. Already applied inside search_pdf_context.
At-rest redaction. Any surface that persists user-generated content (conversation JSONs, exports, future telemetry sinks) must run content through scripts.ai_assistant.phi_safe.redact_phi_in_text(). Already wired at conversations.py’s save / export branches.
Traceback surfaces. Tool error returns, UI error expanders, and telemetry error payloads must sanitise with scripts.ai_assistant.phi_safe.sanitise_traceback(). Already wired at run_study_analysis + streaming.py.
Refused-prompt placeholder. When guard_user_prompt refuses, the persisted conversation must store a category-tagged placeholder (e.g. "[PHI-REFUSED prompt — AADHAAR]"), not the raw prompt.
Adding a new agent tool = @tool → @phi_safe_return → open with validate_agent_read(...) (or validate_agent_write(...)). Any deviation fails tests/test_agent_tools_phi_safe.py + tests/test_file_access.py.

KeyStore

The Streamlit wizard’s step 1 routes the pasted API key into scripts.ai_assistant.keystore (an in-memory KeyStore registry) and scrubs the corresponding *_API_KEY from os.environ.
Keys are re-injected only into the short-lived pipeline subprocess via KeyStore.env_for_subprocess().
Every LLM client constructor (ChatAnthropic, ChatOpenAI, ChatGoogleGenerativeAI, etc.) takes an explicit api_key= kwarg sourced from the KeyStore — no environment lookup at construction time.

Sandbox

run_python_analysis runs in an isolated subprocess. See Sandbox: Subprocess-Isolated Code Execution. Layered protections include subprocess isolation, RLIMIT_AS / RLIMIT_NPROC / RLIMIT_CPU rlimits, in-child AST + import + dunder + builtin guards, wall-clock timeout, output cap, figure cap. The generated .py file is persisted to output/{STUDY}/agent/analysis/{ts}.py for operator reproduction.

Config

All paths and settings come from config.py (env vars + YAML overlay from config/config.yaml). Never hardcode paths — use config.TRIO_BUNDLE_DIR, config.TMP_DIR, etc.

Key flags: STUDY_NAME, LOG_LEVEL, LOG_VERBOSE (see .env.example).

Imports

Use from __future__ import annotations in all modules.
Lazy-import optional deps (streamlit, langchain) inside functions.
First-party packages: scripts, config.

Agent tools

Tools live in scripts/ai_assistant/agent_tools.py as @tool-decorated functions. The docstring becomes the agent-visible description. All tools are collected in ALL_TOOLS list. Use tool_cache for memoization.

Web UI

scripts/ai_assistant/web_ui.py is the main Streamlit app.
UI modules split into scripts/ai_assistant/ui/ (streaming, conversations, providers, session, wizard).
Sidebar JS in scripts/ai_assistant/ui/assets/bridge.js — uses document (not window.parent.document).
Use st.iframe() for injected JS bridge surfaces so the hidden bridge stays isolated and compatible with Streamlit’s current runtime.

Tests

Fixtures in tests/conftest.py — use tmp_path + monkeypatch_config to isolate.
Synthetic data helpers: _fake_records(n), synthetic_excel().
Tests requiring LLM/langchain are excluded from make test (included in make test-all).
Zone markers are patched via monkeypatch in fixtures.

Web UI architecture

The Streamlit web UI implements a Claude Desktop-style dark design language. It is production-ready with a setup wizard, conversation history, model switching, and interactive analysis charts.

UI edit-safe files

Only these paths may be touched by UI work:

scripts/ai_assistant/web_ui.py
scripts/ai_assistant/ui/{chat,conversations,model_policy,providers,shell,state,streaming,wizard}.py
scripts/ai_assistant/ui/assets/{theme.css, bridge.js, fonts/}
.streamlit/config.toml
pyproject.toml (kaleido pin only)

UI edit-forbidden files (hard stop)

config.py
scripts/ai_assistant/agent_graph.py (read-only; use the three entry points only: stream_query, invoke_query, reset_agent)
scripts/ai_assistant/agent_tools.py, agent_prompts.py, analytical_engine.py, study_knowledge.py, file_access.py, tool_cache.py, phi_safe.py, cli.py
Everything under scripts/extraction/, scripts/security/, scripts/utils/

Design token system

All design tokens in scripts/ai_assistant/ui/assets/theme.css use the --rpln-* namespace (canonical primary :root block). New CSS must use these — never the deprecated backward-compat scales.

Categories: colors, spacing, type, line-height, radius, z-index, easing, durations. --rpln-accent-orange is a compat alias — use --rpln-accent instead.

Regression gate (run before every UI commit)

uv run pytest tests/ -x -q

Any red test in a non-UI module = hard stop. Revert the wave, do not patch the test.

Key files

Area	Files
Entry point	`main.py`, `config.py`
Pipeline	`scripts/extraction/dataset_pipeline.py`, `scripts/extraction/build_variables_reference.py`, `scripts/extraction/extract_pdf_data.py`, `scripts/extraction/pdf_pipeline.py` (orchestrator)
PHI scrub + catalog	`scripts/security/phi_scrub.py`, `scripts/security/phi_scrub.yaml`
PHI gate + k-anon + allowlist	`scripts/security/phi_gate.py`, `scripts/security/kanon_gate.py`, `scripts/security/phi_allowlist.py`, `scripts/security/phi_patterns.py`
Phase-0 staging hardening	`scripts/utils/secure_staging.py`
Integrity + governance	`scripts/utils/lineage.py`, `scripts/utils/log_hygiene.py`
Zone guards	`scripts/security/secure_env.py`
AI Assistant agent	`scripts/ai_assistant/agent_graph.py`, `scripts/ai_assistant/agent_tools.py`, `scripts/ai_assistant/agent_prompts.py`, `scripts/ai_assistant/phi_safe.py`, `scripts/ai_assistant/keystore.py` (KeyStore)
Sandbox subprocess	`scripts/ai_assistant/sandbox/{replicate,limits,runner}.py` (subprocess sandbox)
Telemetry	`scripts/utils/telemetry.py`
Web UI	`scripts/ai_assistant/web_ui.py`, `scripts/ai_assistant/ui/`
Config	`config.py`, `config/config.yaml`, `config/study_knowledge.yaml`
IRB/Auditor profile	IRB/Auditor Profile, PHI Handling, Conformance Evidence

Documentation

Architecture — Architecture
Testing — Testing
Contributing — Contributing
Operations — Operations (snapshot maintenance lives here)
Sandbox — Sandbox: Subprocess-Isolated Code Execution
Data pipeline (user view) — Data Pipeline