Agent Instructions (for AI Coding Assistants) ============================================= This page is the authoritative briefing for AI coding assistants (Claude Code, Copilot CLI, Codex, Gemini CLI) working on this repository. It supersedes the historical ``AGENTS.md`` at the repo root, which has been retired (per the directive to keep project documentation in README and Sphinx). The remainder of this page is organised the way an assistant's context-builder reads it: orientation → conventions → rules. Orientation ----------- Privacy-first, local-first AI Assistant system for clinical research data. The PHI scrubber (Step 1.6) is an *honest-broker catalog* with eight action classes — keep / birthdate / drop / cap / generalize / suppress_small_cell / date / id — evaluated in strict priority order against ~200 Indo-VAP-calibrated rules. See :mod:`scripts.security.phi_scrub` and ``scripts/security/phi_scrub.yaml``. The HMAC key lives at ``~/.config/report_ai_portal/phi_key`` (outside this repo, never read by agent code). **Four-tier zone model:** * **RED** — ``data/raw/``: raw clinical inputs. * **AMBER** — ``tmp/{STUDY}/``: secure staging (mode 0700, umask 0077, zero-fill teardown; optional tmpfs via ``REPORTALIN_TMPFS_STAGING=1``). * **GREEN** — ``output/{STUDY}/trio_bundle/`` (PHI-free artifacts) + ``output/{STUDY}/agent/`` (the agent's own state). These two zones form the LLM's read surface, enforced by :func:`scripts.ai_assistant.file_access.validate_agent_read`. * **GREEN-PROTECT** — the agent tool boundary: PHI regex gate plus k-anonymity and l-diversity for row-level results before the LLM can answer. * **AUDIT envelope** — ``output/{STUDY}/audit/``: counts-only IRB evidence, hard-rejected for the agent. Plus a fifth, **out-of-zone** tier: ``data/snapshots/{STUDY}/`` holds the human-reviewed cleaned trio bundle baseline restored when PDF extraction fails or **Use Existing Study** is selected. **The LLM cannot read it.** See :doc:`architecture` for the full architecture. The IRB-grade benchmark lives at :doc:`../irb_auditor/conformance`. Quick reference --------------- .. code-block:: bash make sync # Install all deps (uv sync --all-groups) make test # Deterministic subset excluding AI Assistant construction smokes make test-all # Full suite including AI Assistant construction smokes make lint # ruff check + format make ci # lint → typecheck → test make chat # Launch Streamlit web UI make chat-cli # Launch CLI REPL make pipeline # Full data pipeline (dict → datasets + pdfs → variables.json) Architecture (two-world) ------------------------ **World 1 — Deterministic Pipeline** (``main.py`` → ``scripts/extraction/``, ``scripts/security/``, ``scripts/utils/``): The three extraction legs (dictionary, datasets, PDFs) write into a transient staging workspace at ``tmp/{STUDY_NAME}/``. The three legs run **in parallel** on a 3-worker ``concurrent.futures.ThreadPoolExecutor``; the cleanup chain (PHI scrub / dataset cleanup / cleanup propagation) and Publish + Variables are sequential after the join. Every extracted row gets a full ``_provenance`` dict (raw_sha256, pipeline_version, extraction_engine, source_file, sheet_name, row_index, study_name, extraction_utc). :func:`scripts.security.phi_scrub.run_scrub` (Step 1.6) scrubs staged datasets in place via the eight action classes in strict priority order **BEFORE** any audit is written so no raw PHI lands in ``output/``. ``dataset_cleanup`` (Step 1.7) runs against staged datasets and emits ``audit/dataset_cleanup_report.json``. :func:`scripts.extraction.cleanup_propagation.run_propagation` (Step 1.8) reads the dataset audit, computes the pruning set, and rewrites staged dictionary + PDF artifacts. ``_publish_staging`` atomically renames staging → ``trio_bundle/`` (per-leg, copytree fallback across filesystems). :func:`scripts.extraction.build_variables_reference.build_variables_reference` runs after publish. **Step 4** emits ``audit/lineage_manifest.json`` pairing every raw input (SHA-256) with every published trio artifact (SHA-256). On success, staging is **securely removed** (overwrite + fsync + unlink); on failure, ``tmp/{STUDY_NAME}/`` is preserved for operator inspection. **PDF extraction:** the wizard's "Load Study" button selects the orchestrator path (:mod:`scripts.extraction.pdf_pipeline`). pdfplumber extracts text locally; the text is PHI-redacted; only redacted text reaches the LLM; the response is re-scrubbed and merged with the code candidate. When the LLM tier is unavailable for any reason, the orchestrator falls back per-PDF to ``data/snapshots/{STUDY}/pdfs/`` (the reviewed baseline). If the PDF leg fails, the pipeline restores the full reviewed baseline over ``trio_bundle/``. The legacy raw-PDF API path (:mod:`scripts.extraction.extract_pdf_data`) is the CLI default and is gated by the two-part ``REPORTALIN_PDF_PHI_FREE`` operator attestation. **World 2 — AI Assistant** (``scripts/ai_assistant/``): LangGraph ReAct agent with 12 tools for querying study data. Never accesses raw data. **Output structure:** ``output/{STUDY_NAME}/trio_bundle/{datasets,pdfs,dictionary,variables.json}``, ``audit/{dataset,dictionary,pdfs}_cleanup_report.json`` + ``audit/phi_scrub_report.json`` + ``audit/lineage_manifest.json`` + ``audit/telemetry/events.jsonl``, ``agent/{analysis,conversations}/``; transient staging sibling: ``tmp/{STUDY_NAME}/{datasets,dictionary,pdfs}/``. **Snapshot baseline:** ``data/snapshots/{STUDY_NAME}/{datasets,dictionary,pdfs,variables.json}`` is the human-reviewed, single per-study cleaned trio bundle. The PDF orchestrator reads it as the per-PDF fallback, and the wizard restores it over the live ``trio_bundle/`` for **Use Existing Study**. **LLM is forbidden from reading it.** Maintainer protocol: see :doc:`operations`. **Wizard step 2:** two top-level buttons — *Use Existing Study* (restore the reviewed snapshot baseline into the live ``trio_bundle/``) and *Load Study* (run the pipeline subprocess; orchestrator falls back to the reviewed snapshot baseline when the PDF leg cannot produce complete output). **PHI key:** sidecar at ``~/.config/report_ai_portal/phi_key`` (resolved via ``config.PHI_KEY_PATH``, overridable with ``XDG_CONFIG_HOME``). Mode must be ``0600``. Missing = hard-fail for developer/operator CLI pipeline runs. Normal users should create it only through the web UI's **Load Study** flow. Developers can bootstrap it via ``python -m scripts.security.phi_scrub bootstrap-key`` when running the pipeline outside the web UI. Key rotation = full re-ingestion. Tech stack ---------- * **Python 3.11+**, **uv** package manager (required) * **Ruff** linter (line-length=100, see ``pyproject.toml [tool.ruff]``) * **MyPy** type checker (``ignore_missing_imports=true``) * **Pytest** (``tests/``, ``@pytest.mark.slow`` for heavy tests) * **LangChain/LangGraph** for AI Assistant agent, **Streamlit ≥1.38, <2.0** for web UI * Custom type stubs in ``typings/`` for google, anthropic Critical conventions -------------------- Security zones (MUST follow) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * **Never access** ``data/raw/`` from agent code — only ``output/{STUDY}/trio_bundle/``. * Always call :func:`scripts.ai_assistant.file_access.validate_agent_read` or ``validate_agent_write`` before any file I/O in tools. This is the unified chokepoint — accepts only ``trio_bundle/`` + ``agent/`` paths and rejects audit, telemetry, staging, raw, and arbitrary filesystem paths with ``ZoneViolationError``. * Route every free-text tool return through :func:`scripts.ai_assistant.phi_safe.guard_text` or wrap the tool with ``@phi_safe_return``. * When surfacing row-level data, call :func:`scripts.security.kanon_gate.guard_rows_with_kanon_and_ldiv` first — k=5 + l=2. The gate suppresses responses when any quasi-identifier equivalence class has fewer than k members or when l-diversity (l=2) on the sensitive attribute is violated. * When writing pipeline code that logs subject data, install the PHI log redactor via :func:`scripts.utils.log_hygiene.install_phi_redactor` so raw ``SUBJID`` / dates / emails / Aadhaar / phone never land in ``.logs/*.log``. Conversational-shortcut guard on fuzzy search tools ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * Greetings / acknowledgements / queries shorter than 3 chars are short-circuited *inside* ``search_variables``, ``find_variable_candidates``, ``search_pdf_context`` via ``_query_looks_conversational`` in ``scripts/ai_assistant/agent_tools.py``. The tool returns a refusal (``_CONVERSATIONAL_REFUSAL_MESSAGE``) instead of surfacing noisy fuzzy-substring hits. * Paired with a CONVERSATIONAL WORLD section at the top of ``scripts/ai_assistant/agent_prompts.py`` that tells the LLM to answer greetings / small-talk without any tool call. * This is UX hygiene, **not** a security control. ``phi_safe.guard_user_prompt`` still runs on every prompt at UI + CLI entry; this guard operates inside the tool so a retry-happy agent that tries to call it anyway gets a clean refusal rather than a name-variable paraphrase. * When adding a new fuzzy search tool, call ``_query_looks_conversational(query)`` and return ``_CONVERSATIONAL_REFUSAL_MESSAGE`` on ``True``. Covered by ``tests/test_agent_conversational_guard.py``. Prompt-injection + at-rest defences ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * **Input-side gate.** Every researcher prompt must pass :func:`scripts.ai_assistant.phi_safe.guard_user_prompt` before the LLM is invoked. Already wired at ``ui/chat.py`` + ``cli.py``. * **Untrusted text must be wrapped.** Any text surfaced from outside the agent's control (PDF extracts, dictionary free-text, external vocab) must pass through :func:`scripts.ai_assistant.phi_safe.sanitise_untrusted_snippet` before it reaches the LLM. Already applied inside ``search_pdf_context``. * **At-rest redaction.** Any surface that persists user-generated content (conversation JSONs, exports, future telemetry sinks) must run content through :func:`scripts.ai_assistant.phi_safe.redact_phi_in_text`. Already wired at ``conversations.py``'s save / export branches. * **Traceback surfaces.** Tool error returns, UI error expanders, and telemetry error payloads must sanitise with :func:`scripts.ai_assistant.phi_safe.sanitise_traceback`. Already wired at ``run_study_analysis`` + ``streaming.py``. * **Refused-prompt placeholder.** When ``guard_user_prompt`` refuses, the persisted conversation must store a category-tagged placeholder (e.g. ``"[PHI-REFUSED prompt — AADHAAR]"``), **not** the raw prompt. * Adding a new agent tool = ``@tool`` → ``@phi_safe_return`` → open with ``validate_agent_read(...)`` (or ``validate_agent_write(...)``). Any deviation fails ``tests/test_agent_tools_phi_safe.py`` + ``tests/test_file_access.py``. KeyStore ~~~~~~~~ * The Streamlit wizard's step 1 routes the pasted API key into :mod:`scripts.ai_assistant.keystore` (an in-memory ``KeyStore`` registry) and scrubs the corresponding ``*_API_KEY`` from ``os.environ``. * Keys are re-injected only into the short-lived pipeline subprocess via :meth:`KeyStore.env_for_subprocess`. * Every LLM client constructor (``ChatAnthropic``, ``ChatOpenAI``, ``ChatGoogleGenerativeAI``, etc.) takes an explicit ``api_key=`` kwarg sourced from the KeyStore — no environment lookup at construction time. Sandbox ~~~~~~~ ``run_python_analysis`` runs in an isolated subprocess. See :doc:`sandbox`. Layered protections include subprocess isolation, ``RLIMIT_AS`` / ``RLIMIT_NPROC`` / ``RLIMIT_CPU`` rlimits, in-child AST + import + dunder + builtin guards, wall-clock timeout, output cap, figure cap. The generated ``.py`` file is persisted to ``output/{STUDY}/agent/analysis/{ts}.py`` for operator reproduction. Config ~~~~~~ All paths and settings come from ``config.py`` (env vars + YAML overlay from ``config/config.yaml``). Never hardcode paths — use ``config.TRIO_BUNDLE_DIR``, ``config.TMP_DIR``, etc. Key flags: ``STUDY_NAME``, ``LOG_LEVEL``, ``LOG_VERBOSE`` (see ``.env.example``). Imports ~~~~~~~ * Use ``from __future__ import annotations`` in all modules. * Lazy-import optional deps (streamlit, langchain) inside functions. * First-party packages: ``scripts``, ``config``. Agent tools ~~~~~~~~~~~ Tools live in ``scripts/ai_assistant/agent_tools.py`` as ``@tool``-decorated functions. The docstring becomes the agent-visible description. All tools are collected in ``ALL_TOOLS`` list. Use ``tool_cache`` for memoization. Web UI ~~~~~~ * ``scripts/ai_assistant/web_ui.py`` is the main Streamlit app. * UI modules split into ``scripts/ai_assistant/ui/`` (streaming, conversations, providers, session, wizard). * Sidebar JS in ``scripts/ai_assistant/ui/assets/bridge.js`` — uses ``document`` (not ``window.parent.document``). * Use ``st.iframe()`` for injected JS bridge surfaces so the hidden bridge stays isolated and compatible with Streamlit's current runtime. Tests ~~~~~ * Fixtures in ``tests/conftest.py`` — use ``tmp_path`` + ``monkeypatch_config`` to isolate. * Synthetic data helpers: ``_fake_records(n)``, ``synthetic_excel()``. * Tests requiring LLM/langchain are excluded from ``make test`` (included in ``make test-all``). * Zone markers are patched via ``monkeypatch`` in fixtures. Web UI architecture ------------------- The Streamlit web UI implements a Claude Desktop-style dark design language. It is production-ready with a setup wizard, conversation history, model switching, and interactive analysis charts. UI edit-safe files ~~~~~~~~~~~~~~~~~~ Only these paths may be touched by UI work: * ``scripts/ai_assistant/web_ui.py`` * ``scripts/ai_assistant/ui/{chat,conversations,model_policy,providers,shell,state,streaming,wizard}.py`` * ``scripts/ai_assistant/ui/assets/{theme.css, bridge.js, fonts/}`` * ``.streamlit/config.toml`` * ``pyproject.toml`` (kaleido pin only) UI edit-forbidden files (hard stop) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * ``config.py`` * ``scripts/ai_assistant/agent_graph.py`` (read-only; use the three entry points only: ``stream_query``, ``invoke_query``, ``reset_agent``) * ``scripts/ai_assistant/agent_tools.py``, ``agent_prompts.py``, ``analytical_engine.py``, ``study_knowledge.py``, ``file_access.py``, ``tool_cache.py``, ``phi_safe.py``, ``cli.py`` * Everything under ``scripts/extraction/``, ``scripts/security/``, ``scripts/utils/`` Design token system ~~~~~~~~~~~~~~~~~~~ All design tokens in ``scripts/ai_assistant/ui/assets/theme.css`` use the ``--rpln-*`` namespace (canonical primary ``:root`` block). New CSS must use these — never the deprecated backward-compat scales. Categories: colors, spacing, type, line-height, radius, z-index, easing, durations. ``--rpln-accent-orange`` is a compat alias — use ``--rpln-accent`` instead. Regression gate (run before every UI commit) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash uv run pytest tests/ -x -q Any red test in a non-UI module = hard stop. Revert the wave, do not patch the test. Key files --------- .. list-table:: :header-rows: 1 * - Area - Files * - Entry point - ``main.py``, ``config.py`` * - Pipeline - ``scripts/extraction/dataset_pipeline.py``, ``scripts/extraction/build_variables_reference.py``, ``scripts/extraction/extract_pdf_data.py``, ``scripts/extraction/pdf_pipeline.py`` (orchestrator) * - PHI scrub + catalog - ``scripts/security/phi_scrub.py``, ``scripts/security/phi_scrub.yaml`` * - PHI gate + k-anon + allowlist - ``scripts/security/phi_gate.py``, ``scripts/security/kanon_gate.py``, ``scripts/security/phi_allowlist.py``, ``scripts/security/phi_patterns.py`` * - Phase-0 staging hardening - ``scripts/utils/secure_staging.py`` * - Integrity + governance - ``scripts/utils/lineage.py``, ``scripts/utils/log_hygiene.py`` * - Zone guards - ``scripts/security/secure_env.py`` * - AI Assistant agent - ``scripts/ai_assistant/agent_graph.py``, ``scripts/ai_assistant/agent_tools.py``, ``scripts/ai_assistant/agent_prompts.py``, ``scripts/ai_assistant/phi_safe.py``, ``scripts/ai_assistant/keystore.py`` (KeyStore) * - Sandbox subprocess - ``scripts/ai_assistant/sandbox/{replicate,limits,runner}.py`` (subprocess sandbox) * - Telemetry - ``scripts/utils/telemetry.py`` * - Web UI - ``scripts/ai_assistant/web_ui.py``, ``scripts/ai_assistant/ui/`` * - Config - ``config.py``, ``config/config.yaml``, ``config/study_knowledge.yaml`` * - IRB/Auditor profile - :doc:`../irb_auditor/index`, :doc:`../irb_auditor/phi_handling`, :doc:`../irb_auditor/conformance` Documentation ------------- * **Architecture** — :doc:`architecture` * **Testing** — :doc:`testing` * **Contributing** — :doc:`contributing` * **Operations** — :doc:`operations` (snapshot maintenance lives here) * **Sandbox** — :doc:`sandbox` * **Data pipeline (user view)** — :doc:`../user_guide/data_pipeline`