Agent Instructions (for AI Coding Assistants)
This page is the authoritative briefing for AI coding assistants
(Claude Code, Copilot CLI, Codex, Gemini CLI) working on this
repository. It supersedes the historical AGENTS.md at the repo
root, which has been retired (per the directive to keep project
documentation in README and Sphinx).
The remainder of this page is organised the way an assistant’s context-builder reads it: orientation → conventions → rules.
Orientation
Privacy-first, local-first AI Assistant system for clinical research
data. The PHI scrubber (Step 1.6) is an honest-broker catalog with
eight action classes — keep / birthdate / drop / cap / generalize /
suppress_small_cell / date / id — evaluated in strict priority order
against ~200 Indo-VAP-calibrated rules. See
scripts.security.phi_scrub and scripts/security/phi_scrub.yaml.
The HMAC key lives at ~/.config/report_ai_portal/phi_key (outside
this repo, never read by agent code).
Four-tier zone model:
RED —
data/raw/: raw clinical inputs.AMBER —
tmp/{STUDY}/: secure staging (mode 0700, umask 0077, zero-fill teardown; optional tmpfs viaREPORTALIN_TMPFS_STAGING=1).GREEN —
output/{STUDY}/trio_bundle/(PHI-free artifacts) +output/{STUDY}/agent/(the agent’s own state). These two zones form the LLM’s read surface, enforced byscripts.ai_assistant.file_access.validate_agent_read().GREEN-PROTECT — the agent tool boundary: PHI regex gate plus k-anonymity and l-diversity for row-level results before the LLM can answer.
AUDIT envelope —
output/{STUDY}/audit/: counts-only IRB evidence, hard-rejected for the agent.
Plus a fifth, out-of-zone tier: data/snapshots/{STUDY}/ holds the
human-reviewed cleaned trio bundle baseline restored when PDF
extraction fails or Use Existing Study is selected. The LLM
cannot read it.
See Architecture for the full architecture. The IRB-grade benchmark lives at Conformance Evidence.
Quick reference
make sync # Install all deps (uv sync --all-groups)
make test # Deterministic subset excluding AI Assistant construction smokes
make test-all # Full suite including AI Assistant construction smokes
make lint # ruff check + format
make ci # lint → typecheck → test
make chat # Launch Streamlit web UI
make chat-cli # Launch CLI REPL
make pipeline # Full data pipeline (dict → datasets + pdfs → variables.json)
Architecture (two-world)
World 1 — Deterministic Pipeline (main.py →
scripts/extraction/, scripts/security/, scripts/utils/):
The three extraction legs (dictionary, datasets, PDFs) write into a
transient staging workspace at tmp/{STUDY_NAME}/. The three legs
run in parallel on a 3-worker
concurrent.futures.ThreadPoolExecutor; the cleanup chain (PHI
scrub / dataset cleanup / cleanup propagation) and Publish + Variables
are sequential after the join.
Every extracted row gets a full _provenance dict (raw_sha256,
pipeline_version, extraction_engine, source_file, sheet_name,
row_index, study_name, extraction_utc).
scripts.security.phi_scrub.run_scrub() (Step 1.6) scrubs staged
datasets in place via the eight action classes in strict priority
order BEFORE any audit is written so no raw PHI lands in
output/. dataset_cleanup (Step 1.7) runs against staged
datasets and emits audit/dataset_cleanup_report.json.
scripts.extraction.cleanup_propagation.run_propagation()
(Step 1.8) reads the dataset audit, computes the pruning set, and
rewrites staged dictionary + PDF artifacts. _publish_staging
atomically renames staging → trio_bundle/ (per-leg, copytree
fallback across filesystems).
scripts.extraction.build_variables_reference.build_variables_reference()
runs after publish. Step 4 emits audit/lineage_manifest.json
pairing every raw input (SHA-256) with every published trio artifact
(SHA-256). On success, staging is securely removed (overwrite +
fsync + unlink); on failure, tmp/{STUDY_NAME}/ is preserved for
operator inspection.
PDF extraction: the wizard’s “Load Study” button selects the orchestrator path
(scripts.extraction.pdf_pipeline). pdfplumber extracts text
locally; the text is PHI-redacted; only redacted text reaches the LLM;
the response is re-scrubbed and merged with the code candidate. When
the LLM tier is unavailable for any reason, the orchestrator falls
back per-PDF to data/snapshots/{STUDY}/pdfs/ (the reviewed
baseline). If the PDF leg fails, the pipeline restores the full
reviewed baseline over trio_bundle/.
The legacy raw-PDF API path (scripts.extraction.extract_pdf_data)
is the CLI default and is gated by the two-part
REPORTALIN_PDF_PHI_FREE operator attestation.
World 2 — AI Assistant (scripts/ai_assistant/):
LangGraph ReAct agent with 12 tools for querying study data. Never
accesses raw data.
Output structure:
output/{STUDY_NAME}/trio_bundle/{datasets,pdfs,dictionary,variables.json},
audit/{dataset,dictionary,pdfs}_cleanup_report.json +
audit/phi_scrub_report.json + audit/lineage_manifest.json +
audit/telemetry/events.jsonl,
agent/{analysis,conversations}/; transient staging
sibling: tmp/{STUDY_NAME}/{datasets,dictionary,pdfs}/.
Snapshot baseline: data/snapshots/{STUDY_NAME}/{datasets,dictionary,pdfs,variables.json}
is the human-reviewed, single per-study cleaned trio bundle. The PDF
orchestrator reads it as the per-PDF fallback, and the wizard restores
it over the live trio_bundle/ for Use Existing Study. LLM is
forbidden from reading it. Maintainer protocol: see
Operations.
Wizard step 2: two top-level buttons — Use
Existing Study (restore the reviewed snapshot baseline into the live
trio_bundle/) and Load Study (run the pipeline subprocess;
orchestrator falls back to the reviewed snapshot baseline when the
PDF leg cannot produce complete output).
PHI key: sidecar at ~/.config/report_ai_portal/phi_key
(resolved via config.PHI_KEY_PATH, overridable with
XDG_CONFIG_HOME). Mode must be 0600. Missing = hard-fail for
developer/operator CLI pipeline runs. Normal users should create it only
through the web UI’s Load Study flow. Developers can bootstrap it via
python -m scripts.security.phi_scrub bootstrap-key when running the
pipeline outside the web UI. Key rotation = full re-ingestion.
Tech stack
Python 3.11+, uv package manager (required)
Ruff linter (line-length=100, see
pyproject.toml [tool.ruff])MyPy type checker (
ignore_missing_imports=true)Pytest (
tests/,@pytest.mark.slowfor heavy tests)LangChain/LangGraph for AI Assistant agent, Streamlit ≥1.38, <2.0 for web UI
Custom type stubs in
typings/for google, anthropic
Critical conventions
Security zones (MUST follow)
Never access
data/raw/from agent code — onlyoutput/{STUDY}/trio_bundle/.Always call
scripts.ai_assistant.file_access.validate_agent_read()orvalidate_agent_writebefore any file I/O in tools. This is the unified chokepoint — accepts onlytrio_bundle/+agent/paths and rejects audit, telemetry, staging, raw, and arbitrary filesystem paths withZoneViolationError.Route every free-text tool return through
scripts.ai_assistant.phi_safe.guard_text()or wrap the tool with@phi_safe_return.When surfacing row-level data, call
scripts.security.kanon_gate.guard_rows_with_kanon_and_ldiv()first — k=5 + l=2. The gate suppresses responses when any quasi-identifier equivalence class has fewer than k members or when l-diversity (l=2) on the sensitive attribute is violated.When writing pipeline code that logs subject data, install the PHI log redactor via
scripts.utils.log_hygiene.install_phi_redactor()so rawSUBJID/ dates / emails / Aadhaar / phone never land in.logs/*.log.
Conversational-shortcut guard on fuzzy search tools
Greetings / acknowledgements / queries shorter than 3 chars are short-circuited inside
search_variables,find_variable_candidates,search_pdf_contextvia_query_looks_conversationalinscripts/ai_assistant/agent_tools.py. The tool returns a refusal (_CONVERSATIONAL_REFUSAL_MESSAGE) instead of surfacing noisy fuzzy-substring hits.Paired with a CONVERSATIONAL WORLD section at the top of
scripts/ai_assistant/agent_prompts.pythat tells the LLM to answer greetings / small-talk without any tool call.This is UX hygiene, not a security control.
phi_safe.guard_user_promptstill runs on every prompt at UI + CLI entry; this guard operates inside the tool so a retry-happy agent that tries to call it anyway gets a clean refusal rather than a name-variable paraphrase.When adding a new fuzzy search tool, call
_query_looks_conversational(query)and return_CONVERSATIONAL_REFUSAL_MESSAGEonTrue. Covered bytests/test_agent_conversational_guard.py.
Prompt-injection + at-rest defences
Input-side gate. Every researcher prompt must pass
scripts.ai_assistant.phi_safe.guard_user_prompt()before the LLM is invoked. Already wired atui/chat.py+cli.py.Untrusted text must be wrapped. Any text surfaced from outside the agent’s control (PDF extracts, dictionary free-text, external vocab) must pass through
scripts.ai_assistant.phi_safe.sanitise_untrusted_snippet()before it reaches the LLM. Already applied insidesearch_pdf_context.At-rest redaction. Any surface that persists user-generated content (conversation JSONs, exports, future telemetry sinks) must run content through
scripts.ai_assistant.phi_safe.redact_phi_in_text(). Already wired atconversations.py’s save / export branches.Traceback surfaces. Tool error returns, UI error expanders, and telemetry error payloads must sanitise with
scripts.ai_assistant.phi_safe.sanitise_traceback(). Already wired atrun_study_analysis+streaming.py.Refused-prompt placeholder. When
guard_user_promptrefuses, the persisted conversation must store a category-tagged placeholder (e.g."[PHI-REFUSED prompt — AADHAAR]"), not the raw prompt.Adding a new agent tool =
@tool→@phi_safe_return→ open withvalidate_agent_read(...)(orvalidate_agent_write(...)). Any deviation failstests/test_agent_tools_phi_safe.py+tests/test_file_access.py.
KeyStore
The Streamlit wizard’s step 1 routes the pasted API key into
scripts.ai_assistant.keystore(an in-memoryKeyStoreregistry) and scrubs the corresponding*_API_KEYfromos.environ.Keys are re-injected only into the short-lived pipeline subprocess via
KeyStore.env_for_subprocess().Every LLM client constructor (
ChatAnthropic,ChatOpenAI,ChatGoogleGenerativeAI, etc.) takes an explicitapi_key=kwarg sourced from the KeyStore — no environment lookup at construction time.
Sandbox
run_python_analysis runs in an isolated subprocess. See
Sandbox: Subprocess-Isolated Code Execution. Layered protections include subprocess isolation,
RLIMIT_AS / RLIMIT_NPROC / RLIMIT_CPU rlimits, in-child
AST + import + dunder + builtin guards, wall-clock timeout, output
cap, figure cap. The generated .py file is persisted to
output/{STUDY}/agent/analysis/{ts}.py for operator reproduction.
Config
All paths and settings come from config.py (env vars + YAML
overlay from config/config.yaml). Never hardcode paths — use
config.TRIO_BUNDLE_DIR, config.TMP_DIR, etc.
Key flags: STUDY_NAME, LOG_LEVEL, LOG_VERBOSE (see
.env.example).
Imports
Use
from __future__ import annotationsin all modules.Lazy-import optional deps (streamlit, langchain) inside functions.
First-party packages:
scripts,config.
Agent tools
Tools live in scripts/ai_assistant/agent_tools.py as
@tool-decorated functions. The docstring becomes the
agent-visible description. All tools are collected in ALL_TOOLS
list. Use tool_cache for memoization.
Web UI
scripts/ai_assistant/web_ui.pyis the main Streamlit app.UI modules split into
scripts/ai_assistant/ui/(streaming, conversations, providers, session, wizard).Sidebar JS in
scripts/ai_assistant/ui/assets/bridge.js— usesdocument(notwindow.parent.document).Use
st.iframe()for injected JS bridge surfaces so the hidden bridge stays isolated and compatible with Streamlit’s current runtime.
Tests
Fixtures in
tests/conftest.py— usetmp_path+monkeypatch_configto isolate.Synthetic data helpers:
_fake_records(n),synthetic_excel().Tests requiring LLM/langchain are excluded from
make test(included inmake test-all).Zone markers are patched via
monkeypatchin fixtures.
Web UI architecture
The Streamlit web UI implements a Claude Desktop-style dark design language. It is production-ready with a setup wizard, conversation history, model switching, and interactive analysis charts.
UI edit-safe files
Only these paths may be touched by UI work:
scripts/ai_assistant/web_ui.pyscripts/ai_assistant/ui/{chat,conversations,model_policy,providers,shell,state,streaming,wizard}.pyscripts/ai_assistant/ui/assets/{theme.css, bridge.js, fonts/}.streamlit/config.tomlpyproject.toml(kaleido pin only)
UI edit-forbidden files (hard stop)
config.pyscripts/ai_assistant/agent_graph.py(read-only; use the three entry points only:stream_query,invoke_query,reset_agent)scripts/ai_assistant/agent_tools.py,agent_prompts.py,analytical_engine.py,study_knowledge.py,file_access.py,tool_cache.py,phi_safe.py,cli.pyEverything under
scripts/extraction/,scripts/security/,scripts/utils/
Design token system
All design tokens in scripts/ai_assistant/ui/assets/theme.css use
the --rpln-* namespace (canonical primary :root block). New
CSS must use these — never the deprecated backward-compat scales.
Categories: colors, spacing, type, line-height, radius, z-index,
easing, durations. --rpln-accent-orange is a compat alias — use
--rpln-accent instead.
Regression gate (run before every UI commit)
uv run pytest tests/ -x -q
Any red test in a non-UI module = hard stop. Revert the wave, do not patch the test.
Key files
Area |
Files |
|---|---|
Entry point |
|
Pipeline |
|
PHI scrub + catalog |
|
PHI gate + k-anon + allowlist |
|
Phase-0 staging hardening |
|
Integrity + governance |
|
Zone guards |
|
AI Assistant agent |
|
Sandbox subprocess |
|
Telemetry |
|
Web UI |
|
Config |
|
IRB/Auditor profile |
Documentation
Architecture — Architecture
Testing — Testing
Contributing — Contributing
Operations — Operations (snapshot maintenance lives here)
Data pipeline (user view) — Data Pipeline