Tech Stack
==========

Every runtime and development dependency, grouped by role, with one paragraph each on
**what** it is, **why** it was chosen, and **how** the project uses
it. Pinned versions and rationale live in ``pyproject.toml``.

Runtime — language and tooling
------------------------------

Python 3.11+
~~~~~~~~~~~~

**What.** The host language. **Why.** Required for
:mod:`concurrent.futures` clean shutdown semantics, ``asyncio.timeout``,
and the ``X | Y`` union syntax used throughout the codebase. **How.**
``pyproject.toml`` pins ``requires-python = ">=3.11"``; CI matrix
runs against 3.11 / 3.12 / 3.13.

uv
~~

**What.** A Rust-based pip / poetry / pipx replacement.
**Why.** 10-100× faster lockfile resolution; reproducible
environments. **How.** Project-wide convention: ``uv sync
--all-groups`` to install, ``uv run`` to invoke. The Makefile assumes
``uv``; CI installs it via the official ``astral-sh/setup-uv`` action,
pinned to an immutable commit SHA.

Ruff
~~~~

**What.** A fast Rust-based Python linter + formatter.
**Why.** Single tool that replaces flake8, isort, pyupgrade, and
includes ``S`` (flake8-bandit) security rules. **How.** Configuration
at ``pyproject.toml:215-241`` (the ``[tool.ruff.lint]`` section).
``S101`` (assert) is per-file-ignored for ``tests/`` since pytest
idiom; ``S603`` (subprocess) is whitelisted at our hardened
``subprocess.run`` callsites with ``# noqa: S603``.

mypy
~~~~

**What.** Static type checker. **Why.** Catches a class of LLM-flow
bugs such as nullable provider names reaching SDK constructors. **How.**
``pyproject.toml`` configures ``ignore_missing_imports = true`` so
optional deps don't block; custom stubs live in ``typings/`` for
``google.genai`` and ``anthropic``.

Pytest
~~~~~~

**What.** Test runner. **Why.** Mature ecosystem, ``conftest.py``
fixtures, deterministic markers. **How.**
:doc:`testing` covers the test-file conventions. ``make test`` runs
the deterministic subset that excludes the AI Assistant construction
smokes; ``make test-all`` runs the full suite.

Runtime — pipeline
------------------

pandas
~~~~~~

**What.** Tabular dataframe library. **Why.** Excel reading, JSONL
output, dataset cleanup, k-anonymity equivalence-class lookups all
ride on pandas. **How.**
:mod:`scripts.extraction.dataset_pipeline` reads the raw Excel into
a DataFrame; per-row records are serialised to JSONL with the
provenance dict.

openpyxl
~~~~~~~~

**What.** Excel ``.xlsx`` reader/writer. **Why.** pandas's default
``.xlsx`` engine. **How.** Used implicitly by ``pd.read_excel`` for
the dictionary + dataset legs.

pypdf
~~~~~

**What.** Lightweight PDF text extractor. **Why.** Powers the legacy
raw-PDF API path. **How.** Used in
:mod:`scripts.extraction.extract_pdf_data` when the operator opts
into the gated raw-PDF API path with the two-part attestation.

pdfplumber
~~~~~~~~~~

**What.** Layout-aware PDF extractor. **Why.** Per-character bounding
boxes give better structure recovery than ``pypdf`` for complex
multi-section CRFs. **How.** pdfplumber is
the **always-on code path** inside the two-way PDF orchestrator
(:mod:`scripts.extraction.pdf_pipeline`). Extracted text is
PHI-redacted before any LLM call; the LLM response is merged with
the code candidate via ``_merge``.

PyYAML
~~~~~~

**What.** YAML parser. **Why.** The PHI scrub catalog
(``scripts/security/phi_scrub.yaml``) and the study-knowledge
overlay (``config/study_knowledge.yaml``) ship as YAML so domain
experts can edit without touching code. **How.** Loaded once at
import time; cached.

Runtime — agent
---------------

LangChain + LangGraph
~~~~~~~~~~~~~~~~~~~~~

**What.** LLM-agent framework. **Why.** ``init_chat_model`` gives
provider-agnostic construction (Anthropic / OpenAI / Google / Ollama
/ NVIDIA all behind one API); LangGraph's ReAct prebuilt is the
agent topology. **How.** :mod:`scripts.ai_assistant.agent_graph` is
the only module that constructs an LLM client; every client takes
``api_key=`` as an explicit kwarg sourced from the in-memory
KeyStore — no ``os.environ`` lookup at construction time.

LangChain provider packages
~~~~~~~~~~~~~~~~~~~~~~~~~~~

**What.** Per-provider LangChain integrations: ``langchain-anthropic``,
``langchain-openai``, ``langchain-google-genai``,
``langchain-ollama``, ``langchain-nvidia-ai-endpoints``. **Why.**
Each provider has its own client + auth + retry semantics; the
LangChain wrappers normalise them. **How.** All five are declared
runtime dependencies; ``init_chat_model("anthropic:claude-...")``
dispatches to the right wrapper based on the provider prefix.

anthropic, google-genai (raw SDKs)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**What.** Provider raw SDKs. **Why.** The PDF orchestrator's
``_extract_via_llm`` calls the raw SDK directly because the
orchestrator's contract is a single non-streaming JSON response
with PHI-redacted text — heavier LangChain machinery is overkill
here. **How.**
:func:`scripts.extraction.pdf_pipeline._extract_via_llm` dispatches
on ``provider`` ∈ ``{anthropic, google, gemini, google-genai}``.

Streamlit ≥ 1.38, < 2.0
~~~~~~~~~~~~~~~~~~~~~~~~

**What.** Web UI framework. **Why.** Fast prototyping; built-in
``session_state`` and chat widgets. The chat UI intentionally has no
file-upload surface; source data enters through the audited extraction
pipeline. **How.**
``scripts/ai_assistant/web_ui.py`` is the entry; UI primitives
factored into ``scripts/ai_assistant/ui/{wizard,chat,conversations,
streaming,...}.py``. Theme + bridge JS in
``scripts/ai_assistant/ui/assets/``.

Plotly + Kaleido
~~~~~~~~~~~~~~~~

**What.** Interactive charts (Plotly) + headless export (Kaleido).
**Why.** ``run_python_analysis`` renders model output as Plotly
figures; Kaleido exports them as PNG so the persisted analysis
``.py`` file produces reproducible images on a fresh run. **How.**
Used inside the sandbox subprocess child only — the agent's parent
process does not ``import plotly``.

Runtime — security
------------------

scripts.security.* (in-tree)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**What.** The PHI handling surface lives entirely in-tree:

* :mod:`scripts.security.phi_scrub` — 8-action honest-broker catalog
* :mod:`scripts.security.phi_patterns` — shared regex catalog
* :mod:`scripts.security.phi_allowlist` — clinical-phrase exemption
* :mod:`scripts.security.phi_gate` — agent-output gate
* :mod:`scripts.security.kanon_gate` — k-anon (k=5) + l-diversity (l=2)
* :mod:`scripts.security.secure_env` — zone guards

**Why.** No external dependency for PHI handling — auditors can read
every line of the security surface without trusting an upstream maintainer.
**How.** See :doc:`phi_architecture` for the full architecture.

cryptography (HMAC + secure_zero_fill)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**What.** Standard library wrapper for HMAC-SHA256 and secure
random. **Why.** Used for per-subject SANT date jitter and ID
pseudonymization. **How.** :func:`scripts.security.phi_scrub.pseudo_id`,
:func:`scripts.security.phi_scrub.date_offset_days`.

Runtime — observability
-----------------------

Python ``logging`` (with custom redactor)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**What.** Standard logging. **Why.** Familiar API; the redactor is a
single ``logging.Filter`` so we don't need a logging-framework
dependency. **How.**
:func:`scripts.utils.log_hygiene.install_phi_redactor` attaches a
filter to the root logger that scrubs API keys + PHI patterns from
every log line at format time.

structlog (deferred)
~~~~~~~~~~~~~~~~~~~~

**What.** Not currently used. **Why mentioned.** Open question
whether to migrate to ``structlog`` for structured logging in a
future phase; for now standard logging is sufficient.

Development
-----------

pip-audit
~~~~~~~~~

**What.** Dependency vulnerability scanner. **Why.** Catches known
CVEs in pinned dependencies before they reach production. **How.**
Runs on demand via ``make security`` and should be included in local
release verification.

Sphinx + sphinx-rtd-theme + myst-parser
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**What.** Documentation generator. **Why.** RST + autodoc gives
free API reference from docstrings; mature toctree semantics. **How.**
``make docs`` builds; ``make docs-quality`` runs the doc-freshness
lint and a ``-W`` (warnings as errors) Sphinx rebuild. CI gate at
``.github/workflows/docs-quality-check.yml``.

Custom type stubs
-----------------

``typings/`` ships in-tree stubs for two providers whose upstream
typing is incomplete:

* ``typings/anthropic/`` — covers the raw SDK's
  ``messages.create`` / ``messages.stream`` surface used by the PDF
  orchestrator + legacy raw-PDF path.
* ``typings/google/`` — covers the
  ``google.genai.Client.models.generate_content`` surface.

The ``mypy`` config picks up ``typings/`` automatically via
``mypy_path``.

Pinning policy
--------------

* **Major versions pinned with caret semantics** for runtime deps that
  the agent talks to (LangChain, Anthropic, Google) — e.g.
  ``langchain>=1.0.0,<2.0.0``. Reason: provider APIs evolve; we
  catch the v2 break in CI before it reaches production.
* **Streamlit pinned to ``>=1.38, <2.0``** because
  ``st.session_state`` semantics changed materially across major
  versions.
* **All other deps** pinned with ``>=`` only; ``uv.lock`` records
  the resolved versions reproducibly.

Where this is enforced: ``pyproject.toml`` (top-level + dev /
test / docs optional groups). The lockfile (``uv.lock``) is the
source of truth for the installed tree.