PDF Extraction ============== The PDF orchestrator is the primary path; the legacy raw-PDF API path is documented as the back-compat fallback. Two co-existing extraction paths -------------------------------- Annotated CRF PDFs may carry filled-in patient data, signatures, or example subject IDs in annotations — they are presumed PHI-bearing. The pipeline ships two paths with very different egress postures: .. list-table:: :header-rows: 1 :widths: 22 39 39 * - Path - Module - Egress posture * - **Orchestrator (default)** - :mod:`scripts.extraction.pdf_pipeline` - **No raw PDF bytes leave the host.** Text is extracted locally via ``pdfplumber``, PHI-redacted in place, then sent to the LLM as redacted text only. * - **Legacy raw-PDF API** - :mod:`scripts.extraction.extract_pdf_data` - **Raw PDF bytes are base64-encoded and shipped to the provider.** Refused unless the operator opts in twice (``REPORTALIN_PDF_PHI_FREE=1`` env flag *plus* a non-empty attestation note at ``authorities/phi_free_pdfs.md``). The wizard's "Load Study" button selects the orchestrator. The legacy path is the CLI default for back-compat. Dispatch -------- Both paths route through :func:`scripts.extraction.extract_pdf_data.extract_pdfs_to_jsonl`, which checks the ``REPORTALIN_PDF_EXTRACTION_MODE`` env var: * ``llm`` → orchestrator path (the wizard always sets this). * ``snapshot`` → publish ``data/snapshots/{STUDY}/pdfs/`` baseline verbatim, no LLM call. * unset (CLI default) → legacy raw-PDF API path with the two-part attestation gate. Orchestrator path (the default) ------------------------------- For each PDF, the orchestrator runs: 1. **Code path (always).** ``pdfplumber.open()`` extracts page text + a heuristic candidate via ``_candidate_from_text``. 2. **Capability + provider gate.** :func:`scripts.utils.llm_capabilities.is_capable_model` enforces a model allowlist (Claude Opus/Sonnet 4.6+, GPT-5+, Gemini 2.5 Pro, Llama 3.3 405B). The default allowlist is hardcoded; operators can override with ``REPORTALIN_PDF_LLM_CAPABLE_MODELS`` (comma-separated prefix list) — for example to validate a local Ollama model. :data:`scripts.extraction.pdf_pipeline.ORCHESTRATOR_SUPPORTED_PROVIDERS` restricts the LLM tier to ``anthropic`` + ``google`` (where the actual client wiring exists). 3. **Redact-then-call.** Extracted text is scrubbed via ``phi_patterns.BLOCKING_PATTERNS``; a defensive ``_assert_no_raw_phi_in_payload`` re-checks and raises if any blocking pattern survives. Only the redacted text reaches the LLM. 4. **Merge.** :func:`scripts.extraction.pdf_pipeline._merge` reconciles the LLM response with the code candidate (LLM wins on field-level conflicts; code fills in vars the LLM missed). The LLM response is also re-scrubbed via :func:`scripts.ai_assistant.phi_safe.guard_text` before merge. 5. **Idempotent cache.** Keyed on ``SHA-256(pdf_bytes || provider || model || phi_scrub.yaml hash)``; stored at ``tmp/{STUDY}/.pdf_cache/`` (mode ``0600``). Editing the PHI scrub config invalidates every cache entry. 6. **Snapshot fallback (per-PDF).** When any of (capable model, API key, code-path text non-empty, LLM call success) is missing, the orchestrator publishes the reviewed baseline at ``data/snapshots/{STUDY}/pdfs/{stem}_variables.json`` instead. **Code-only output is never an acceptable result** — heuristic-only metadata is too unreliable to publish without LLM oversight. Output JSON carries an ``extraction_tier`` field (``merged`` / ``llm`` / ``snapshot`` / ``empty``) so any reader can tell which path produced it. The orchestrator never reads raw inputs other than the PDF file itself. It does not touch ``data/raw/{STUDY}/datasets/`` or ``data/raw/{STUDY}/data_dictionary/``. Capability gate details ----------------------- The capability gate is intentionally conservative. The default allowlist is: * **Anthropic** — ``claude-opus-4-6+``, ``claude-sonnet-4-6+`` (older Sonnet struggles on multi-section CRFs) * **OpenAI** — ``gpt-5+`` (GPT-4 family is borderline on complex CRFs) * **Google** — ``gemini-2.5-pro+`` (Flash is excluded — good for chat, weaker on table-heavy PDFs) * **NVIDIA NIM** — ``meta/llama-3.3-405b-instruct+`` only **Ollama is excluded by default** regardless of model name — historically local Ollama models cannot sustain a JSON-schema response on a 30-page CRF. Operator opt-in via ``REPORTALIN_PDF_LLM_CAPABLE_MODELS`` if you've validated a specific local model. Note that the wizard's UI also enforces :data:`ORCHESTRATOR_SUPPORTED_PROVIDERS` (anthropic + google), which is narrower than the capability allowlist. An OpenAI or NVIDIA model passes ``is_capable_model`` but the wizard will not surface "Load Study" as an option for it — the underlying ``_extract_via_llm`` only has client wiring for anthropic + google. This prevents a silent fall-through to snapshot. Legacy raw-PDF API path ----------------------- When ``REPORTALIN_PDF_EXTRACTION_MODE`` is unset (CLI default for back-compat), :func:`scripts.extraction.extract_pdf_data._resolve_pdf_provider` runs a two-part attestation gate: 1. Env flag ``REPORTALIN_PDF_PHI_FREE=1`` set in the runtime env. 2. Non-empty ``authorities/phi_free_pdfs.md`` operator-signed attestation note (under version control, audit-trail material). If either is missing, the function raises ``ValueError`` with a remediation message listing three alternatives: a. Flip both attestations if the source PDFs are verified PHI-free (blank CRFs, protocol-only, MOP). b. Use ``--pdf-source `` with pre-extracted JSON files (no LLM call). c. Skip the PDF leg entirely; the pipeline succeeds without it. The legacy path then base64-encodes the entire PDF and ships it to Anthropic's ``messages.stream`` or Google Gemini's ``generate_content`` for native PDF-document understanding. No redaction is performed on the legacy path — the operator's attestation is the safety story. Output schema (per-form) ------------------------ Both paths write per-form JSON files at ``tmp/{STUDY}/pdfs/{stem}_variables.json`` (later atomically promoted to ``trio_bundle/pdfs/``): .. code-block:: json { "form_name": "Form 1A - Index Case Screening", "source_pdf": "form_1a_index_screening.pdf", "version": "v1.0", "summary": "", "extraction_tier": "merged", "variables": { "ABBREVIATION": { "description": "...", "values": {"1": "Yes", "2": "No"}, "depends_on": "PARENT_ABBREVIATION_OR_NULL", "condition": "Human-readable activation rule", "section_context": "" } }, "sections": { "SECTION_NAME": { "context": "", "variables": ["ABBREV1", "ABBREV2"] } } } Within-file dedup (case-insensitive collision handling) and cross-form dedup run after extraction; both are pure-function helpers in :mod:`scripts.extraction.dedup`. CLI usage --------- .. code-block:: bash # Via Makefile (recommended) make pdf-extract # Via Python (direct module entry point — legacy path) uv run python -m scripts.extraction.extract_pdf_data # Force orchestrator from CLI REPORTALIN_PDF_EXTRACTION_MODE=llm uv run python main.py --pipeline # Force snapshot mode (no LLM, no API egress) REPORTALIN_PDF_EXTRACTION_MODE=snapshot uv run python main.py --pipeline # Skip the PDF leg via pre-extracted JSON uv run python main.py --pipeline --pdf-source /path/to/jsons/ Key files --------- * :mod:`scripts.extraction.pdf_pipeline` — the orchestrator. Two-way merge, capability gate, idempotent cache, snapshot fallback. * :mod:`scripts.extraction.extract_pdf_data` — the legacy raw-PDF API path; also hosts the dispatcher (``extract_pdfs_to_jsonl``). * :mod:`scripts.utils.llm_capabilities` — model-name allowlist (``DEFAULT_CAPABLE_MODEL_PREFIXES``, :func:`scripts.utils.llm_capabilities.is_capable_model`). * :mod:`scripts.security.phi_patterns` — the BLOCKING_PATTERNS used by the orchestrator's redaction step (and by the agent-output PHI gate). * ``config.STUDY_SNAPSHOTS_DIR`` — the reviewed baseline location (``data/snapshots/{STUDY}/``); see :doc:`operations` for the maintenance protocol. Testing ------- PHI-critical tests for this surface: * ``tests/security/test_pdf_redaction_pipeline.py`` — pre-LLM redaction, post-LLM re-scrub, merge contract, idempotent cache, two-way decision (LLM-merged or snapshot, never code-only). * ``tests/security/test_llm_capabilities.py`` — capability allowlist, Ollama-excluded-by-default, env-override semantics. * ``tests/test_pdf_phi_flag.py`` — legacy two-part attestation gate. * ``tests/test_extract_pdf_data.py`` — dispatcher mode selection. Downstream usage ---------------- * :func:`scripts.extraction.build_variables_reference.build_variables_reference` merges PDF extractions with the data dictionary into ``trio_bundle/variables.json`` (the consolidated schema the agent uses to validate variable names). * The agent's ``cohort_builder`` and ``query_dataset`` tools read ``variables.json`` plus per-form ``*_variables.json`` to map user-friendly names back to dataset columns. Licensing --------- PDF extraction uses open-source libraries (``pypdf``, ``pdfplumber``) under permissive licenses. No proprietary PDF tools are required.