PDF Extraction

The PDF orchestrator is the primary path; the legacy raw-PDF API path is documented as the back-compat fallback.

Two co-existing extraction paths

Annotated CRF PDFs may carry filled-in patient data, signatures, or example subject IDs in annotations — they are presumed PHI-bearing. The pipeline ships two paths with very different egress postures:

Path

Module

Egress posture

Orchestrator (default)

scripts.extraction.pdf_pipeline

No raw PDF bytes leave the host. Text is extracted locally via pdfplumber, PHI-redacted in place, then sent to the LLM as redacted text only.

Legacy raw-PDF API

scripts.extraction.extract_pdf_data

Raw PDF bytes are base64-encoded and shipped to the provider. Refused unless the operator opts in twice (REPORTALIN_PDF_PHI_FREE=1 env flag plus a non-empty attestation note at authorities/phi_free_pdfs.md).

The wizard’s “Load Study” button selects the orchestrator. The legacy path is the CLI default for back-compat.

Dispatch

Both paths route through scripts.extraction.extract_pdf_data.extract_pdfs_to_jsonl(), which checks the REPORTALIN_PDF_EXTRACTION_MODE env var:

  • llm → orchestrator path (the wizard always sets this).

  • snapshot → publish data/snapshots/{STUDY}/pdfs/ baseline verbatim, no LLM call.

  • unset (CLI default) → legacy raw-PDF API path with the two-part attestation gate.

Orchestrator path (the default)

For each PDF, the orchestrator runs:

  1. Code path (always). pdfplumber.open() extracts page text + a heuristic candidate via _candidate_from_text.

  2. Capability + provider gate. scripts.utils.llm_capabilities.is_capable_model() enforces a model allowlist (Claude Opus/Sonnet 4.6+, GPT-5+, Gemini 2.5 Pro, Llama 3.3 405B). The default allowlist is hardcoded; operators can override with REPORTALIN_PDF_LLM_CAPABLE_MODELS (comma-separated prefix list) — for example to validate a local Ollama model. scripts.extraction.pdf_pipeline.ORCHESTRATOR_SUPPORTED_PROVIDERS restricts the LLM tier to anthropic + google (where the actual client wiring exists).

  3. Redact-then-call. Extracted text is scrubbed via phi_patterns.BLOCKING_PATTERNS; a defensive _assert_no_raw_phi_in_payload re-checks and raises if any blocking pattern survives. Only the redacted text reaches the LLM.

  4. Merge. scripts.extraction.pdf_pipeline._merge() reconciles the LLM response with the code candidate (LLM wins on field-level conflicts; code fills in vars the LLM missed). The LLM response is also re-scrubbed via scripts.ai_assistant.phi_safe.guard_text() before merge.

  5. Idempotent cache. Keyed on SHA-256(pdf_bytes || provider || model || phi_scrub.yaml hash); stored at tmp/{STUDY}/.pdf_cache/ (mode 0600). Editing the PHI scrub config invalidates every cache entry.

  6. Snapshot fallback (per-PDF). When any of (capable model, API key, code-path text non-empty, LLM call success) is missing, the orchestrator publishes the reviewed baseline at data/snapshots/{STUDY}/pdfs/{stem}_variables.json instead. Code-only output is never an acceptable result — heuristic-only metadata is too unreliable to publish without LLM oversight.

Output JSON carries an extraction_tier field (merged / llm / snapshot / empty) so any reader can tell which path produced it.

The orchestrator never reads raw inputs other than the PDF file itself. It does not touch data/raw/{STUDY}/datasets/ or data/raw/{STUDY}/data_dictionary/.

Capability gate details

The capability gate is intentionally conservative. The default allowlist is:

  • Anthropicclaude-opus-4-6+, claude-sonnet-4-6+ (older Sonnet struggles on multi-section CRFs)

  • OpenAIgpt-5+ (GPT-4 family is borderline on complex CRFs)

  • Googlegemini-2.5-pro+ (Flash is excluded — good for chat, weaker on table-heavy PDFs)

  • NVIDIA NIMmeta/llama-3.3-405b-instruct+ only

Ollama is excluded by default regardless of model name — historically local Ollama models cannot sustain a JSON-schema response on a 30-page CRF. Operator opt-in via REPORTALIN_PDF_LLM_CAPABLE_MODELS if you’ve validated a specific local model.

Note that the wizard’s UI also enforces ORCHESTRATOR_SUPPORTED_PROVIDERS (anthropic + google), which is narrower than the capability allowlist. An OpenAI or NVIDIA model passes is_capable_model but the wizard will not surface “Load Study” as an option for it — the underlying _extract_via_llm only has client wiring for anthropic + google. This prevents a silent fall-through to snapshot.

Legacy raw-PDF API path

When REPORTALIN_PDF_EXTRACTION_MODE is unset (CLI default for back-compat), scripts.extraction.extract_pdf_data._resolve_pdf_provider() runs a two-part attestation gate:

  1. Env flag REPORTALIN_PDF_PHI_FREE=1 set in the runtime env.

  2. Non-empty authorities/phi_free_pdfs.md operator-signed attestation note (under version control, audit-trail material).

If either is missing, the function raises ValueError with a remediation message listing three alternatives:

  1. Flip both attestations if the source PDFs are verified PHI-free (blank CRFs, protocol-only, MOP).

  2. Use --pdf-source <path> with pre-extracted JSON files (no LLM call).

  3. Skip the PDF leg entirely; the pipeline succeeds without it.

The legacy path then base64-encodes the entire PDF and ships it to Anthropic’s messages.stream or Google Gemini’s generate_content for native PDF-document understanding. No redaction is performed on the legacy path — the operator’s attestation is the safety story.

Output schema (per-form)

Both paths write per-form JSON files at tmp/{STUDY}/pdfs/{stem}_variables.json (later atomically promoted to trio_bundle/pdfs/):

{
  "form_name": "Form 1A - Index Case Screening",
  "source_pdf": "form_1a_index_screening.pdf",
  "version": "v1.0",
  "summary": "<brief>",
  "extraction_tier": "merged",
  "variables": {
    "ABBREVIATION": {
      "description": "...",
      "values": {"1": "Yes", "2": "No"},
      "depends_on": "PARENT_ABBREVIATION_OR_NULL",
      "condition": "Human-readable activation rule",
      "section_context": "<text>"
    }
  },
  "sections": {
    "SECTION_NAME": {
      "context": "<instruction text>",
      "variables": ["ABBREV1", "ABBREV2"]
    }
  }
}

Within-file dedup (case-insensitive collision handling) and cross-form dedup run after extraction; both are pure-function helpers in scripts.extraction.dedup.

CLI usage

# Via Makefile (recommended)
make pdf-extract

# Via Python (direct module entry point — legacy path)
uv run python -m scripts.extraction.extract_pdf_data

# Force orchestrator from CLI
REPORTALIN_PDF_EXTRACTION_MODE=llm uv run python main.py --pipeline

# Force snapshot mode (no LLM, no API egress)
REPORTALIN_PDF_EXTRACTION_MODE=snapshot uv run python main.py --pipeline

# Skip the PDF leg via pre-extracted JSON
uv run python main.py --pipeline --pdf-source /path/to/jsons/

Key files

Testing

PHI-critical tests for this surface:

  • tests/security/test_pdf_redaction_pipeline.py — pre-LLM redaction, post-LLM re-scrub, merge contract, idempotent cache, two-way decision (LLM-merged or snapshot, never code-only).

  • tests/security/test_llm_capabilities.py — capability allowlist, Ollama-excluded-by-default, env-override semantics.

  • tests/test_pdf_phi_flag.py — legacy two-part attestation gate.

  • tests/test_extract_pdf_data.py — dispatcher mode selection.

Downstream usage

  • scripts.extraction.build_variables_reference.build_variables_reference() merges PDF extractions with the data dictionary into trio_bundle/variables.json (the consolidated schema the agent uses to validate variable names).

  • The agent’s cohort_builder and query_dataset tools read variables.json plus per-form *_variables.json to map user-friendly names back to dataset columns.

Licensing

PDF extraction uses open-source libraries (pypdf, pdfplumber) under permissive licenses. No proprietary PDF tools are required.