PDF Extraction
The PDF orchestrator is the primary path; the legacy raw-PDF API path is documented as the back-compat fallback.
Two co-existing extraction paths
Annotated CRF PDFs may carry filled-in patient data, signatures, or example subject IDs in annotations — they are presumed PHI-bearing. The pipeline ships two paths with very different egress postures:
Path |
Module |
Egress posture |
|---|---|---|
Orchestrator (default) |
No raw PDF bytes leave the host. Text is extracted locally
via |
|
Legacy raw-PDF API |
Raw PDF bytes are base64-encoded and shipped to the
provider. Refused unless the operator opts in twice
( |
The wizard’s “Load Study” button selects the orchestrator. The legacy path is the CLI default for back-compat.
Dispatch
Both paths route through
scripts.extraction.extract_pdf_data.extract_pdfs_to_jsonl(),
which checks the REPORTALIN_PDF_EXTRACTION_MODE env var:
llm→ orchestrator path (the wizard always sets this).snapshot→ publishdata/snapshots/{STUDY}/pdfs/baseline verbatim, no LLM call.unset (CLI default) → legacy raw-PDF API path with the two-part attestation gate.
Orchestrator path (the default)
For each PDF, the orchestrator runs:
Code path (always).
pdfplumber.open()extracts page text + a heuristic candidate via_candidate_from_text.Capability + provider gate.
scripts.utils.llm_capabilities.is_capable_model()enforces a model allowlist (Claude Opus/Sonnet 4.6+, GPT-5+, Gemini 2.5 Pro, Llama 3.3 405B). The default allowlist is hardcoded; operators can override withREPORTALIN_PDF_LLM_CAPABLE_MODELS(comma-separated prefix list) — for example to validate a local Ollama model.scripts.extraction.pdf_pipeline.ORCHESTRATOR_SUPPORTED_PROVIDERSrestricts the LLM tier toanthropic+google(where the actual client wiring exists).Redact-then-call. Extracted text is scrubbed via
phi_patterns.BLOCKING_PATTERNS; a defensive_assert_no_raw_phi_in_payloadre-checks and raises if any blocking pattern survives. Only the redacted text reaches the LLM.Merge.
scripts.extraction.pdf_pipeline._merge()reconciles the LLM response with the code candidate (LLM wins on field-level conflicts; code fills in vars the LLM missed). The LLM response is also re-scrubbed viascripts.ai_assistant.phi_safe.guard_text()before merge.Idempotent cache. Keyed on
SHA-256(pdf_bytes || provider || model || phi_scrub.yaml hash); stored attmp/{STUDY}/.pdf_cache/(mode0600). Editing the PHI scrub config invalidates every cache entry.Snapshot fallback (per-PDF). When any of (capable model, API key, code-path text non-empty, LLM call success) is missing, the orchestrator publishes the reviewed baseline at
data/snapshots/{STUDY}/pdfs/{stem}_variables.jsoninstead. Code-only output is never an acceptable result — heuristic-only metadata is too unreliable to publish without LLM oversight.
Output JSON carries an extraction_tier field
(merged / llm / snapshot / empty) so any reader can
tell which path produced it.
The orchestrator never reads raw inputs other than the PDF file
itself. It does not touch data/raw/{STUDY}/datasets/ or
data/raw/{STUDY}/data_dictionary/.
Capability gate details
The capability gate is intentionally conservative. The default allowlist is:
Anthropic —
claude-opus-4-6+,claude-sonnet-4-6+(older Sonnet struggles on multi-section CRFs)OpenAI —
gpt-5+(GPT-4 family is borderline on complex CRFs)Google —
gemini-2.5-pro+(Flash is excluded — good for chat, weaker on table-heavy PDFs)NVIDIA NIM —
meta/llama-3.3-405b-instruct+only
Ollama is excluded by default regardless of model name —
historically local Ollama models cannot sustain a JSON-schema
response on a 30-page CRF. Operator opt-in via
REPORTALIN_PDF_LLM_CAPABLE_MODELS if you’ve validated a specific
local model.
Note that the wizard’s UI also enforces
ORCHESTRATOR_SUPPORTED_PROVIDERS (anthropic + google),
which is narrower than the capability allowlist. An OpenAI or NVIDIA
model passes is_capable_model but the wizard will not surface
“Load Study” as an option for it — the underlying
_extract_via_llm only has client wiring for anthropic + google.
This prevents a silent fall-through to snapshot.
Legacy raw-PDF API path
When REPORTALIN_PDF_EXTRACTION_MODE is unset (CLI default for
back-compat), scripts.extraction.extract_pdf_data._resolve_pdf_provider()
runs a two-part attestation gate:
Env flag
REPORTALIN_PDF_PHI_FREE=1set in the runtime env.Non-empty
authorities/phi_free_pdfs.mdoperator-signed attestation note (under version control, audit-trail material).
If either is missing, the function raises ValueError with a
remediation message listing three alternatives:
Flip both attestations if the source PDFs are verified PHI-free (blank CRFs, protocol-only, MOP).
Use
--pdf-source <path>with pre-extracted JSON files (no LLM call).Skip the PDF leg entirely; the pipeline succeeds without it.
The legacy path then base64-encodes the entire PDF and ships it to
Anthropic’s messages.stream or Google Gemini’s
generate_content for native PDF-document understanding. No
redaction is performed on the legacy path — the operator’s
attestation is the safety story.
Output schema (per-form)
Both paths write per-form JSON files at tmp/{STUDY}/pdfs/{stem}_variables.json
(later atomically promoted to trio_bundle/pdfs/):
{
"form_name": "Form 1A - Index Case Screening",
"source_pdf": "form_1a_index_screening.pdf",
"version": "v1.0",
"summary": "<brief>",
"extraction_tier": "merged",
"variables": {
"ABBREVIATION": {
"description": "...",
"values": {"1": "Yes", "2": "No"},
"depends_on": "PARENT_ABBREVIATION_OR_NULL",
"condition": "Human-readable activation rule",
"section_context": "<text>"
}
},
"sections": {
"SECTION_NAME": {
"context": "<instruction text>",
"variables": ["ABBREV1", "ABBREV2"]
}
}
}
Within-file dedup (case-insensitive collision handling) and
cross-form dedup run after extraction; both are pure-function helpers
in scripts.extraction.dedup.
CLI usage
# Via Makefile (recommended)
make pdf-extract
# Via Python (direct module entry point — legacy path)
uv run python -m scripts.extraction.extract_pdf_data
# Force orchestrator from CLI
REPORTALIN_PDF_EXTRACTION_MODE=llm uv run python main.py --pipeline
# Force snapshot mode (no LLM, no API egress)
REPORTALIN_PDF_EXTRACTION_MODE=snapshot uv run python main.py --pipeline
# Skip the PDF leg via pre-extracted JSON
uv run python main.py --pipeline --pdf-source /path/to/jsons/
Key files
scripts.extraction.pdf_pipeline— the orchestrator. Two-way merge, capability gate, idempotent cache, snapshot fallback.scripts.extraction.extract_pdf_data— the legacy raw-PDF API path; also hosts the dispatcher (extract_pdfs_to_jsonl).scripts.utils.llm_capabilities— model-name allowlist (DEFAULT_CAPABLE_MODEL_PREFIXES,scripts.utils.llm_capabilities.is_capable_model()).scripts.security.phi_patterns— the BLOCKING_PATTERNS used by the orchestrator’s redaction step (and by the agent-output PHI gate).config.STUDY_SNAPSHOTS_DIR— the reviewed baseline location (data/snapshots/{STUDY}/); see Operations for the maintenance protocol.
Testing
PHI-critical tests for this surface:
tests/security/test_pdf_redaction_pipeline.py— pre-LLM redaction, post-LLM re-scrub, merge contract, idempotent cache, two-way decision (LLM-merged or snapshot, never code-only).tests/security/test_llm_capabilities.py— capability allowlist, Ollama-excluded-by-default, env-override semantics.tests/test_pdf_phi_flag.py— legacy two-part attestation gate.tests/test_extract_pdf_data.py— dispatcher mode selection.
Downstream usage
scripts.extraction.build_variables_reference.build_variables_reference()merges PDF extractions with the data dictionary intotrio_bundle/variables.json(the consolidated schema the agent uses to validate variable names).The agent’s
cohort_builderandquery_datasettools readvariables.jsonplus per-form*_variables.jsonto map user-friendly names back to dataset columns.
Licensing
PDF extraction uses open-source libraries (pypdf, pdfplumber)
under permissive licenses. No proprietary PDF tools are required.