Dataset Extraction

Extracts tabular clinical data from Excel and CSV files into JSONL format via the study’s staging workspace, which is then atomically promoted into the trio bundle.

Overview

Dataset extraction reads raw study data files from data/raw/{STUDY_NAME}/datasets/, normalises their rows, and writes the resulting JSONL into the study’s AMBER staging workspace (tmp/{STUDY_NAME}/datasets/ by default; or /dev/shm/{STUDY}/ when REPORTALIN_TMPFS_STAGING=1). The staged JSONL is then run through phi_scrub.run_scrub (Step 1.6, eight-action catalog defined in scripts/security/phi_scrub.yaml) before any audit artifact is written. A subsequent publish step atomically promotes the now-PHI-free staging bundle into output/{STUDY_NAME}/trio_bundle/datasets/. PHI handling is fully covered by scripts/security/ (rule + allowlist; not Presidio, not NER-by-default — see ADR-004 in developer_guide/decisions.rst).

Data Flow

data/raw/{STUDY}/datasets/*.{xlsx,csv}
              │
              ▼
dataset_pipeline.py  →  tmp/{STUDY}/datasets/*.jsonl   (staging)
              │
              ▼  (atomic publish)
output/{STUDY}/trio_bundle/datasets/*.jsonl

Source

  • Path: data/raw/{STUDY_NAME}/datasets/

  • Formats: .xlsx, .csv

  • Auto-discovered — no manual file list needed

Output

  • Location: output/{STUDY_NAME}/trio_bundle/datasets/ (config.TRIO_DATASETS_DIR)

  • Format: One JSONL file per source file/sheet

  • Deterministic: sort_keys=True, ensure_ascii=False

  • Provenance: Every record includes __source_file__, __sheet__, __row_index__ metadata

JSONL Record Schema

Each line is a JSON object:

{
  "FIELD_A": "value",
  "FIELD_B": 42,
  "__source_file__": "enrollment.xlsx",
  "__sheet__": "Sheet1",
  "__row_index__": 0
}

Zone Enforcement

  • Extraction outputs must live under output/ — never under data/

  • secure_env.assert_not_raw() rejects any write path that resolves inside data/raw/

  • secure_env.assert_output_not_in_data() prevents accidental writes into the raw data tree

CLI Usage

# Via Makefile (recommended)
make extract-datasets

# Via Python
uv run python -c "from scripts.extraction.dataset_pipeline import extract_datasets; extract_datasets()"

Downstream Handoff

JSONL files are written first to tmp/{STUDY}/datasets/ (the AMBER staging workspace created by scripts/utils/secure_staging.prepare_staging with mode 0700 + umask 0077). Every row carries a full _provenance dict (raw_sha256, pipeline_version, extraction_engine, source_file, sheet_name, row_index, study_name, extraction_utc). The PHI scrubber (scripts.security.phi_scrub) then runs in place as Step 1.6 BEFORE any audit is emitted. After cleanup (Step 1.7) and cleanup propagation (Step 1.8), _publish_staging atomically renames the staging datasets dir into output/{STUDY}/trio_bundle/datasets/. A per-run audit/lineage_manifest.json then pairs every raw input SHA-256 with every published JSONL SHA-256.

Key Files

  • scripts/extraction/dataset_pipeline.py — main extraction logic

  • scripts/extraction/io/ — atomic write helpers, file discovery

  • config.pyTRIO_DATASETS_DIR, DATASETS_DIR