Dataset Extraction

Extracts tabular clinical data from Excel and CSV files into JSONL format via the study’s staging workspace, which is then atomically promoted into the trio bundle.

Overview 

Dataset extraction reads raw study data files from data/raw/{STUDY_NAME}/datasets/, normalises their rows, and writes the resulting JSONL into the study’s AMBER staging workspace (tmp/{STUDY_NAME}/datasets/ by default; or /dev/shm/{STUDY}/ when REPORTALIN_TMPFS_STAGING=1). The staged JSONL is then run through phi_scrub.run_scrub (Step 1.6, eight-action catalog defined in scripts/security/phi_scrub.yaml) before any audit artifact is written. A subsequent publish step atomically promotes the now-PHI-free staging bundle into output/{STUDY_NAME}/trio_bundle/datasets/. PHI handling is fully covered by scripts/security/ (rule + allowlist; not Presidio, not NER-by-default — see ADR-004 in developer_guide/decisions.rst).

Data Flow 

data/raw/{STUDY}/datasets/*.{xlsx,csv}
              │
              ▼
dataset_pipeline.py  →  tmp/{STUDY}/datasets/*.jsonl   (staging)
              │
              ▼  (atomic publish)
output/{STUDY}/trio_bundle/datasets/*.jsonl

Source 

Path: data/raw/{STUDY_NAME}/datasets/
Formats: .xlsx, .csv
Auto-discovered — no manual file list needed

Output 

Location: output/{STUDY_NAME}/trio_bundle/datasets/ (config.TRIO_DATASETS_DIR)
Format: One JSONL file per source file/sheet
Deterministic: sort_keys=True, ensure_ascii=False
Provenance: Every record includes __source_file__, __sheet__, __row_index__ metadata

JSONL Record Schema 

Each line is a JSON object:

{
  "FIELD_A": "value",
  "FIELD_B": 42,
  "__source_file__": "enrollment.xlsx",
  "__sheet__": "Sheet1",
  "__row_index__": 0
}

Zone Enforcement 

Extraction outputs must live under output/ — never under data/
secure_env.assert_not_raw() rejects any write path that resolves inside data/raw/
secure_env.assert_output_not_in_data() prevents accidental writes into the raw data tree

CLI Usage 

# Via Makefile (recommended)
make extract-datasets

# Via Python
uv run python -c "from scripts.extraction.dataset_pipeline import extract_datasets; extract_datasets()"

Downstream Handoff 

JSONL files are written first to tmp/{STUDY}/datasets/ (the AMBER staging workspace created by scripts/utils/secure_staging.prepare_staging with mode 0700 + umask 0077). Every row carries a full _provenance dict (raw_sha256, pipeline_version, extraction_engine, source_file, sheet_name, row_index, study_name, extraction_utc). The PHI scrubber (scripts.security.phi_scrub) then runs in place as Step 1.6 BEFORE any audit is emitted. After cleanup (Step 1.7) and cleanup propagation (Step 1.8), _publish_staging atomically renames the staging datasets dir into output/{STUDY}/trio_bundle/datasets/. A per-run audit/lineage_manifest.json then pairs every raw input SHA-256 with every published JSONL SHA-256.

Key Files 

scripts/extraction/dataset_pipeline.py — main extraction logic
scripts/extraction/io/ — atomic write helpers, file discovery
config.py — TRIO_DATASETS_DIR, DATASETS_DIR