Dataset Extraction ================== Extracts tabular clinical data from Excel and CSV files into JSONL format via the study's staging workspace, which is then atomically promoted into the trio bundle. .. contents:: On this page :local: :depth: 2 Overview -------- Dataset extraction reads raw study data files from ``data/raw/{STUDY_NAME}/datasets/``, normalises their rows, and writes the resulting JSONL into the study's **AMBER staging workspace** (``tmp/{STUDY_NAME}/datasets/`` by default; or ``/dev/shm/{STUDY}/`` when ``REPORTALIN_TMPFS_STAGING=1``). The staged JSONL is then run through ``phi_scrub.run_scrub`` (**Step 1.6**, eight-action catalog defined in ``scripts/security/phi_scrub.yaml``) before any audit artifact is written. A subsequent publish step atomically promotes the now-PHI-free staging bundle into ``output/{STUDY_NAME}/trio_bundle/datasets/``. PHI handling is fully covered by ``scripts/security/`` (rule + allowlist; not Presidio, not NER-by-default — see ADR-004 in ``developer_guide/decisions.rst``). Data Flow --------- .. code-block:: text data/raw/{STUDY}/datasets/*.{xlsx,csv} │ ▼ dataset_pipeline.py → tmp/{STUDY}/datasets/*.jsonl (staging) │ ▼ (atomic publish) output/{STUDY}/trio_bundle/datasets/*.jsonl Source ------ - **Path:** ``data/raw/{STUDY_NAME}/datasets/`` - **Formats:** ``.xlsx``, ``.csv`` - Auto-discovered — no manual file list needed Output ------ - **Location:** ``output/{STUDY_NAME}/trio_bundle/datasets/`` (``config.TRIO_DATASETS_DIR``) - **Format:** One JSONL file per source file/sheet - **Deterministic:** ``sort_keys=True, ensure_ascii=False`` - **Provenance:** Every record includes ``__source_file__``, ``__sheet__``, ``__row_index__`` metadata JSONL Record Schema ------------------- Each line is a JSON object: .. code-block:: json { "FIELD_A": "value", "FIELD_B": 42, "__source_file__": "enrollment.xlsx", "__sheet__": "Sheet1", "__row_index__": 0 } Zone Enforcement ---------------- - Extraction outputs must live under ``output/`` — never under ``data/`` - ``secure_env.assert_not_raw()`` rejects any write path that resolves inside ``data/raw/`` - ``secure_env.assert_output_not_in_data()`` prevents accidental writes into the raw data tree CLI Usage --------- .. code-block:: bash # Via Makefile (recommended) make extract-datasets # Via Python uv run python -c "from scripts.extraction.dataset_pipeline import extract_datasets; extract_datasets()" Downstream Handoff ------------------ JSONL files are written first to ``tmp/{STUDY}/datasets/`` (the AMBER staging workspace created by ``scripts/utils/secure_staging.prepare_staging`` with mode 0700 + umask 0077). Every row carries a full ``_provenance`` dict (raw_sha256, pipeline_version, extraction_engine, source_file, sheet_name, row_index, study_name, extraction_utc). The PHI scrubber (:mod:`scripts.security.phi_scrub`) then runs in place as Step 1.6 BEFORE any audit is emitted. After cleanup (Step 1.7) and cleanup propagation (Step 1.8), ``_publish_staging`` atomically renames the staging datasets dir into ``output/{STUDY}/trio_bundle/datasets/``. A per-run ``audit/lineage_manifest.json`` then pairs every raw input SHA-256 with every published JSONL SHA-256. Key Files --------- - ``scripts/extraction/dataset_pipeline.py`` — main extraction logic - ``scripts/extraction/io/`` — atomic write helpers, file discovery - ``config.py`` — ``TRIO_DATASETS_DIR``, ``DATASETS_DIR``