scripts package

RePORTaLiN Scripts Package

Core data processing modules for clinical research data extraction, validation, and de-identification.

This package provides high-level functions for the complete data processing pipeline:

  • Data Dictionary Processing: Load and validate study data dictionaries

  • Data Extraction: Convert Excel files to JSONL format with validation

  • De-identification: Advanced PHI/PII detection and pseudonymization (via utils)

Public API

The package exports 2 main high-level functions via __all__:

  • load_study_dictionary: Process data dictionary Excel files

  • extract_excel_to_jsonl: Extract dataset Excel files to JSONL

For more specialized functionality, import directly from submodules:

  • scripts.load_dictionary: Dictionary processing (2 public functions)

  • scripts.extract_data: Data extraction (6 public functions)

  • scripts.deidentify: De-identification engine (10 public functions)

  • scripts.utils.country_regulations: Privacy regulations (6 public functions)

  • scripts.utils.logging: Enhanced logging (12 public functions)

Usage Examples

Basic Pipeline

Process both data dictionary and dataset with default configuration:

from scripts import load_study_dictionary, extract_excel_to_jsonl

# Step 1: Load data dictionary
dict_success = load_study_dictionary()

# Step 2: Extract dataset (uses config.DATASET_DIR and config.CLEAN_DATASET_DIR)
result = extract_excel_to_jsonl()

if dict_success and result['files_created'] > 0:
    print("βœ“ Pipeline completed successfully!")

Custom Processing

Use individual modules for custom workflows:

from scripts.load_dictionary import process_excel_file
from scripts.extract_data import find_excel_files, process_excel_file as process_data

# Custom dictionary processing
process_excel_file(
    excel_path="custom_dict.xlsx",
    output_dir="results/custom_dict"
)

# Custom data extraction with file discovery
excel_files = find_excel_files("data/custom_dataset")
for file_path in excel_files:
    process_data(
        excel_path=file_path,
        output_dir="results/custom_output"
    )

De-identification Workflow

Complete pipeline with de-identification:

from scripts import extract_excel_to_jsonl
from scripts.deidentify import deidentify_dataset, DeidentificationConfig
import config

# Step 1: Extract data (uses config.DATASET_DIR and config.CLEAN_DATASET_DIR)
result = extract_excel_to_jsonl()

# Step 2: De-identify with custom configuration
deidentify_config = DeidentificationConfig(
    countries=['IN', 'US'],
    enable_encryption=True
)

deidentify_dataset(
    input_dir=f"{config.CLEAN_DATASET_DIR}/cleaned",
    output_dir="results/deidentified/Indo-vap",
    config=deidentify_config
)

Module Structure

The package is organized as follows:

scripts/
β”œβ”€β”€ __init__.py              # Package API (this file)
β”œβ”€β”€ load_dictionary.py       # Data dictionary processing
β”œβ”€β”€ extract_data.py          # Excel to JSONL extraction
β”œβ”€β”€ deidentify.py            # De-identification engine
└── utils/                   # Utility modules
    β”œβ”€β”€ __init__.py
    β”œβ”€β”€ country_regulations.py  # Privacy compliance
    └── logging.py           # Enhanced logging

Version History

  • v0.0.9: Enhanced package-level API with comprehensive documentation

  • v0.0.8: Enhanced load_dictionary module (public API, type hints, docs)

  • v0.0.7: Enhanced extract_data module (public API, type hints, docs)

  • v0.0.6: Enhanced deidentify module (public API, type safety, docs)

  • v0.0.5: Enhanced country_regulations module (public API, docs)

  • v0.0.4: Enhanced logging module (performance, type hints)

  • v0.0.3: Enhanced config module (utilities, robustness)

  • v0.0.1: Initial package structure

See also

-

mod:scripts.load_dictionary - Data dictionary processing

-

mod:scripts.extract_data - Data extraction

-

mod:scripts.deidentify - De-identification

-

mod:scripts.utils.country_regulations - Privacy regulations

-

mod:scripts.utils.logging - Logging utilities

scripts.extract_excel_to_jsonl()[source]

Extract all Excel files from dataset directory, creating original and cleaned JSONL versions.

Return type:

Dict[str, Any]

Returns:

Dictionary with extraction statistics

scripts.load_study_dictionary(file_path=None, json_output_dir=None, preserve_na=True)[source]

Load and process study data dictionary from Excel to JSONL format.

Parameters:
  • file_path (Optional[str]) – Path to Excel file (defaults to config.DICTIONARY_EXCEL_FILE)

  • json_output_dir (Optional[str]) – Output directory (defaults to config.DICTIONARY_JSON_OUTPUT_DIR)

  • preserve_na (bool) – If True, preserve empty cells as None; if False, use pandas defaults

Return type:

bool

Returns:

True if processing was successful, False otherwise

Overview

The scripts package contains the core processing modules for RePORTaLiN.

Enhanced in v0.0.9:

  • βœ… Enhanced package-level documentation with comprehensive usage examples

  • βœ… Clear public API definition (2 high-level functions)

  • βœ… Integration examples for complete data processing pipeline

  • βœ… De-identification workflow documentation

  • βœ… Module structure and cross-references

Package-Level Public API

The package exports 2 high-level functions for the main processing pipeline:

  1. load_study_dictionary - Process data dictionary Excel files

  2. extract_excel_to_jsonl - Extract dataset Excel files to JSONL

Quick Start:

from scripts import load_study_dictionary, extract_excel_to_jsonl

# Step 1: Load data dictionary
dict_success = load_study_dictionary()

# Step 2: Extract dataset
extract_success = extract_excel_to_jsonl(
    input_dir="data/dataset/Indo-vap",
    output_dir="results/dataset/Indo-vap"
)

For specialized functionality, import directly from submodules:

  • scripts.load_dictionary - 2 public functions

  • scripts.extract_data - 6 public functions

  • scripts.deidentify - 10 public functions

  • scripts.utils.country_regulations - 6 public functions

  • scripts.utils.logging - 12 public functions

Module Organization

scripts/
β”œβ”€β”€ __init__.py              # Package API (2 exports)
β”œβ”€β”€ load_dictionary.py       # Data dictionary (2 exports)
β”œβ”€β”€ extract_data.py          # Data extraction (6 exports)
└── utils/
    β”œβ”€β”€ deidentify.py        # De-identification (10 exports)
    β”œβ”€β”€ country_regulations.py  # Privacy rules (6 exports)
    └── logging.py           # Logging (12 exports)

Submodules

Module Summary

extract_data

Main data extraction module for converting Excel files to JSONL format.

Key functions:

See: scripts.extract_data module

load_dictionary

Data dictionary processing module with intelligent table detection.

Key functions:

See: scripts.load_dictionary module

utils

Utility modules including de-identification and logging.

De-identification (scripts.deidentify)

PHI/PII de-identification module with pseudonymization and encryption.

Key classes:

Key functions:

See: scripts.deidentify module

Logging (scripts.utils.logging)

Centralized logging module with custom SUCCESS level.

Key features:

  • Custom SUCCESS log level

  • Timestamped log files

  • Dual output (console + file)

  • Structured logging

See: scripts.utils.logging module

Country Regulations (scripts.utils.country_regulations)

Country-specific data privacy regulations module for compliance.

Key features:

  • Multi-country support (14 countries)

  • Privacy frameworks (PUBLIC to CRITICAL levels)

  • Identifier detection and validation

  • Regulatory requirements management

  • Configuration export/import

See: scripts.utils.country_regulations module

Quick Examples

Data Extraction

from scripts.extract_data import extract_excel_to_jsonl
import config

# Extract all Excel files
extract_excel_to_jsonl(
    input_dir=config.DATASET_DIR,
    output_dir=config.CLEAN_DATASET_DIR
)

Dictionary Loading

from scripts.load_dictionary import load_study_dictionary
import config

# Load data dictionary
load_study_dictionary(
    excel_file=config.DICTIONARY_EXCEL_FILE,
    output_dir=config.DICTIONARY_JSON_OUTPUT_DIR
)

De-identification

from scripts.deidentify import DeidentificationEngine

# Initialize engine
engine = DeidentificationEngine()

# De-identify text
original = "Patient John Doe, MRN: 123456, SSN: 123-45-6789"
deidentified = engine.deidentify_text(original)

# Save mappings
engine.save_mappings()

Batch De-identification

from scripts.deidentify import deidentify_dataset

# Process entire dataset (maintains directory structure)
stats = deidentify_dataset(
    input_dir="results/dataset/Indo-vap",
    output_dir="results/deidentified/Indo-vap",
    process_subdirs=True
)

print(f"Processed {stats['texts_processed']} texts")

Single File Processing

from scripts.extract_data import process_excel_file
from pathlib import Path

# Process one file
input_file = Path("data/dataset/Indo-vap/10_TST.xlsx")
output_dir = Path("results/dataset/Indo-vap")

result = process_excel_file(str(input_file), str(output_dir))
print(f"Processed {result['records']} records")

Custom Logging

from scripts.utils import logging as log

# Use custom logger
log.info("Processing started")
log.success("Operation completed successfully")
log.warning("Potential issue detected")
log.error("An error occurred", exc_info=True)

Country-Specific De-identification

from scripts.utils.country_regulations import CountryRegulationManager

# Initialize for India
manager = CountryRegulationManager()
manager.set_country("IN")

# Get identifiers
identifiers = manager.get_identifiers()
print(f"Found {len(identifiers)} identifiers for India")

# Validate Aadhaar number
is_valid = manager.validate_identifier("AADHAAR", "1234 5678 9012")

Module Dependencies

scripts/
β”œβ”€β”€ extract_data.py
β”‚   └── uses: logging, config
β”‚
β”œβ”€β”€ load_dictionary.py
β”‚   └── uses: logging, config
β”‚
└── utils/
    β”œβ”€β”€ logging.py
    β”‚   └── uses: config
    β”œβ”€β”€ deidentify.py
    β”‚   └── uses: config, country_regulations, cryptography (optional)
    └── country_regulations.py
        └── uses: re, json, dataclasses

See Also

Usage Guide

Usage examples

Architecture

Architecture documentation

main module

Main module that orchestrates scripts