scripts package

RePORTaLiN Scripts Package

Core data processing modules for clinical research data extraction, validation, and de-identification.

This package provides high-level functions for the complete data processing pipeline:

Data Dictionary Processing: Load and validate study data dictionaries
Data Extraction: Convert Excel files to JSONL format with validation
De-identification: Advanced PHI/PII detection and pseudonymization (via utils)

Public API

The package exports 2 main high-level functions via __all__:

load_study_dictionary: Process data dictionary Excel files
extract_excel_to_jsonl: Extract dataset Excel files to JSONL

For more specialized functionality, import directly from submodules:

scripts.load_dictionary: Dictionary processing (2 public functions)
scripts.extract_data: Data extraction (6 public functions)
scripts.deidentify: De-identification engine (10 public functions)
scripts.utils.country_regulations: Privacy regulations (6 public functions)
scripts.utils.logging: Enhanced logging (12 public functions)

Usage Examples

Basic Pipeline

Process both data dictionary and dataset with default configuration:

from scripts import load_study_dictionary, extract_excel_to_jsonl

# Step 1: Load data dictionary
dict_success = load_study_dictionary()

# Step 2: Extract dataset (uses config.DATASET_DIR and config.CLEAN_DATASET_DIR)
result = extract_excel_to_jsonl()

if dict_success and result['files_created'] > 0:
    print("✓ Pipeline completed successfully!")

Custom Processing

Use individual modules for custom workflows:

from scripts.load_dictionary import process_excel_file
from scripts.extract_data import find_excel_files, process_excel_file as process_data

# Custom dictionary processing
process_excel_file(
    excel_path="custom_dict.xlsx",
    output_dir="results/custom_dict"
)

# Custom data extraction with file discovery
excel_files = find_excel_files("data/custom_dataset")
for file_path in excel_files:
    process_data(
        excel_path=file_path,
        output_dir="results/custom_output"
    )

De-identification Workflow

Complete pipeline with de-identification:

from scripts import extract_excel_to_jsonl
from scripts.deidentify import deidentify_dataset, DeidentificationConfig
import config

# Step 1: Extract data (uses config.DATASET_DIR and config.CLEAN_DATASET_DIR)
result = extract_excel_to_jsonl()

# Step 2: De-identify with custom configuration
deidentify_config = DeidentificationConfig(
    countries=['IN', 'US'],
    enable_encryption=True
)

deidentify_dataset(
    input_dir=f"{config.CLEAN_DATASET_DIR}/cleaned",
    output_dir="results/deidentified/Indo-vap",
    config=deidentify_config
)

Module Structure

The package is organized as follows:

scripts/
├── __init__.py              # Package API (this file)
├── load_dictionary.py       # Data dictionary processing
├── extract_data.py          # Excel to JSONL extraction
├── deidentify.py            # De-identification engine
└── utils/                   # Utility modules
    ├── __init__.py
    ├── country_regulations.py  # Privacy compliance
    └── logging.py           # Enhanced logging

Version History

v0.0.9: Enhanced package-level API with comprehensive documentation
v0.0.8: Enhanced load_dictionary module (public API, type hints, docs)
v0.0.7: Enhanced extract_data module (public API, type hints, docs)
v0.0.6: Enhanced deidentify module (public API, type safety, docs)
v0.0.5: Enhanced country_regulations module (public API, docs)
v0.0.4: Enhanced logging module (performance, type hints)
v0.0.3: Enhanced config module (utilities, robustness)
v0.0.1: Initial package structure

Overview

The scripts package contains the core processing modules for RePORTaLiN.

Enhanced in v0.0.9:

✅ Enhanced package-level documentation with comprehensive usage examples
✅ Clear public API definition (2 high-level functions)
✅ Integration examples for complete data processing pipeline
✅ De-identification workflow documentation
✅ Module structure and cross-references

Package-Level Public API

The package exports 2 high-level functions for the main processing pipeline:

load_study_dictionary - Process data dictionary Excel files
extract_excel_to_jsonl - Extract dataset Excel files to JSONL

Quick Start:

from scripts import load_study_dictionary, extract_excel_to_jsonl

# Step 1: Load data dictionary
dict_success = load_study_dictionary()

# Step 2: Extract dataset
extract_success = extract_excel_to_jsonl(
    input_dir="data/dataset/Indo-vap",
    output_dir="results/dataset/Indo-vap"
)

For specialized functionality, import directly from submodules:

scripts.load_dictionary - 2 public functions
scripts.extract_data - 6 public functions
scripts.deidentify - 10 public functions
scripts.utils.country_regulations - 6 public functions
scripts.utils.logging - 12 public functions

Module Organization

scripts/
├── __init__.py              # Package API (2 exports)
├── load_dictionary.py       # Data dictionary (2 exports)
├── extract_data.py          # Data extraction (6 exports)
└── utils/
    ├── deidentify.py        # De-identification (10 exports)
    ├── country_regulations.py  # Privacy rules (6 exports)
    └── logging.py           # Logging (12 exports)

Submodules

Module Summary

extract_data

Main data extraction module for converting Excel files to JSONL format.

Key functions:

extract_excel_to_jsonl(): Batch processing of Excel files
process_excel_file(): Single file processing
convert_dataframe_to_jsonl(): DataFrame to JSONL conversion
clean_record_for_json(): Type conversion for JSON serialization
find_excel_files(): File discovery
is_dataframe_empty(): Empty DataFrame detection

See: scripts.extract_data module

load_dictionary

Data dictionary processing module with intelligent table detection.

Key functions:

load_study_dictionary(): High-level API for dictionary loading
process_excel_file(): Excel file processing
_split_sheet_into_tables(): Automatic table detection
_process_and_save_tables(): Table output
_deduplicate_columns(): Duplicate column handling

See: scripts.load_dictionary module

utils

Utility modules including de-identification and logging.

De-identification (`scripts.deidentify`)

PHI/PII de-identification module with pseudonymization and encryption.

Key classes:

DeidentificationEngine: Main processing engine
PseudonymGenerator: Cryptographic pseudonym generation
DateShifter: Multi-format date shifting with interval preservation and format preservation
MappingStore: Encrypted mapping storage
PatternLibrary: PHI/PII detection patterns

Key functions:

deidentify_dataset(): Batch dataset de-identification
validate_dataset(): Dataset validation

See: scripts.deidentify module

Logging (`scripts.utils.logging`)

Centralized logging module with custom SUCCESS level.

Key features:

Custom SUCCESS log level
Timestamped log files
Dual output (console + file)
Structured logging

See: scripts.utils.logging module

Country Regulations (`scripts.utils.country_regulations`)

Country-specific data privacy regulations module for compliance.

Key features:

Multi-country support (14 countries)
Privacy frameworks (PUBLIC to CRITICAL levels)
Identifier detection and validation
Regulatory requirements management
Configuration export/import

See: scripts.utils.country_regulations module

Quick Examples

Data Extraction

from scripts.extract_data import extract_excel_to_jsonl
import config

# Extract all Excel files
extract_excel_to_jsonl(
    input_dir=config.DATASET_DIR,
    output_dir=config.CLEAN_DATASET_DIR
)

Dictionary Loading

from scripts.load_dictionary import load_study_dictionary
import config

# Load data dictionary
load_study_dictionary(
    excel_file=config.DICTIONARY_EXCEL_FILE,
    output_dir=config.DICTIONARY_JSON_OUTPUT_DIR
)

De-identification

from scripts.deidentify import DeidentificationEngine

# Initialize engine
engine = DeidentificationEngine()

# De-identify text
original = "Patient John Doe, MRN: 123456, SSN: 123-45-6789"
deidentified = engine.deidentify_text(original)

# Save mappings
engine.save_mappings()

Batch De-identification

from scripts.deidentify import deidentify_dataset

# Process entire dataset (maintains directory structure)
stats = deidentify_dataset(
    input_dir="results/dataset/Indo-vap",
    output_dir="results/deidentified/Indo-vap",
    process_subdirs=True
)

print(f"Processed {stats['texts_processed']} texts")

Single File Processing

from scripts.extract_data import process_excel_file
from pathlib import Path

# Process one file
input_file = Path("data/dataset/Indo-vap/10_TST.xlsx")
output_dir = Path("results/dataset/Indo-vap")

result = process_excel_file(str(input_file), str(output_dir))
print(f"Processed {result['records']} records")

Custom Logging

from scripts.utils import logging as log

# Use custom logger
log.info("Processing started")
log.success("Operation completed successfully")
log.warning("Potential issue detected")
log.error("An error occurred", exc_info=True)

Country-Specific De-identification

from scripts.utils.country_regulations import CountryRegulationManager

# Initialize for India
manager = CountryRegulationManager()
manager.set_country("IN")

# Get identifiers
identifiers = manager.get_identifiers()
print(f"Found {len(identifiers)} identifiers for India")

# Validate Aadhaar number
is_valid = manager.validate_identifier("AADHAAR", "1234 5678 9012")

Module Dependencies

scripts/
├── extract_data.py
│   └── uses: logging, config
│
├── load_dictionary.py
│   └── uses: logging, config
│
└── utils/
    ├── logging.py
    │   └── uses: config
    ├── deidentify.py
    │   └── uses: config, country_regulations, cryptography (optional)
    └── country_regulations.py
        └── uses: re, json, dataclasses

scripts package

RePORTaLiN Scripts Package

Public API

Usage Examples

Basic Pipeline

Custom Processing

De-identification Workflow

Module Structure

Version History

Overview

Package-Level Public API

Module Organization

Submodules

Module Summary

extract_data

load_dictionary

utils

De-identification (scripts.deidentify)

Logging (scripts.utils.logging)

Country Regulations (scripts.utils.country_regulations)

Quick Examples

Data Extraction

Dictionary Loading

De-identification

Batch De-identification

Single File Processing

Custom Logging

Country-Specific De-identification

Module Dependencies

See Also

De-identification (`scripts.deidentify`)

Logging (`scripts.utils.logging`)

Country Regulations (`scripts.utils.country_regulations`)