main module

RePORTaLiN Main Pipeline

Central entry point for the clinical data processing pipeline, orchestrating: - Data dictionary loading and validation - Excel to JSONL extraction with type conversion - PHI/PII de-identification with country-specific compliance

This module provides a complete end-to-end pipeline with comprehensive error handling, progress tracking, and flexible configuration via command-line arguments.

Public API

Exports 2 main functions via __all__:

main: Main pipeline orchestrator
run_step: Pipeline step executor with error handling

Key Features

Multi-Step Pipeline: Dictionary → Extraction → De-identification
Flexible Execution: Skip individual steps or run complete pipeline
Country Compliance: Support for 14 countries (US, IN, ID, BR, etc.)
Error Recovery: Comprehensive error handling with detailed logging
Version Tracking: Built-in version management

Pipeline Steps

Step 0: Data Dictionary Loading (Optional) - Processes Excel data dictionary files - Splits multi-table sheets automatically - Outputs JSONL format with metadata

Step 1: Data Extraction (Default) - Converts Excel files to JSONL format - Dual output: original and cleaned versions - Type conversion and validation - Progress tracking with real-time feedback

Step 2: De-identification (Opt-in) - PHI/PII detection and pseudonymization - Country-specific regulations (HIPAA, GDPR, DPDPA, etc.) - Encrypted mapping storage - Date shifting with interval preservation

Error Handling

The pipeline uses comprehensive error handling:

Step-level Errors: Each step is wrapped in try/except
Validation Errors: Invalid results cause immediate exit
Logging: All errors logged with full stack traces
Exit Codes: Non-zero exit on any failure

Return Codes: - 0: Success - 1: Pipeline failure (any step)

Overview

The main module serves as the central entry point for the RePORTaLiN pipeline. It orchestrates the execution of data processing steps and provides command-line interface functionality.

Enhanced in v0.0.12:

✅ Added -v / --verbose flag for DEBUG-level logging
✅ Enhanced logging throughout pipeline for detailed troubleshooting
✅ Version updated to 0.0.12

Enhanced in v0.0.11:

✅ Enhanced module docstring with comprehensive usage examples (162 lines, 2,214% increase)
✅ Added explicit public API definition via __all__ (2 exports)
✅ Complete command-line arguments documentation
✅ Pipeline steps explanation with detailed features
✅ Four usage examples (basic, custom, de-identification, advanced)
✅ Output structure with directory tree
✅ Error handling and return codes documented

Public API

The module exports 2 functions via __all__:

main - Main pipeline orchestrator with command-line interface
run_step - Pipeline step executor with error handling

Quick Start:

# Run complete pipeline
python3 main.py

# With de-identification
python3 main.py --enable-deidentification --countries IN US

# Custom execution
python3 main.py --skip-dictionary --enable-deidentification

Functions

run_step

main.run_step(step_name, func)[source]

Execute pipeline step with error handling and logging.

Parameters:

step_name (str) – Name of the pipeline step
func (Callable[[], Any]) – Callable function to execute

Return type:

Any

Returns:

Result from the function, or exits with code 1 on error

Execute a pipeline step with comprehensive error handling and logging.

Example:

from main import run_step

def my_processing_step():
    print("Processing...")
    return True

result = run_step("My Step", my_processing_step)

main

main.main()[source]

Main pipeline orchestrating dictionary loading, data extraction, and de-identification.

Return type:: None

Command-line Arguments:: –skip-dictionary: Skip data dictionary loading –skip-extraction: Skip data extraction –enable-deidentification: Enable de-identification (disabled by default) –skip-deidentification: Skip de-identification even if enabled –no-encryption: Disable encryption for de-identification mappings -c, –countries: Country codes (e.g., IN US ID BR) or ALL -v, –verbose: Enable verbose (DEBUG level) logging

Main entry point for the pipeline.

Command-line usage:

# Run full pipeline
python main.py

# Skip dictionary loading
python main.py --skip-dictionary

# Skip data extraction
python main.py --skip-extraction

# Enable de-identification with encryption (default)
python main.py --enable-deidentification

# Enable de-identification without encryption (testing only)
python main.py --enable-deidentification --no-encryption

# Skip de-identification step
python main.py --skip-deidentification

# Specify country codes for de-identification
python main.py --enable-deidentification -c IN US BR
python main.py --enable-deidentification --countries ALL

# Show version
python main.py --version

Programmatic usage:

# Import and run
import main
main.main()

Pipeline Steps

The main function executes these steps in order:

Step 0: Load Data Dictionary

Processes the Excel-based data dictionary using scripts.load_dictionary.load_study_dictionary(). Can be skipped with --skip-dictionary.
Step 1: Extract Raw Data

Extracts data from Excel files using scripts.extract_data.extract_excel_to_jsonl(). Can be skipped with --skip-extraction.
Step 2: De-identify Data (Optional)

De-identifies PHI/PII from extracted data using scripts.deidentify.deidentify_dataset(). Must be explicitly enabled with --enable-deidentification.
- Encryption enabled by default for security
- Can disable encryption with --no-encryption (testing only)
- Can specify country codes with -c or --countries
- Can be skipped with --skip-deidentification

Error Handling

All steps are wrapped with error handling:

Exceptions are caught and logged
Detailed error messages with traceback
Program exits with code 1 on error
Ensures clean shutdown

Logging

The module uses centralized logging:

Step execution logged at INFO level
Success logged at SUCCESS level (custom)
Errors logged at ERROR level with traceback
All logs written to timestamped file

main module

RePORTaLiN Main Pipeline

Public API

Key Features

Pipeline Steps

Error Handling

Overview

Public API

Functions

run_step

main

Pipeline Steps

Error Handling

Logging

See Also