main module

RePORTaLiN Main Pipeline

Central entry point for the clinical data processing pipeline, orchestrating: - Data dictionary loading and validation - Excel to JSONL extraction with type conversion - PHI/PII de-identification with country-specific compliance

This module provides a complete end-to-end pipeline with comprehensive error handling, progress tracking, and flexible configuration via command-line arguments.

Public API

Exports 2 main functions via __all__:

  • main: Main pipeline orchestrator

  • run_step: Pipeline step executor with error handling

Key Features

  • Multi-Step Pipeline: Dictionary → Extraction → De-identification

  • Flexible Execution: Skip individual steps or run complete pipeline

  • Country Compliance: Support for 14 countries (US, IN, ID, BR, etc.)

  • Error Recovery: Comprehensive error handling with detailed logging

  • Version Tracking: Built-in version management

Pipeline Steps

Step 0: Data Dictionary Loading (Optional) - Processes Excel data dictionary files - Splits multi-table sheets automatically - Outputs JSONL format with metadata

Step 1: Data Extraction (Default) - Converts Excel files to JSONL format - Dual output: original and cleaned versions - Type conversion and validation - Progress tracking with real-time feedback

Step 2: De-identification (Opt-in) - PHI/PII detection and pseudonymization - Country-specific regulations (HIPAA, GDPR, DPDPA, etc.) - Encrypted mapping storage - Date shifting with interval preservation

Error Handling

The pipeline uses comprehensive error handling:

  1. Step-level Errors: Each step is wrapped in try/except

  2. Validation Errors: Invalid results cause immediate exit

  3. Logging: All errors logged with full stack traces

  4. Exit Codes: Non-zero exit on any failure

Return Codes: - 0: Success - 1: Pipeline failure (any step)

See also

User Documentation:

  • user_guide/quickstart - Quick start guide with basic examples

  • user_guide/usage - Advanced usage patterns and workflows

  • user_guide/configuration - Configuration and command-line options

  • developer_guide/architecture - Technical architecture details

API Reference:

main.main()[source]

Main pipeline orchestrating dictionary loading, data extraction, and de-identification.

Return type:

None

Command-line Arguments:

–skip-dictionary: Skip data dictionary loading –skip-extraction: Skip data extraction –enable-deidentification: Enable de-identification (disabled by default) –skip-deidentification: Skip de-identification even if enabled –no-encryption: Disable encryption for de-identification mappings -c, –countries: Country codes (e.g., IN US ID BR) or ALL -v, –verbose: Enable verbose (DEBUG level) logging

main.run_step(step_name, func)[source]

Execute pipeline step with error handling and logging.

Parameters:
  • step_name (str) – Name of the pipeline step

  • func (Callable[[], Any]) – Callable function to execute

Return type:

Any

Returns:

Result from the function, or exits with code 1 on error

Overview

The main module serves as the central entry point for the RePORTaLiN pipeline. It orchestrates the execution of data processing steps and provides command-line interface functionality.

Enhanced in v0.0.12:

  • ✅ Added -v / --verbose flag for DEBUG-level logging

  • ✅ Enhanced logging throughout pipeline for detailed troubleshooting

  • ✅ Version updated to 0.0.12

Enhanced in v0.0.11:

  • ✅ Enhanced module docstring with comprehensive usage examples (162 lines, 2,214% increase)

  • ✅ Added explicit public API definition via __all__ (2 exports)

  • ✅ Complete command-line arguments documentation

  • ✅ Pipeline steps explanation with detailed features

  • ✅ Four usage examples (basic, custom, de-identification, advanced)

  • ✅ Output structure with directory tree

  • ✅ Error handling and return codes documented

Public API

The module exports 2 functions via __all__:

  1. main - Main pipeline orchestrator with command-line interface

  2. run_step - Pipeline step executor with error handling

Quick Start:

# Run complete pipeline
python3 main.py

# With de-identification
python3 main.py --enable-deidentification --countries IN US

# Custom execution
python3 main.py --skip-dictionary --enable-deidentification

Functions

run_step

main.run_step(step_name, func)[source]

Execute pipeline step with error handling and logging.

Parameters:
  • step_name (str) – Name of the pipeline step

  • func (Callable[[], Any]) – Callable function to execute

Return type:

Any

Returns:

Result from the function, or exits with code 1 on error

Execute a pipeline step with comprehensive error handling and logging.

Example:

from main import run_step

def my_processing_step():
    print("Processing...")
    return True

result = run_step("My Step", my_processing_step)

main

main.main()[source]

Main pipeline orchestrating dictionary loading, data extraction, and de-identification.

Return type:

None

Command-line Arguments:

–skip-dictionary: Skip data dictionary loading –skip-extraction: Skip data extraction –enable-deidentification: Enable de-identification (disabled by default) –skip-deidentification: Skip de-identification even if enabled –no-encryption: Disable encryption for de-identification mappings -c, –countries: Country codes (e.g., IN US ID BR) or ALL -v, –verbose: Enable verbose (DEBUG level) logging

Main entry point for the pipeline.

Command-line usage:

# Run full pipeline
python main.py

# Skip dictionary loading
python main.py --skip-dictionary

# Skip data extraction
python main.py --skip-extraction

# Enable de-identification with encryption (default)
python main.py --enable-deidentification

# Enable de-identification without encryption (testing only)
python main.py --enable-deidentification --no-encryption

# Skip de-identification step
python main.py --skip-deidentification

# Specify country codes for de-identification
python main.py --enable-deidentification -c IN US BR
python main.py --enable-deidentification --countries ALL

# Show version
python main.py --version

Programmatic usage:

# Import and run
import main
main.main()

Pipeline Steps

The main function executes these steps in order:

  1. Step 0: Load Data Dictionary

    Processes the Excel-based data dictionary using scripts.load_dictionary.load_study_dictionary(). Can be skipped with --skip-dictionary.

  2. Step 1: Extract Raw Data

    Extracts data from Excel files using scripts.extract_data.extract_excel_to_jsonl(). Can be skipped with --skip-extraction.

  3. Step 2: De-identify Data (Optional)

    De-identifies PHI/PII from extracted data using scripts.deidentify.deidentify_dataset(). Must be explicitly enabled with --enable-deidentification.

    • Encryption enabled by default for security

    • Can disable encryption with --no-encryption (testing only)

    • Can specify country codes with -c or --countries

    • Can be skipped with --skip-deidentification

Error Handling

All steps are wrapped with error handling:

  • Exceptions are caught and logged

  • Detailed error messages with traceback

  • Program exits with code 1 on error

  • Ensures clean shutdown

Logging

The module uses centralized logging:

  • Step execution logged at INFO level

  • Success logged at SUCCESS level (custom)

  • Errors logged at ERROR level with traceback

  • All logs written to timestamped file

See Also

config

Configuration management

scripts.extract_data

Data extraction functionality

scripts.load_dictionary

Dictionary loading functionality

scripts.deidentify

De-identification utilities

scripts.utils.logging

Logging utilities