main module
RePORTaLiN Main Pipeline
Central entry point for the clinical data processing pipeline, orchestrating: - Data dictionary loading and validation - Excel to JSONL extraction with type conversion - PHI/PII de-identification with country-specific compliance
This module provides a complete end-to-end pipeline with comprehensive error handling, progress tracking, and flexible configuration via command-line arguments.
Public API
Exports 2 main functions via __all__:
main: Main pipeline orchestratorrun_step: Pipeline step executor with error handling
Key Features
Multi-Step Pipeline: Dictionary → Extraction → De-identification
Flexible Execution: Skip individual steps or run complete pipeline
Country Compliance: Support for 14 countries (US, IN, ID, BR, etc.)
Error Recovery: Comprehensive error handling with detailed logging
Version Tracking: Built-in version management
Pipeline Steps
Step 0: Data Dictionary Loading (Optional) - Processes Excel data dictionary files - Splits multi-table sheets automatically - Outputs JSONL format with metadata
Step 1: Data Extraction (Default) - Converts Excel files to JSONL format - Dual output: original and cleaned versions - Type conversion and validation - Progress tracking with real-time feedback
Step 2: De-identification (Opt-in) - PHI/PII detection and pseudonymization - Country-specific regulations (HIPAA, GDPR, DPDPA, etc.) - Encrypted mapping storage - Date shifting with interval preservation
Error Handling
The pipeline uses comprehensive error handling:
Step-level Errors: Each step is wrapped in try/except
Validation Errors: Invalid results cause immediate exit
Logging: All errors logged with full stack traces
Exit Codes: Non-zero exit on any failure
Return Codes: - 0: Success - 1: Pipeline failure (any step)
See also
User Documentation:
user_guide/quickstart - Quick start guide with basic examples
user_guide/usage - Advanced usage patterns and workflows
user_guide/configuration - Configuration and command-line options
developer_guide/architecture - Technical architecture details
API Reference:
scripts.load_dictionary- Data dictionary processingscripts.extract_data- Data extractionscripts.deidentify- De-identificationconfig- Configuration settings
- main.main()[source]
Main pipeline orchestrating dictionary loading, data extraction, and de-identification.
- Return type:
- Command-line Arguments:
–skip-dictionary: Skip data dictionary loading –skip-extraction: Skip data extraction –enable-deidentification: Enable de-identification (disabled by default) –skip-deidentification: Skip de-identification even if enabled –no-encryption: Disable encryption for de-identification mappings -c, –countries: Country codes (e.g., IN US ID BR) or ALL -v, –verbose: Enable verbose (DEBUG level) logging
Overview
The main module serves as the central entry point for the RePORTaLiN pipeline.
It orchestrates the execution of data processing steps and provides command-line
interface functionality.
Enhanced in v0.0.12:
✅ Added
-v/--verboseflag for DEBUG-level logging✅ Enhanced logging throughout pipeline for detailed troubleshooting
✅ Version updated to 0.0.12
Enhanced in v0.0.11:
✅ Enhanced module docstring with comprehensive usage examples (162 lines, 2,214% increase)
✅ Added explicit public API definition via
__all__(2 exports)✅ Complete command-line arguments documentation
✅ Pipeline steps explanation with detailed features
✅ Four usage examples (basic, custom, de-identification, advanced)
✅ Output structure with directory tree
✅ Error handling and return codes documented
Public API
The module exports 2 functions via __all__:
main - Main pipeline orchestrator with command-line interface
run_step - Pipeline step executor with error handling
Quick Start:
# Run complete pipeline
python3 main.py
# With de-identification
python3 main.py --enable-deidentification --countries IN US
# Custom execution
python3 main.py --skip-dictionary --enable-deidentification
Functions
run_step
- main.run_step(step_name, func)[source]
Execute pipeline step with error handling and logging.
Execute a pipeline step with comprehensive error handling and logging.
Example:
from main import run_step
def my_processing_step():
print("Processing...")
return True
result = run_step("My Step", my_processing_step)
main
- main.main()[source]
Main pipeline orchestrating dictionary loading, data extraction, and de-identification.
- Return type:
- Command-line Arguments:
–skip-dictionary: Skip data dictionary loading –skip-extraction: Skip data extraction –enable-deidentification: Enable de-identification (disabled by default) –skip-deidentification: Skip de-identification even if enabled –no-encryption: Disable encryption for de-identification mappings -c, –countries: Country codes (e.g., IN US ID BR) or ALL -v, –verbose: Enable verbose (DEBUG level) logging
Main entry point for the pipeline.
Command-line usage:
# Run full pipeline
python main.py
# Skip dictionary loading
python main.py --skip-dictionary
# Skip data extraction
python main.py --skip-extraction
# Enable de-identification with encryption (default)
python main.py --enable-deidentification
# Enable de-identification without encryption (testing only)
python main.py --enable-deidentification --no-encryption
# Skip de-identification step
python main.py --skip-deidentification
# Specify country codes for de-identification
python main.py --enable-deidentification -c IN US BR
python main.py --enable-deidentification --countries ALL
# Show version
python main.py --version
Programmatic usage:
# Import and run
import main
main.main()
Pipeline Steps
The main function executes these steps in order:
Step 0: Load Data Dictionary
Processes the Excel-based data dictionary using
scripts.load_dictionary.load_study_dictionary(). Can be skipped with--skip-dictionary.Step 1: Extract Raw Data
Extracts data from Excel files using
scripts.extract_data.extract_excel_to_jsonl(). Can be skipped with--skip-extraction.Step 2: De-identify Data (Optional)
De-identifies PHI/PII from extracted data using
scripts.deidentify.deidentify_dataset(). Must be explicitly enabled with--enable-deidentification.Encryption enabled by default for security
Can disable encryption with
--no-encryption(testing only)Can specify country codes with
-cor--countriesCan be skipped with
--skip-deidentification
Error Handling
All steps are wrapped with error handling:
Exceptions are caught and logged
Detailed error messages with traceback
Program exits with code 1 on error
Ensures clean shutdown
Logging
The module uses centralized logging:
Step execution logged at INFO level
Success logged at SUCCESS level (custom)
Errors logged at ERROR level with traceback
All logs written to timestamped file
See Also
configConfiguration management
scripts.extract_dataData extraction functionality
scripts.load_dictionaryDictionary loading functionality
scripts.deidentifyDe-identification utilities
scripts.utils.loggingLogging utilities