scripts packageο
RePORTaLiN Scripts Packageο
Core data processing modules for clinical research data extraction, validation, and de-identification.
This package provides high-level functions for the complete data processing pipeline:
Data Dictionary Processing: Load and validate study data dictionaries
Data Extraction: Convert Excel files to JSONL format with validation
De-identification: Advanced PHI/PII detection and pseudonymization (via utils)
Public APIο
The package exports 2 main high-level functions via __all__:
load_study_dictionary: Process data dictionary Excel filesextract_excel_to_jsonl: Extract dataset Excel files to JSONL
For more specialized functionality, import directly from submodules:
scripts.load_dictionary: Dictionary processing (2 public functions)scripts.extract_data: Data extraction (6 public functions)scripts.deidentify: De-identification engine (10 public functions)scripts.utils.country_regulations: Privacy regulations (6 public functions)scripts.utils.logging: Enhanced logging (12 public functions)
Usage Examplesο
Basic Pipelineο
Process both data dictionary and dataset with default configuration:
from scripts import load_study_dictionary, extract_excel_to_jsonl
# Step 1: Load data dictionary
dict_success = load_study_dictionary()
# Step 2: Extract dataset (uses config.DATASET_DIR and config.CLEAN_DATASET_DIR)
result = extract_excel_to_jsonl()
if dict_success and result['files_created'] > 0:
print("β Pipeline completed successfully!")
Custom Processingο
Use individual modules for custom workflows:
from scripts.load_dictionary import process_excel_file
from scripts.extract_data import find_excel_files, process_excel_file as process_data
# Custom dictionary processing
process_excel_file(
excel_path="custom_dict.xlsx",
output_dir="results/custom_dict"
)
# Custom data extraction with file discovery
excel_files = find_excel_files("data/custom_dataset")
for file_path in excel_files:
process_data(
excel_path=file_path,
output_dir="results/custom_output"
)
De-identification Workflowο
Complete pipeline with de-identification:
from scripts import extract_excel_to_jsonl
from scripts.deidentify import deidentify_dataset, DeidentificationConfig
import config
# Step 1: Extract data (uses config.DATASET_DIR and config.CLEAN_DATASET_DIR)
result = extract_excel_to_jsonl()
# Step 2: De-identify with custom configuration
deidentify_config = DeidentificationConfig(
countries=['IN', 'US'],
enable_encryption=True
)
deidentify_dataset(
input_dir=f"{config.CLEAN_DATASET_DIR}/cleaned",
output_dir="results/deidentified/Indo-vap",
config=deidentify_config
)
Module Structureο
The package is organized as follows:
scripts/
βββ __init__.py # Package API (this file)
βββ load_dictionary.py # Data dictionary processing
βββ extract_data.py # Excel to JSONL extraction
βββ deidentify.py # De-identification engine
βββ utils/ # Utility modules
βββ __init__.py
βββ country_regulations.py # Privacy compliance
βββ logging.py # Enhanced logging
Version Historyο
v0.0.9: Enhanced package-level API with comprehensive documentation
v0.0.8: Enhanced load_dictionary module (public API, type hints, docs)
v0.0.7: Enhanced extract_data module (public API, type hints, docs)
v0.0.6: Enhanced deidentify module (public API, type safety, docs)
v0.0.5: Enhanced country_regulations module (public API, docs)
v0.0.4: Enhanced logging module (performance, type hints)
v0.0.3: Enhanced config module (utilities, robustness)
v0.0.1: Initial package structure
See also
-mod:scripts.load_dictionary - Data dictionary processing
-mod:scripts.extract_data - Data extraction
-mod:scripts.deidentify - De-identification
-mod:scripts.utils.country_regulations - Privacy regulations
-mod:scripts.utils.logging - Logging utilities
- scripts.extract_excel_to_jsonl()[source]ο
Extract all Excel files from dataset directory, creating original and cleaned JSONL versions.
- scripts.load_study_dictionary(file_path=None, json_output_dir=None, preserve_na=True)[source]ο
Load and process study data dictionary from Excel to JSONL format.
- Parameters:
- Return type:
- Returns:
True if processing was successful, False otherwise
Overviewο
The scripts package contains the core processing modules for RePORTaLiN.
Enhanced in v0.0.9:
β Enhanced package-level documentation with comprehensive usage examples
β Clear public API definition (2 high-level functions)
β Integration examples for complete data processing pipeline
β De-identification workflow documentation
β Module structure and cross-references
Package-Level Public APIο
The package exports 2 high-level functions for the main processing pipeline:
load_study_dictionary - Process data dictionary Excel files
extract_excel_to_jsonl - Extract dataset Excel files to JSONL
Quick Start:
from scripts import load_study_dictionary, extract_excel_to_jsonl
# Step 1: Load data dictionary
dict_success = load_study_dictionary()
# Step 2: Extract dataset
extract_success = extract_excel_to_jsonl(
input_dir="data/dataset/Indo-vap",
output_dir="results/dataset/Indo-vap"
)
For specialized functionality, import directly from submodules:
scripts.load_dictionary- 2 public functionsscripts.extract_data- 6 public functionsscripts.deidentify- 10 public functionsscripts.utils.country_regulations- 6 public functionsscripts.utils.logging- 12 public functions
Module Organizationο
scripts/
βββ __init__.py # Package API (2 exports)
βββ load_dictionary.py # Data dictionary (2 exports)
βββ extract_data.py # Data extraction (6 exports)
βββ utils/
βββ deidentify.py # De-identification (10 exports)
βββ country_regulations.py # Privacy rules (6 exports)
βββ logging.py # Logging (12 exports)
Submodulesο
- scripts.extract_data module
- scripts.load_dictionary module
- scripts.utils.logging module
- Centralized Logging Module
CustomFormatterVerboseLoggercritical()debug()error()get_log_file_path()get_logger()get_verbose_logger()info()setup_logger()success()warning()- Overview
- Public API
- Setup Functions
- Custom Log Levels
- Logging Functions
- Logging Configuration
- Output Handlers
- Usage Examples
- Logging Best Practices
- See Also
- scripts.deidentify module
- scripts.utils.country_regulations module
Module Summaryο
extract_dataο
Main data extraction module for converting Excel files to JSONL format.
Key functions:
extract_excel_to_jsonl(): Batch processing of Excel filesprocess_excel_file(): Single file processingconvert_dataframe_to_jsonl(): DataFrame to JSONL conversionclean_record_for_json(): Type conversion for JSON serializationfind_excel_files(): File discoveryis_dataframe_empty(): Empty DataFrame detection
load_dictionaryο
Data dictionary processing module with intelligent table detection.
Key functions:
load_study_dictionary(): High-level API for dictionary loadingprocess_excel_file(): Excel file processing_split_sheet_into_tables(): Automatic table detection_process_and_save_tables(): Table output_deduplicate_columns(): Duplicate column handling
utilsο
Utility modules including de-identification and logging.
De-identification (scripts.deidentify)ο
PHI/PII de-identification module with pseudonymization and encryption.
Key classes:
DeidentificationEngine: Main processing enginePseudonymGenerator: Cryptographic pseudonym generationDateShifter: Multi-format date shifting with interval preservation and format preservationMappingStore: Encrypted mapping storagePatternLibrary: PHI/PII detection patterns
Key functions:
deidentify_dataset(): Batch dataset de-identificationvalidate_dataset(): Dataset validation
Logging (scripts.utils.logging)ο
Centralized logging module with custom SUCCESS level.
Key features:
Custom SUCCESS log level
Timestamped log files
Dual output (console + file)
Structured logging
Country Regulations (scripts.utils.country_regulations)ο
Country-specific data privacy regulations module for compliance.
Key features:
Multi-country support (14 countries)
Privacy frameworks (PUBLIC to CRITICAL levels)
Identifier detection and validation
Regulatory requirements management
Configuration export/import
Quick Examplesο
Data Extractionο
from scripts.extract_data import extract_excel_to_jsonl
import config
# Extract all Excel files
extract_excel_to_jsonl(
input_dir=config.DATASET_DIR,
output_dir=config.CLEAN_DATASET_DIR
)
Dictionary Loadingο
from scripts.load_dictionary import load_study_dictionary
import config
# Load data dictionary
load_study_dictionary(
excel_file=config.DICTIONARY_EXCEL_FILE,
output_dir=config.DICTIONARY_JSON_OUTPUT_DIR
)
De-identificationο
from scripts.deidentify import DeidentificationEngine
# Initialize engine
engine = DeidentificationEngine()
# De-identify text
original = "Patient John Doe, MRN: 123456, SSN: 123-45-6789"
deidentified = engine.deidentify_text(original)
# Save mappings
engine.save_mappings()
Batch De-identificationο
from scripts.deidentify import deidentify_dataset
# Process entire dataset (maintains directory structure)
stats = deidentify_dataset(
input_dir="results/dataset/Indo-vap",
output_dir="results/deidentified/Indo-vap",
process_subdirs=True
)
print(f"Processed {stats['texts_processed']} texts")
Single File Processingο
from scripts.extract_data import process_excel_file
from pathlib import Path
# Process one file
input_file = Path("data/dataset/Indo-vap/10_TST.xlsx")
output_dir = Path("results/dataset/Indo-vap")
result = process_excel_file(str(input_file), str(output_dir))
print(f"Processed {result['records']} records")
Custom Loggingο
from scripts.utils import logging as log
# Use custom logger
log.info("Processing started")
log.success("Operation completed successfully")
log.warning("Potential issue detected")
log.error("An error occurred", exc_info=True)
Country-Specific De-identificationο
from scripts.utils.country_regulations import CountryRegulationManager
# Initialize for India
manager = CountryRegulationManager()
manager.set_country("IN")
# Get identifiers
identifiers = manager.get_identifiers()
print(f"Found {len(identifiers)} identifiers for India")
# Validate Aadhaar number
is_valid = manager.validate_identifier("AADHAAR", "1234 5678 9012")
Module Dependenciesο
scripts/
βββ extract_data.py
β βββ uses: logging, config
β
βββ load_dictionary.py
β βββ uses: logging, config
β
βββ utils/
βββ logging.py
β βββ uses: config
βββ deidentify.py
β βββ uses: config, country_regulations, cryptography (optional)
βββ country_regulations.py
βββ uses: re, json, dataclasses
See Alsoο
- Usage Guide
Usage examples
- Architecture
Architecture documentation
- main module
Main module that orchestrates scripts