Architecture

For Developers: Comprehensive Technical Documentation

This document provides in-depth technical details about RePORTaLiN’s architecture, internal algorithms, data structures, dependencies, design patterns, and extension points to enable effective maintenance, debugging, and feature development.

Last Updated: October 23, 2025 Current Version: 0.8.6 Code Optimization: 35% reduction (640 lines) while maintaining 100% functionality Historical Enhancements: Enhanced modules from v0.0.1 through v0.0.12

System Overview

RePORTaLiN is designed as a modular, pipeline-based system for processing sensitive medical research data from Excel to JSONL format with optional PHI/PII de-identification. The architecture emphasizes:

  • Simplicity: Single entry point (main.py) with clear pipeline stages

  • Modularity: Clear separation of concerns with dedicated modules

  • Robustness: Comprehensive error handling with graceful degradation

  • Transparency: Detailed logging at every step with audit trails

  • Security: Encryption-by-default for de-identification mappings

  • Compliance: Multi-country privacy regulation support (14 countries)

  • Performance: Optimized for high throughput (benchmarks pending)

Architecture Diagram

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         main.py                             β”‚
β”‚                   (Pipeline Orchestrator)                   β”‚
β”‚                                                             β”‚
β”‚  β€’ Command-line argument parsing                           β”‚
β”‚  β€’ Step execution coordination                             β”‚
β”‚  β€’ High-level error handling                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚                      β”‚
         β–Ό                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  load_dict.py  β”‚     β”‚ extract_data.py  β”‚
β”‚                β”‚     β”‚                  β”‚
β”‚ β€’ Sheet split  β”‚     β”‚ β€’ File discovery β”‚
β”‚ β€’ Table detect β”‚     β”‚ β€’ Type convert   β”‚
β”‚ β€’ Duplicate    β”‚     β”‚ β€’ JSONL export   β”‚
β”‚   handling     β”‚     β”‚ β€’ Progress track β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                      β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
                    β–Ό
           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
           β”‚   config.py    β”‚
           β”‚                β”‚
           β”‚ β€’ Paths        β”‚
           β”‚ β€’ Settings     β”‚
           β”‚ β€’ Dynamic      β”‚
           β”‚   detection    β”‚
           β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
                    β–Ό
           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
           β”‚ logging  β”‚
           β”‚                β”‚
           β”‚ β€’ Log files    β”‚
           β”‚ β€’ Console out  β”‚
           β”‚ β€’ SUCCESS lvl  β”‚
           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Core Components

1. main.py - Pipeline Orchestrator

Purpose: Central entry point and workflow coordinator

Responsibilities:

  • Parse command-line arguments

  • Execute pipeline steps in sequence

  • Handle top-level errors

  • Coordinate logging

Key Functions:

  • main(): Entry point, orchestrates full pipeline

  • run_step(): Wrapper for step execution with error handling

Design Pattern: Command pattern with error handling decorator

2. config.py - Configuration Management

Purpose: Centralized configuration

Responsibilities:

  • Define all file paths

  • Manage settings

  • Dynamic dataset detection

  • Path resolution

Key Features:

  • Automatic dataset folder detection

  • Relative path resolution

  • Environment-agnostic configuration

  • Type hints for IDE support

Design Pattern: Module-level singleton

3. scripts/extract_data.py - Data Extraction

Purpose: Convert Excel files to JSONL

Responsibilities:

  • File discovery and validation

  • Excel reading and parsing

  • Data type conversion

  • JSONL serialization

  • Progress tracking with tqdm

Key Functions:

  • extract_excel_to_jsonl(): Batch processing with progress bars

  • process_excel_file(): Single file processing

  • convert_dataframe_to_jsonl(): DataFrame conversion

  • clean_record_for_json(): Type conversion

  • is_dataframe_empty(): Empty detection

  • find_excel_files(): File discovery

Progress Tracking:

  • Uses tqdm for all file and row processing

  • Status messages via tqdm.write() for clean output

  • Summary statistics after completion

Design Pattern: Pipeline pattern with functional composition

4. scripts/load_dictionary.py - Dictionary Processing

Purpose: Process data dictionary Excel file

Responsibilities:

  • Sheet processing with progress tracking

  • Table detection and splitting

  • Duplicate column handling

  • Table serialization

Key Functions:

  • load_study_dictionary(): High-level API with tqdm progress bars

  • process_excel_file(): Sheet processing

  • _split_sheet_into_tables(): Table detection

  • _process_and_save_tables(): Table output

  • _deduplicate_columns(): Column name handling

Progress Tracking:

  • tqdm progress bars for sheet processing

  • tqdm.write() for status messages

  • Clean console output during processing

Design Pattern: Functional composition with table detection algorithm

Design Pattern: Strategy pattern for table detection

5. scripts/utils/logging.py - Logging System

Purpose: Centralized logging infrastructure

Responsibilities:

  • Create timestamped log files

  • Dual output (console + file)

  • Custom SUCCESS log level

  • Structured logging with configurable verbosity

Key Features:

  • Custom SUCCESS level (between INFO and WARNING)

  • Timestamped log files in .logs/ directory

  • Console and file handlers with different filtering

  • UTF-8 encoding for international characters

  • Works alongside tqdm for clean progress bar output

  • Verbose mode: DEBUG-level logging via -v flag

Log Levels:

DEBUG (10)    # Verbose mode only: file processing, patterns, details
INFO (20)     # Default: major steps, summaries
SUCCESS (25)  # Custom: successful completions
WARNING (30)  # Potential issues
ERROR (40)    # Failures
CRITICAL (50) # Fatal errors

Console vs. File Output:

  • Console: Only SUCCESS, ERROR, and CRITICAL (keeps terminal clean)

  • File: INFO or DEBUG (depending on --verbose flag) and above

Verbose Logging:

When --verbose or -v flag is used:

  • Log level set to DEBUG in main.py

  • VerboseLogger class provides tree-view, context-managed output

  • Integrated into all three pipeline steps (Dictionary Loading, Data Extraction, De-identification)

  • Additional details logged throughout pipeline:

    • File lists and processing order

    • Sheet/table detection details with row/column counts

    • Duplicate column detection and removal reasons with detailed explanations

    • Processing phases with timing information (per-file, per-sheet, per-table, overall)

    • PHI/PII pattern matches and replacement counts (de-identification step)

    • Record-level progress (every 1000 records for large files)

    • Validation results with issue tracking

    • Comprehensive timing for all operations

VerboseLogger Class:

The VerboseLogger class provides centralized verbose logging with tree-view formatting and context management:

# In module
from scripts.utils import logging as log
vlog = log.get_verbose_logger()

# Usage in code
with vlog.file_processing("Data dictionary", total_records=43):
    vlog.metric("Total sheets", 43)

    for sheet_idx, sheet_name in enumerate(sheets, 1):
        with vlog.step(f"Sheet {sheet_idx}/{len(sheets)}: '{sheet_name}'"):
            vlog.metric("Tables detected", num_tables)
            vlog.detail("Processing table...")
            vlog.timing("Sheet processing time", elapsed)

Output Format:

DEBUG: β”œβ”€ Processing: Data extraction (65 files)
DEBUG:   β”œβ”€ Total files to process: 65
DEBUG:   β”œβ”€ File 1/65: 10_TST.xlsx
DEBUG:   β”‚  β”œβ”€ Loading Excel file
DEBUG:   β”‚  β”‚  β”œβ”€ Rows: 412
DEBUG:   β”‚  β”‚  β”œβ”€ Columns: 28
DEBUG:   β”‚  β”œβ”€ Cleaning duplicate columns
DEBUG:   β”‚  β”‚  β”œβ”€ Marking SUBJID2 for removal (duplicate of SUBJID)
DEBUG:   β”‚  β”‚  β”œβ”€ Removed 3 duplicate columns: SUBJID2, AGE_2, PHONE_3
DEBUG:   β”‚  └─ ⏱ Total processing time: 0.45s
DEBUG:   └─ ⏱ Overall extraction time: 32.15s

Key Methods:

  • file_processing(name, total_records): Wrap entire file processing

  • step(name): Wrap a processing step

  • detail(message): Log a detail line

  • metric(label, value): Log a metric/statistic

  • timing(operation, seconds): Log timing information

  • items_list(label, items, max_show): Log list of items with truncation

Integration Pattern:

All pipeline modules (load_dictionary.py, extract_data.py, deidentify.py) follow this pattern:

  1. Import verbose logger at module level:

    from scripts.utils import logging as log
    vlog = log.get_verbose_logger()
    
  2. Wrap main processing functions with context managers:

    with vlog.file_processing(filename):
        # Main processing
        for item in items:
            with vlog.step(f"Processing {item}"):
                # Item processing
                vlog.metric("Metric", value)
                vlog.timing("Operation", elapsed_time)
    
  3. Always log both to file (verbose) and console (normal):

    log.debug(...)      # File only in verbose mode
    log.info(...)       # File always
    log.success(...)    # Console + file
    vlog.detail(...)    # File only in verbose mode (via context)
    

Usage:

from scripts.utils import logging as log

# Standard (INFO level)
python main.py

# Verbose (DEBUG level) with tree-view logs
python main.py -v

# Verbose with de-identification
python main.py -v --enable-deidentification

# In code
log.debug("Detailed processing info")  # Only in verbose mode
log.info("Major step completed")       # Always logged to file
log.success("Pipeline completed")      # Console + file
vlog.detail("Context-managed detail")  # Only in verbose mode

Design Pattern: Singleton logger instance with configurable formatting; VerboseLogger provides tree-view abstraction over standard logging

6. scripts/deidentify.py - De-identification Engine

Purpose: Remove PHI/PII from text data with pseudonymization

Responsibilities:

  • Detect PHI/PII using regex patterns

  • Generate consistent pseudonyms

  • Encrypt and store mappings

  • Validate de-identified output

  • Support country-specific regulations

  • Progress tracking for large datasets

Key Classes:

  • DeidentificationEngine: Main orchestrator

  • PseudonymGenerator: Creates deterministic placeholders

  • MappingStore: Secure encrypted storage

  • DateShifter: Consistent date shifting

  • PatternLibrary: Detection patterns

Progress Tracking:

  • tqdm progress bars for processing batches

  • tqdm.write() for status messages during processing

  • Summary statistics upon completion

Design Pattern: Strategy pattern for detection, Builder pattern for configuration

7. scripts/utils/country_regulations.py - Country-Specific Regulations

Purpose: Manage country-specific data privacy regulations

Responsibilities:

  • Define country-specific data fields

  • Provide detection patterns for local identifiers

  • Document regulatory requirements

  • Support multiple jurisdictions simultaneously

Key Classes:

  • CountryRegulationManager: Orchestrates regulations

  • CountryRegulation: Single country configuration

  • DataField: Field definition with validation

  • PrivacyLevel / DataFieldType: Enumerations

Supported Countries: US, EU, GB, CA, AU, IN, ID, BR, PH, ZA, KE, NG, GH, UG

Design Pattern: Registry pattern for country lookup, Factory pattern for regulation creation

Data Flow

Step-by-Step Data Flow:

1. User invokes: python main.py
                 β”‚
                 β–Ό
2. main.py initializes logging
                 β”‚
                 β–Ό
3. Step 0: load_study_dictionary()
                 β”‚
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚                            β”‚
   β–Ό                            β–Ό
Read Excel           Split sheets into tables
Dictionary                     β”‚
                               β–Ό
                     Deduplicate columns
                               β”‚
                               β–Ό
                     Save as JSONL in:
                     results/data_dictionary_mappings/
                 β”‚
                 β–Ό
4. Step 1: extract_excel_to_jsonl()
                 β”‚
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚                            β”‚
   β–Ό                            β–Ό
Find Excel files    Process each file
in dataset/                    β”‚
                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                   β”‚                        β”‚
                   β–Ό                        β–Ό
           Read Excel sheets    Convert data types
                   β”‚                        β”‚
                   β–Ό                        β–Ό
           Clean records        Handle NaN/dates
                   β”‚                        β”‚
                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                               β–Ό
                     Save as JSONL in:
                     results/dataset/<dataset_name>/
                         β”œβ”€β”€ original/  (all columns)
                         └── cleaned/   (duplicates removed)
                 β”‚
                 β–Ό
5. Step 2: deidentify_dataset() [OPTIONAL]
                 β”‚
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚                            β”‚
   β–Ό                            β–Ό
Recursively find      Process each file
JSONL files                    β”‚
in subdirs         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                   β”‚                        β”‚
                   β–Ό                        β–Ό
           Detect PHI/PII       Generate pseudonyms
                   β”‚                        β”‚
                   β–Ό                        β–Ό
           Replace sensitive    Maintain mappings
                data                        β”‚
                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                               β–Ό
                     Save de-identified in:
                     results/deidentified/<dataset_name>/
                         β”œβ”€β”€ original/  (de-identified)
                         β”œβ”€β”€ cleaned/   (de-identified)
                         └── _deidentification_audit.json

                     Store encrypted mappings:
                     results/deidentified/mappings/
                         └── mappings.enc

Data Flow Architecture

The system processes data through three primary pipelines:

Pipeline 1: Data Dictionary Processing

Excel File (Dictionary) β†’ pd.read_excel() β†’ Table Detection β†’ Split Tables
β†’ Column Deduplication β†’ "Ignore Below" Filter β†’ JSONL Export (per table)

Algorithm: Two-Phase Table Detection
1. Horizontal Split: Identify empty rows as boundaries
2. Vertical Split: Within horizontal strips, identify empty columns
3. Result: NxM tables from single sheet

Pipeline 2: Data Extraction

Excel Files (Dataset) β†’ find_excel_files() β†’ pd.read_excel()
β†’ Type Conversion β†’ Duplicate Column Removal β†’ JSONL Export
β†’ File Integrity Check β†’ Statistics Collection

Outputs: Two versions (original/, cleaned/) for validation

Pipeline 3: De-identification (Optional)

JSONL Files β†’ Pattern Matching (Regex + Country-Specific)
β†’ PHI/PII Detection β†’ Pseudonym Generation (Cryptographic Hash)
β†’ Mapping Storage (Encrypted) β†’ Date Shifting (Consistent Offset)
β†’ Validation β†’ Encrypted JSONL Output + Audit Log

Security: Fernet encryption, deterministic pseudonyms, audit trails

Design Decisions

1. JSONL Format

Rationale:

  • Line-oriented: Each record is independent

  • Streaming friendly: Can process files line-by-line

  • Easy to merge: Just concatenate files

  • Human-readable: Each line is valid JSON

  • Standard format: Wide tool support

Alternative Considered: CSV Rejected Because: CSV doesn’t handle nested structures well

2. Automatic Table Detection

Rationale:

  • Excel sheets often contain multiple logical tables

  • Empty rows/columns serve as natural separators

  • Preserves semantic structure of data

Algorithm:

  1. Find maximum consecutive empty rows/columns

  2. Split at these boundaries

  3. Handle special β€œIgnore below” markers

3. Dynamic Dataset Detection

Rationale:

  • Avoid hardcoding dataset names

  • Enable working with multiple datasets

  • Reduce configuration burden

Implementation: Scan data/dataset/ for first subdirectory

4. Progress Tracking

Rationale:

  • Long-running operations need real-time feedback

  • Users want to know progress and time remaining

  • Helps identify slow operations

  • Clean console output is essential

Implementation:

  • tqdm library for all progress bars (required dependency)

  • tqdm.write() for status messages during progress tracking

  • Consistent usage across all processing modules:

    • extract_data.py: File and row processing

    • load_dictionary.py: Sheet processing

    • deidentify.py: Batch de-identification

Design Decision: tqdm is a required dependency, not optional, ensuring consistent user experience

5. Centralized Configuration

Rationale:

  • Single source of truth

  • Easy to modify paths

  • Reduces coupling

  • Testability

Alternative Considered: Environment variables Rejected Because: More complex for non-technical users

Algorithms and Data Structures

Algorithm 1: Two-Phase Table Detection

Located in: scripts/load_dictionary.py β†’ _split_sheet_into_tables()

Purpose: Intelligently split Excel sheets containing multiple logical tables into separate tables

Algorithm:

Phase 1: Horizontal Splitting
1. Identify rows where ALL cells are null/empty
2. Use these rows as boundaries to split sheet into horizontal strips
3. Each strip potentially contains one or more tables side-by-side

Phase 2: Vertical Splitting (within each horizontal strip)
1. Identify columns where ALL cells are null/empty
2. Use these columns as boundaries to split strip into tables
3. Remove completely empty tables
4. Drop rows that are entirely null

Result: NxM independent tables from single sheet

Data Structures:

# Input: Raw DataFrame (no assumptions about structure)
df: pd.DataFrame  # header=None, all data preserved

# Intermediate: List of horizontal strips
horizontal_strips: List[pd.DataFrame]

# Output: List of independent tables
all_tables: List[pd.DataFrame]

Edge Cases Handled:

  • Empty rows between tables (common in medical research data dictionaries)

  • Empty columns between tables (side-by-side table layouts)

  • Tables with no data rows (only headers) - preserved with metadata

  • β€œignore below” markers - subsequent tables saved to separate directory

  • Duplicate column names - automatically suffixed with β€œ_1”, β€œ_2”, etc.

Complexity: O(r Γ— c) where r = rows, c = columns

β€”

Algorithm 2: JSON Type Conversion

Located in: scripts/extract_data.py β†’ clean_record_for_json()

Purpose: Convert pandas/numpy types to JSON-serializable Python types

Algorithm:

For each key-value pair in record:
1. If value is pd.isna(value) β†’ None (JSON null)
2. If value is np.integer or np.floating β†’ call .item() to get Python int/float
3. If value is pd.Timestamp, np.datetime64, datetime, date β†’ convert to string
4. Otherwise β†’ keep as-is

Return cleaned dictionary

Type Mappings:

Pandas/Numpy Type

Python Type

JSON Type

pd.NA, np.nan

None

null

np.int64

int

number

np.float64

float

number

pd.Timestamp

str

string (ISO format)

datetime

str

string

Edge Cases:

  • Mixed-type columns β†’ handled by pandas during read_excel()

  • Unicode characters β†’ preserved with ensure_ascii=False

  • Large integers β†’ may lose precision if > 2^53 (JSON limitation)

β€”

Algorithm 3: Duplicate Column Detection and Removal

Located in: scripts/extract_data.py β†’ clean_duplicate_columns()

Purpose: Remove duplicate columns with numeric suffixes (e.g., SUBJID2, SUBJID3)

Algorithm:

For each column in DataFrame:
1. Match pattern: column_name ends with "_?" followed by digits
2. Extract base_name (everything before the suffix)
3. If base_name exists as a column:
   - Mark current column for removal (it's a duplicate)
   - Keep the base column
4. Otherwise:
   - Keep the column

Return DataFrame with only non-duplicate columns

Regex Pattern: ^(.+?)_?(\d+)$

Examples:

  • SUBJID (base) + SUBJID2, SUBJID3 β†’ Keep SUBJID, remove others

  • NAME_1 (numbered) + NAME (base) β†’ Keep NAME, remove NAME_1

  • ID3 (numbered) + ID (base) β†’ Keep ID, remove ID3

β€”

Algorithm 4: Cryptographic Pseudonymization

Located in: scripts/deidentify.py β†’ PseudonymGenerator.generate()

Purpose: Generate deterministic, unique pseudonyms for PHI/PII values

Algorithm:

Input: (value, phi_type, template)

1. Check cache: If (phi_type, value.lower()) already pseudonymized:
   - Return cached pseudonym (ensures consistency)

2. Generate deterministic ID:
   a. Create hash_input = "{salt}:{phi_type}:{value}"
   b. hash_digest = SHA256(hash_input)
   c. Take first 4 bytes of digest
   d. Encode as base32, strip padding, take first 6 chars
   e. Result: Alphanumeric ID (e.g., "A4B8C3")

3. Apply template:
   - Replace {id} placeholder with generated ID
   - Example: "PATIENT-{id}" β†’ "PATIENT-A4B8C3"

4. Cache and return pseudonym

Security Properties:

  • Deterministic: Same input always produces same output (required for data consistency)

  • One-way: Cannot reverse SHA256 without salt

  • Salt-dependent: Different salt produces different pseudonyms

  • Collision-resistant: SHA256 ensures uniqueness

Data Structure:

class PseudonymGenerator:
    salt: str  # Cryptographic salt (32 bytes hex)
    _cache: Dict[Tuple[PHIType, str], str]  # Memoization
    _counter: Dict[PHIType, int]  # Statistics

β€”

Algorithm 5: Consistent Date Shifting (Country-Aware with Smart Validation)

Located in: scripts/deidentify.py β†’ DateShifter.shift_date()

Changed in version 0.6.0: Enhanced date parsing with smart validation to handle ambiguous dates correctly.

Purpose: Shift all dates by consistent offset to preserve temporal relationships, with intelligent multi-format detection, country-specific priority, and smart validation

The Ambiguity Challenge:

Dates like 08/09/2020 or 12/12/2012 can be interpreted differently:

  • US Format (MM/DD): 08/09/2020 = August 9, 2020

  • India Format (DD/MM): 08/09/2020 = September 8, 2020

  • Symmetric: 12/12/2012 = December 12, 2012 (same in both formats)

Solution: Country-based priority with smart validation for unambiguous dates.

Algorithm:

Input: date_string, country_code

1. Check cache: If date_string already shifted:
   - Return cached shifted date (O(1) lookup)

2. Generate consistent offset (first time only):
   a. hash_digest = SHA256(seed)
   b. offset_int = first 4 bytes as integer
   c. offset_days = (offset_int % (2 * range + 1)) - range
   d. Cache offset for all future shifts

3. Determine format priority based on country:
   DD/MM/YYYY priority (day first):
      Countries: IN, ID, BR, ZA, EU, GB, AU, KE, NG, GH, UG
      Formats to try:
         a. YYYY-MM-DD (ISO 8601) - always unambiguous
         b. DD/MM/YYYY (country preference)
         c. DD-MM-YYYY (hyphen variant)
         d. DD.MM.YYYY (European dot notation)

   MM/DD/YYYY priority (month first):
      Countries: US, PH, CA
      Formats to try:
         a. YYYY-MM-DD (ISO 8601) - always unambiguous
         b. MM/DD/YYYY (country preference)
         c. MM-DD-YYYY (hyphen variant)

4. For each format in priority order:
   a. Try parsing date_string with format
   b. If parse successful:

      SMART VALIDATION (for ambiguous slash/hyphen formats):

      If format is MM/DD or DD/MM:
         - Extract first_num and second_num from date string

         CASE 1: first_num > 12
            β†’ Must be day (no 13th month exists!)
            β†’ If trying MM/DD format: REJECT, try next format
            β†’ If trying DD/MM format: ACCEPT

         CASE 2: second_num > 12
            β†’ Must be day (no 13th month exists!)
            β†’ If trying DD/MM format: REJECT, try next format
            β†’ If trying MM/DD format: ACCEPT

         CASE 3: Both numbers ≀ 12 (ambiguous)
            β†’ Trust country preference (first matching format wins)
            β†’ Ensures consistency within dataset

      c. If validation passes: BREAK, use this format
      d. If validation fails: Continue to next format

   c. If parse failed: Continue to next format

5. If NO format succeeded:
   - Return placeholder: [DATE-HASH]
   - Log warning for manual review

6. Apply shift with successful format:
   a. shifted_date = parsed_date + timedelta(days=offset_days)
   b. Format back to string in SAME format as input
   c. Cache result: {date_string: shifted_string}

7. Return shifted date string

Properties:

  • Unambiguous dates always correct: 13/05/2020 parsed as May 13 (only valid interpretation)

  • Ambiguous dates use country preference: 08/09/2020 interpreted consistently per country

  • Symmetric dates handled: 12/12/2012 uses country format (though result is same)

  • Consistent: All dates shifted by SAME offset (preserves intervals)

  • Deterministic: Seed determines offset (reproducible)

  • Format-preserving: Output format matches input format

  • HIPAA-compliant: Dates obscured while relationships preserved

Example Flow - Unambiguous Date:

# Processing "13/05/2020" for India (DD/MM preference)

Step 1: Try YYYY-MM-DD
   Result: ❌ Doesn't match pattern

Step 2: Try DD/MM/YYYY (India preference)
   Parse: βœ… Day=13, Month=05 (May 13, 2020)
   Validate: first_num=13 > 12 βœ… Must be day (valid)
   Result: βœ… SUCCESS β†’ May 13, 2020

# Processing "13/05/2020" for USA (MM/DD preference)

Step 1: Try YYYY-MM-DD
   Result: ❌ Doesn't match pattern

Step 2: Try MM/DD/YYYY (USA preference)
   Parse: ❌ Month=13 invalid (strptime fails)
   Result: Continue to next format

Step 3: Try DD/MM/YYYY (fallback)
   Parse: βœ… Day=13, Month=05
   Validate: first_num=13 > 12 βœ… Must be day (valid)
   Result: βœ… SUCCESS β†’ May 13, 2020

Example Flow - Ambiguous Date:

# Processing "08/09/2020" for India (DD/MM preference)

Step 1: Try YYYY-MM-DD
   Result: ❌ Doesn't match pattern

Step 2: Try DD/MM/YYYY (India preference)
   Parse: βœ… Day=08, Month=09 (Sep 8, 2020)
   Validate: first_num=8 ≀ 12 βœ…, second_num=9 ≀ 12 βœ…
             Both ambiguous β†’ Trust country preference
   Result: βœ… SUCCESS β†’ Sep 8, 2020 (India interpretation)

# Processing "08/09/2020" for USA (MM/DD preference)

Step 1: Try YYYY-MM-DD
   Result: ❌ Doesn't match pattern

Step 2: Try MM/DD/YYYY (USA preference)
   Parse: βœ… Month=08, Day=09 (Aug 9, 2020)
   Validate: first_num=8 ≀ 12 βœ…, second_num=9 ≀ 12 βœ…
             Both ambiguous β†’ Trust country preference
   Result: βœ… SUCCESS β†’ Aug 9, 2020 (USA interpretation)

Complete Example with Shifting:

# For India (DD/MM/YYYY format):
shifter_in = DateShifter(country_code="IN", seed="abc123")
"2014-09-04" β†’ "2013-12-14"  # ISO β†’ Always Sep 4 (unambiguous)
"04/09/2014" β†’ "14/12/2013"  # DD/MM β†’ Sep 4, 2014 β†’ Dec 14, 2013 (-265 days)
"09/09/2014" β†’ "19/12/2013"  # DD/MM β†’ Sep 9, 2014 β†’ Dec 19, 2013 (-265 days)
"13/05/2020" β†’ "03/08/2019"  # Unambiguous β†’ May 13 (only valid parsing)
# Interval preserved: All dates shifted by -265 days

# For United States (MM/DD/YYYY format):
shifter_us = DateShifter(country_code="US", seed="abc123")
"2014-09-04" β†’ "2013-12-14"  # ISO β†’ Always Sep 4 (unambiguous)
"04/09/2014" β†’ "07/17/2013"  # MM/DD β†’ Apr 9, 2014 β†’ Jul 17, 2013 (-265 days)
"04/14/2014" β†’ "07/22/2013"  # MM/DD β†’ Apr 14, 2014 β†’ Jul 22, 2013 (-265 days)
"13/05/2020" β†’ "08/03/2019"  # Unambiguous β†’ May 13 (validation rejects MM/DD)
# Interval preserved: All dates shifted by -265 days

Time Complexity:

  • Cache hit: O(1) - constant time lookup

  • Cache miss: O(f) where f = number of formats (typically 3-4)

  • Overall: O(1) amortized for repeated dates in large datasets

β€”

Data Structure: Mapping Store (Encrypted)

Located in: scripts/deidentify.py β†’ MappingStore

Purpose: Securely store original β†’ pseudonym mappings

Structure:

# In-memory structure
mappings: Dict[str, Dict[str, Any]] = {
    "PHI_TYPE:original_value": {
        "original": "John Doe",  # Original sensitive value
        "pseudonym": "PATIENT-A4B8C3",  # Generated pseudonym
        "phi_type": "NAME_FULL",  # Type of PHI
        "created_at": "2025-10-13T14:32:15",  # Timestamp
        "metadata": {"pattern": "Full name pattern"}
    },
    ...
}

# On-disk structure (encrypted with Fernet)
File: mappings.enc
Content: Fernet.encrypt(JSON.dumps(mappings))

Encryption: Fernet (symmetric encryption, 128-bit AES in CBC mode with HMAC)

Security:

  • Encryption key stored separately

  • Keys never committed to version control

  • Audit log exports WITHOUT original values by default

β€”

Data Structure: JSONL File Format

Structure:

Each line is a valid JSON object (one record per line):

{"SUBJID": "INV001", "VISIT": 1, "TST_RESULT": "Positive", "source_file": "10_TST.xlsx"}
{"SUBJID": "INV002", "VISIT": 1, "TST_RESULT": "Negative", "source_file": "10_TST.xlsx"}
{"SUBJID": "INV003", "VISIT": 1, "TST_RESULT": "Positive", "source_file": "10_TST.xlsx"}

Advantages:

  • Streamable: Can process without loading entire file into memory

  • Line-oriented: Easy to split, merge, or process in parallel

  • JSON-compatible: Works with standard JSON parsers

  • Human-readable: Can inspect with head, tail, grep

Metadata Fields:

  • source_file: Original Excel filename for traceability

  • _metadata: Optional metadata (e.g., for empty files with structure)

β€”

Data Structure: Progress Tracking with tqdm

Integration Pattern:

from tqdm import tqdm
import sys

# File-level progress
for file in tqdm(files, desc="Processing", unit="file",
                 file=sys.stdout, dynamic_ncols=True, leave=True):
    # Use tqdm.write() instead of print() for clean output
    tqdm.write(f"Processing: {file.name}")

    # Row-level progress (if needed)
    for row in tqdm(rows, desc="Rows", leave=False):
        process(row)

# Result: Clean progress bars without interfering with logging

Why tqdm.write():

  • Ensures messages don’t corrupt progress bar display

  • Automatically repositions progress bar after message

  • Works with logging system

Dependencies and Their Roles

pandas (>= 2.0.0)

  • Role: DataFrame manipulation, Excel reading, data analysis

  • Key functions used: - pd.read_excel(): Excel file parsing - df.to_json(): JSONL export - pd.isna(): Null value detection

  • Why chosen: Industry standard for data manipulation in Python

openpyxl (>= 3.1.0)

  • Role: Excel file format (.xlsx) support for pandas

  • Used by: pd.read_excel(engine='openpyxl')

  • Why chosen: Pure Python, no external dependencies, handles modern Excel formats

numpy (>= 1.24.0)

  • Role: Numerical operations, type handling

  • Key types used: - np.int64, np.float64: Numeric types from pandas - np.datetime64: Datetime types - np.nan: Missing value representation

  • Why chosen: Required by pandas, efficient numerical operations

tqdm (>= 4.66.0)

  • Role: Progress bars and user feedback

  • Key features: - Real-time progress tracking - ETA calculations - Clean console output with tqdm.write()

  • Why chosen: Most popular Python progress bar library, excellent integration

cryptography (>= 41.0.0)

  • Role: Encryption for de-identification mappings

  • Key components: - Fernet: Symmetric encryption - hashlib.sha256(): Cryptographic hashing - secrets: Secure random number generation

  • Why chosen: Industry-standard cryptography library, HIPAA-compliant algorithms

sphinx (>= 7.0.0) + extensions

  • Role: Documentation generation

  • Extensions used: - sphinx.ext.autodoc: Automatic API documentation from docstrings - sphinx.ext.napoleon: Google/NumPy style docstring support - sphinx_autodoc_typehints: Type hint documentation - sphinx-autobuild: Live documentation preview (dev dependency)

  • Why chosen: Standard for Python project documentation

Documentation Workflow:

Added in version 0.3.0: Added make docs-watch for automatic documentation rebuilding.

  • Autodoc is ENABLED: Sphinx automatically extracts documentation from Python docstrings

  • NOT Automatic by Default: Documentation does NOT rebuild automatically on every code change

  • Manual Build: Run make docs to regenerate documentation after changes

  • Auto-Rebuild (Development): Use make docs-watch for live preview during documentation development

How Autodoc Works:

  1. Write Google-style docstrings in Python code

  2. Use .. automodule:: directives in .rst files

  3. Run make docs - Sphinx extracts docstrings and generates HTML

  4. Or use make docs-watch - Server auto-rebuilds on file changes

Important: While autodoc extracts documentation automatically from code, you must build the documentation manually (or use watch mode) to see the changes.