Architecture

For Developers: Comprehensive Technical Documentation

This document provides in-depth technical details about RePORTaLiN’s architecture, internal algorithms, data structures, dependencies, design patterns, and extension points to enable effective maintenance, debugging, and feature development.

Last Updated: October 23, 2025 Current Version: 0.8.6 Code Optimization: 35% reduction (640 lines) while maintaining 100% functionality Historical Enhancements: Enhanced modules from v0.0.1 through v0.0.12

System Overview

RePORTaLiN is designed as a modular, pipeline-based system for processing sensitive medical research data from Excel to JSONL format with optional PHI/PII de-identification. The architecture emphasizes:

Simplicity: Single entry point (main.py) with clear pipeline stages
Modularity: Clear separation of concerns with dedicated modules
Robustness: Comprehensive error handling with graceful degradation
Transparency: Detailed logging at every step with audit trails
Security: Encryption-by-default for de-identification mappings
Compliance: Multi-country privacy regulation support (14 countries)
Performance: Optimized for high throughput (benchmarks pending)

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                         main.py                             │
│                   (Pipeline Orchestrator)                   │
│                                                             │
│  • Command-line argument parsing                           │
│  • Step execution coordination                             │
│  • High-level error handling                              │
└───────────────────┬─────────────────────────────────────────┘
                    │
         ┌──────────┴───────────┐
         │                      │
         ▼                      ▼
┌────────────────┐     ┌──────────────────┐
│  load_dict.py  │     │ extract_data.py  │
│                │     │                  │
│ • Sheet split  │     │ • File discovery │
│ • Table detect │     │ • Type convert   │
│ • Duplicate    │     │ • JSONL export   │
│   handling     │     │ • Progress track │
└────────┬───────┘     └────────┬─────────┘
         │                      │
         └──────────┬───────────┘
                    │
                    ▼
           ┌────────────────┐
           │   config.py    │
           │                │
           │ • Paths        │
           │ • Settings     │
           │ • Dynamic      │
           │   detection    │
           └────────┬───────┘
                    │
                    ▼
           ┌────────────────┐
           │ logging  │
           │                │
           │ • Log files    │
           │ • Console out  │
           │ • SUCCESS lvl  │
           └────────────────┘

Core Components

1. main.py - Pipeline Orchestrator

Purpose: Central entry point and workflow coordinator

Responsibilities:

Parse command-line arguments
Execute pipeline steps in sequence
Handle top-level errors
Coordinate logging

Key Functions:

main(): Entry point, orchestrates full pipeline
run_step(): Wrapper for step execution with error handling

Design Pattern: Command pattern with error handling decorator

2. config.py - Configuration Management

Purpose: Centralized configuration

Responsibilities:

Define all file paths
Manage settings
Dynamic dataset detection
Path resolution

Key Features:

Automatic dataset folder detection
Relative path resolution
Environment-agnostic configuration
Type hints for IDE support

Design Pattern: Module-level singleton

3. scripts/extract_data.py - Data Extraction

Purpose: Convert Excel files to JSONL

Responsibilities:

File discovery and validation
Excel reading and parsing
Data type conversion
JSONL serialization
Progress tracking with tqdm

Key Functions:

extract_excel_to_jsonl(): Batch processing with progress bars
process_excel_file(): Single file processing
convert_dataframe_to_jsonl(): DataFrame conversion
clean_record_for_json(): Type conversion
is_dataframe_empty(): Empty detection
find_excel_files(): File discovery

Progress Tracking:

Uses tqdm for all file and row processing
Status messages via tqdm.write() for clean output
Summary statistics after completion

Design Pattern: Pipeline pattern with functional composition

4. scripts/load_dictionary.py - Dictionary Processing

Purpose: Process data dictionary Excel file

Responsibilities:

Sheet processing with progress tracking
Table detection and splitting
Duplicate column handling
Table serialization

Key Functions:

load_study_dictionary(): High-level API with tqdm progress bars
process_excel_file(): Sheet processing
_split_sheet_into_tables(): Table detection
_process_and_save_tables(): Table output
_deduplicate_columns(): Column name handling

Progress Tracking:

tqdm progress bars for sheet processing
tqdm.write() for status messages
Clean console output during processing

Design Pattern: Functional composition with table detection algorithm

Design Pattern: Strategy pattern for table detection

5. scripts/utils/logging.py - Logging System

Purpose: Centralized logging infrastructure

Responsibilities:

Create timestamped log files
Dual output (console + file)
Custom SUCCESS log level
Structured logging with configurable verbosity

Key Features:

Custom SUCCESS level (between INFO and WARNING)
Timestamped log files in .logs/ directory
Console and file handlers with different filtering
UTF-8 encoding for international characters
Works alongside tqdm for clean progress bar output
Verbose mode: DEBUG-level logging via -v flag

Log Levels:

DEBUG (10)    # Verbose mode only: file processing, patterns, details
INFO (20)     # Default: major steps, summaries
SUCCESS (25)  # Custom: successful completions
WARNING (30)  # Potential issues
ERROR (40)    # Failures
CRITICAL (50) # Fatal errors

Console vs. File Output:

Console: Only SUCCESS, ERROR, and CRITICAL (keeps terminal clean)
File: INFO or DEBUG (depending on --verbose flag) and above

Verbose Logging:

When --verbose or -v flag is used:

Log level set to DEBUG in main.py
VerboseLogger class provides tree-view, context-managed output
Integrated into all three pipeline steps (Dictionary Loading, Data Extraction, De-identification)
Additional details logged throughout pipeline:
- File lists and processing order
- Sheet/table detection details with row/column counts
- Duplicate column detection and removal reasons with detailed explanations
- Processing phases with timing information (per-file, per-sheet, per-table, overall)
- PHI/PII pattern matches and replacement counts (de-identification step)
- Record-level progress (every 1000 records for large files)
- Validation results with issue tracking
- Comprehensive timing for all operations

VerboseLogger Class:

The VerboseLogger class provides centralized verbose logging with tree-view formatting and context management:

# In module
from scripts.utils import logging as log
vlog = log.get_verbose_logger()

# Usage in code
with vlog.file_processing("Data dictionary", total_records=43):
    vlog.metric("Total sheets", 43)

    for sheet_idx, sheet_name in enumerate(sheets, 1):
        with vlog.step(f"Sheet {sheet_idx}/{len(sheets)}: '{sheet_name}'"):
            vlog.metric("Tables detected", num_tables)
            vlog.detail("Processing table...")
            vlog.timing("Sheet processing time", elapsed)

Output Format:

DEBUG: ├─ Processing: Data extraction (65 files)
DEBUG:   ├─ Total files to process: 65
DEBUG:   ├─ File 1/65: 10_TST.xlsx
DEBUG:   │  ├─ Loading Excel file
DEBUG:   │  │  ├─ Rows: 412
DEBUG:   │  │  ├─ Columns: 28
DEBUG:   │  ├─ Cleaning duplicate columns
DEBUG:   │  │  ├─ Marking SUBJID2 for removal (duplicate of SUBJID)
DEBUG:   │  │  ├─ Removed 3 duplicate columns: SUBJID2, AGE_2, PHONE_3
DEBUG:   │  └─ ⏱ Total processing time: 0.45s
DEBUG:   └─ ⏱ Overall extraction time: 32.15s

Key Methods:

file_processing(name, total_records): Wrap entire file processing
step(name): Wrap a processing step
detail(message): Log a detail line
metric(label, value): Log a metric/statistic
timing(operation, seconds): Log timing information
items_list(label, items, max_show): Log list of items with truncation

Integration Pattern:

All pipeline modules (load_dictionary.py, extract_data.py, deidentify.py) follow this pattern:

Import verbose logger at module level:

from scripts.utils import logging as log
vlog = log.get_verbose_logger()

Wrap main processing functions with context managers:

with vlog.file_processing(filename):
    # Main processing
    for item in items:
        with vlog.step(f"Processing {item}"):
            # Item processing
            vlog.metric("Metric", value)
            vlog.timing("Operation", elapsed_time)

Always log both to file (verbose) and console (normal):

log.debug(...)      # File only in verbose mode
log.info(...)       # File always
log.success(...)    # Console + file
vlog.detail(...)    # File only in verbose mode (via context)

Usage:

from scripts.utils import logging as log

# Standard (INFO level)
python main.py

# Verbose (DEBUG level) with tree-view logs
python main.py -v

# Verbose with de-identification
python main.py -v --enable-deidentification

# In code
log.debug("Detailed processing info")  # Only in verbose mode
log.info("Major step completed")       # Always logged to file
log.success("Pipeline completed")      # Console + file
vlog.detail("Context-managed detail")  # Only in verbose mode

Design Pattern: Singleton logger instance with configurable formatting; VerboseLogger provides tree-view abstraction over standard logging

6. scripts/deidentify.py - De-identification Engine

Purpose: Remove PHI/PII from text data with pseudonymization

Responsibilities:

Detect PHI/PII using regex patterns
Generate consistent pseudonyms
Encrypt and store mappings
Validate de-identified output
Support country-specific regulations
Progress tracking for large datasets

Key Classes:

DeidentificationEngine: Main orchestrator
PseudonymGenerator: Creates deterministic placeholders
MappingStore: Secure encrypted storage
DateShifter: Consistent date shifting
PatternLibrary: Detection patterns

Progress Tracking:

tqdm progress bars for processing batches
tqdm.write() for status messages during processing
Summary statistics upon completion

Design Pattern: Strategy pattern for detection, Builder pattern for configuration

7. scripts/utils/country_regulations.py - Country-Specific Regulations

Purpose: Manage country-specific data privacy regulations

Responsibilities:

Define country-specific data fields
Provide detection patterns for local identifiers
Document regulatory requirements
Support multiple jurisdictions simultaneously

Key Classes:

CountryRegulationManager: Orchestrates regulations
CountryRegulation: Single country configuration
DataField: Field definition with validation
PrivacyLevel / DataFieldType: Enumerations

Supported Countries: US, EU, GB, CA, AU, IN, ID, BR, PH, ZA, KE, NG, GH, UG

Design Pattern: Registry pattern for country lookup, Factory pattern for regulation creation

Data Flow

Step-by-Step Data Flow:

1. User invokes: python main.py
                 │
                 ▼
2. main.py initializes logging
                 │
                 ▼
3. Step 0: load_study_dictionary()
                 │
   ┌─────────────┴──────────────┐
   │                            │
   ▼                            ▼
Read Excel           Split sheets into tables
Dictionary                     │
                               ▼
                     Deduplicate columns
                               │
                               ▼
                     Save as JSONL in:
                     results/data_dictionary_mappings/
                 │
                 ▼
4. Step 1: extract_excel_to_jsonl()
                 │
   ┌─────────────┴──────────────┐
   │                            │
   ▼                            ▼
Find Excel files    Process each file
in dataset/                    │
                   ┌───────────┴────────────┐
                   │                        │
                   ▼                        ▼
           Read Excel sheets    Convert data types
                   │                        │
                   ▼                        ▼
           Clean records        Handle NaN/dates
                   │                        │
                   └───────────┬────────────┘
                               │
                               ▼
                     Save as JSONL in:
                     results/dataset/<dataset_name>/
                         ├── original/  (all columns)
                         └── cleaned/   (duplicates removed)
                 │
                 ▼
5. Step 2: deidentify_dataset() [OPTIONAL]
                 │
   ┌─────────────┴──────────────┐
   │                            │
   ▼                            ▼
Recursively find      Process each file
JSONL files                    │
in subdirs         ┌───────────┴────────────┐
                   │                        │
                   ▼                        ▼
           Detect PHI/PII       Generate pseudonyms
                   │                        │
                   ▼                        ▼
           Replace sensitive    Maintain mappings
                data                        │
                   └───────────┬────────────┘
                               │
                               ▼
                     Save de-identified in:
                     results/deidentified/<dataset_name>/
                         ├── original/  (de-identified)
                         ├── cleaned/   (de-identified)
                         └── _deidentification_audit.json

                     Store encrypted mappings:
                     results/deidentified/mappings/
                         └── mappings.enc

Data Flow Architecture

The system processes data through three primary pipelines:

Pipeline 1: Data Dictionary Processing

Excel File (Dictionary) → pd.read_excel() → Table Detection → Split Tables
→ Column Deduplication → "Ignore Below" Filter → JSONL Export (per table)

Algorithm: Two-Phase Table Detection
1. Horizontal Split: Identify empty rows as boundaries
2. Vertical Split: Within horizontal strips, identify empty columns
3. Result: NxM tables from single sheet

Pipeline 2: Data Extraction

Excel Files (Dataset) → find_excel_files() → pd.read_excel()
→ Type Conversion → Duplicate Column Removal → JSONL Export
→ File Integrity Check → Statistics Collection

Outputs: Two versions (original/, cleaned/) for validation

Pipeline 3: De-identification (Optional)

JSONL Files → Pattern Matching (Regex + Country-Specific)
→ PHI/PII Detection → Pseudonym Generation (Cryptographic Hash)
→ Mapping Storage (Encrypted) → Date Shifting (Consistent Offset)
→ Validation → Encrypted JSONL Output + Audit Log

Security: Fernet encryption, deterministic pseudonyms, audit trails

Design Decisions

1. JSONL Format

Rationale:

Line-oriented: Each record is independent
Streaming friendly: Can process files line-by-line
Easy to merge: Just concatenate files
Human-readable: Each line is valid JSON
Standard format: Wide tool support

Alternative Considered: CSV Rejected Because: CSV doesn’t handle nested structures well

2. Automatic Table Detection

Rationale:

Excel sheets often contain multiple logical tables
Empty rows/columns serve as natural separators
Preserves semantic structure of data

Algorithm:

Find maximum consecutive empty rows/columns
Split at these boundaries
Handle special “Ignore below” markers

3. Dynamic Dataset Detection

Rationale:

Avoid hardcoding dataset names
Enable working with multiple datasets
Reduce configuration burden

Implementation: Scan data/dataset/ for first subdirectory

4. Progress Tracking

Rationale:

Long-running operations need real-time feedback
Users want to know progress and time remaining
Helps identify slow operations
Clean console output is essential

Implementation:

tqdm library for all progress bars (required dependency)
tqdm.write() for status messages during progress tracking
Consistent usage across all processing modules:
- extract_data.py: File and row processing
- load_dictionary.py: Sheet processing
- deidentify.py: Batch de-identification

Design Decision: tqdm is a required dependency, not optional, ensuring consistent user experience

5. Centralized Configuration

Rationale:

Single source of truth
Easy to modify paths
Reduces coupling
Testability

Alternative Considered: Environment variables Rejected Because: More complex for non-technical users

Algorithms and Data Structures

Algorithm 1: Two-Phase Table Detection

Located in: scripts/load_dictionary.py → _split_sheet_into_tables()

Purpose: Intelligently split Excel sheets containing multiple logical tables into separate tables

Algorithm:

Phase 1: Horizontal Splitting
Identify rows where ALL cells are null/empty
Use these rows as boundaries to split sheet into horizontal strips
Each strip potentially contains one or more tables side-by-side

Phase 2: Vertical Splitting (within each horizontal strip)
Identify columns where ALL cells are null/empty
Use these columns as boundaries to split strip into tables
Remove completely empty tables
Drop rows that are entirely null

Result: NxM independent tables from single sheet

Data Structures:

# Input: Raw DataFrame (no assumptions about structure)
df: pd.DataFrame  # header=None, all data preserved

# Intermediate: List of horizontal strips
horizontal_strips: List[pd.DataFrame]

# Output: List of independent tables
all_tables: List[pd.DataFrame]

Edge Cases Handled:

Empty rows between tables (common in medical research data dictionaries)
Empty columns between tables (side-by-side table layouts)
Tables with no data rows (only headers) - preserved with metadata
“ignore below” markers - subsequent tables saved to separate directory
Duplicate column names - automatically suffixed with “_1”, “_2”, etc.

Complexity: O(r × c) where r = rows, c = columns

—

Algorithm 2: JSON Type Conversion

Located in: scripts/extract_data.py → clean_record_for_json()

Purpose: Convert pandas/numpy types to JSON-serializable Python types

Algorithm:

For each key-value pair in record:
1. If value is pd.isna(value) → None (JSON null)
2. If value is np.integer or np.floating → call .item() to get Python int/float
3. If value is pd.Timestamp, np.datetime64, datetime, date → convert to string
4. Otherwise → keep as-is

Return cleaned dictionary

Type Mappings:

Pandas/Numpy Type	Python Type	JSON Type
pd.NA, np.nan	None	null
np.int64	int	number
np.float64	float	number
pd.Timestamp	str	string (ISO format)
datetime	str	string

Edge Cases:

Mixed-type columns → handled by pandas during read_excel()
Unicode characters → preserved with ensure_ascii=False
Large integers → may lose precision if > 2^53 (JSON limitation)

—

Algorithm 3: Duplicate Column Detection and Removal

Located in: scripts/extract_data.py → clean_duplicate_columns()

Purpose: Remove duplicate columns with numeric suffixes (e.g., SUBJID2, SUBJID3)

Algorithm:

For each column in DataFrame:
1. Match pattern: column_name ends with "_?" followed by digits
2. Extract base_name (everything before the suffix)
3. If base_name exists as a column:
   - Mark current column for removal (it's a duplicate)
   - Keep the base column
4. Otherwise:
   - Keep the column

Return DataFrame with only non-duplicate columns

Regex Pattern: ^(.+?)_?(\d+)$

Examples:

SUBJID (base) + SUBJID2, SUBJID3 → Keep SUBJID, remove others
NAME_1 (numbered) + NAME (base) → Keep NAME, remove NAME_1
ID3 (numbered) + ID (base) → Keep ID, remove ID3

—

Algorithm 4: Cryptographic Pseudonymization

Located in: scripts/deidentify.py → PseudonymGenerator.generate()

Purpose: Generate deterministic, unique pseudonyms for PHI/PII values

Algorithm:

Input: (value, phi_type, template)

1. Check cache: If (phi_type, value.lower()) already pseudonymized:
   - Return cached pseudonym (ensures consistency)

2. Generate deterministic ID:
   a. Create hash_input = "{salt}:{phi_type}:{value}"
   b. hash_digest = SHA256(hash_input)
   c. Take first 4 bytes of digest
   d. Encode as base32, strip padding, take first 6 chars
   e. Result: Alphanumeric ID (e.g., "A4B8C3")

3. Apply template:
   - Replace {id} placeholder with generated ID
   - Example: "PATIENT-{id}" → "PATIENT-A4B8C3"

4. Cache and return pseudonym

Security Properties:

Deterministic: Same input always produces same output (required for data consistency)
One-way: Cannot reverse SHA256 without salt
Salt-dependent: Different salt produces different pseudonyms
Collision-resistant: SHA256 ensures uniqueness

Data Structure:

class PseudonymGenerator:
    salt: str  # Cryptographic salt (32 bytes hex)
    _cache: Dict[Tuple[PHIType, str], str]  # Memoization
    _counter: Dict[PHIType, int]  # Statistics

—

Algorithm 5: Consistent Date Shifting (Country-Aware with Smart Validation)

Located in: scripts/deidentify.py → DateShifter.shift_date()

Changed in version 0.6.0: Enhanced date parsing with smart validation to handle ambiguous dates correctly.

Purpose: Shift all dates by consistent offset to preserve temporal relationships, with intelligent multi-format detection, country-specific priority, and smart validation

The Ambiguity Challenge:

Dates like 08/09/2020 or 12/12/2012 can be interpreted differently:

US Format (MM/DD): 08/09/2020 = August 9, 2020
India Format (DD/MM): 08/09/2020 = September 8, 2020
Symmetric: 12/12/2012 = December 12, 2012 (same in both formats)

Solution: Country-based priority with smart validation for unambiguous dates.

Algorithm:

Input: date_string, country_code

1. Check cache: If date_string already shifted:
   - Return cached shifted date (O(1) lookup)

2. Generate consistent offset (first time only):
   a. hash_digest = SHA256(seed)
   b. offset_int = first 4 bytes as integer
   c. offset_days = (offset_int % (2 * range + 1)) - range
   d. Cache offset for all future shifts

3. Determine format priority based on country:
   DD/MM/YYYY priority (day first):
      Countries: IN, ID, BR, ZA, EU, GB, AU, KE, NG, GH, UG
      Formats to try:
         a. YYYY-MM-DD (ISO 8601) - always unambiguous
         b. DD/MM/YYYY (country preference)
         c. DD-MM-YYYY (hyphen variant)
         d. DD.MM.YYYY (European dot notation)

   MM/DD/YYYY priority (month first):
      Countries: US, PH, CA
      Formats to try:
         a. YYYY-MM-DD (ISO 8601) - always unambiguous
         b. MM/DD/YYYY (country preference)
         c. MM-DD-YYYY (hyphen variant)

4. For each format in priority order:
   a. Try parsing date_string with format
   b. If parse successful:

      SMART VALIDATION (for ambiguous slash/hyphen formats):

      If format is MM/DD or DD/MM:
         - Extract first_num and second_num from date string

         CASE 1: first_num > 12
            → Must be day (no 13th month exists!)
            → If trying MM/DD format: REJECT, try next format
            → If trying DD/MM format: ACCEPT

         CASE 2: second_num > 12
            → Must be day (no 13th month exists!)
            → If trying DD/MM format: REJECT, try next format
            → If trying MM/DD format: ACCEPT

         CASE 3: Both numbers ≤ 12 (ambiguous)
            → Trust country preference (first matching format wins)
            → Ensures consistency within dataset

      c. If validation passes: BREAK, use this format
      d. If validation fails: Continue to next format

   c. If parse failed: Continue to next format

5. If NO format succeeded:
   - Return placeholder: [DATE-HASH]
   - Log warning for manual review

6. Apply shift with successful format:
   a. shifted_date = parsed_date + timedelta(days=offset_days)
   b. Format back to string in SAME format as input
   c. Cache result: {date_string: shifted_string}

7. Return shifted date string

Properties:

Unambiguous dates always correct: 13/05/2020 parsed as May 13 (only valid interpretation)
Ambiguous dates use country preference: 08/09/2020 interpreted consistently per country
Symmetric dates handled: 12/12/2012 uses country format (though result is same)
Consistent: All dates shifted by SAME offset (preserves intervals)
Deterministic: Seed determines offset (reproducible)
Format-preserving: Output format matches input format
HIPAA-compliant: Dates obscured while relationships preserved

Example Flow - Unambiguous Date:

# Processing "13/05/2020" for India (DD/MM preference)

Step 1: Try YYYY-MM-DD
   Result: ❌ Doesn't match pattern

Step 2: Try DD/MM/YYYY (India preference)
   Parse: ✅ Day=13, Month=05 (May 13, 2020)
   Validate: first_num=13 > 12 ✅ Must be day (valid)
   Result: ✅ SUCCESS → May 13, 2020

# Processing "13/05/2020" for USA (MM/DD preference)

Step 1: Try YYYY-MM-DD
   Result: ❌ Doesn't match pattern

Step 2: Try MM/DD/YYYY (USA preference)
   Parse: ❌ Month=13 invalid (strptime fails)
   Result: Continue to next format

Step 3: Try DD/MM/YYYY (fallback)
   Parse: ✅ Day=13, Month=05
   Validate: first_num=13 > 12 ✅ Must be day (valid)
   Result: ✅ SUCCESS → May 13, 2020

Example Flow - Ambiguous Date:

# Processing "08/09/2020" for India (DD/MM preference)

Step 1: Try YYYY-MM-DD
   Result: ❌ Doesn't match pattern

Step 2: Try DD/MM/YYYY (India preference)
   Parse: ✅ Day=08, Month=09 (Sep 8, 2020)
   Validate: first_num=8 ≤ 12 ✅, second_num=9 ≤ 12 ✅
             Both ambiguous → Trust country preference
   Result: ✅ SUCCESS → Sep 8, 2020 (India interpretation)

# Processing "08/09/2020" for USA (MM/DD preference)

Step 1: Try YYYY-MM-DD
   Result: ❌ Doesn't match pattern

Step 2: Try MM/DD/YYYY (USA preference)
   Parse: ✅ Month=08, Day=09 (Aug 9, 2020)
   Validate: first_num=8 ≤ 12 ✅, second_num=9 ≤ 12 ✅
             Both ambiguous → Trust country preference
   Result: ✅ SUCCESS → Aug 9, 2020 (USA interpretation)

Complete Example with Shifting:

# For India (DD/MM/YYYY format):
shifter_in = DateShifter(country_code="IN", seed="abc123")
"2014-09-04" → "2013-12-14"  # ISO → Always Sep 4 (unambiguous)
"04/09/2014" → "14/12/2013"  # DD/MM → Sep 4, 2014 → Dec 14, 2013 (-265 days)
"09/09/2014" → "19/12/2013"  # DD/MM → Sep 9, 2014 → Dec 19, 2013 (-265 days)
"13/05/2020" → "03/08/2019"  # Unambiguous → May 13 (only valid parsing)
# Interval preserved: All dates shifted by -265 days

# For United States (MM/DD/YYYY format):
shifter_us = DateShifter(country_code="US", seed="abc123")
"2014-09-04" → "2013-12-14"  # ISO → Always Sep 4 (unambiguous)
"04/09/2014" → "07/17/2013"  # MM/DD → Apr 9, 2014 → Jul 17, 2013 (-265 days)
"04/14/2014" → "07/22/2013"  # MM/DD → Apr 14, 2014 → Jul 22, 2013 (-265 days)
"13/05/2020" → "08/03/2019"  # Unambiguous → May 13 (validation rejects MM/DD)
# Interval preserved: All dates shifted by -265 days

Time Complexity:

Cache hit: O(1) - constant time lookup
Cache miss: O(f) where f = number of formats (typically 3-4)
Overall: O(1) amortized for repeated dates in large datasets

—

Data Structure: Mapping Store (Encrypted)

Located in: scripts/deidentify.py → MappingStore

Purpose: Securely store original → pseudonym mappings

Structure:

# In-memory structure
mappings: Dict[str, Dict[str, Any]] = {
    "PHI_TYPE:original_value": {
        "original": "John Doe",  # Original sensitive value
        "pseudonym": "PATIENT-A4B8C3",  # Generated pseudonym
        "phi_type": "NAME_FULL",  # Type of PHI
        "created_at": "2025-10-13T14:32:15",  # Timestamp
        "metadata": {"pattern": "Full name pattern"}
    },
    ...
}

# On-disk structure (encrypted with Fernet)
File: mappings.enc
Content: Fernet.encrypt(JSON.dumps(mappings))

Encryption: Fernet (symmetric encryption, 128-bit AES in CBC mode with HMAC)

Security:

Encryption key stored separately
Keys never committed to version control
Audit log exports WITHOUT original values by default

—

Data Structure: JSONL File Format

Structure:

Each line is a valid JSON object (one record per line):

{"SUBJID": "INV001", "VISIT": 1, "TST_RESULT": "Positive", "source_file": "10_TST.xlsx"}
{"SUBJID": "INV002", "VISIT": 1, "TST_RESULT": "Negative", "source_file": "10_TST.xlsx"}
{"SUBJID": "INV003", "VISIT": 1, "TST_RESULT": "Positive", "source_file": "10_TST.xlsx"}

Advantages:

Streamable: Can process without loading entire file into memory
Line-oriented: Easy to split, merge, or process in parallel
JSON-compatible: Works with standard JSON parsers
Human-readable: Can inspect with head, tail, grep

Metadata Fields:

source_file: Original Excel filename for traceability
_metadata: Optional metadata (e.g., for empty files with structure)

—

Data Structure: Progress Tracking with tqdm

Integration Pattern:

from tqdm import tqdm
import sys

# File-level progress
for file in tqdm(files, desc="Processing", unit="file",
                 file=sys.stdout, dynamic_ncols=True, leave=True):
    # Use tqdm.write() instead of print() for clean output
    tqdm.write(f"Processing: {file.name}")

    # Row-level progress (if needed)
    for row in tqdm(rows, desc="Rows", leave=False):
        process(row)

# Result: Clean progress bars without interfering with logging

Why tqdm.write():

Ensures messages don’t corrupt progress bar display
Automatically repositions progress bar after message
Works with logging system

Dependencies and Their Roles

pandas (>= 2.0.0)

Role: DataFrame manipulation, Excel reading, data analysis
Key functions used: - pd.read_excel(): Excel file parsing - df.to_json(): JSONL export - pd.isna(): Null value detection
Why chosen: Industry standard for data manipulation in Python

openpyxl (>= 3.1.0)

Role: Excel file format (.xlsx) support for pandas
Used by: pd.read_excel(engine='openpyxl')
Why chosen: Pure Python, no external dependencies, handles modern Excel formats

numpy (>= 1.24.0)

Role: Numerical operations, type handling
Key types used: - np.int64, np.float64: Numeric types from pandas - np.datetime64: Datetime types - np.nan: Missing value representation
Why chosen: Required by pandas, efficient numerical operations

tqdm (>= 4.66.0)

Role: Progress bars and user feedback
Key features: - Real-time progress tracking - ETA calculations - Clean console output with tqdm.write()
Why chosen: Most popular Python progress bar library, excellent integration

cryptography (>= 41.0.0)

Role: Encryption for de-identification mappings
Key components: - Fernet: Symmetric encryption - hashlib.sha256(): Cryptographic hashing - secrets: Secure random number generation
Why chosen: Industry-standard cryptography library, HIPAA-compliant algorithms

sphinx (>= 7.0.0) + extensions

Role: Documentation generation
Extensions used: - sphinx.ext.autodoc: Automatic API documentation from docstrings - sphinx.ext.napoleon: Google/NumPy style docstring support - sphinx_autodoc_typehints: Type hint documentation - sphinx-autobuild: Live documentation preview (dev dependency)
Why chosen: Standard for Python project documentation

Documentation Workflow:

Added in version 0.3.0: Added make docs-watch for automatic documentation rebuilding.

Autodoc is ENABLED: Sphinx automatically extracts documentation from Python docstrings
NOT Automatic by Default: Documentation does NOT rebuild automatically on every code change
Manual Build: Run make docs to regenerate documentation after changes
Auto-Rebuild (Development): Use make docs-watch for live preview during documentation development

How Autodoc Works:

Write Google-style docstrings in Python code
Use .. automodule:: directives in .rst files
Run make docs - Sphinx extracts docstrings and generates HTML
Or use make docs-watch - Server auto-rebuilds on file changes

Important: While autodoc extracts documentation automatically from code, you must build the documentation manually (or use watch mode) to see the changes.