Architecture ============ **For Developers: Comprehensive Technical Documentation** This document provides in-depth technical details about RePORTaLiN's architecture, internal algorithms, data structures, dependencies, design patterns, and extension points to enable effective maintenance, debugging, and feature development. **Last Updated:** October 23, 2025 **Current Version:** |version| **Code Optimization:** 35% reduction (640 lines) while maintaining 100% functionality **Historical Enhancements:** Enhanced modules from v0.0.1 through v0.0.12 System Overview --------------- RePORTaLiN is designed as a modular, pipeline-based system for processing sensitive medical research data from Excel to JSONL format with optional PHI/PII de-identification. The architecture emphasizes: - **Simplicity**: Single entry point (``main.py``) with clear pipeline stages - **Modularity**: Clear separation of concerns with dedicated modules - **Robustness**: Comprehensive error handling with graceful degradation - **Transparency**: Detailed logging at every step with audit trails - **Security**: Encryption-by-default for de-identification mappings - **Compliance**: Multi-country privacy regulation support (14 countries) - **Performance**: Optimized for high throughput (benchmarks pending) Architecture Diagram -------------------- .. code-block:: text ┌─────────────────────────────────────────────────────────────┐ │ main.py │ │ (Pipeline Orchestrator) │ │ │ │ • Command-line argument parsing │ │ • Step execution coordination │ │ • High-level error handling │ └───────────────────┬─────────────────────────────────────────┘ │ ┌──────────┴───────────┐ │ │ ▼ ▼ ┌────────────────┐ ┌──────────────────┐ │ load_dict.py │ │ extract_data.py │ │ │ │ │ │ • Sheet split │ │ • File discovery │ │ • Table detect │ │ • Type convert │ │ • Duplicate │ │ • JSONL export │ │ handling │ │ • Progress track │ └────────┬───────┘ └────────┬─────────┘ │ │ └──────────┬───────────┘ │ ▼ ┌────────────────┐ │ config.py │ │ │ │ • Paths │ │ • Settings │ │ • Dynamic │ │ detection │ └────────┬───────┘ │ ▼ ┌────────────────┐ │ logging │ │ │ │ • Log files │ │ • Console out │ │ • SUCCESS lvl │ └────────────────┘ Core Components --------------- 1. main.py - Pipeline Orchestrator ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Purpose**: Central entry point and workflow coordinator **Responsibilities**: - Parse command-line arguments - Execute pipeline steps in sequence - Handle top-level errors - Coordinate logging **Key Functions**: - ``main()``: Entry point, orchestrates full pipeline - ``run_step()``: Wrapper for step execution with error handling **Design Pattern**: Command pattern with error handling decorator 2. config.py - Configuration Management ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Purpose**: Centralized configuration **Responsibilities**: - Define all file paths - Manage settings - Dynamic dataset detection - Path resolution **Key Features**: - Automatic dataset folder detection - Relative path resolution - Environment-agnostic configuration - Type hints for IDE support **Design Pattern**: Module-level singleton 3. scripts/extract_data.py - Data Extraction ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Purpose**: Convert Excel files to JSONL **Responsibilities**: - File discovery and validation - Excel reading and parsing - Data type conversion - JSONL serialization - Progress tracking with tqdm **Key Functions**: - ``extract_excel_to_jsonl()``: Batch processing with progress bars - ``process_excel_file()``: Single file processing - ``convert_dataframe_to_jsonl()``: DataFrame conversion - ``clean_record_for_json()``: Type conversion - ``is_dataframe_empty()``: Empty detection - ``find_excel_files()``: File discovery **Progress Tracking**: - Uses tqdm for all file and row processing - Status messages via tqdm.write() for clean output - Summary statistics after completion **Design Pattern**: Pipeline pattern with functional composition 4. scripts/load_dictionary.py - Dictionary Processing ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Purpose**: Process data dictionary Excel file **Responsibilities**: - Sheet processing with progress tracking - Table detection and splitting - Duplicate column handling - Table serialization **Key Functions**: - ``load_study_dictionary()``: High-level API with tqdm progress bars - ``process_excel_file()``: Sheet processing - ``_split_sheet_into_tables()``: Table detection - ``_process_and_save_tables()``: Table output - ``_deduplicate_columns()``: Column name handling **Progress Tracking**: - tqdm progress bars for sheet processing - tqdm.write() for status messages - Clean console output during processing **Design Pattern**: Functional composition with table detection algorithm **Design Pattern**: Strategy pattern for table detection 5. scripts/utils/logging.py - Logging System ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Purpose**: Centralized logging infrastructure **Responsibilities**: - Create timestamped log files - Dual output (console + file) - Custom SUCCESS log level - Structured logging with configurable verbosity **Key Features**: - Custom SUCCESS level (between INFO and WARNING) - Timestamped log files in ``.logs/`` directory - Console and file handlers with different filtering - UTF-8 encoding for international characters - Works alongside tqdm for clean progress bar output - **Verbose mode**: DEBUG-level logging via ``-v`` flag **Log Levels**: .. code-block:: python DEBUG (10) # Verbose mode only: file processing, patterns, details INFO (20) # Default: major steps, summaries SUCCESS (25) # Custom: successful completions WARNING (30) # Potential issues ERROR (40) # Failures CRITICAL (50) # Fatal errors **Console vs. File Output**: - **Console**: Only SUCCESS, ERROR, and CRITICAL (keeps terminal clean) - **File**: INFO or DEBUG (depending on ``--verbose`` flag) and above **Verbose Logging**: When ``--verbose`` or ``-v`` flag is used: - Log level set to DEBUG in ``main.py`` - VerboseLogger class provides tree-view, context-managed output - Integrated into all three pipeline steps (Dictionary Loading, Data Extraction, De-identification) - Additional details logged throughout pipeline: - File lists and processing order - Sheet/table detection details with row/column counts - Duplicate column detection and removal reasons with detailed explanations - Processing phases with timing information (per-file, per-sheet, per-table, overall) - PHI/PII pattern matches and replacement counts (de-identification step) - Record-level progress (every 1000 records for large files) - Validation results with issue tracking - Comprehensive timing for all operations **VerboseLogger Class**: The ``VerboseLogger`` class provides centralized verbose logging with tree-view formatting and context management: .. code-block:: python # In module from scripts.utils import logging as log vlog = log.get_verbose_logger() # Usage in code with vlog.file_processing("Data dictionary", total_records=43): vlog.metric("Total sheets", 43) for sheet_idx, sheet_name in enumerate(sheets, 1): with vlog.step(f"Sheet {sheet_idx}/{len(sheets)}: '{sheet_name}'"): vlog.metric("Tables detected", num_tables) vlog.detail("Processing table...") vlog.timing("Sheet processing time", elapsed) **Output Format**: .. code-block:: text DEBUG: ├─ Processing: Data extraction (65 files) DEBUG: ├─ Total files to process: 65 DEBUG: ├─ File 1/65: 10_TST.xlsx DEBUG: │ ├─ Loading Excel file DEBUG: │ │ ├─ Rows: 412 DEBUG: │ │ ├─ Columns: 28 DEBUG: │ ├─ Cleaning duplicate columns DEBUG: │ │ ├─ Marking SUBJID2 for removal (duplicate of SUBJID) DEBUG: │ │ ├─ Removed 3 duplicate columns: SUBJID2, AGE_2, PHONE_3 DEBUG: │ └─ ⏱ Total processing time: 0.45s DEBUG: └─ ⏱ Overall extraction time: 32.15s **Key Methods**: - ``file_processing(name, total_records)``: Wrap entire file processing - ``step(name)``: Wrap a processing step - ``detail(message)``: Log a detail line - ``metric(label, value)``: Log a metric/statistic - ``timing(operation, seconds)``: Log timing information - ``items_list(label, items, max_show)``: Log list of items with truncation **Integration Pattern**: All pipeline modules (``load_dictionary.py``, ``extract_data.py``, ``deidentify.py``) follow this pattern: 1. Import verbose logger at module level: .. code-block:: python from scripts.utils import logging as log vlog = log.get_verbose_logger() 2. Wrap main processing functions with context managers: .. code-block:: python with vlog.file_processing(filename): # Main processing for item in items: with vlog.step(f"Processing {item}"): # Item processing vlog.metric("Metric", value) vlog.timing("Operation", elapsed_time) 3. Always log both to file (verbose) and console (normal): .. code-block:: python log.debug(...) # File only in verbose mode log.info(...) # File always log.success(...) # Console + file vlog.detail(...) # File only in verbose mode (via context) **Usage**: .. code-block:: python from scripts.utils import logging as log # Standard (INFO level) python main.py # Verbose (DEBUG level) with tree-view logs python main.py -v # Verbose with de-identification python main.py -v --enable-deidentification # In code log.debug("Detailed processing info") # Only in verbose mode log.info("Major step completed") # Always logged to file log.success("Pipeline completed") # Console + file vlog.detail("Context-managed detail") # Only in verbose mode **Design Pattern**: Singleton logger instance with configurable formatting; VerboseLogger provides tree-view abstraction over standard logging 6. scripts/deidentify.py - De-identification Engine ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Purpose**: Remove PHI/PII from text data with pseudonymization **Responsibilities**: - Detect PHI/PII using regex patterns - Generate consistent pseudonyms - Encrypt and store mappings - Validate de-identified output - Support country-specific regulations - Progress tracking for large datasets **Key Classes**: - ``DeidentificationEngine``: Main orchestrator - ``PseudonymGenerator``: Creates deterministic placeholders - ``MappingStore``: Secure encrypted storage - ``DateShifter``: Consistent date shifting - ``PatternLibrary``: Detection patterns **Progress Tracking**: - tqdm progress bars for processing batches - tqdm.write() for status messages during processing - Summary statistics upon completion **Design Pattern**: Strategy pattern for detection, Builder pattern for configuration 7. scripts/utils/country_regulations.py - Country-Specific Regulations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Purpose**: Manage country-specific data privacy regulations **Responsibilities**: - Define country-specific data fields - Provide detection patterns for local identifiers - Document regulatory requirements - Support multiple jurisdictions simultaneously **Key Classes**: - ``CountryRegulationManager``: Orchestrates regulations - ``CountryRegulation``: Single country configuration - ``DataField``: Field definition with validation - ``PrivacyLevel`` / ``DataFieldType``: Enumerations **Supported Countries**: US, EU, GB, CA, AU, IN, ID, BR, PH, ZA, KE, NG, GH, UG **Design Pattern**: Registry pattern for country lookup, Factory pattern for regulation creation Data Flow --------- Step-by-Step Data Flow: .. code-block:: text 1. User invokes: python main.py │ ▼ 2. main.py initializes logging │ ▼ 3. Step 0: load_study_dictionary() │ ┌─────────────┴──────────────┐ │ │ ▼ ▼ Read Excel Split sheets into tables Dictionary │ ▼ Deduplicate columns │ ▼ Save as JSONL in: results/data_dictionary_mappings/ │ ▼ 4. Step 1: extract_excel_to_jsonl() │ ┌─────────────┴──────────────┐ │ │ ▼ ▼ Find Excel files Process each file in dataset/ │ ┌───────────┴────────────┐ │ │ ▼ ▼ Read Excel sheets Convert data types │ │ ▼ ▼ Clean records Handle NaN/dates │ │ └───────────┬────────────┘ │ ▼ Save as JSONL in: results/dataset// ├── original/ (all columns) └── cleaned/ (duplicates removed) │ ▼ 5. Step 2: deidentify_dataset() [OPTIONAL] │ ┌─────────────┴──────────────┐ │ │ ▼ ▼ Recursively find Process each file JSONL files │ in subdirs ┌───────────┴────────────┐ │ │ ▼ ▼ Detect PHI/PII Generate pseudonyms │ │ ▼ ▼ Replace sensitive Maintain mappings data │ └───────────┬────────────┘ │ ▼ Save de-identified in: results/deidentified// ├── original/ (de-identified) ├── cleaned/ (de-identified) └── _deidentification_audit.json Store encrypted mappings: results/deidentified/mappings/ └── mappings.enc Data Flow Architecture ---------------------- The system processes data through three primary pipelines: **Pipeline 1: Data Dictionary Processing** .. code-block:: text Excel File (Dictionary) → pd.read_excel() → Table Detection → Split Tables → Column Deduplication → "Ignore Below" Filter → JSONL Export (per table) Algorithm: Two-Phase Table Detection 1. Horizontal Split: Identify empty rows as boundaries 2. Vertical Split: Within horizontal strips, identify empty columns 3. Result: NxM tables from single sheet **Pipeline 2: Data Extraction** .. code-block:: text Excel Files (Dataset) → find_excel_files() → pd.read_excel() → Type Conversion → Duplicate Column Removal → JSONL Export → File Integrity Check → Statistics Collection Outputs: Two versions (original/, cleaned/) for validation **Pipeline 3: De-identification** *(Optional)* .. code-block:: text JSONL Files → Pattern Matching (Regex + Country-Specific) → PHI/PII Detection → Pseudonym Generation (Cryptographic Hash) → Mapping Storage (Encrypted) → Date Shifting (Consistent Offset) → Validation → Encrypted JSONL Output + Audit Log Security: Fernet encryption, deterministic pseudonyms, audit trails Design Decisions ---------------- 1. JSONL Format ~~~~~~~~~~~~~~~ **Rationale**: - Line-oriented: Each record is independent - Streaming friendly: Can process files line-by-line - Easy to merge: Just concatenate files - Human-readable: Each line is valid JSON - Standard format: Wide tool support **Alternative Considered**: CSV **Rejected Because**: CSV doesn't handle nested structures well 2. Automatic Table Detection ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Rationale**: - Excel sheets often contain multiple logical tables - Empty rows/columns serve as natural separators - Preserves semantic structure of data **Algorithm**: 1. Find maximum consecutive empty rows/columns 2. Split at these boundaries 3. Handle special "Ignore below" markers 3. Dynamic Dataset Detection ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Rationale**: - Avoid hardcoding dataset names - Enable working with multiple datasets - Reduce configuration burden **Implementation**: Scan ``data/dataset/`` for first subdirectory 4. Progress Tracking ~~~~~~~~~~~~~~~~~~~~ **Rationale**: - Long-running operations need real-time feedback - Users want to know progress and time remaining - Helps identify slow operations - Clean console output is essential **Implementation**: - **tqdm** library for all progress bars (required dependency) - **tqdm.write()** for status messages during progress tracking - Consistent usage across all processing modules: - ``extract_data.py``: File and row processing - ``load_dictionary.py``: Sheet processing - ``deidentify.py``: Batch de-identification **Design Decision**: tqdm is a required dependency, not optional, ensuring consistent user experience 5. Centralized Configuration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Rationale**: - Single source of truth - Easy to modify paths - Reduces coupling - Testability **Alternative Considered**: Environment variables **Rejected Because**: More complex for non-technical users Algorithms and Data Structures ------------------------------- **Algorithm 1: Two-Phase Table Detection** Located in: ``scripts/load_dictionary.py`` → ``_split_sheet_into_tables()`` **Purpose:** Intelligently split Excel sheets containing multiple logical tables into separate tables **Algorithm:** .. code-block:: text Phase 1: Horizontal Splitting 1. Identify rows where ALL cells are null/empty 2. Use these rows as boundaries to split sheet into horizontal strips 3. Each strip potentially contains one or more tables side-by-side Phase 2: Vertical Splitting (within each horizontal strip) 1. Identify columns where ALL cells are null/empty 2. Use these columns as boundaries to split strip into tables 3. Remove completely empty tables 4. Drop rows that are entirely null Result: NxM independent tables from single sheet **Data Structures:** .. code-block:: python # Input: Raw DataFrame (no assumptions about structure) df: pd.DataFrame # header=None, all data preserved # Intermediate: List of horizontal strips horizontal_strips: List[pd.DataFrame] # Output: List of independent tables all_tables: List[pd.DataFrame] **Edge Cases Handled:** - Empty rows between tables (common in medical research data dictionaries) - Empty columns between tables (side-by-side table layouts) - Tables with no data rows (only headers) - preserved with metadata - "ignore below" markers - subsequent tables saved to separate directory - Duplicate column names - automatically suffixed with "_1", "_2", etc. **Complexity:** O(r × c) where r = rows, c = columns --- **Algorithm 2: JSON Type Conversion** Located in: ``scripts/extract_data.py`` → ``clean_record_for_json()`` **Purpose:** Convert pandas/numpy types to JSON-serializable Python types **Algorithm:** .. code-block:: text For each key-value pair in record: 1. If value is pd.isna(value) → None (JSON null) 2. If value is np.integer or np.floating → call .item() to get Python int/float 3. If value is pd.Timestamp, np.datetime64, datetime, date → convert to string 4. Otherwise → keep as-is Return cleaned dictionary **Type Mappings:** ================== ====================== ==================== Pandas/Numpy Type Python Type JSON Type ================== ====================== ==================== pd.NA, np.nan None null np.int64 int number np.float64 float number pd.Timestamp str string (ISO format) datetime str string ================== ====================== ==================== **Edge Cases:** - Mixed-type columns → handled by pandas during read_excel() - Unicode characters → preserved with ensure_ascii=False - Large integers → may lose precision if > 2^53 (JSON limitation) --- **Algorithm 3: Duplicate Column Detection and Removal** Located in: ``scripts/extract_data.py`` → ``clean_duplicate_columns()`` **Purpose:** Remove duplicate columns with numeric suffixes (e.g., SUBJID2, SUBJID3) **Algorithm:** .. code-block:: text For each column in DataFrame: 1. Match pattern: column_name ends with "_?" followed by digits 2. Extract base_name (everything before the suffix) 3. If base_name exists as a column: - Mark current column for removal (it's a duplicate) - Keep the base column 4. Otherwise: - Keep the column Return DataFrame with only non-duplicate columns **Regex Pattern:** ``^(.+?)_?(\d+)$`` **Examples:** - ``SUBJID`` (base) + ``SUBJID2``, ``SUBJID3`` → Keep ``SUBJID``, remove others - ``NAME_1`` (numbered) + ``NAME`` (base) → Keep ``NAME``, remove ``NAME_1`` - ``ID3`` (numbered) + ``ID`` (base) → Keep ``ID``, remove ``ID3`` --- **Algorithm 4: Cryptographic Pseudonymization** Located in: ``scripts/deidentify.py`` → ``PseudonymGenerator.generate()`` **Purpose:** Generate deterministic, unique pseudonyms for PHI/PII values **Algorithm:** .. code-block:: text Input: (value, phi_type, template) 1. Check cache: If (phi_type, value.lower()) already pseudonymized: - Return cached pseudonym (ensures consistency) 2. Generate deterministic ID: a. Create hash_input = "{salt}:{phi_type}:{value}" b. hash_digest = SHA256(hash_input) c. Take first 4 bytes of digest d. Encode as base32, strip padding, take first 6 chars e. Result: Alphanumeric ID (e.g., "A4B8C3") 3. Apply template: - Replace {id} placeholder with generated ID - Example: "PATIENT-{id}" → "PATIENT-A4B8C3" 4. Cache and return pseudonym **Security Properties:** - **Deterministic:** Same input always produces same output (required for data consistency) - **One-way:** Cannot reverse SHA256 without salt - **Salt-dependent:** Different salt produces different pseudonyms - **Collision-resistant:** SHA256 ensures uniqueness **Data Structure:** .. code-block:: python class PseudonymGenerator: salt: str # Cryptographic salt (32 bytes hex) _cache: Dict[Tuple[PHIType, str], str] # Memoization _counter: Dict[PHIType, int] # Statistics --- **Algorithm 5: Consistent Date Shifting (Country-Aware with Smart Validation)** Located in: ``scripts/deidentify.py`` → ``DateShifter.shift_date()`` .. versionchanged:: 0.6.0 Enhanced date parsing with smart validation to handle ambiguous dates correctly. **Purpose:** Shift all dates by consistent offset to preserve temporal relationships, with intelligent multi-format detection, country-specific priority, and smart validation **The Ambiguity Challenge:** Dates like ``08/09/2020`` or ``12/12/2012`` can be interpreted differently: - **US Format (MM/DD):** 08/09/2020 = August 9, 2020 - **India Format (DD/MM):** 08/09/2020 = September 8, 2020 - **Symmetric:** 12/12/2012 = December 12, 2012 (same in both formats) **Solution:** Country-based priority with smart validation for unambiguous dates. **Algorithm:** .. code-block:: text Input: date_string, country_code 1. Check cache: If date_string already shifted: - Return cached shifted date (O(1) lookup) 2. Generate consistent offset (first time only): a. hash_digest = SHA256(seed) b. offset_int = first 4 bytes as integer c. offset_days = (offset_int % (2 * range + 1)) - range d. Cache offset for all future shifts 3. Determine format priority based on country: DD/MM/YYYY priority (day first): Countries: IN, ID, BR, ZA, EU, GB, AU, KE, NG, GH, UG Formats to try: a. YYYY-MM-DD (ISO 8601) - always unambiguous b. DD/MM/YYYY (country preference) c. DD-MM-YYYY (hyphen variant) d. DD.MM.YYYY (European dot notation) MM/DD/YYYY priority (month first): Countries: US, PH, CA Formats to try: a. YYYY-MM-DD (ISO 8601) - always unambiguous b. MM/DD/YYYY (country preference) c. MM-DD-YYYY (hyphen variant) 4. For each format in priority order: a. Try parsing date_string with format b. If parse successful: SMART VALIDATION (for ambiguous slash/hyphen formats): If format is MM/DD or DD/MM: - Extract first_num and second_num from date string CASE 1: first_num > 12 → Must be day (no 13th month exists!) → If trying MM/DD format: REJECT, try next format → If trying DD/MM format: ACCEPT CASE 2: second_num > 12 → Must be day (no 13th month exists!) → If trying DD/MM format: REJECT, try next format → If trying MM/DD format: ACCEPT CASE 3: Both numbers ≤ 12 (ambiguous) → Trust country preference (first matching format wins) → Ensures consistency within dataset c. If validation passes: BREAK, use this format d. If validation fails: Continue to next format c. If parse failed: Continue to next format 5. If NO format succeeded: - Return placeholder: [DATE-HASH] - Log warning for manual review 6. Apply shift with successful format: a. shifted_date = parsed_date + timedelta(days=offset_days) b. Format back to string in SAME format as input c. Cache result: {date_string: shifted_string} 7. Return shifted date string **Properties:** - **Unambiguous dates always correct:** ``13/05/2020`` parsed as May 13 (only valid interpretation) - **Ambiguous dates use country preference:** ``08/09/2020`` interpreted consistently per country - **Symmetric dates handled:** ``12/12/2012`` uses country format (though result is same) - **Consistent:** All dates shifted by SAME offset (preserves intervals) - **Deterministic:** Seed determines offset (reproducible) - **Format-preserving:** Output format matches input format - **HIPAA-compliant:** Dates obscured while relationships preserved **Example Flow - Unambiguous Date:** .. code-block:: python # Processing "13/05/2020" for India (DD/MM preference) Step 1: Try YYYY-MM-DD Result: ❌ Doesn't match pattern Step 2: Try DD/MM/YYYY (India preference) Parse: ✅ Day=13, Month=05 (May 13, 2020) Validate: first_num=13 > 12 ✅ Must be day (valid) Result: ✅ SUCCESS → May 13, 2020 # Processing "13/05/2020" for USA (MM/DD preference) Step 1: Try YYYY-MM-DD Result: ❌ Doesn't match pattern Step 2: Try MM/DD/YYYY (USA preference) Parse: ❌ Month=13 invalid (strptime fails) Result: Continue to next format Step 3: Try DD/MM/YYYY (fallback) Parse: ✅ Day=13, Month=05 Validate: first_num=13 > 12 ✅ Must be day (valid) Result: ✅ SUCCESS → May 13, 2020 **Example Flow - Ambiguous Date:** .. code-block:: python # Processing "08/09/2020" for India (DD/MM preference) Step 1: Try YYYY-MM-DD Result: ❌ Doesn't match pattern Step 2: Try DD/MM/YYYY (India preference) Parse: ✅ Day=08, Month=09 (Sep 8, 2020) Validate: first_num=8 ≤ 12 ✅, second_num=9 ≤ 12 ✅ Both ambiguous → Trust country preference Result: ✅ SUCCESS → Sep 8, 2020 (India interpretation) # Processing "08/09/2020" for USA (MM/DD preference) Step 1: Try YYYY-MM-DD Result: ❌ Doesn't match pattern Step 2: Try MM/DD/YYYY (USA preference) Parse: ✅ Month=08, Day=09 (Aug 9, 2020) Validate: first_num=8 ≤ 12 ✅, second_num=9 ≤ 12 ✅ Both ambiguous → Trust country preference Result: ✅ SUCCESS → Aug 9, 2020 (USA interpretation) **Complete Example with Shifting:** .. code-block:: python # For India (DD/MM/YYYY format): shifter_in = DateShifter(country_code="IN", seed="abc123") "2014-09-04" → "2013-12-14" # ISO → Always Sep 4 (unambiguous) "04/09/2014" → "14/12/2013" # DD/MM → Sep 4, 2014 → Dec 14, 2013 (-265 days) "09/09/2014" → "19/12/2013" # DD/MM → Sep 9, 2014 → Dec 19, 2013 (-265 days) "13/05/2020" → "03/08/2019" # Unambiguous → May 13 (only valid parsing) # Interval preserved: All dates shifted by -265 days # For United States (MM/DD/YYYY format): shifter_us = DateShifter(country_code="US", seed="abc123") "2014-09-04" → "2013-12-14" # ISO → Always Sep 4 (unambiguous) "04/09/2014" → "07/17/2013" # MM/DD → Apr 9, 2014 → Jul 17, 2013 (-265 days) "04/14/2014" → "07/22/2013" # MM/DD → Apr 14, 2014 → Jul 22, 2013 (-265 days) "13/05/2020" → "08/03/2019" # Unambiguous → May 13 (validation rejects MM/DD) # Interval preserved: All dates shifted by -265 days **Time Complexity:** - **Cache hit:** O(1) - constant time lookup - **Cache miss:** O(f) where f = number of formats (typically 3-4) - **Overall:** O(1) amortized for repeated dates in large datasets --- **Data Structure: Mapping Store (Encrypted)** Located in: ``scripts/deidentify.py`` → ``MappingStore`` **Purpose:** Securely store original → pseudonym mappings **Structure:** .. code-block:: python # In-memory structure mappings: Dict[str, Dict[str, Any]] = { "PHI_TYPE:original_value": { "original": "John Doe", # Original sensitive value "pseudonym": "PATIENT-A4B8C3", # Generated pseudonym "phi_type": "NAME_FULL", # Type of PHI "created_at": "2025-10-13T14:32:15", # Timestamp "metadata": {"pattern": "Full name pattern"} }, ... } # On-disk structure (encrypted with Fernet) File: mappings.enc Content: Fernet.encrypt(JSON.dumps(mappings)) **Encryption:** Fernet (symmetric encryption, 128-bit AES in CBC mode with HMAC) **Security:** - Encryption key stored separately - Keys never committed to version control - Audit log exports WITHOUT original values by default --- **Data Structure: JSONL File Format** **Structure:** Each line is a valid JSON object (one record per line): .. code-block:: json {"SUBJID": "INV001", "VISIT": 1, "TST_RESULT": "Positive", "source_file": "10_TST.xlsx"} {"SUBJID": "INV002", "VISIT": 1, "TST_RESULT": "Negative", "source_file": "10_TST.xlsx"} {"SUBJID": "INV003", "VISIT": 1, "TST_RESULT": "Positive", "source_file": "10_TST.xlsx"} **Advantages:** - Streamable: Can process without loading entire file into memory - Line-oriented: Easy to split, merge, or process in parallel - JSON-compatible: Works with standard JSON parsers - Human-readable: Can inspect with `head`, `tail`, `grep` **Metadata Fields:** - ``source_file``: Original Excel filename for traceability - ``_metadata``: Optional metadata (e.g., for empty files with structure) --- **Data Structure: Progress Tracking with tqdm** **Integration Pattern:** .. code-block:: python from tqdm import tqdm import sys # File-level progress for file in tqdm(files, desc="Processing", unit="file", file=sys.stdout, dynamic_ncols=True, leave=True): # Use tqdm.write() instead of print() for clean output tqdm.write(f"Processing: {file.name}") # Row-level progress (if needed) for row in tqdm(rows, desc="Rows", leave=False): process(row) # Result: Clean progress bars without interfering with logging **Why tqdm.write():** - Ensures messages don't corrupt progress bar display - Automatically repositions progress bar after message - Works with logging system Dependencies and Their Roles ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **pandas (>= 2.0.0)** - Role: DataFrame manipulation, Excel reading, data analysis - Key functions used: - ``pd.read_excel()``: Excel file parsing - ``df.to_json()``: JSONL export - ``pd.isna()``: Null value detection - Why chosen: Industry standard for data manipulation in Python **openpyxl (>= 3.1.0)** - Role: Excel file format (.xlsx) support for pandas - Used by: ``pd.read_excel(engine='openpyxl')`` - Why chosen: Pure Python, no external dependencies, handles modern Excel formats **numpy (>= 1.24.0)** - Role: Numerical operations, type handling - Key types used: - ``np.int64``, ``np.float64``: Numeric types from pandas - ``np.datetime64``: Datetime types - ``np.nan``: Missing value representation - Why chosen: Required by pandas, efficient numerical operations **tqdm (>= 4.66.0)** - Role: Progress bars and user feedback - Key features: - Real-time progress tracking - ETA calculations - Clean console output with ``tqdm.write()`` - Why chosen: Most popular Python progress bar library, excellent integration **cryptography (>= 41.0.0)** - Role: Encryption for de-identification mappings - Key components: - ``Fernet``: Symmetric encryption - ``hashlib.sha256()``: Cryptographic hashing - ``secrets``: Secure random number generation - Why chosen: Industry-standard cryptography library, HIPAA-compliant algorithms **sphinx (>= 7.0.0) + extensions** - Role: Documentation generation - Extensions used: - ``sphinx.ext.autodoc``: Automatic API documentation from docstrings - ``sphinx.ext.napoleon``: Google/NumPy style docstring support - ``sphinx_autodoc_typehints``: Type hint documentation - ``sphinx-autobuild``: Live documentation preview (dev dependency) - Why chosen: Standard for Python project documentation **Documentation Workflow**: .. versionadded:: 0.3.0 Added ``make docs-watch`` for automatic documentation rebuilding. - **Autodoc is ENABLED**: Sphinx automatically extracts documentation from Python docstrings - **NOT Automatic by Default**: Documentation does NOT rebuild automatically on every code change - **Manual Build**: Run ``make docs`` to regenerate documentation after changes - **Auto-Rebuild (Development)**: Use ``make docs-watch`` for live preview during documentation development **How Autodoc Works**: 1. Write Google-style docstrings in Python code 2. Use ``.. automodule::`` directives in ``.rst`` files 3. Run ``make docs`` - Sphinx extracts docstrings and generates HTML 4. Or use ``make docs-watch`` - Server auto-rebuilds on file changes **Important**: While autodoc **extracts** documentation automatically from code, you must **build** the documentation manually (or use watch mode) to see the changes.