Architectureο
For Developers: Comprehensive Technical Documentation
This document provides in-depth technical details about RePORTaLiNβs architecture, internal algorithms, data structures, dependencies, design patterns, and extension points to enable effective maintenance, debugging, and feature development.
Last Updated: October 23, 2025 Current Version: 0.8.6 Code Optimization: 35% reduction (640 lines) while maintaining 100% functionality Historical Enhancements: Enhanced modules from v0.0.1 through v0.0.12
System Overviewο
RePORTaLiN is designed as a modular, pipeline-based system for processing sensitive medical research data from Excel to JSONL format with optional PHI/PII de-identification. The architecture emphasizes:
Simplicity: Single entry point (
main.py) with clear pipeline stagesModularity: Clear separation of concerns with dedicated modules
Robustness: Comprehensive error handling with graceful degradation
Transparency: Detailed logging at every step with audit trails
Security: Encryption-by-default for de-identification mappings
Compliance: Multi-country privacy regulation support (14 countries)
Performance: Optimized for high throughput (benchmarks pending)
Architecture Diagramο
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β main.py β
β (Pipeline Orchestrator) β
β β
β β’ Command-line argument parsing β
β β’ Step execution coordination β
β β’ High-level error handling β
βββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββ΄ββββββββββββ
β β
βΌ βΌ
ββββββββββββββββββ ββββββββββββββββββββ
β load_dict.py β β extract_data.py β
β β β β
β β’ Sheet split β β β’ File discovery β
β β’ Table detect β β β’ Type convert β
β β’ Duplicate β β β’ JSONL export β
β handling β β β’ Progress track β
ββββββββββ¬ββββββββ ββββββββββ¬ββββββββββ
β β
ββββββββββββ¬ββββββββββββ
β
βΌ
ββββββββββββββββββ
β config.py β
β β
β β’ Paths β
β β’ Settings β
β β’ Dynamic β
β detection β
ββββββββββ¬ββββββββ
β
βΌ
ββββββββββββββββββ
β logging β
β β
β β’ Log files β
β β’ Console out β
β β’ SUCCESS lvl β
ββββββββββββββββββ
Core Componentsο
1. main.py - Pipeline Orchestratorο
Purpose: Central entry point and workflow coordinator
Responsibilities:
Parse command-line arguments
Execute pipeline steps in sequence
Handle top-level errors
Coordinate logging
Key Functions:
main(): Entry point, orchestrates full pipelinerun_step(): Wrapper for step execution with error handling
Design Pattern: Command pattern with error handling decorator
2. config.py - Configuration Managementο
Purpose: Centralized configuration
Responsibilities:
Define all file paths
Manage settings
Dynamic dataset detection
Path resolution
Key Features:
Automatic dataset folder detection
Relative path resolution
Environment-agnostic configuration
Type hints for IDE support
Design Pattern: Module-level singleton
3. scripts/extract_data.py - Data Extractionο
Purpose: Convert Excel files to JSONL
Responsibilities:
File discovery and validation
Excel reading and parsing
Data type conversion
JSONL serialization
Progress tracking with tqdm
Key Functions:
extract_excel_to_jsonl(): Batch processing with progress barsprocess_excel_file(): Single file processingconvert_dataframe_to_jsonl(): DataFrame conversionclean_record_for_json(): Type conversionis_dataframe_empty(): Empty detectionfind_excel_files(): File discovery
Progress Tracking:
Uses tqdm for all file and row processing
Status messages via tqdm.write() for clean output
Summary statistics after completion
Design Pattern: Pipeline pattern with functional composition
4. scripts/load_dictionary.py - Dictionary Processingο
Purpose: Process data dictionary Excel file
Responsibilities:
Sheet processing with progress tracking
Table detection and splitting
Duplicate column handling
Table serialization
Key Functions:
load_study_dictionary(): High-level API with tqdm progress barsprocess_excel_file(): Sheet processing_split_sheet_into_tables(): Table detection_process_and_save_tables(): Table output_deduplicate_columns(): Column name handling
Progress Tracking:
tqdm progress bars for sheet processing
tqdm.write() for status messages
Clean console output during processing
Design Pattern: Functional composition with table detection algorithm
Design Pattern: Strategy pattern for table detection
5. scripts/utils/logging.py - Logging Systemο
Purpose: Centralized logging infrastructure
Responsibilities:
Create timestamped log files
Dual output (console + file)
Custom SUCCESS log level
Structured logging with configurable verbosity
Key Features:
Custom SUCCESS level (between INFO and WARNING)
Timestamped log files in
.logs/directoryConsole and file handlers with different filtering
UTF-8 encoding for international characters
Works alongside tqdm for clean progress bar output
Verbose mode: DEBUG-level logging via
-vflag
Log Levels:
DEBUG (10) # Verbose mode only: file processing, patterns, details
INFO (20) # Default: major steps, summaries
SUCCESS (25) # Custom: successful completions
WARNING (30) # Potential issues
ERROR (40) # Failures
CRITICAL (50) # Fatal errors
Console vs. File Output:
Console: Only SUCCESS, ERROR, and CRITICAL (keeps terminal clean)
File: INFO or DEBUG (depending on
--verboseflag) and above
Verbose Logging:
When --verbose or -v flag is used:
Log level set to DEBUG in
main.pyVerboseLogger class provides tree-view, context-managed output
Integrated into all three pipeline steps (Dictionary Loading, Data Extraction, De-identification)
Additional details logged throughout pipeline:
File lists and processing order
Sheet/table detection details with row/column counts
Duplicate column detection and removal reasons with detailed explanations
Processing phases with timing information (per-file, per-sheet, per-table, overall)
PHI/PII pattern matches and replacement counts (de-identification step)
Record-level progress (every 1000 records for large files)
Validation results with issue tracking
Comprehensive timing for all operations
VerboseLogger Class:
The VerboseLogger class provides centralized verbose logging with tree-view formatting and context management:
# In module
from scripts.utils import logging as log
vlog = log.get_verbose_logger()
# Usage in code
with vlog.file_processing("Data dictionary", total_records=43):
vlog.metric("Total sheets", 43)
for sheet_idx, sheet_name in enumerate(sheets, 1):
with vlog.step(f"Sheet {sheet_idx}/{len(sheets)}: '{sheet_name}'"):
vlog.metric("Tables detected", num_tables)
vlog.detail("Processing table...")
vlog.timing("Sheet processing time", elapsed)
Output Format:
DEBUG: ββ Processing: Data extraction (65 files)
DEBUG: ββ Total files to process: 65
DEBUG: ββ File 1/65: 10_TST.xlsx
DEBUG: β ββ Loading Excel file
DEBUG: β β ββ Rows: 412
DEBUG: β β ββ Columns: 28
DEBUG: β ββ Cleaning duplicate columns
DEBUG: β β ββ Marking SUBJID2 for removal (duplicate of SUBJID)
DEBUG: β β ββ Removed 3 duplicate columns: SUBJID2, AGE_2, PHONE_3
DEBUG: β ββ β± Total processing time: 0.45s
DEBUG: ββ β± Overall extraction time: 32.15s
Key Methods:
file_processing(name, total_records): Wrap entire file processingstep(name): Wrap a processing stepdetail(message): Log a detail linemetric(label, value): Log a metric/statistictiming(operation, seconds): Log timing informationitems_list(label, items, max_show): Log list of items with truncation
Integration Pattern:
All pipeline modules (load_dictionary.py, extract_data.py, deidentify.py) follow this pattern:
Import verbose logger at module level:
from scripts.utils import logging as log vlog = log.get_verbose_logger()
Wrap main processing functions with context managers:
with vlog.file_processing(filename): # Main processing for item in items: with vlog.step(f"Processing {item}"): # Item processing vlog.metric("Metric", value) vlog.timing("Operation", elapsed_time)
Always log both to file (verbose) and console (normal):
log.debug(...) # File only in verbose mode log.info(...) # File always log.success(...) # Console + file vlog.detail(...) # File only in verbose mode (via context)
Usage:
from scripts.utils import logging as log
# Standard (INFO level)
python main.py
# Verbose (DEBUG level) with tree-view logs
python main.py -v
# Verbose with de-identification
python main.py -v --enable-deidentification
# In code
log.debug("Detailed processing info") # Only in verbose mode
log.info("Major step completed") # Always logged to file
log.success("Pipeline completed") # Console + file
vlog.detail("Context-managed detail") # Only in verbose mode
Design Pattern: Singleton logger instance with configurable formatting; VerboseLogger provides tree-view abstraction over standard logging
6. scripts/deidentify.py - De-identification Engineο
Purpose: Remove PHI/PII from text data with pseudonymization
Responsibilities:
Detect PHI/PII using regex patterns
Generate consistent pseudonyms
Encrypt and store mappings
Validate de-identified output
Support country-specific regulations
Progress tracking for large datasets
Key Classes:
DeidentificationEngine: Main orchestratorPseudonymGenerator: Creates deterministic placeholdersMappingStore: Secure encrypted storageDateShifter: Consistent date shiftingPatternLibrary: Detection patterns
Progress Tracking:
tqdm progress bars for processing batches
tqdm.write() for status messages during processing
Summary statistics upon completion
Design Pattern: Strategy pattern for detection, Builder pattern for configuration
7. scripts/utils/country_regulations.py - Country-Specific Regulationsο
Purpose: Manage country-specific data privacy regulations
Responsibilities:
Define country-specific data fields
Provide detection patterns for local identifiers
Document regulatory requirements
Support multiple jurisdictions simultaneously
Key Classes:
CountryRegulationManager: Orchestrates regulationsCountryRegulation: Single country configurationDataField: Field definition with validationPrivacyLevel/DataFieldType: Enumerations
Supported Countries: US, EU, GB, CA, AU, IN, ID, BR, PH, ZA, KE, NG, GH, UG
Design Pattern: Registry pattern for country lookup, Factory pattern for regulation creation
Data Flowο
Step-by-Step Data Flow:
1. User invokes: python main.py
β
βΌ
2. main.py initializes logging
β
βΌ
3. Step 0: load_study_dictionary()
β
βββββββββββββββ΄βββββββββββββββ
β β
βΌ βΌ
Read Excel Split sheets into tables
Dictionary β
βΌ
Deduplicate columns
β
βΌ
Save as JSONL in:
results/data_dictionary_mappings/
β
βΌ
4. Step 1: extract_excel_to_jsonl()
β
βββββββββββββββ΄βββββββββββββββ
β β
βΌ βΌ
Find Excel files Process each file
in dataset/ β
βββββββββββββ΄βββββββββββββ
β β
βΌ βΌ
Read Excel sheets Convert data types
β β
βΌ βΌ
Clean records Handle NaN/dates
β β
βββββββββββββ¬βββββββββββββ
β
βΌ
Save as JSONL in:
results/dataset/<dataset_name>/
βββ original/ (all columns)
βββ cleaned/ (duplicates removed)
β
βΌ
5. Step 2: deidentify_dataset() [OPTIONAL]
β
βββββββββββββββ΄βββββββββββββββ
β β
βΌ βΌ
Recursively find Process each file
JSONL files β
in subdirs βββββββββββββ΄βββββββββββββ
β β
βΌ βΌ
Detect PHI/PII Generate pseudonyms
β β
βΌ βΌ
Replace sensitive Maintain mappings
data β
βββββββββββββ¬βββββββββββββ
β
βΌ
Save de-identified in:
results/deidentified/<dataset_name>/
βββ original/ (de-identified)
βββ cleaned/ (de-identified)
βββ _deidentification_audit.json
Store encrypted mappings:
results/deidentified/mappings/
βββ mappings.enc
Data Flow Architectureο
The system processes data through three primary pipelines:
Pipeline 1: Data Dictionary Processing
Excel File (Dictionary) β pd.read_excel() β Table Detection β Split Tables
β Column Deduplication β "Ignore Below" Filter β JSONL Export (per table)
Algorithm: Two-Phase Table Detection
1. Horizontal Split: Identify empty rows as boundaries
2. Vertical Split: Within horizontal strips, identify empty columns
3. Result: NxM tables from single sheet
Pipeline 2: Data Extraction
Excel Files (Dataset) β find_excel_files() β pd.read_excel()
β Type Conversion β Duplicate Column Removal β JSONL Export
β File Integrity Check β Statistics Collection
Outputs: Two versions (original/, cleaned/) for validation
Pipeline 3: De-identification (Optional)
JSONL Files β Pattern Matching (Regex + Country-Specific)
β PHI/PII Detection β Pseudonym Generation (Cryptographic Hash)
β Mapping Storage (Encrypted) β Date Shifting (Consistent Offset)
β Validation β Encrypted JSONL Output + Audit Log
Security: Fernet encryption, deterministic pseudonyms, audit trails
Design Decisionsο
1. JSONL Formatο
Rationale:
Line-oriented: Each record is independent
Streaming friendly: Can process files line-by-line
Easy to merge: Just concatenate files
Human-readable: Each line is valid JSON
Standard format: Wide tool support
Alternative Considered: CSV Rejected Because: CSV doesnβt handle nested structures well
2. Automatic Table Detectionο
Rationale:
Excel sheets often contain multiple logical tables
Empty rows/columns serve as natural separators
Preserves semantic structure of data
Algorithm:
Find maximum consecutive empty rows/columns
Split at these boundaries
Handle special βIgnore belowβ markers
3. Dynamic Dataset Detectionο
Rationale:
Avoid hardcoding dataset names
Enable working with multiple datasets
Reduce configuration burden
Implementation: Scan data/dataset/ for first subdirectory
4. Progress Trackingο
Rationale:
Long-running operations need real-time feedback
Users want to know progress and time remaining
Helps identify slow operations
Clean console output is essential
Implementation:
tqdm library for all progress bars (required dependency)
tqdm.write() for status messages during progress tracking
Consistent usage across all processing modules:
extract_data.py: File and row processingload_dictionary.py: Sheet processingdeidentify.py: Batch de-identification
Design Decision: tqdm is a required dependency, not optional, ensuring consistent user experience
5. Centralized Configurationο
Rationale:
Single source of truth
Easy to modify paths
Reduces coupling
Testability
Alternative Considered: Environment variables Rejected Because: More complex for non-technical users
Algorithms and Data Structuresο
Algorithm 1: Two-Phase Table Detection
Located in: scripts/load_dictionary.py β _split_sheet_into_tables()
Purpose: Intelligently split Excel sheets containing multiple logical tables into separate tables
Algorithm:
Phase 1: Horizontal Splitting
1. Identify rows where ALL cells are null/empty
2. Use these rows as boundaries to split sheet into horizontal strips
3. Each strip potentially contains one or more tables side-by-side
Phase 2: Vertical Splitting (within each horizontal strip)
1. Identify columns where ALL cells are null/empty
2. Use these columns as boundaries to split strip into tables
3. Remove completely empty tables
4. Drop rows that are entirely null
Result: NxM independent tables from single sheet
Data Structures:
# Input: Raw DataFrame (no assumptions about structure)
df: pd.DataFrame # header=None, all data preserved
# Intermediate: List of horizontal strips
horizontal_strips: List[pd.DataFrame]
# Output: List of independent tables
all_tables: List[pd.DataFrame]
Edge Cases Handled:
Empty rows between tables (common in medical research data dictionaries)
Empty columns between tables (side-by-side table layouts)
Tables with no data rows (only headers) - preserved with metadata
βignore belowβ markers - subsequent tables saved to separate directory
Duplicate column names - automatically suffixed with β_1β, β_2β, etc.
Complexity: O(r Γ c) where r = rows, c = columns
β
Algorithm 2: JSON Type Conversion
Located in: scripts/extract_data.py β clean_record_for_json()
Purpose: Convert pandas/numpy types to JSON-serializable Python types
Algorithm:
For each key-value pair in record:
1. If value is pd.isna(value) β None (JSON null)
2. If value is np.integer or np.floating β call .item() to get Python int/float
3. If value is pd.Timestamp, np.datetime64, datetime, date β convert to string
4. Otherwise β keep as-is
Return cleaned dictionary
Type Mappings:
Pandas/Numpy Type |
Python Type |
JSON Type |
|---|---|---|
pd.NA, np.nan |
None |
null |
np.int64 |
int |
number |
np.float64 |
float |
number |
pd.Timestamp |
str |
string (ISO format) |
datetime |
str |
string |
Edge Cases:
Mixed-type columns β handled by pandas during read_excel()
Unicode characters β preserved with ensure_ascii=False
Large integers β may lose precision if > 2^53 (JSON limitation)
β
Algorithm 3: Duplicate Column Detection and Removal
Located in: scripts/extract_data.py β clean_duplicate_columns()
Purpose: Remove duplicate columns with numeric suffixes (e.g., SUBJID2, SUBJID3)
Algorithm:
For each column in DataFrame:
1. Match pattern: column_name ends with "_?" followed by digits
2. Extract base_name (everything before the suffix)
3. If base_name exists as a column:
- Mark current column for removal (it's a duplicate)
- Keep the base column
4. Otherwise:
- Keep the column
Return DataFrame with only non-duplicate columns
Regex Pattern: ^(.+?)_?(\d+)$
Examples:
SUBJID(base) +SUBJID2,SUBJID3β KeepSUBJID, remove othersNAME_1(numbered) +NAME(base) β KeepNAME, removeNAME_1ID3(numbered) +ID(base) β KeepID, removeID3
β
Algorithm 4: Cryptographic Pseudonymization
Located in: scripts/deidentify.py β PseudonymGenerator.generate()
Purpose: Generate deterministic, unique pseudonyms for PHI/PII values
Algorithm:
Input: (value, phi_type, template)
1. Check cache: If (phi_type, value.lower()) already pseudonymized:
- Return cached pseudonym (ensures consistency)
2. Generate deterministic ID:
a. Create hash_input = "{salt}:{phi_type}:{value}"
b. hash_digest = SHA256(hash_input)
c. Take first 4 bytes of digest
d. Encode as base32, strip padding, take first 6 chars
e. Result: Alphanumeric ID (e.g., "A4B8C3")
3. Apply template:
- Replace {id} placeholder with generated ID
- Example: "PATIENT-{id}" β "PATIENT-A4B8C3"
4. Cache and return pseudonym
Security Properties:
Deterministic: Same input always produces same output (required for data consistency)
One-way: Cannot reverse SHA256 without salt
Salt-dependent: Different salt produces different pseudonyms
Collision-resistant: SHA256 ensures uniqueness
Data Structure:
class PseudonymGenerator:
salt: str # Cryptographic salt (32 bytes hex)
_cache: Dict[Tuple[PHIType, str], str] # Memoization
_counter: Dict[PHIType, int] # Statistics
β
Algorithm 5: Consistent Date Shifting (Country-Aware with Smart Validation)
Located in: scripts/deidentify.py β DateShifter.shift_date()
Changed in version 0.6.0: Enhanced date parsing with smart validation to handle ambiguous dates correctly.
Purpose: Shift all dates by consistent offset to preserve temporal relationships, with intelligent multi-format detection, country-specific priority, and smart validation
The Ambiguity Challenge:
Dates like 08/09/2020 or 12/12/2012 can be interpreted differently:
US Format (MM/DD): 08/09/2020 = August 9, 2020
India Format (DD/MM): 08/09/2020 = September 8, 2020
Symmetric: 12/12/2012 = December 12, 2012 (same in both formats)
Solution: Country-based priority with smart validation for unambiguous dates.
Algorithm:
Input: date_string, country_code
1. Check cache: If date_string already shifted:
- Return cached shifted date (O(1) lookup)
2. Generate consistent offset (first time only):
a. hash_digest = SHA256(seed)
b. offset_int = first 4 bytes as integer
c. offset_days = (offset_int % (2 * range + 1)) - range
d. Cache offset for all future shifts
3. Determine format priority based on country:
DD/MM/YYYY priority (day first):
Countries: IN, ID, BR, ZA, EU, GB, AU, KE, NG, GH, UG
Formats to try:
a. YYYY-MM-DD (ISO 8601) - always unambiguous
b. DD/MM/YYYY (country preference)
c. DD-MM-YYYY (hyphen variant)
d. DD.MM.YYYY (European dot notation)
MM/DD/YYYY priority (month first):
Countries: US, PH, CA
Formats to try:
a. YYYY-MM-DD (ISO 8601) - always unambiguous
b. MM/DD/YYYY (country preference)
c. MM-DD-YYYY (hyphen variant)
4. For each format in priority order:
a. Try parsing date_string with format
b. If parse successful:
SMART VALIDATION (for ambiguous slash/hyphen formats):
If format is MM/DD or DD/MM:
- Extract first_num and second_num from date string
CASE 1: first_num > 12
β Must be day (no 13th month exists!)
β If trying MM/DD format: REJECT, try next format
β If trying DD/MM format: ACCEPT
CASE 2: second_num > 12
β Must be day (no 13th month exists!)
β If trying DD/MM format: REJECT, try next format
β If trying MM/DD format: ACCEPT
CASE 3: Both numbers β€ 12 (ambiguous)
β Trust country preference (first matching format wins)
β Ensures consistency within dataset
c. If validation passes: BREAK, use this format
d. If validation fails: Continue to next format
c. If parse failed: Continue to next format
5. If NO format succeeded:
- Return placeholder: [DATE-HASH]
- Log warning for manual review
6. Apply shift with successful format:
a. shifted_date = parsed_date + timedelta(days=offset_days)
b. Format back to string in SAME format as input
c. Cache result: {date_string: shifted_string}
7. Return shifted date string
Properties:
Unambiguous dates always correct:
13/05/2020parsed as May 13 (only valid interpretation)Ambiguous dates use country preference:
08/09/2020interpreted consistently per countrySymmetric dates handled:
12/12/2012uses country format (though result is same)Consistent: All dates shifted by SAME offset (preserves intervals)
Deterministic: Seed determines offset (reproducible)
Format-preserving: Output format matches input format
HIPAA-compliant: Dates obscured while relationships preserved
Example Flow - Unambiguous Date:
# Processing "13/05/2020" for India (DD/MM preference)
Step 1: Try YYYY-MM-DD
Result: β Doesn't match pattern
Step 2: Try DD/MM/YYYY (India preference)
Parse: β
Day=13, Month=05 (May 13, 2020)
Validate: first_num=13 > 12 β
Must be day (valid)
Result: β
SUCCESS β May 13, 2020
# Processing "13/05/2020" for USA (MM/DD preference)
Step 1: Try YYYY-MM-DD
Result: β Doesn't match pattern
Step 2: Try MM/DD/YYYY (USA preference)
Parse: β Month=13 invalid (strptime fails)
Result: Continue to next format
Step 3: Try DD/MM/YYYY (fallback)
Parse: β
Day=13, Month=05
Validate: first_num=13 > 12 β
Must be day (valid)
Result: β
SUCCESS β May 13, 2020
Example Flow - Ambiguous Date:
# Processing "08/09/2020" for India (DD/MM preference)
Step 1: Try YYYY-MM-DD
Result: β Doesn't match pattern
Step 2: Try DD/MM/YYYY (India preference)
Parse: β
Day=08, Month=09 (Sep 8, 2020)
Validate: first_num=8 β€ 12 β
, second_num=9 β€ 12 β
Both ambiguous β Trust country preference
Result: β
SUCCESS β Sep 8, 2020 (India interpretation)
# Processing "08/09/2020" for USA (MM/DD preference)
Step 1: Try YYYY-MM-DD
Result: β Doesn't match pattern
Step 2: Try MM/DD/YYYY (USA preference)
Parse: β
Month=08, Day=09 (Aug 9, 2020)
Validate: first_num=8 β€ 12 β
, second_num=9 β€ 12 β
Both ambiguous β Trust country preference
Result: β
SUCCESS β Aug 9, 2020 (USA interpretation)
Complete Example with Shifting:
# For India (DD/MM/YYYY format):
shifter_in = DateShifter(country_code="IN", seed="abc123")
"2014-09-04" β "2013-12-14" # ISO β Always Sep 4 (unambiguous)
"04/09/2014" β "14/12/2013" # DD/MM β Sep 4, 2014 β Dec 14, 2013 (-265 days)
"09/09/2014" β "19/12/2013" # DD/MM β Sep 9, 2014 β Dec 19, 2013 (-265 days)
"13/05/2020" β "03/08/2019" # Unambiguous β May 13 (only valid parsing)
# Interval preserved: All dates shifted by -265 days
# For United States (MM/DD/YYYY format):
shifter_us = DateShifter(country_code="US", seed="abc123")
"2014-09-04" β "2013-12-14" # ISO β Always Sep 4 (unambiguous)
"04/09/2014" β "07/17/2013" # MM/DD β Apr 9, 2014 β Jul 17, 2013 (-265 days)
"04/14/2014" β "07/22/2013" # MM/DD β Apr 14, 2014 β Jul 22, 2013 (-265 days)
"13/05/2020" β "08/03/2019" # Unambiguous β May 13 (validation rejects MM/DD)
# Interval preserved: All dates shifted by -265 days
Time Complexity:
Cache hit: O(1) - constant time lookup
Cache miss: O(f) where f = number of formats (typically 3-4)
Overall: O(1) amortized for repeated dates in large datasets
β
Data Structure: Mapping Store (Encrypted)
Located in: scripts/deidentify.py β MappingStore
Purpose: Securely store original β pseudonym mappings
Structure:
# In-memory structure
mappings: Dict[str, Dict[str, Any]] = {
"PHI_TYPE:original_value": {
"original": "John Doe", # Original sensitive value
"pseudonym": "PATIENT-A4B8C3", # Generated pseudonym
"phi_type": "NAME_FULL", # Type of PHI
"created_at": "2025-10-13T14:32:15", # Timestamp
"metadata": {"pattern": "Full name pattern"}
},
...
}
# On-disk structure (encrypted with Fernet)
File: mappings.enc
Content: Fernet.encrypt(JSON.dumps(mappings))
Encryption: Fernet (symmetric encryption, 128-bit AES in CBC mode with HMAC)
Security:
Encryption key stored separately
Keys never committed to version control
Audit log exports WITHOUT original values by default
β
Data Structure: JSONL File Format
Structure:
Each line is a valid JSON object (one record per line):
{"SUBJID": "INV001", "VISIT": 1, "TST_RESULT": "Positive", "source_file": "10_TST.xlsx"}
{"SUBJID": "INV002", "VISIT": 1, "TST_RESULT": "Negative", "source_file": "10_TST.xlsx"}
{"SUBJID": "INV003", "VISIT": 1, "TST_RESULT": "Positive", "source_file": "10_TST.xlsx"}
Advantages:
Streamable: Can process without loading entire file into memory
Line-oriented: Easy to split, merge, or process in parallel
JSON-compatible: Works with standard JSON parsers
Human-readable: Can inspect with head, tail, grep
Metadata Fields:
source_file: Original Excel filename for traceability_metadata: Optional metadata (e.g., for empty files with structure)
β
Data Structure: Progress Tracking with tqdm
Integration Pattern:
from tqdm import tqdm
import sys
# File-level progress
for file in tqdm(files, desc="Processing", unit="file",
file=sys.stdout, dynamic_ncols=True, leave=True):
# Use tqdm.write() instead of print() for clean output
tqdm.write(f"Processing: {file.name}")
# Row-level progress (if needed)
for row in tqdm(rows, desc="Rows", leave=False):
process(row)
# Result: Clean progress bars without interfering with logging
Why tqdm.write():
Ensures messages donβt corrupt progress bar display
Automatically repositions progress bar after message
Works with logging system
Dependencies and Their Rolesο
pandas (>= 2.0.0)
Role: DataFrame manipulation, Excel reading, data analysis
Key functions used: -
pd.read_excel(): Excel file parsing -df.to_json(): JSONL export -pd.isna(): Null value detectionWhy chosen: Industry standard for data manipulation in Python
openpyxl (>= 3.1.0)
Role: Excel file format (.xlsx) support for pandas
Used by:
pd.read_excel(engine='openpyxl')Why chosen: Pure Python, no external dependencies, handles modern Excel formats
numpy (>= 1.24.0)
Role: Numerical operations, type handling
Key types used: -
np.int64,np.float64: Numeric types from pandas -np.datetime64: Datetime types -np.nan: Missing value representationWhy chosen: Required by pandas, efficient numerical operations
tqdm (>= 4.66.0)
Role: Progress bars and user feedback
Key features: - Real-time progress tracking - ETA calculations - Clean console output with
tqdm.write()Why chosen: Most popular Python progress bar library, excellent integration
cryptography (>= 41.0.0)
Role: Encryption for de-identification mappings
Key components: -
Fernet: Symmetric encryption -hashlib.sha256(): Cryptographic hashing -secrets: Secure random number generationWhy chosen: Industry-standard cryptography library, HIPAA-compliant algorithms
sphinx (>= 7.0.0) + extensions
Role: Documentation generation
Extensions used: -
sphinx.ext.autodoc: Automatic API documentation from docstrings -sphinx.ext.napoleon: Google/NumPy style docstring support -sphinx_autodoc_typehints: Type hint documentation -sphinx-autobuild: Live documentation preview (dev dependency)Why chosen: Standard for Python project documentation
Documentation Workflow:
Added in version 0.3.0: Added make docs-watch for automatic documentation rebuilding.
Autodoc is ENABLED: Sphinx automatically extracts documentation from Python docstrings
NOT Automatic by Default: Documentation does NOT rebuild automatically on every code change
Manual Build: Run
make docsto regenerate documentation after changesAuto-Rebuild (Development): Use
make docs-watchfor live preview during documentation development
How Autodoc Works:
Write Google-style docstrings in Python code
Use
.. automodule::directives in.rstfilesRun
make docs- Sphinx extracts docstrings and generates HTMLOr use
make docs-watch- Server auto-rebuilds on file changes
Important: While autodoc extracts documentation automatically from code, you must build the documentation manually (or use watch mode) to see the changes.