scripts.deidentify moduleο
De-identification and Pseudonymization Moduleο
Robust PHI/PII detection and replacement with encrypted mapping storage, country-specific compliance, and comprehensive validation.
This module provides de-identification features designed to support HIPAA/GDPR compliance for medical datasets, supporting 14 countries with country-specific regulations, encrypted mapping storage, and comprehensive validation.
Note: This module provides tools to assist with compliance but does not guarantee regulatory compliance. Users are responsible for validating that the de-identification meets their specific regulatory requirements.
- Key Features:
PHI/PII detection using regex patterns (18+ identifier types)
Country-specific regulations (14 countries: US, IN, ID, BR, etc.)
Pseudonymization with deterministic hashing
Date shifting with interval preservation
Encrypted mapping storage
Comprehensive validation
Verbose logging: Detailed tree-view logs with timing (v0.0.12+)
- Verbose Mode:
When running with
--verboseflag, detailed logs are generated including file-by-file de-identification progress, record processing counts (every 1000 records), per-file and overall timing, and validation results with issue tracking.
See also
-doc:../user_guide/deidentification - De-identification guide and examples
-doc:../user_guide/country_regulations - Country-specific regulations
-class:DeidentificationEngine - Core de-identification engine
-func:deidentify_dataset - Main de-identification function
- class scripts.deidentify.DateShifter(shift_range_days=365, preserve_intervals=True, seed=None, country_code='US')[source]ο
Bases:
objectConsistent date shifting with intelligent multi-format parsing.
Shifts all dates by a consistent offset while maintaining: - Relative time intervals between dates - Original date format (ISO 8601, DD/MM/YYYY, MM/DD/YYYY, hyphen/dot-separated) - Country-specific format priority for consistent interpretation
Supported formats (auto-detected): - YYYY-MM-DD (ISO 8601) - Always tried first (unambiguous) - DD/MM/YYYY or MM/DD/YYYY (slash-separated) - Country-specific priority - DD-MM-YYYY or MM-DD-YYYY (hyphen-separated) - Country-specific priority - DD.MM.YYYY (dot-separated, European)
Date Interpretation Strategyο
The shifter uses a three-tier strategy to handle date ambiguity:
Unambiguous Formats (ISO 8601): Always tried first - Example: β2020-01-15β is always January 15, 2020 regardless of country
Country-Specific Preference: For ambiguous dates - India (IN): β08/09/2020β interpreted as DD/MM β September 8, 2020 - USA (US): β08/09/2020β interpreted as MM/DD β August 9, 2020
Smart Validation: Reject logically impossible formats - β13/05/2020β can only be DD/MM (no 13th month) - β05/25/2020β can only be MM/DD (no 25th month)
Ambiguous Date Handlingο
For dates where both day and month are β€ 12 (e.g., 12/12/2012, 08/09/2020):
Consistency Guarantee: All dates from the same country use the same format
Country Setting: The country_code parameter determines interpretation
Transparency: Users know upfront how dates will be interpreted
Examples
Basic usage:
>>> shifter = DateShifter(country_code="IN") >>> shifter.shift_date("2019-01-11") # ISO format '2018-04-13' # Shifted by offset, format preserved
Ambiguous dates (country-specific):
>>> shifter_india = DateShifter(country_code="IN") >>> shifter_india.shift_date("08/09/2020") # DD/MM for India '18/12/2019' # September 8, 2020 β shifted >>> shifter_usa = DateShifter(country_code="US") >>> shifter_usa.shift_date("08/09/2020") # MM/DD for USA '28/11/2019' # August 9, 2020 β shifted
Symmetric dates (country preference):
>>> shifter = DateShifter(country_code="IN") >>> shifter.shift_date("12/12/2012") # Ambiguous! '02/03/2012' # Interpreted as DD/MM (India preference)
Unambiguous dates (validation):
>>> shifter = DateShifter(country_code="IN") >>> shifter.shift_date("13/05/2020") # Must be DD/MM '23/08/2019' # May 13, 2020 β shifted (13 > 12, can't be month)
Country Format Referenceο
DD/MM/YYYY countries: IN, ID, BR, ZA, EU, GB, AU, KE, NG, GH, UG
MM/DD/YYYY countries: US, PH, CA
See also
-,-- __init__(shift_range_days=365, preserve_intervals=True, seed=None, country_code='US')[source]ο
Initialize date shifter with country-specific format interpretation.
- Parameters:
shift_range_days (
int) β Maximum days to shift (Β±), default 365preserve_intervals (
bool) β If True, all dates shift by same offset (recommended for consistency)seed (
Optional[str]) β Optional seed for deterministic shift generation (same seed = same shift)country_code (
str) β Country code determining date format priority for ambiguous dates - βINβ, βIDβ, βBRβ, βZAβ, βEUβ, βGBβ, βAUβ, βKEβ, βNGβ, βGHβ, βUGβ: DD/MM/YYYY - βUSβ, βPHβ, βCAβ: MM/DD/YYYY - Default: βUSβ
Note
The country_code setting ensures consistent interpretation of ambiguous dates (e.g., β08/09/2020β or β12/12/2012β). All dates from the same country will use the same format rules. ISO 8601 dates (YYYY-MM-DD) are always unambiguous and parse correctly regardless of country_code.
Examples
>>> # India dataset - interpret slash dates as DD/MM >>> shifter_in = DateShifter(country_code="IN") >>> shifter_in.shift_date("08/09/2020") # September 8, 2020
>>> # USA dataset - interpret slash dates as MM/DD >>> shifter_us = DateShifter(country_code="US") >>> shifter_us.shift_date("08/09/2020") # August 9, 2020
- shift_date(date_str, date_format=None)[source]ο
Shift a date string by consistent offset with intelligent format detection.
The algorithm prioritizes unambiguous formats (ISO 8601) and uses country-specific format preferences for ambiguous dates. This ensures consistency for dates like 12/12/2012 or 08/09/2020 which could be interpreted multiple ways.
- Parameters:
- Return type:
- Returns:
Shifted date in same format as input
Examples
>>> shifter = DateShifter(country_code="IN") >>> shifter.shift_date("2019-01-11") '2018-04-13' # ISO format preserved >>> shifter.shift_date("04/09/2014") '14/12/2013' # DD/MM/YYYY format (India interprets as Sep 4) >>> shifter.shift_date("12/12/2012") '02/03/2012' # DD/MM/YYYY format (country preference trusted)
Note
For ambiguous dates where both numbers are β€ 12 (e.g., 12/12/2012, 08/09/2020), the country-specific format is used consistently based on the country_code setting. This ensures all dates from the same country are interpreted with the same rules.
- class scripts.deidentify.DeidentificationConfig(pseudonym_templates=<factory>, enable_date_shifting=True, date_shift_range_days=365, preserve_date_intervals=True, enable_encryption=True, encryption_key=None, enable_validation=True, strict_mode=True, log_detections=True, log_level=20, countries=None, enable_country_patterns=True)[source]ο
Bases:
objectDe-identification engine configuration.
- __init__(pseudonym_templates=<factory>, enable_date_shifting=True, date_shift_range_days=365, preserve_date_intervals=True, enable_encryption=True, encryption_key=None, enable_validation=True, strict_mode=True, log_detections=True, log_level=20, countries=None, enable_country_patterns=True)ο
- class scripts.deidentify.DeidentificationEngine(config=None, mapping_store=None)[source]ο
Bases:
objectMain engine for PHI/PII detection and de-identification.
Orchestrates the entire de-identification process: 1. Detects PHI/PII using patterns and optional NER 2. Generates consistent pseudonyms 3. Replaces sensitive data with pseudonyms 4. Stores mappings securely 5. Validates results
- __init__(config=None, mapping_store=None)[source]ο
Initialize de-identification engine.
- Parameters:
config (
Optional[DeidentificationConfig]) β Configuration objectmapping_store (
Optional[MappingStore]) β Optional mapping store (creates default if None)
- deidentify_record(record, text_fields=None)[source]ο
De-identify a dictionary record (e.g., from JSONL).
- deidentify_text(text, custom_patterns=None)[source]ο
De-identify a single text string.
- Parameters:
text (
str) β Text to de-identifycustom_patterns (
Optional[List[DetectionPattern]]) β Optional additional patterns to use
- Return type:
- Returns:
De-identified text with PHI/PII replaced by pseudonyms
- class scripts.deidentify.DetectionPattern(phi_type, pattern, priority=50, description='')[source]ο
Bases:
objectPHI/PII detection pattern configuration.
- __init__(phi_type, pattern, priority=50, description='')ο
- class scripts.deidentify.MappingStore(storage_path, encryption_key=None, enable_encryption=True)[source]ο
Bases:
objectSecure storage for PHI to pseudonym mappings.
Features: - Encrypted storage using Fernet (symmetric encryption) - Separate key management - JSON serialization - Audit logging
- __init__(storage_path, encryption_key=None, enable_encryption=True)[source]ο
Initialize mapping store.
- export_for_audit(output_path, include_originals=False)[source]ο
Export mappings for audit purposes.
- class scripts.deidentify.PHIType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]ο
Bases:
EnumPHI/PII type categorization.
- ACCOUNT_NUMBER = 'ACCOUNT'ο
- ADDRESS_CITY = 'CITY'ο
- ADDRESS_STATE = 'STATE'ο
- ADDRESS_STREET = 'STREET'ο
- ADDRESS_ZIP = 'ZIP'ο
- AGE_OVER_89 = 'AGE'ο
- CUSTOM = 'CUSTOM'ο
- DATE = 'DATE'ο
- DEVICE_ID = 'DEVICE'ο
- EMAIL = 'EMAIL'ο
- IP_ADDRESS = 'IP'ο
- LICENSE_NUMBER = 'LICENSE'ο
- LOCATION = 'LOCATION'ο
- MRN = 'MRN'ο
- NAME_FIRST = 'FNAME'ο
- NAME_FULL = 'PATIENT'ο
- NAME_LAST = 'LNAME'ο
- ORGANIZATION = 'ORG'ο
- PHONE = 'PHONE'ο
- SSN = 'SSN'ο
- URL = 'URL'ο
- class scripts.deidentify.PatternLibrary[source]ο
Bases:
objectLibrary of regex patterns for detecting PHI/PII.
- static get_country_specific_patterns(countries=None)[source]ο
Get country-specific detection patterns.
- class scripts.deidentify.PseudonymGenerator(salt=None)[source]ο
Bases:
objectGenerates consistent, deterministic pseudonyms for PHI/PII.
Uses cryptographic hashing to ensure: - Same input always produces same pseudonym - Different inputs produce different pseudonyms - Pseudonyms are not reversible without the mapping table
- scripts.deidentify.deidentify_dataset(input_dir, output_dir, text_fields=None, config=None, file_pattern='*.jsonl', process_subdirs=True)[source]ο
Batch de-identification of JSONL dataset files.
Processes JSONL files while maintaining directory structure. If the input directory contains subdirectories (e.g., βoriginal/β, βcleaned/β), the same structure will be replicated in the output directory.
- Parameters:
input_dir (
Union[str,Path]) β Directory containing JSONL files (may have subdirectories)output_dir (
Union[str,Path]) β Directory to write de-identified files (maintains structure)text_fields (
Optional[List[str]]) β List of field names to de-identify (all string fields if None)config (
Optional[DeidentificationConfig]) β De-identification configurationfile_pattern (
str) β Glob pattern for files to processprocess_subdirs (
bool) β If True, recursively process subdirectories
- Return type:
- Returns:
Dictionary with processing statistics
- scripts.deidentify.validate_dataset(dataset_dir, file_pattern='*.jsonl', text_fields=None)[source]ο
Validate that no PHI remains in de-identified dataset.
Changed in version 0.3.0: Added explicit public API definition via __all__ (10 exports), enhanced module
docstring with comprehensive usage examples (48 lines), and added complete return type annotations.
Overviewο
The deidentify module provides robust HIPAA/GDPR-compliant de-identification for medical datasets,
supporting 14 countries with country-specific regulations, encrypted mapping storage, and comprehensive validation.
Public API:
__all__ = [
'PHIType', # Enum for PHI types
'DetectionPattern', # Dataclass for patterns
'DeidentificationConfig', # Dataclass for configuration
'PatternLibrary', # Pattern library class
'PseudonymGenerator', # Pseudonym generation
'DateShifter', # Date shifting
'MappingStore', # Secure mapping storage
'DeidentificationEngine', # Main engine class
'deidentify_dataset', # Top-level function
'validate_dataset', # Validation function
]
Key Featuresο
Multi-Country Support: HIPAA (US), GDPR (EU/GB), DPDPA (IN), and 11 other countries
PHI/PII Detection: 21 PHI types with country-specific patterns
Pseudonymization: Consistent, reversible pseudonyms with encrypted mapping
Date Shifting: Preserves time intervals while shifting dates
Encrypted Storage: Fernet encryption for mapping files
Validation: Comprehensive validation to ensure de-identification quality
Audit Trails: Export mappings for compliance audits
Classesο
DeidentificationEngineο
- class scripts.deidentify.DeidentificationEngine(config=None, mapping_store=None)[source]ο
Bases:
objectMain engine for PHI/PII detection and de-identification.
Orchestrates the entire de-identification process: 1. Detects PHI/PII using patterns and optional NER 2. Generates consistent pseudonyms 3. Replaces sensitive data with pseudonyms 4. Stores mappings securely 5. Validates results
- __init__(config=None, mapping_store=None)[source]ο
Initialize de-identification engine.
- Parameters:
config (
Optional[DeidentificationConfig]) β Configuration objectmapping_store (
Optional[MappingStore]) β Optional mapping store (creates default if None)
- deidentify_record(record, text_fields=None)[source]ο
De-identify a dictionary record (e.g., from JSONL).
- deidentify_text(text, custom_patterns=None)[source]ο
De-identify a single text string.
- Parameters:
text (
str) β Text to de-identifycustom_patterns (
Optional[List[DetectionPattern]]) β Optional additional patterns to use
- Return type:
- Returns:
De-identified text with PHI/PII replaced by pseudonyms
PseudonymGeneratorο
- class scripts.deidentify.PseudonymGenerator(salt=None)[source]ο
Bases:
objectGenerates consistent, deterministic pseudonyms for PHI/PII.
Uses cryptographic hashing to ensure: - Same input always produces same pseudonym - Different inputs produce different pseudonyms - Pseudonyms are not reversible without the mapping table
DateShifterο
- class scripts.deidentify.DateShifter(shift_range_days=365, preserve_intervals=True, seed=None, country_code='US')[source]ο
Bases:
objectConsistent date shifting with intelligent multi-format parsing.
Shifts all dates by a consistent offset while maintaining: - Relative time intervals between dates - Original date format (ISO 8601, DD/MM/YYYY, MM/DD/YYYY, hyphen/dot-separated) - Country-specific format priority for consistent interpretation
Supported formats (auto-detected): - YYYY-MM-DD (ISO 8601) - Always tried first (unambiguous) - DD/MM/YYYY or MM/DD/YYYY (slash-separated) - Country-specific priority - DD-MM-YYYY or MM-DD-YYYY (hyphen-separated) - Country-specific priority - DD.MM.YYYY (dot-separated, European)
Date Interpretation Strategyο
The shifter uses a three-tier strategy to handle date ambiguity:
Unambiguous Formats (ISO 8601): Always tried first - Example: β2020-01-15β is always January 15, 2020 regardless of country
Country-Specific Preference: For ambiguous dates - India (IN): β08/09/2020β interpreted as DD/MM β September 8, 2020 - USA (US): β08/09/2020β interpreted as MM/DD β August 9, 2020
Smart Validation: Reject logically impossible formats - β13/05/2020β can only be DD/MM (no 13th month) - β05/25/2020β can only be MM/DD (no 25th month)
Ambiguous Date Handlingο
For dates where both day and month are β€ 12 (e.g., 12/12/2012, 08/09/2020):
Consistency Guarantee: All dates from the same country use the same format
Country Setting: The country_code parameter determines interpretation
Transparency: Users know upfront how dates will be interpreted
Examples
Basic usage:
>>> shifter = DateShifter(country_code="IN") >>> shifter.shift_date("2019-01-11") # ISO format '2018-04-13' # Shifted by offset, format preserved
Ambiguous dates (country-specific):
>>> shifter_india = DateShifter(country_code="IN") >>> shifter_india.shift_date("08/09/2020") # DD/MM for India '18/12/2019' # September 8, 2020 β shifted >>> shifter_usa = DateShifter(country_code="US") >>> shifter_usa.shift_date("08/09/2020") # MM/DD for USA '28/11/2019' # August 9, 2020 β shifted
Symmetric dates (country preference):
>>> shifter = DateShifter(country_code="IN") >>> shifter.shift_date("12/12/2012") # Ambiguous! '02/03/2012' # Interpreted as DD/MM (India preference)
Unambiguous dates (validation):
>>> shifter = DateShifter(country_code="IN") >>> shifter.shift_date("13/05/2020") # Must be DD/MM '23/08/2019' # May 13, 2020 β shifted (13 > 12, can't be month)
Country Format Referenceο
DD/MM/YYYY countries: IN, ID, BR, ZA, EU, GB, AU, KE, NG, GH, UG
MM/DD/YYYY countries: US, PH, CA
See also
-,-- __init__(shift_range_days=365, preserve_intervals=True, seed=None, country_code='US')[source]ο
Initialize date shifter with country-specific format interpretation.
- Parameters:
shift_range_days (
int) β Maximum days to shift (Β±), default 365preserve_intervals (
bool) β If True, all dates shift by same offset (recommended for consistency)seed (
Optional[str]) β Optional seed for deterministic shift generation (same seed = same shift)country_code (
str) β Country code determining date format priority for ambiguous dates - βINβ, βIDβ, βBRβ, βZAβ, βEUβ, βGBβ, βAUβ, βKEβ, βNGβ, βGHβ, βUGβ: DD/MM/YYYY - βUSβ, βPHβ, βCAβ: MM/DD/YYYY - Default: βUSβ
Note
The country_code setting ensures consistent interpretation of ambiguous dates (e.g., β08/09/2020β or β12/12/2012β). All dates from the same country will use the same format rules. ISO 8601 dates (YYYY-MM-DD) are always unambiguous and parse correctly regardless of country_code.
Examples
>>> # India dataset - interpret slash dates as DD/MM >>> shifter_in = DateShifter(country_code="IN") >>> shifter_in.shift_date("08/09/2020") # September 8, 2020
>>> # USA dataset - interpret slash dates as MM/DD >>> shifter_us = DateShifter(country_code="US") >>> shifter_us.shift_date("08/09/2020") # August 9, 2020
- shift_date(date_str, date_format=None)[source]ο
Shift a date string by consistent offset with intelligent format detection.
The algorithm prioritizes unambiguous formats (ISO 8601) and uses country-specific format preferences for ambiguous dates. This ensures consistency for dates like 12/12/2012 or 08/09/2020 which could be interpreted multiple ways.
- Parameters:
- Return type:
- Returns:
Shifted date in same format as input
Examples
>>> shifter = DateShifter(country_code="IN") >>> shifter.shift_date("2019-01-11") '2018-04-13' # ISO format preserved >>> shifter.shift_date("04/09/2014") '14/12/2013' # DD/MM/YYYY format (India interprets as Sep 4) >>> shifter.shift_date("12/12/2012") '02/03/2012' # DD/MM/YYYY format (country preference trusted)
Note
For ambiguous dates where both numbers are β€ 12 (e.g., 12/12/2012, 08/09/2020), the country-specific format is used consistently based on the country_code setting. This ensures all dates from the same country are interpreted with the same rules.
MappingStoreο
- class scripts.deidentify.MappingStore(storage_path, encryption_key=None, enable_encryption=True)[source]ο
Bases:
objectSecure storage for PHI to pseudonym mappings.
Features: - Encrypted storage using Fernet (symmetric encryption) - Separate key management - JSON serialization - Audit logging
- __init__(storage_path, encryption_key=None, enable_encryption=True)[source]ο
Initialize mapping store.
- export_for_audit(output_path, include_originals=False)[source]ο
Export mappings for audit purposes.
PatternLibraryο
- class scripts.deidentify.PatternLibrary[source]ο
Bases:
objectLibrary of regex patterns for detecting PHI/PII.
- static get_country_specific_patterns(countries=None)[source]ο
Get country-specific detection patterns.
Enumerationsο
PHITypeο
- class scripts.deidentify.PHIType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]ο
Bases:
EnumPHI/PII type categorization.
- ACCOUNT_NUMBER = 'ACCOUNT'ο
- ADDRESS_CITY = 'CITY'ο
- ADDRESS_STATE = 'STATE'ο
- ADDRESS_STREET = 'STREET'ο
- ADDRESS_ZIP = 'ZIP'ο
- AGE_OVER_89 = 'AGE'ο
- CUSTOM = 'CUSTOM'ο
- DATE = 'DATE'ο
- DEVICE_ID = 'DEVICE'ο
- EMAIL = 'EMAIL'ο
- IP_ADDRESS = 'IP'ο
- LICENSE_NUMBER = 'LICENSE'ο
- LOCATION = 'LOCATION'ο
- MRN = 'MRN'ο
- NAME_FIRST = 'FNAME'ο
- NAME_FULL = 'PATIENT'ο
- NAME_LAST = 'LNAME'ο
- ORGANIZATION = 'ORG'ο
- PHONE = 'PHONE'ο
- SSN = 'SSN'ο
- URL = 'URL'ο
Data Classesο
DetectionPatternο
DeidentificationConfigο
- class scripts.deidentify.DeidentificationConfig(pseudonym_templates=<factory>, enable_date_shifting=True, date_shift_range_days=365, preserve_date_intervals=True, enable_encryption=True, encryption_key=None, enable_validation=True, strict_mode=True, log_detections=True, log_level=20, countries=None, enable_country_patterns=True)[source]ο
Bases:
objectDe-identification engine configuration.
Functionsο
deidentify_datasetο
- scripts.deidentify.deidentify_dataset(input_dir, output_dir, text_fields=None, config=None, file_pattern='*.jsonl', process_subdirs=True)[source]ο
Batch de-identification of JSONL dataset files.
Processes JSONL files while maintaining directory structure. If the input directory contains subdirectories (e.g., βoriginal/β, βcleaned/β), the same structure will be replicated in the output directory.
- Parameters:
input_dir (
Union[str,Path]) β Directory containing JSONL files (may have subdirectories)output_dir (
Union[str,Path]) β Directory to write de-identified files (maintains structure)text_fields (
Optional[List[str]]) β List of field names to de-identify (all string fields if None)config (
Optional[DeidentificationConfig]) β De-identification configurationfile_pattern (
str) β Glob pattern for files to processprocess_subdirs (
bool) β If True, recursively process subdirectories
- Return type:
- Returns:
Dictionary with processing statistics