scripts.deidentify module<a class="headerlink" href="#module-scripts.deidentify" title="Link to this heading">

Returns:

Shifted date in same format as input

Examples

>>> shifter = DateShifter(country_code="IN")
>>> shifter.shift_date("2019-01-11")
'2018-04-13'  # ISO format preserved
>>> shifter.shift_date("04/09/2014")
'14/12/2013'  # DD/MM/YYYY format (India interprets as Sep 4)
>>> shifter.shift_date("12/12/2012")
'02/03/2012'  # DD/MM/YYYY format (country preference trusted)

Note

For ambiguous dates where both numbers are ≤ 12 (e.g., 12/12/2012, 08/09/2020), the country-specific format is used consistently based on the country_code setting. This ensures all dates from the same country are interpreted with the same rules.

class scripts.deidentify.DeidentificationConfig(pseudonym_templates=<factory>, enable_date_shifting=True, date_shift_range_days=365, preserve_date_intervals=True, enable_encryption=True, encryption_key=None, enable_validation=True, strict_mode=True, log_detections=True, log_level=20, countries=None, enable_country_patterns=True)[source]

Bases: object

De-identification engine configuration.

__init__(pseudonym_templates=<factory>, enable_date_shifting=True, date_shift_range_days=365, preserve_date_intervals=True, enable_encryption=True, encryption_key=None, enable_validation=True, strict_mode=True, log_detections=True, log_level=20, countries=None, enable_country_patterns=True)

countries: Optional[List[str]] = None

date_shift_range_days: int = 365

enable_country_patterns: bool = True

enable_date_shifting: bool = True

enable_encryption: bool = True

enable_validation: bool = True

encryption_key: Optional[bytes] = None

log_detections: bool = True

log_level: int = 20

preserve_date_intervals: bool = True

pseudonym_templates: Dict[PHIType, str]

strict_mode: bool = True

class scripts.deidentify.DeidentificationEngine(config=None, mapping_store=None)[source]

Bases: object

Main engine for PHI/PII detection and de-identification.

Orchestrates the entire de-identification process: 1. Detects PHI/PII using patterns and optional NER 2. Generates consistent pseudonyms 3. Replaces sensitive data with pseudonyms 4. Stores mappings securely 5. Validates results

__init__(config=None, mapping_store=None)[source]

Initialize de-identification engine.

Parameters:

config (Optional[DeidentificationConfig]) – Configuration object
mapping_store (Optional[MappingStore]) – Optional mapping store (creates default if None)

deidentify_record(record, text_fields=None)[source]

De-identify a dictionary record (e.g., from JSONL).

Parameters:

record (Dict[str, Any]) – Dictionary containing data
text_fields (Optional[List[str]]) – List of field names to de-identify (all string fields if None)

Return type:

Returns:

De-identified record

deidentify_text(text, custom_patterns=None)[source]

De-identify a single text string.

Parameters:

text (str) – Text to de-identify
custom_patterns (Optional[List[DetectionPattern]]) – Optional additional patterns to use

Return type:

Returns:

De-identified text with PHI/PII replaced by pseudonyms

get_statistics()[source]

Get de-identification statistics.

Return type:: Dict[str, Any]
Returns:: Dictionary with processing statistics

save_mappings()[source]

Save all mappings to secure storage.

Return type:: None

validate_deidentification(text, strict=True)[source]

Validate that no PHI remains in de-identified text.

Parameters:

text (str) – De-identified text to validate
strict (bool) – If True, any detection is considered a failure

Return type:

Tuple[bool, List[str]]

Returns:

Tuple of (is_valid, list of potential PHI found)

class scripts.deidentify.DetectionPattern(phi_type, pattern, priority=50, description='')[source]

Bases: object

PHI/PII detection pattern configuration.

__init__(phi_type, pattern, priority=50, description='')

description: str = ''

pattern: Pattern

phi_type: PHIType

priority: int = 50

class scripts.deidentify.MappingStore(storage_path, encryption_key=None, enable_encryption=True)[source]

Bases: object

Secure storage for PHI to pseudonym mappings.

Features: - Encrypted storage using Fernet (symmetric encryption) - Separate key management - JSON serialization - Audit logging

__init__(storage_path, encryption_key=None, enable_encryption=True)[source]

Initialize mapping store.

Parameters:

storage_path (Path) – Path to store mapping file
encryption_key (Optional[bytes]) – Encryption key (generates new if None)
enable_encryption (bool) – Whether to encrypt mappings

add_mapping(original, pseudonym, phi_type, metadata=None)[source]

Add a mapping entry.

Parameters:

original (str) – Original sensitive value
pseudonym (str) – Pseudonymized value
phi_type (PHIType) – Type of PHI
metadata (Optional[Dict]) – Optional additional metadata

Return type:

export_for_audit(output_path, include_originals=False)[source]

Export mappings for audit purposes.

Parameters:

output_path (Path) – Path to export file
include_originals (bool) – Whether to include original values (dangerous!)

Return type:

get_pseudonym(original, phi_type)[source]

Retrieve pseudonym for original value.

Parameters:

original (str) – Original value
phi_type (PHIType) – Type of PHI

Return type:

Optional[str]

Returns:

Pseudonym if exists, None otherwise

save_mappings()[source]

Save mappings to storage file.

Return type:: None

class scripts.deidentify.PHIType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

PHI/PII type categorization.

ACCOUNT_NUMBER = 'ACCOUNT'

ADDRESS_CITY = 'CITY'

ADDRESS_STATE = 'STATE'

ADDRESS_STREET = 'STREET'

ADDRESS_ZIP = 'ZIP'

AGE_OVER_89 = 'AGE'

CUSTOM = 'CUSTOM'

DATE = 'DATE'

DEVICE_ID = 'DEVICE'

EMAIL = 'EMAIL'

IP_ADDRESS = 'IP'

LICENSE_NUMBER = 'LICENSE'

LOCATION = 'LOCATION'

MRN = 'MRN'

NAME_FIRST = 'FNAME'

NAME_FULL = 'PATIENT'

NAME_LAST = 'LNAME'

ORGANIZATION = 'ORG'

PHONE = 'PHONE'

SSN = 'SSN'

URL = 'URL'

class scripts.deidentify.PatternLibrary[source]

Bases: object

Library of regex patterns for detecting PHI/PII.

static get_country_specific_patterns(countries=None)[source]

Get country-specific detection patterns.

Parameters:: countries (Optional[List[str]]) – List of country codes or None for all
Return type:: List[DetectionPattern]
Returns:: List of DetectionPattern objects for country-specific identifiers

static get_default_patterns()[source]

Get default detection patterns for common PHI/PII types.

Return type:: List[DetectionPattern]
Returns:: List of DetectionPattern objects sorted by priority.

class scripts.deidentify.PseudonymGenerator(salt=None)[source]

Bases: object

Generates consistent, deterministic pseudonyms for PHI/PII.

Uses cryptographic hashing to ensure: - Same input always produces same pseudonym - Different inputs produce different pseudonyms - Pseudonyms are not reversible without the mapping table

__init__(salt=None)[source]

Initialize pseudonym generator.

Parameters:: salt (Optional[str]) – Optional salt for hash function. If None, generates random salt.

generate(value, phi_type, template)[source]

Generate a pseudonym for a given value.

Parameters:

value (str) – The sensitive value to pseudonymize
phi_type (PHIType) – Type of PHI/PII
template (str) – Template string with {id} placeholder

Return type:

Returns:

Pseudonymized value (e.g., “PATIENT-A4B8”)

get_statistics()[source]

Get statistics on pseudonym generation.

Return type:: Dict[str, int]
Returns:: Dictionary mapping PHI type to count of unique values

scripts.deidentify.deidentify_dataset(input_dir, output_dir, text_fields=None, config=None, file_pattern='*.jsonl', process_subdirs=True)[source]

Batch de-identification of JSONL dataset files.

Processes JSONL files while maintaining directory structure. If the input directory contains subdirectories (e.g., ‘original/’, ‘cleaned/’), the same structure will be replicated in the output directory.

Parameters:

input_dir (Union[str, Path]) – Directory containing JSONL files (may have subdirectories)
output_dir (Union[str, Path]) – Directory to write de-identified files (maintains structure)
text_fields (Optional[List[str]]) – List of field names to de-identify (all string fields if None)
config (Optional[DeidentificationConfig]) – De-identification configuration
file_pattern (str) – Glob pattern for files to process
process_subdirs (bool) – If True, recursively process subdirectories

Return type:

Returns:

Dictionary with processing statistics

scripts.deidentify.validate_dataset(dataset_dir, file_pattern='*.jsonl', text_fields=None)[source]

Validate that no PHI remains in de-identified dataset.

Parameters:

dataset_dir (Union[str, Path]) – Directory containing de-identified JSONL files
file_pattern (str) – Glob pattern for files to validate
text_fields (Optional[List[str]]) – List of field names to validate

Return type:

Returns:

Dictionary with validation results

Changed in version 0.3.0: Added explicit public API definition via __all__ (10 exports), enhanced module docstring with comprehensive usage examples (48 lines), and added complete return type annotations.

Overview

The deidentify module provides robust HIPAA/GDPR-compliant de-identification for medical datasets, supporting 14 countries with country-specific regulations, encrypted mapping storage, and comprehensive validation.

Public API:

__all__ = [
    'PHIType',                    # Enum for PHI types
    'DetectionPattern',           # Dataclass for patterns
    'DeidentificationConfig',     # Dataclass for configuration
    'PatternLibrary',             # Pattern library class
    'PseudonymGenerator',         # Pseudonym generation
    'DateShifter',                # Date shifting
    'MappingStore',               # Secure mapping storage
    'DeidentificationEngine',     # Main engine class
    'deidentify_dataset',         # Top-level function
    'validate_dataset',           # Validation function
]

Key Features

Multi-Country Support: HIPAA (US), GDPR (EU/GB), DPDPA (IN), and 11 other countries
PHI/PII Detection: 21 PHI types with country-specific patterns
Pseudonymization: Consistent, reversible pseudonyms with encrypted mapping
Date Shifting: Preserves time intervals while shifting dates
Encrypted Storage: Fernet encryption for mapping files
Validation: Comprehensive validation to ensure de-identification quality
Audit Trails: Export mappings for compliance audits

Classes

DeidentificationEngine

class scripts.deidentify.DeidentificationEngine(config=None, mapping_store=None)[source]

Bases: object

Main engine for PHI/PII detection and de-identification.

Orchestrates the entire de-identification process: 1. Detects PHI/PII using patterns and optional NER 2. Generates consistent pseudonyms 3. Replaces sensitive data with pseudonyms 4. Stores mappings securely 5. Validates results

__init__(config=None, mapping_store=None)[source]

Initialize de-identification engine.

Parameters:

config (Optional[DeidentificationConfig]) – Configuration object
mapping_store (Optional[MappingStore]) – Optional mapping store (creates default if None)

deidentify_record(record, text_fields=None)[source]

De-identify a dictionary record (e.g., from JSONL).

Parameters:

record (Dict[str, Any]) – Dictionary containing data
text_fields (Optional[List[str]]) – List of field names to de-identify (all string fields if None)

Return type:

Returns:

De-identified record

deidentify_text(text, custom_patterns=None)[source]

De-identify a single text string.

Parameters:

text (str) – Text to de-identify
custom_patterns (Optional[List[DetectionPattern]]) – Optional additional patterns to use

Return type:

Returns:

De-identified text with PHI/PII replaced by pseudonyms

get_statistics()[source]

Get de-identification statistics.

Return type:: Dict[str, Any]
Returns:: Dictionary with processing statistics

save_mappings()[source]

Save all mappings to secure storage.

Return type:: None

validate_deidentification(text, strict=True)[source]

Validate that no PHI remains in de-identified text.

Parameters:

text (str) – De-identified text to validate
strict (bool) – If True, any detection is considered a failure

Return type:

Tuple[bool, List[str]]

Returns:

Tuple of (is_valid, list of potential PHI found)

PseudonymGenerator

class scripts.deidentify.PseudonymGenerator(salt=None)[source]

Bases: object

Generates consistent, deterministic pseudonyms for PHI/PII.

Uses cryptographic hashing to ensure: - Same input always produces same pseudonym - Different inputs produce different pseudonyms - Pseudonyms are not reversible without the mapping table

__init__(salt=None)[source]

Initialize pseudonym generator.

Parameters:: salt (Optional[str]) – Optional salt for hash function. If None, generates random salt.

generate(value, phi_type, template)[source]

Generate a pseudonym for a given value.

Parameters:

value (str) – The sensitive value to pseudonymize
phi_type (PHIType) – Type of PHI/PII
template (str) – Template string with {id} placeholder

Return type:

Returns:

Pseudonymized value (e.g., “PATIENT-A4B8”)

get_statistics()[source]

Get statistics on pseudonym generation.

Return type:: Dict[str, int]
Returns:: Dictionary mapping PHI type to count of unique values

DateShifter

class scripts.deidentify.DateShifter(shift_range_days=365, preserve_intervals=True, seed=None, country_code='US')[source]

Bases: object

Consistent date shifting with intelligent multi-format parsing.

Shifts all dates by a consistent offset while maintaining: - Relative time intervals between dates - Original date format (ISO 8601, DD/MM/YYYY, MM/DD/YYYY, hyphen/dot-separated) - Country-specific format priority for consistent interpretation

Supported formats (auto-detected): - YYYY-MM-DD (ISO 8601) - Always tried first (unambiguous) - DD/MM/YYYY or MM/DD/YYYY (slash-separated) - Country-specific priority - DD-MM-YYYY or MM-DD-YYYY (hyphen-separated) - Country-specific priority - DD.MM.YYYY (dot-separated, European)

Date Interpretation Strategy

The shifter uses a three-tier strategy to handle date ambiguity:

Unambiguous Formats (ISO 8601): Always tried first - Example: “2020-01-15” is always January 15, 2020 regardless of country
Country-Specific Preference: For ambiguous dates - India (IN): “08/09/2020” interpreted as DD/MM → September 8, 2020 - USA (US): “08/09/2020” interpreted as MM/DD → August 9, 2020
Smart Validation: Reject logically impossible formats - “13/05/2020” can only be DD/MM (no 13th month) - “05/25/2020” can only be MM/DD (no 25th month)

Ambiguous Date Handling

For dates where both day and month are ≤ 12 (e.g., 12/12/2012, 08/09/2020):

Consistency Guarantee: All dates from the same country use the same format
Country Setting: The country_code parameter determines interpretation
Transparency: Users know upfront how dates will be interpreted

Examples

Basic usage:

>>> shifter = DateShifter(country_code="IN")
>>> shifter.shift_date("2019-01-11")  # ISO format
'2018-04-13'  # Shifted by offset, format preserved

Ambiguous dates (country-specific):

>>> shifter_india = DateShifter(country_code="IN")
>>> shifter_india.shift_date("08/09/2020")  # DD/MM for India
'18/12/2019'  # September 8, 2020 → shifted

>>> shifter_usa = DateShifter(country_code="US")
>>> shifter_usa.shift_date("08/09/2020")  # MM/DD for USA
'28/11/2019'  # August 9, 2020 → shifted

Symmetric dates (country preference):

>>> shifter = DateShifter(country_code="IN")
>>> shifter.shift_date("12/12/2012")  # Ambiguous!
'02/03/2012'  # Interpreted as DD/MM (India preference)

Unambiguous dates (validation):

>>> shifter = DateShifter(country_code="IN")
>>> shifter.shift_date("13/05/2020")  # Must be DD/MM
'23/08/2019'  # May 13, 2020 → shifted (13 > 12, can't be month)

Country Format Reference

DD/MM/YYYY countries: IN, ID, BR, ZA, EU, GB, AU, KE, NG, GH, UG
MM/DD/YYYY countries: US, PH, CA

MappingStore

class scripts.deidentify.MappingStore(storage_path, encryption_key=None, enable_encryption=True)[source]

Bases: object

Secure storage for PHI to pseudonym mappings.

Features: - Encrypted storage using Fernet (symmetric encryption) - Separate key management - JSON serialization - Audit logging

__init__(storage_path, encryption_key=None, enable_encryption=True)[source]

Initialize mapping store.

Parameters:

storage_path (Path) – Path to store mapping file
encryption_key (Optional[bytes]) – Encryption key (generates new if None)
enable_encryption (bool) – Whether to encrypt mappings

add_mapping(original, pseudonym, phi_type, metadata=None)[source]

Add a mapping entry.

Parameters:

original (str) – Original sensitive value
pseudonym (str) – Pseudonymized value
phi_type (PHIType) – Type of PHI
metadata (Optional[Dict]) – Optional additional metadata

Return type:

export_for_audit(output_path, include_originals=False)[source]

Export mappings for audit purposes.

Parameters:

output_path (Path) – Path to export file
include_originals (bool) – Whether to include original values (dangerous!)

Return type:

get_pseudonym(original, phi_type)[source]

Retrieve pseudonym for original value.

Parameters:

original (str) – Original value
phi_type (PHIType) – Type of PHI

Return type:

Optional[str]

Returns:

Pseudonym if exists, None otherwise

save_mappings()[source]

Save mappings to storage file.

Return type:: None

PatternLibrary

class scripts.deidentify.PatternLibrary[source]

Bases: object

Library of regex patterns for detecting PHI/PII.

static get_country_specific_patterns(countries=None)[source]

Get country-specific detection patterns.

Parameters:: countries (Optional[List[str]]) – List of country codes or None for all
Return type:: List[DetectionPattern]
Returns:: List of DetectionPattern objects for country-specific identifiers

static get_default_patterns()[source]

Get default detection patterns for common PHI/PII types.

Return type:: List[DetectionPattern]
Returns:: List of DetectionPattern objects sorted by priority.

Enumerations

PHIType

class scripts.deidentify.PHIType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

PHI/PII type categorization.

ACCOUNT_NUMBER = 'ACCOUNT'

ADDRESS_CITY = 'CITY'

ADDRESS_STATE = 'STATE'

ADDRESS_STREET = 'STREET'

ADDRESS_ZIP = 'ZIP'

AGE_OVER_89 = 'AGE'

CUSTOM = 'CUSTOM'

DATE = 'DATE'

DEVICE_ID = 'DEVICE'

EMAIL = 'EMAIL'

IP_ADDRESS = 'IP'

LICENSE_NUMBER = 'LICENSE'

LOCATION = 'LOCATION'

MRN = 'MRN'

NAME_FIRST = 'FNAME'

NAME_FULL = 'PATIENT'

NAME_LAST = 'LNAME'

ORGANIZATION = 'ORG'

PHONE = 'PHONE'

SSN = 'SSN'

URL = 'URL'

Data Classes

DetectionPattern

class scripts.deidentify.DetectionPattern(phi_type, pattern, priority=50, description='')[source]

Bases: object

PHI/PII detection pattern configuration.

description: str = ''

pattern: Pattern

phi_type: PHIType

priority: int = 50

DeidentificationConfig

class scripts.deidentify.DeidentificationConfig(pseudonym_templates=<factory>, enable_date_shifting=True, date_shift_range_days=365, preserve_date_intervals=True, enable_encryption=True, encryption_key=None, enable_validation=True, strict_mode=True, log_detections=True, log_level=20, countries=None, enable_country_patterns=True)[source]

Bases: object

De-identification engine configuration.

countries: Optional[List[str]] = None

date_shift_range_days: int = 365

enable_country_patterns: bool = True

enable_date_shifting: bool = True

enable_encryption: bool = True

enable_validation: bool = True

encryption_key: Optional[bytes] = None

log_detections: bool = True

log_level: int = 20

preserve_date_intervals: bool = True

pseudonym_templates: Dict[PHIType, str]

strict_mode: bool = True

Functions

deidentify_dataset

scripts.deidentify.deidentify_dataset(input_dir, output_dir, text_fields=None, config=None, file_pattern='*.jsonl', process_subdirs=True)[source]

Batch de-identification of JSONL dataset files.

Processes JSONL files while maintaining directory structure. If the input directory contains subdirectories (e.g., ‘original/’, ‘cleaned/’), the same structure will be replicated in the output directory.

Parameters:

input_dir (Union[str, Path]) – Directory containing JSONL files (may have subdirectories)
output_dir (Union[str, Path]) – Directory to write de-identified files (maintains structure)
text_fields (Optional[List[str]]) – List of field names to de-identify (all string fields if None)
config (Optional[DeidentificationConfig]) – De-identification configuration
file_pattern (str) – Glob pattern for files to process
process_subdirs (bool) – If True, recursively process subdirectories

Return type:

Returns:

Dictionary with processing statistics

validate_dataset

scripts.deidentify.validate_dataset(dataset_dir, file_pattern='*.jsonl', text_fields=None)[source]

Validate that no PHI remains in de-identified dataset.

Parameters:

dataset_dir (Union[str, Path]) – Directory containing de-identified JSONL files
file_pattern (str) – Glob pattern for files to validate
text_fields (Optional[List[str]]) – List of field names to validate

Return type: