scripts.deidentify module

De-identification and Pseudonymization Module

Robust PHI/PII detection and replacement with encrypted mapping storage, country-specific compliance, and comprehensive validation.

This module provides de-identification features designed to support HIPAA/GDPR compliance for medical datasets, supporting 14 countries with country-specific regulations, encrypted mapping storage, and comprehensive validation.

Note: This module provides tools to assist with compliance but does not guarantee regulatory compliance. Users are responsible for validating that the de-identification meets their specific regulatory requirements.

Key Features:
  • PHI/PII detection using regex patterns (18+ identifier types)

  • Country-specific regulations (14 countries: US, IN, ID, BR, etc.)

  • Pseudonymization with deterministic hashing

  • Date shifting with interval preservation

  • Encrypted mapping storage

  • Comprehensive validation

  • Verbose logging: Detailed tree-view logs with timing (v0.0.12+)

Verbose Mode:

When running with --verbose flag, detailed logs are generated including file-by-file de-identification progress, record processing counts (every 1000 records), per-file and overall timing, and validation results with issue tracking.

See also

-

doc:../user_guide/deidentification - De-identification guide and examples

-

doc:../user_guide/country_regulations - Country-specific regulations

-

class:DeidentificationEngine - Core de-identification engine

-

func:deidentify_dataset - Main de-identification function

class scripts.deidentify.DateShifter(shift_range_days=365, preserve_intervals=True, seed=None, country_code='US')[source]

Bases: object

Consistent date shifting with intelligent multi-format parsing.

Shifts all dates by a consistent offset while maintaining: - Relative time intervals between dates - Original date format (ISO 8601, DD/MM/YYYY, MM/DD/YYYY, hyphen/dot-separated) - Country-specific format priority for consistent interpretation

Supported formats (auto-detected): - YYYY-MM-DD (ISO 8601) - Always tried first (unambiguous) - DD/MM/YYYY or MM/DD/YYYY (slash-separated) - Country-specific priority - DD-MM-YYYY or MM-DD-YYYY (hyphen-separated) - Country-specific priority - DD.MM.YYYY (dot-separated, European)

Date Interpretation Strategy

The shifter uses a three-tier strategy to handle date ambiguity:

  1. Unambiguous Formats (ISO 8601): Always tried first - Example: β€œ2020-01-15” is always January 15, 2020 regardless of country

  2. Country-Specific Preference: For ambiguous dates - India (IN): β€œ08/09/2020” interpreted as DD/MM β†’ September 8, 2020 - USA (US): β€œ08/09/2020” interpreted as MM/DD β†’ August 9, 2020

  3. Smart Validation: Reject logically impossible formats - β€œ13/05/2020” can only be DD/MM (no 13th month) - β€œ05/25/2020” can only be MM/DD (no 25th month)

Ambiguous Date Handling

For dates where both day and month are ≀ 12 (e.g., 12/12/2012, 08/09/2020):

  • Consistency Guarantee: All dates from the same country use the same format

  • Country Setting: The country_code parameter determines interpretation

  • Transparency: Users know upfront how dates will be interpreted

Examples

Basic usage:

>>> shifter = DateShifter(country_code="IN")
>>> shifter.shift_date("2019-01-11")  # ISO format
'2018-04-13'  # Shifted by offset, format preserved

Ambiguous dates (country-specific):

>>> shifter_india = DateShifter(country_code="IN")
>>> shifter_india.shift_date("08/09/2020")  # DD/MM for India
'18/12/2019'  # September 8, 2020 β†’ shifted

>>> shifter_usa = DateShifter(country_code="US")
>>> shifter_usa.shift_date("08/09/2020")  # MM/DD for USA
'28/11/2019'  # August 9, 2020 β†’ shifted

Symmetric dates (country preference):

>>> shifter = DateShifter(country_code="IN")
>>> shifter.shift_date("12/12/2012")  # Ambiguous!
'02/03/2012'  # Interpreted as DD/MM (India preference)

Unambiguous dates (validation):

>>> shifter = DateShifter(country_code="IN")
>>> shifter.shift_date("13/05/2020")  # Must be DD/MM
'23/08/2019'  # May 13, 2020 β†’ shifted (13 > 12, can't be month)

Country Format Reference

  • DD/MM/YYYY countries: IN, ID, BR, ZA, EU, GB, AU, KE, NG, GH, UG

  • MM/DD/YYYY countries: US, PH, CA

See also

-, -

__init__(shift_range_days=365, preserve_intervals=True, seed=None, country_code='US')[source]

Initialize date shifter with country-specific format interpretation.

Parameters:
  • shift_range_days (int) – Maximum days to shift (Β±), default 365

  • preserve_intervals (bool) – If True, all dates shift by same offset (recommended for consistency)

  • seed (Optional[str]) – Optional seed for deterministic shift generation (same seed = same shift)

  • country_code (str) – Country code determining date format priority for ambiguous dates - β€œIN”, β€œID”, β€œBR”, β€œZA”, β€œEU”, β€œGB”, β€œAU”, β€œKE”, β€œNG”, β€œGH”, β€œUG”: DD/MM/YYYY - β€œUS”, β€œPH”, β€œCA”: MM/DD/YYYY - Default: β€œUS”

Note

The country_code setting ensures consistent interpretation of ambiguous dates (e.g., β€œ08/09/2020” or β€œ12/12/2012”). All dates from the same country will use the same format rules. ISO 8601 dates (YYYY-MM-DD) are always unambiguous and parse correctly regardless of country_code.

Examples

>>> # India dataset - interpret slash dates as DD/MM
>>> shifter_in = DateShifter(country_code="IN")
>>> shifter_in.shift_date("08/09/2020")  # September 8, 2020
>>> # USA dataset - interpret slash dates as MM/DD
>>> shifter_us = DateShifter(country_code="US")
>>> shifter_us.shift_date("08/09/2020")  # August 9, 2020
shift_date(date_str, date_format=None)[source]

Shift a date string by consistent offset with intelligent format detection.

The algorithm prioritizes unambiguous formats (ISO 8601) and uses country-specific format preferences for ambiguous dates. This ensures consistency for dates like 12/12/2012 or 08/09/2020 which could be interpreted multiple ways.

Parameters:
  • date_str (str) – Date string to shift (e.g., β€œ2019-01-11”, β€œ04/09/2014”, β€œ12/12/2012”)

  • date_format (Optional[str]) – Specific format to use (auto-detected if None)

Return type:

str

Returns:

Shifted date in same format as input

Examples

>>> shifter = DateShifter(country_code="IN")
>>> shifter.shift_date("2019-01-11")
'2018-04-13'  # ISO format preserved
>>> shifter.shift_date("04/09/2014")
'14/12/2013'  # DD/MM/YYYY format (India interprets as Sep 4)
>>> shifter.shift_date("12/12/2012")
'02/03/2012'  # DD/MM/YYYY format (country preference trusted)

Note

For ambiguous dates where both numbers are ≀ 12 (e.g., 12/12/2012, 08/09/2020), the country-specific format is used consistently based on the country_code setting. This ensures all dates from the same country are interpreted with the same rules.

class scripts.deidentify.DeidentificationConfig(pseudonym_templates=<factory>, enable_date_shifting=True, date_shift_range_days=365, preserve_date_intervals=True, enable_encryption=True, encryption_key=None, enable_validation=True, strict_mode=True, log_detections=True, log_level=20, countries=None, enable_country_patterns=True)[source]

Bases: object

De-identification engine configuration.

__init__(pseudonym_templates=<factory>, enable_date_shifting=True, date_shift_range_days=365, preserve_date_intervals=True, enable_encryption=True, encryption_key=None, enable_validation=True, strict_mode=True, log_detections=True, log_level=20, countries=None, enable_country_patterns=True)
countries: Optional[List[str]] = None
date_shift_range_days: int = 365
enable_country_patterns: bool = True
enable_date_shifting: bool = True
enable_encryption: bool = True
enable_validation: bool = True
encryption_key: Optional[bytes] = None
log_detections: bool = True
log_level: int = 20
preserve_date_intervals: bool = True
pseudonym_templates: Dict[PHIType, str]
strict_mode: bool = True
class scripts.deidentify.DeidentificationEngine(config=None, mapping_store=None)[source]

Bases: object

Main engine for PHI/PII detection and de-identification.

Orchestrates the entire de-identification process: 1. Detects PHI/PII using patterns and optional NER 2. Generates consistent pseudonyms 3. Replaces sensitive data with pseudonyms 4. Stores mappings securely 5. Validates results

__init__(config=None, mapping_store=None)[source]

Initialize de-identification engine.

Parameters:
deidentify_record(record, text_fields=None)[source]

De-identify a dictionary record (e.g., from JSONL).

Parameters:
  • record (Dict[str, Any]) – Dictionary containing data

  • text_fields (Optional[List[str]]) – List of field names to de-identify (all string fields if None)

Return type:

Dict[str, Any]

Returns:

De-identified record

deidentify_text(text, custom_patterns=None)[source]

De-identify a single text string.

Parameters:
Return type:

str

Returns:

De-identified text with PHI/PII replaced by pseudonyms

get_statistics()[source]

Get de-identification statistics.

Return type:

Dict[str, Any]

Returns:

Dictionary with processing statistics

save_mappings()[source]

Save all mappings to secure storage.

Return type:

None

validate_deidentification(text, strict=True)[source]

Validate that no PHI remains in de-identified text.

Parameters:
  • text (str) – De-identified text to validate

  • strict (bool) – If True, any detection is considered a failure

Return type:

Tuple[bool, List[str]]

Returns:

Tuple of (is_valid, list of potential PHI found)

class scripts.deidentify.DetectionPattern(phi_type, pattern, priority=50, description='')[source]

Bases: object

PHI/PII detection pattern configuration.

__init__(phi_type, pattern, priority=50, description='')
description: str = ''
pattern: Pattern
phi_type: PHIType
priority: int = 50
class scripts.deidentify.MappingStore(storage_path, encryption_key=None, enable_encryption=True)[source]

Bases: object

Secure storage for PHI to pseudonym mappings.

Features: - Encrypted storage using Fernet (symmetric encryption) - Separate key management - JSON serialization - Audit logging

__init__(storage_path, encryption_key=None, enable_encryption=True)[source]

Initialize mapping store.

Parameters:
  • storage_path (Path) – Path to store mapping file

  • encryption_key (Optional[bytes]) – Encryption key (generates new if None)

  • enable_encryption (bool) – Whether to encrypt mappings

add_mapping(original, pseudonym, phi_type, metadata=None)[source]

Add a mapping entry.

Parameters:
  • original (str) – Original sensitive value

  • pseudonym (str) – Pseudonymized value

  • phi_type (PHIType) – Type of PHI

  • metadata (Optional[Dict]) – Optional additional metadata

Return type:

None

export_for_audit(output_path, include_originals=False)[source]

Export mappings for audit purposes.

Parameters:
  • output_path (Path) – Path to export file

  • include_originals (bool) – Whether to include original values (dangerous!)

Return type:

None

get_pseudonym(original, phi_type)[source]

Retrieve pseudonym for original value.

Parameters:
  • original (str) – Original value

  • phi_type (PHIType) – Type of PHI

Return type:

Optional[str]

Returns:

Pseudonym if exists, None otherwise

save_mappings()[source]

Save mappings to storage file.

Return type:

None

class scripts.deidentify.PHIType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

PHI/PII type categorization.

ACCOUNT_NUMBER = 'ACCOUNT'
ADDRESS_CITY = 'CITY'
ADDRESS_STATE = 'STATE'
ADDRESS_STREET = 'STREET'
ADDRESS_ZIP = 'ZIP'
AGE_OVER_89 = 'AGE'
CUSTOM = 'CUSTOM'
DATE = 'DATE'
DEVICE_ID = 'DEVICE'
EMAIL = 'EMAIL'
IP_ADDRESS = 'IP'
LICENSE_NUMBER = 'LICENSE'
LOCATION = 'LOCATION'
MRN = 'MRN'
NAME_FIRST = 'FNAME'
NAME_FULL = 'PATIENT'
NAME_LAST = 'LNAME'
ORGANIZATION = 'ORG'
PHONE = 'PHONE'
SSN = 'SSN'
URL = 'URL'
class scripts.deidentify.PatternLibrary[source]

Bases: object

Library of regex patterns for detecting PHI/PII.

static get_country_specific_patterns(countries=None)[source]

Get country-specific detection patterns.

Parameters:

countries (Optional[List[str]]) – List of country codes or None for all

Return type:

List[DetectionPattern]

Returns:

List of DetectionPattern objects for country-specific identifiers

static get_default_patterns()[source]

Get default detection patterns for common PHI/PII types.

Return type:

List[DetectionPattern]

Returns:

List of DetectionPattern objects sorted by priority.

class scripts.deidentify.PseudonymGenerator(salt=None)[source]

Bases: object

Generates consistent, deterministic pseudonyms for PHI/PII.

Uses cryptographic hashing to ensure: - Same input always produces same pseudonym - Different inputs produce different pseudonyms - Pseudonyms are not reversible without the mapping table

__init__(salt=None)[source]

Initialize pseudonym generator.

Parameters:

salt (Optional[str]) – Optional salt for hash function. If None, generates random salt.

generate(value, phi_type, template)[source]

Generate a pseudonym for a given value.

Parameters:
  • value (str) – The sensitive value to pseudonymize

  • phi_type (PHIType) – Type of PHI/PII

  • template (str) – Template string with {id} placeholder

Return type:

str

Returns:

Pseudonymized value (e.g., β€œPATIENT-A4B8”)

get_statistics()[source]

Get statistics on pseudonym generation.

Return type:

Dict[str, int]

Returns:

Dictionary mapping PHI type to count of unique values

scripts.deidentify.deidentify_dataset(input_dir, output_dir, text_fields=None, config=None, file_pattern='*.jsonl', process_subdirs=True)[source]

Batch de-identification of JSONL dataset files.

Processes JSONL files while maintaining directory structure. If the input directory contains subdirectories (e.g., β€˜original/’, β€˜cleaned/’), the same structure will be replicated in the output directory.

Parameters:
  • input_dir (Union[str, Path]) – Directory containing JSONL files (may have subdirectories)

  • output_dir (Union[str, Path]) – Directory to write de-identified files (maintains structure)

  • text_fields (Optional[List[str]]) – List of field names to de-identify (all string fields if None)

  • config (Optional[DeidentificationConfig]) – De-identification configuration

  • file_pattern (str) – Glob pattern for files to process

  • process_subdirs (bool) – If True, recursively process subdirectories

Return type:

Dict[str, Any]

Returns:

Dictionary with processing statistics

scripts.deidentify.validate_dataset(dataset_dir, file_pattern='*.jsonl', text_fields=None)[source]

Validate that no PHI remains in de-identified dataset.

Parameters:
  • dataset_dir (Union[str, Path]) – Directory containing de-identified JSONL files

  • file_pattern (str) – Glob pattern for files to validate

  • text_fields (Optional[List[str]]) – List of field names to validate

Return type:

Dict[str, Any]

Returns:

Dictionary with validation results

Changed in version 0.3.0: Added explicit public API definition via __all__ (10 exports), enhanced module docstring with comprehensive usage examples (48 lines), and added complete return type annotations.

Overview

The deidentify module provides robust HIPAA/GDPR-compliant de-identification for medical datasets, supporting 14 countries with country-specific regulations, encrypted mapping storage, and comprehensive validation.

Public API:

__all__ = [
    'PHIType',                    # Enum for PHI types
    'DetectionPattern',           # Dataclass for patterns
    'DeidentificationConfig',     # Dataclass for configuration
    'PatternLibrary',             # Pattern library class
    'PseudonymGenerator',         # Pseudonym generation
    'DateShifter',                # Date shifting
    'MappingStore',               # Secure mapping storage
    'DeidentificationEngine',     # Main engine class
    'deidentify_dataset',         # Top-level function
    'validate_dataset',           # Validation function
]

Key Features

  • Multi-Country Support: HIPAA (US), GDPR (EU/GB), DPDPA (IN), and 11 other countries

  • PHI/PII Detection: 21 PHI types with country-specific patterns

  • Pseudonymization: Consistent, reversible pseudonyms with encrypted mapping

  • Date Shifting: Preserves time intervals while shifting dates

  • Encrypted Storage: Fernet encryption for mapping files

  • Validation: Comprehensive validation to ensure de-identification quality

  • Audit Trails: Export mappings for compliance audits

Classes

DeidentificationEngine

class scripts.deidentify.DeidentificationEngine(config=None, mapping_store=None)[source]

Bases: object

Main engine for PHI/PII detection and de-identification.

Orchestrates the entire de-identification process: 1. Detects PHI/PII using patterns and optional NER 2. Generates consistent pseudonyms 3. Replaces sensitive data with pseudonyms 4. Stores mappings securely 5. Validates results

__init__(config=None, mapping_store=None)[source]

Initialize de-identification engine.

Parameters:
deidentify_record(record, text_fields=None)[source]

De-identify a dictionary record (e.g., from JSONL).

Parameters:
  • record (Dict[str, Any]) – Dictionary containing data

  • text_fields (Optional[List[str]]) – List of field names to de-identify (all string fields if None)

Return type:

Dict[str, Any]

Returns:

De-identified record

deidentify_text(text, custom_patterns=None)[source]

De-identify a single text string.

Parameters:
Return type:

str

Returns:

De-identified text with PHI/PII replaced by pseudonyms

get_statistics()[source]

Get de-identification statistics.

Return type:

Dict[str, Any]

Returns:

Dictionary with processing statistics

save_mappings()[source]

Save all mappings to secure storage.

Return type:

None

validate_deidentification(text, strict=True)[source]

Validate that no PHI remains in de-identified text.

Parameters:
  • text (str) – De-identified text to validate

  • strict (bool) – If True, any detection is considered a failure

Return type:

Tuple[bool, List[str]]

Returns:

Tuple of (is_valid, list of potential PHI found)

PseudonymGenerator

class scripts.deidentify.PseudonymGenerator(salt=None)[source]

Bases: object

Generates consistent, deterministic pseudonyms for PHI/PII.

Uses cryptographic hashing to ensure: - Same input always produces same pseudonym - Different inputs produce different pseudonyms - Pseudonyms are not reversible without the mapping table

__init__(salt=None)[source]

Initialize pseudonym generator.

Parameters:

salt (Optional[str]) – Optional salt for hash function. If None, generates random salt.

generate(value, phi_type, template)[source]

Generate a pseudonym for a given value.

Parameters:
  • value (str) – The sensitive value to pseudonymize

  • phi_type (PHIType) – Type of PHI/PII

  • template (str) – Template string with {id} placeholder

Return type:

str

Returns:

Pseudonymized value (e.g., β€œPATIENT-A4B8”)

get_statistics()[source]

Get statistics on pseudonym generation.

Return type:

Dict[str, int]

Returns:

Dictionary mapping PHI type to count of unique values

DateShifter

class scripts.deidentify.DateShifter(shift_range_days=365, preserve_intervals=True, seed=None, country_code='US')[source]

Bases: object

Consistent date shifting with intelligent multi-format parsing.

Shifts all dates by a consistent offset while maintaining: - Relative time intervals between dates - Original date format (ISO 8601, DD/MM/YYYY, MM/DD/YYYY, hyphen/dot-separated) - Country-specific format priority for consistent interpretation

Supported formats (auto-detected): - YYYY-MM-DD (ISO 8601) - Always tried first (unambiguous) - DD/MM/YYYY or MM/DD/YYYY (slash-separated) - Country-specific priority - DD-MM-YYYY or MM-DD-YYYY (hyphen-separated) - Country-specific priority - DD.MM.YYYY (dot-separated, European)

Date Interpretation Strategy

The shifter uses a three-tier strategy to handle date ambiguity:

  1. Unambiguous Formats (ISO 8601): Always tried first - Example: β€œ2020-01-15” is always January 15, 2020 regardless of country

  2. Country-Specific Preference: For ambiguous dates - India (IN): β€œ08/09/2020” interpreted as DD/MM β†’ September 8, 2020 - USA (US): β€œ08/09/2020” interpreted as MM/DD β†’ August 9, 2020

  3. Smart Validation: Reject logically impossible formats - β€œ13/05/2020” can only be DD/MM (no 13th month) - β€œ05/25/2020” can only be MM/DD (no 25th month)

Ambiguous Date Handling

For dates where both day and month are ≀ 12 (e.g., 12/12/2012, 08/09/2020):

  • Consistency Guarantee: All dates from the same country use the same format

  • Country Setting: The country_code parameter determines interpretation

  • Transparency: Users know upfront how dates will be interpreted

Examples

Basic usage:

>>> shifter = DateShifter(country_code="IN")
>>> shifter.shift_date("2019-01-11")  # ISO format
'2018-04-13'  # Shifted by offset, format preserved

Ambiguous dates (country-specific):

>>> shifter_india = DateShifter(country_code="IN")
>>> shifter_india.shift_date("08/09/2020")  # DD/MM for India
'18/12/2019'  # September 8, 2020 β†’ shifted

>>> shifter_usa = DateShifter(country_code="US")
>>> shifter_usa.shift_date("08/09/2020")  # MM/DD for USA
'28/11/2019'  # August 9, 2020 β†’ shifted

Symmetric dates (country preference):

>>> shifter = DateShifter(country_code="IN")
>>> shifter.shift_date("12/12/2012")  # Ambiguous!
'02/03/2012'  # Interpreted as DD/MM (India preference)

Unambiguous dates (validation):

>>> shifter = DateShifter(country_code="IN")
>>> shifter.shift_date("13/05/2020")  # Must be DD/MM
'23/08/2019'  # May 13, 2020 β†’ shifted (13 > 12, can't be month)

Country Format Reference

  • DD/MM/YYYY countries: IN, ID, BR, ZA, EU, GB, AU, KE, NG, GH, UG

  • MM/DD/YYYY countries: US, PH, CA

See also

-, -

__init__(shift_range_days=365, preserve_intervals=True, seed=None, country_code='US')[source]

Initialize date shifter with country-specific format interpretation.

Parameters:
  • shift_range_days (int) – Maximum days to shift (Β±), default 365

  • preserve_intervals (bool) – If True, all dates shift by same offset (recommended for consistency)

  • seed (Optional[str]) – Optional seed for deterministic shift generation (same seed = same shift)

  • country_code (str) – Country code determining date format priority for ambiguous dates - β€œIN”, β€œID”, β€œBR”, β€œZA”, β€œEU”, β€œGB”, β€œAU”, β€œKE”, β€œNG”, β€œGH”, β€œUG”: DD/MM/YYYY - β€œUS”, β€œPH”, β€œCA”: MM/DD/YYYY - Default: β€œUS”

Note

The country_code setting ensures consistent interpretation of ambiguous dates (e.g., β€œ08/09/2020” or β€œ12/12/2012”). All dates from the same country will use the same format rules. ISO 8601 dates (YYYY-MM-DD) are always unambiguous and parse correctly regardless of country_code.

Examples

>>> # India dataset - interpret slash dates as DD/MM
>>> shifter_in = DateShifter(country_code="IN")
>>> shifter_in.shift_date("08/09/2020")  # September 8, 2020
>>> # USA dataset - interpret slash dates as MM/DD
>>> shifter_us = DateShifter(country_code="US")
>>> shifter_us.shift_date("08/09/2020")  # August 9, 2020
shift_date(date_str, date_format=None)[source]

Shift a date string by consistent offset with intelligent format detection.

The algorithm prioritizes unambiguous formats (ISO 8601) and uses country-specific format preferences for ambiguous dates. This ensures consistency for dates like 12/12/2012 or 08/09/2020 which could be interpreted multiple ways.

Parameters:
  • date_str (str) – Date string to shift (e.g., β€œ2019-01-11”, β€œ04/09/2014”, β€œ12/12/2012”)

  • date_format (Optional[str]) – Specific format to use (auto-detected if None)

Return type:

str

Returns:

Shifted date in same format as input

Examples

>>> shifter = DateShifter(country_code="IN")
>>> shifter.shift_date("2019-01-11")
'2018-04-13'  # ISO format preserved
>>> shifter.shift_date("04/09/2014")
'14/12/2013'  # DD/MM/YYYY format (India interprets as Sep 4)
>>> shifter.shift_date("12/12/2012")
'02/03/2012'  # DD/MM/YYYY format (country preference trusted)

Note

For ambiguous dates where both numbers are ≀ 12 (e.g., 12/12/2012, 08/09/2020), the country-specific format is used consistently based on the country_code setting. This ensures all dates from the same country are interpreted with the same rules.

MappingStore

class scripts.deidentify.MappingStore(storage_path, encryption_key=None, enable_encryption=True)[source]

Bases: object

Secure storage for PHI to pseudonym mappings.

Features: - Encrypted storage using Fernet (symmetric encryption) - Separate key management - JSON serialization - Audit logging

__init__(storage_path, encryption_key=None, enable_encryption=True)[source]

Initialize mapping store.

Parameters:
  • storage_path (Path) – Path to store mapping file

  • encryption_key (Optional[bytes]) – Encryption key (generates new if None)

  • enable_encryption (bool) – Whether to encrypt mappings

add_mapping(original, pseudonym, phi_type, metadata=None)[source]

Add a mapping entry.

Parameters:
  • original (str) – Original sensitive value

  • pseudonym (str) – Pseudonymized value

  • phi_type (PHIType) – Type of PHI

  • metadata (Optional[Dict]) – Optional additional metadata

Return type:

None

export_for_audit(output_path, include_originals=False)[source]

Export mappings for audit purposes.

Parameters:
  • output_path (Path) – Path to export file

  • include_originals (bool) – Whether to include original values (dangerous!)

Return type:

None

get_pseudonym(original, phi_type)[source]

Retrieve pseudonym for original value.

Parameters:
  • original (str) – Original value

  • phi_type (PHIType) – Type of PHI

Return type:

Optional[str]

Returns:

Pseudonym if exists, None otherwise

save_mappings()[source]

Save mappings to storage file.

Return type:

None

PatternLibrary

class scripts.deidentify.PatternLibrary[source]

Bases: object

Library of regex patterns for detecting PHI/PII.

static get_country_specific_patterns(countries=None)[source]

Get country-specific detection patterns.

Parameters:

countries (Optional[List[str]]) – List of country codes or None for all

Return type:

List[DetectionPattern]

Returns:

List of DetectionPattern objects for country-specific identifiers

static get_default_patterns()[source]

Get default detection patterns for common PHI/PII types.

Return type:

List[DetectionPattern]

Returns:

List of DetectionPattern objects sorted by priority.

Enumerations

PHIType

class scripts.deidentify.PHIType(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

PHI/PII type categorization.

ACCOUNT_NUMBER = 'ACCOUNT'
ADDRESS_CITY = 'CITY'
ADDRESS_STATE = 'STATE'
ADDRESS_STREET = 'STREET'
ADDRESS_ZIP = 'ZIP'
AGE_OVER_89 = 'AGE'
CUSTOM = 'CUSTOM'
DATE = 'DATE'
DEVICE_ID = 'DEVICE'
EMAIL = 'EMAIL'
IP_ADDRESS = 'IP'
LICENSE_NUMBER = 'LICENSE'
LOCATION = 'LOCATION'
MRN = 'MRN'
NAME_FIRST = 'FNAME'
NAME_FULL = 'PATIENT'
NAME_LAST = 'LNAME'
ORGANIZATION = 'ORG'
PHONE = 'PHONE'
SSN = 'SSN'
URL = 'URL'

Data Classes

DetectionPattern

class scripts.deidentify.DetectionPattern(phi_type, pattern, priority=50, description='')[source]

Bases: object

PHI/PII detection pattern configuration.

description: str = ''
pattern: Pattern
phi_type: PHIType
priority: int = 50

DeidentificationConfig

class scripts.deidentify.DeidentificationConfig(pseudonym_templates=<factory>, enable_date_shifting=True, date_shift_range_days=365, preserve_date_intervals=True, enable_encryption=True, encryption_key=None, enable_validation=True, strict_mode=True, log_detections=True, log_level=20, countries=None, enable_country_patterns=True)[source]

Bases: object

De-identification engine configuration.

countries: Optional[List[str]] = None
date_shift_range_days: int = 365
enable_country_patterns: bool = True
enable_date_shifting: bool = True
enable_encryption: bool = True
enable_validation: bool = True
encryption_key: Optional[bytes] = None
log_detections: bool = True
log_level: int = 20
preserve_date_intervals: bool = True
pseudonym_templates: Dict[PHIType, str]
strict_mode: bool = True

Functions

deidentify_dataset

scripts.deidentify.deidentify_dataset(input_dir, output_dir, text_fields=None, config=None, file_pattern='*.jsonl', process_subdirs=True)[source]

Batch de-identification of JSONL dataset files.

Processes JSONL files while maintaining directory structure. If the input directory contains subdirectories (e.g., β€˜original/’, β€˜cleaned/’), the same structure will be replicated in the output directory.

Parameters:
  • input_dir (Union[str, Path]) – Directory containing JSONL files (may have subdirectories)

  • output_dir (Union[str, Path]) – Directory to write de-identified files (maintains structure)

  • text_fields (Optional[List[str]]) – List of field names to de-identify (all string fields if None)

  • config (Optional[DeidentificationConfig]) – De-identification configuration

  • file_pattern (str) – Glob pattern for files to process

  • process_subdirs (bool) – If True, recursively process subdirectories

Return type:

Dict[str, Any]

Returns:

Dictionary with processing statistics

validate_dataset

scripts.deidentify.validate_dataset(dataset_dir, file_pattern='*.jsonl', text_fields=None)[source]

Validate that no PHI remains in de-identified dataset.

Parameters:
  • dataset_dir (Union[str, Path]) – Directory containing de-identified JSONL files

  • file_pattern (str) – Glob pattern for files to validate

  • text_fields (Optional[List[str]]) – List of field names to validate

Return type:

Dict[str, Any]

Returns:

Dictionary with validation results