De-identification

For Users: Protecting Privacy

This feature helps you protect sensitive patient information by replacing real names, dates, and other personal details with safe placeholders. It follows medical privacy rules from 14 different countries.

Changed in version 0.3.0: Enhanced privacy protection with improved detection and better support for international regulations.

See also

For information about privacy rules in specific countries, see Country-Specific Privacy Rules.

Overview

The privacy protection feature can detect and replace 21 types of personal information including:

  • Names and Addresses: Patient names, street addresses, cities

  • Dates: Birth dates, admission dates, appointment dates

  • ID Numbers: Social security numbers, medical record numbers, account numbers

  • Contact Info: Phone numbers, email addresses, website URLs

  • Encrypted Storage: Fernet encryption for mapping tables with secure key management

  • Date Shifting: Preserves temporal relationships while shifting dates by consistent offset

  • Validation: Comprehensive validation to ensure no PHI leakage in de-identified output

  • Security: Built-in encryption, access control, and audit trail capabilities

  • Directory Structure Preservation: Maintains original file organization during batch processing

What’s New in 0.8.6

Better Privacy Protection:
  • Improved detection of sensitive information

  • More secure replacement methods

  • Easier to use with better error messages

Enhanced Security:
  • Stronger encryption for mapping files

  • Better protection of patient information

  • Comprehensive audit trail for compliance

Improved Documentation:
  • Clear examples for common tasks

  • Step-by-step privacy protection guides

  • Easy-to-follow security best practices

How It Works

Privacy protection happens automatically as part of the data processing:

Step 1: Data Extraction

Your Excel files are converted to a simpler format (JSONL):

  • results/dataset/Indo-vap/original/ - All your data preserved

  • results/dataset/Indo-vap/cleaned/ - Duplicate information removed

Step 2: Privacy Protection (Optional)

Both folders are protected while keeping the same structure:

  • results/deidentified/Indo-vap/original/ - Protected original files

  • results/deidentified/Indo-vap/cleaned/ - Protected cleaned files

  • results/deidentified/mappings/mappings.enc - Encrypted lookup table

  • results/deidentified/Indo-vap/_deidentification_audit.json - Record of changes

What You Get:

  1. Consistent Replacement: The same name always gets the same safe placeholder

  2. Secure Storage: Your lookup table is encrypted and protected

  3. Same Organization: Protected files are organized exactly like your original files

  4. Complete Record: Full audit trail of what was changed (without showing the original values)

  5. Easy Review: You can verify the protection worked correctly

Getting Started

Basic Usage

To protect patient privacy in your data, run:

# Protect data for United States privacy rules
python main.py --enable-deidentification --countries US

# Protect data for multiple countries
python main.py --enable-deidentification --countries IN US GB

This will: - Find and replace sensitive information like names, dates, and phone numbers - Create protected versions of your files - Save an encrypted lookup table so you can track changes - Generate a report showing what was protected

What Gets Protected

The privacy feature protects 21 types of sensitive information including:

MappingStore, # Secure mapping storage DeidentificationEngine, # Main engine

# Top-level Functions deidentify_dataset, # Batch processing validate_dataset, # Validation

)

What to Import:

  • For Basic Use: Import DeidentificationEngine and optionally DeidentificationConfig

  • For Batch Processing: Import deidentify_dataset and validate_dataset

  • For Advanced Use: Import specific classes like DateShifter, MappingStore, etc.

  • For Custom Patterns: Import PHIType and DetectionPattern

Example - Basic Usage:

from scripts.deidentify import DeidentificationEngine, DeidentificationConfig

# Configure with custom settings
config = DeidentificationConfig(
    enable_date_shifting=True,
    enable_encryption=True,
    countries=['US', 'IN']
)

# Create engine
engine = DeidentificationEngine(config=config)

# De-identify text
text = "Patient John Doe, MRN: AB123456, DOB: 01/15/1980"
deidentified = engine.deidentify_text(text)
print(deidentified)
# Output: "Patient [PATIENT-A4B8], MRN: [MRN-X7Y2], DOB: [DATE-1980-01-15]"

Example - Batch Processing:

from scripts.deidentify import deidentify_dataset, validate_dataset

# Process entire dataset
stats = deidentify_dataset(
    input_dir="data/patient_records",
    output_dir="data/deidentified",
    config=config
)

# Validate results
validation = validate_dataset(
    dataset_dir="data/deidentified"
)

if validation['is_valid']:
    print("βœ“ No PHI detected in output")
else:
    print(f"⚠ Found {len(validation['potential_phi_found'])} issues")

Example - Custom Patterns:

from scripts.deidentify import (
    DeidentificationEngine,
    PHIType,
    DetectionPattern
)
import re

# Define custom pattern for employee IDs
custom_pattern = DetectionPattern(
    phi_type=PHIType.CUSTOM,
    pattern=re.compile(r'EMP-\d{6}'),
    priority=85,
    description="Employee ID format: EMP-XXXXXX"
)

# Use with engine
engine = DeidentificationEngine()
text = "Employee EMP-123456 accessed record"
deidentified = engine.deidentify_text(text, custom_patterns=[custom_pattern])

Basic Usage

from scripts.deidentify import DeidentificationEngine

# Initialize engine
engine = DeidentificationEngine()

# De-identify text
original = "Patient John Doe, MRN: 123456, DOB: 01/15/1980"
deidentified = engine.deidentify_text(original)
# Output: "Patient [PATIENT-A4B8], MRN: [MRN-X7Y2], DOB: [DATE-1980-01-15]"

# Save mappings
engine.save_mappings()

Batch Processing

from scripts.deidentify import deidentify_dataset

# Process entire dataset (maintains directory structure)
# Input directory contains: original/ and cleaned/ subdirectories
stats = deidentify_dataset(
    input_dir="results/dataset/Indo-vap",
    output_dir="results/deidentified/Indo-vap",
    process_subdirs=True  # Recursively process subdirectories
)

print(f"Processed {stats['texts_processed']} texts")
print(f"Detected {stats['total_detections']} PHI items")

# Output structure:
# results/deidentified/Indo-vap/
#   β”œβ”€β”€ original/          (de-identified original files)
#   β”œβ”€β”€ cleaned/           (de-identified cleaned files)
#   └── _deidentification_audit.json

Command Line Interface

# Basic usage - processes subdirectories recursively
python -m scripts.deidentify \
    --input-dir results/dataset/Indo-vap \
    --output-dir results/deidentified/Indo-vap

# With validation
python -m scripts.deidentify \
    --input-dir results/dataset/Indo-vap \
    --output-dir results/deidentified/Indo-vap \
    --validate

# Specify text fields
python -m scripts.deidentify \
    --input-dir results/dataset/Indo-vap \
    --output-dir results/deidentified/Indo-vap \
    --text-fields patient_name notes diagnosis

# Disable encryption (not recommended)
python -m scripts.deidentify \
    --input-dir results/dataset/Indo-vap \
    --output-dir results/deidentified/Indo-vap \
    --no-encryption

Pipeline Integration

The de-identification step processes both original/ and cleaned/ subdirectories while maintaining the same file structure in the output directory.

# Enable de-identification in main pipeline
python main.py --enable-deidentification

# Skip de-identification
python main.py --enable-deidentification --skip-deidentification

# Disable encryption (not recommended for production)
python main.py --enable-deidentification --no-encryption

Output Directory Structure:

results/
β”œβ”€β”€ dataset/
β”‚   └── Indo-vap/
β”‚       β”œβ”€β”€ original/        (extracted JSONL files)
β”‚       └── cleaned/         (cleaned JSONL files)
β”œβ”€β”€ deidentified/
β”‚   β”œβ”€β”€ Indo-vap/
β”‚   β”‚   β”œβ”€β”€ original/        (de-identified original files)
β”‚   β”‚   β”œβ”€β”€ cleaned/         (de-identified cleaned files)
β”‚   β”‚   └── _deidentification_audit.json
β”‚   └── mappings/
β”‚       └── mappings.enc     (encrypted mapping table)
└── data_dictionary_mappings/

Important

Version Control Best Practices

The .gitignore file is pre-configured with security best practices:

Safe to Track in Git:

  • βœ… De-identified datasets (results/deidentified/Indo-vap/)

  • βœ… Data dictionary mappings (results/data_dictionary_mappings/)

  • βœ… Source code and documentation

Never Commit to Git:

  • ❌ Original datasets with PHI (results/dataset/)

  • ❌ Deidentification mappings (results/deidentified/mappings/)

  • ❌ Encryption keys (*.key, *.pem, *.fernet)

  • ❌ Audit logs (*_deidentification_audit.json)

Always review git status before committing to ensure no PHI/PII files are staged.

Supported PHI/PII Types

The tool detects and de-identifies the following 21 HIPAA identifier types:

Names

  • First names

  • Last names

  • Full names

Medical Identifiers

  • Medical Record Numbers (MRN)

  • Account numbers

  • License/certificate numbers

Government Identifiers

  • Social Security Numbers (SSN)

Contact Information

  • Phone numbers (US and international formats)

  • Email addresses

  • Fax numbers

Geographic Information

  • Street addresses

  • Cities

  • States

  • ZIP codes

Temporal Information

  • Dates (all formats including DOB)

  • Ages over 89 (HIPAA requirement)

Technical Identifiers

  • Device identifiers

  • URLs

  • IP addresses (IPv4)

Custom Identifiers

  • Easy to extend with new detection rules

  • User-defined PHI types

Pseudonym Formats

Different PHI types use different pseudonym formats:

PHI Type

Example Original

Pseudonym Format

Name

John Doe

[PATIENT-A4B8C2]

MRN

AB123456

[MRN-X7Y2Z9]

SSN

123-45-6789

[SSN-Q3W8E5]

Phone

  1. 123-4567

[PHONE-E5R7T9]

Email

patient@example.com

[EMAIL-T9Y3U8]

Date

01/15/1980

Shifted date or [DATE-1]

Address

123 Main St

[STREET-Z2X5C8]

ZIP

12345

[ZIP-K9L4M7]

Age >89

Age 92

[AGE-K4L8P6]

Configuration

Directory Structure Processing

The de-identification tool automatically processes subfolders to maintain the same file structure between input and output directories:

from scripts.deidentify import deidentify_dataset

# Process with subdirectories (default)
stats = deidentify_dataset(
    input_dir="results/dataset/Indo-vap",
    output_dir="results/deidentified/Indo-vap",
    process_subdirs=True  # Recursively process all subdirectories
)

# Process only top-level files (no subdirectories)
stats = deidentify_dataset(
    input_dir="results/dataset/Indo-vap",
    output_dir="results/deidentified/Indo-vap",
    process_subdirs=False  # Only process files in the root directory
)

Features:

  • Maintains relative directory structure in output

  • Processes both original/ and cleaned/ subdirectories

  • Creates output directories automatically

  • Preserves file naming conventions

  • Single mapping table shared across all subdirectories

DeidentificationConfig

from scripts.deidentify import DeidentificationConfig, DeidentificationEngine

config = DeidentificationConfig(
    # Date shifting
    enable_date_shifting=True,
    date_shift_range_days=365,
    preserve_date_intervals=True,

    # Security
    enable_encryption=True,
    encryption_key=None,  # Auto-generate if None

    # Validation
    enable_validation=True,
    strict_mode=True,

    # Logging
    log_detections=True,
    log_level=logging.INFO
)

engine = DeidentificationEngine(config=config)

Custom PHI Patterns

from scripts.deidentify import DetectionPattern, PHIType
import re

# Define custom pattern
custom_pattern = DetectionPattern(
    phi_type=PHIType.CUSTOM,
    pattern=re.compile(r'\bSTUDY-\d{4}\b'),
    priority=85,
    description="Custom Study ID format"
)

# Use in de-identification
deidentified = engine.deidentify_text(
    text="Study ID: STUDY-1234",
    custom_patterns=[custom_pattern]
)

Advanced Features

Date Shifting

Date shifting preserves temporal relationships while obscuring actual dates. The date shifter automatically uses intelligent multi-format parsing with country-specific defaults:

from scripts.deidentify import DateShifter

# For India (DD/MM/YYYY format priority)
shifter_india = DateShifter(
    shift_range_days=365,
    preserve_intervals=True,
    country_code="IN"
)

# All dates shift by same offset, format preserved
date1 = shifter_india.shift_date("04/09/2014")  # September 4, 2014 (DD/MM/YYYY)
date2 = shifter_india.shift_date("09/09/2014")  # September 9, 2014
# Output: 14/12/2013, 19/12/2013 (5 days interval preserved)

# ISO 8601 format also supported
date3 = shifter_india.shift_date("2014-09-04")  # September 4, 2014
# Output: 2013-12-14 (format preserved)

# For United States (MM/DD/YYYY format priority)
shifter_us = DateShifter(
    shift_range_days=365,
    preserve_intervals=True,
    country_code="US"
)

date4 = shifter_us.shift_date("04/09/2014")  # April 9, 2014 (MM/DD/YYYY)
# Output: Different interpretation due to country format

Supported Date Formats (auto-detected):

  • ISO 8601: YYYY-MM-DD (e.g., 2014-09-04) - International standard, all countries

  • Slash-separated: DD/MM/YYYY or MM/DD/YYYY (e.g., 04/09/2014)

  • Hyphen-separated: DD-MM-YYYY or MM-DD-YYYY (e.g., 04-09-2014)

  • Dot-separated: DD.MM.YYYY (e.g., 04.09.2014) - European format

Primary Format by Country:

  • DD/MM/YYYY (Day first): India (IN), UK (GB), Australia (AU), Indonesia (ID), Brazil (BR), South Africa (ZA), EU countries, Kenya (KE), Nigeria (NG), Ghana (GH), Uganda (UG)

  • MM/DD/YYYY (Month first): United States (US), Philippines (PH), Canada (CA)

Key Features:

  • Intelligent multi-format detection (tries multiple formats automatically)

  • Original format preservation (shifted dates maintain the input format)

  • Consistent offset across all dates in a dataset

  • Temporal relationships preserved (intervals between dates maintained)

  • Country-specific format priority

  • Fallback to [DATE-HASH] placeholder only if all formats fail

Understanding Date Format Handling

Added in version 0.6.0: Improved date parsing with country-specific format priority and smart validation.

The date shifter uses an intelligent algorithm to handle ambiguous dates correctly:

The Ambiguity Problem

Dates like 08/09/2020 or 12/12/2012 can be interpreted in multiple ways:

Date String

US Format (MM/DD)

India Format (DD/MM)

Ambiguity

08/09/2020

August 9, 2020

September 8, 2020

⚠️ Both valid

12/12/2012

December 12, 2012

December 12, 2012

⚠️ Symmetric date

13/05/2020

❌ Invalid (no 13th month)

May 13, 2020

βœ… Unambiguous

05/25/2020

May 25, 2020

❌ Invalid (no 25th month)

βœ… Unambiguous

The Solution: Country-Based Priority with Smart Validation

The date shifter uses a three-step algorithm:

  1. Try ISO 8601 First (YYYY-MM-DD): Always unambiguous, works for all countries

  2. Try Country-Specific Format: Use the country’s preferred interpretation

  3. Smart Validation: Reject formats that are logically impossible

Algorithm Details:

# Example: Processing "13/05/2020" for India (DD/MM preference)

Step 1: Try ISO 8601 (YYYY-MM-DD)
  Result: ❌ Doesn't match pattern

Step 2: Try DD/MM/YYYY (India preference)
  Parse: βœ… Day=13, Month=05 (May 13, 2020)
  Validate: first_num=13 > 12 βœ… Valid (day can be >12)
  Result: βœ… Success! β†’ May 13, 2020

# Example: Processing "13/05/2020" for USA (MM/DD preference)

Step 1: Try ISO 8601 (YYYY-MM-DD)
  Result: ❌ Doesn't match pattern

Step 2: Try MM/DD/YYYY (USA preference)
  Parse: ❌ Month=13 is invalid (only 12 months)
  Result: Parsing fails, try next format

Step 3: Try DD/MM/YYYY (fallback)
  Parse: βœ… Day=13, Month=05
  Result: βœ… Success! β†’ May 13, 2020

Smart Validation Rules:

  • If first number > 12 β†’ Must be day (can’t be month)

  • If second number > 12 β†’ Must be day (can’t be month)

  • If both numbers ≀ 12 β†’ Trust country preference (ambiguous case)

Examples by Country:

from scripts.deidentify import DateShifter

# India: DD/MM/YYYY preference
shifter_india = DateShifter(country_code="IN")

shifter_india.shift_date("2020-01-15")   # ISO β†’ Always Jan 15, 2020
shifter_india.shift_date("13/05/2020")   # Unambiguous β†’ May 13, 2020
shifter_india.shift_date("08/09/2020")   # Ambiguous β†’ Sep 8, 2020 (DD/MM)
shifter_india.shift_date("12/12/2012")   # Symmetric β†’ Dec 12, 2012 (DD/MM)

# United States: MM/DD/YYYY preference
shifter_us = DateShifter(country_code="US")

shifter_us.shift_date("2020-01-15")      # ISO β†’ Always Jan 15, 2020
shifter_us.shift_date("13/05/2020")      # Unambiguous β†’ May 13, 2020
shifter_us.shift_date("08/09/2020")      # Ambiguous β†’ Aug 9, 2020 (MM/DD)
shifter_us.shift_date("12/12/2012")      # Symmetric β†’ Dec 12, 2012 (MM/DD)

Best Practices:

  1. Use ISO 8601 when possible (YYYY-MM-DD): Eliminates all ambiguity

  2. Set country code correctly: Ensures consistent interpretation within your dataset

  3. Validate output: Review shifted dates to ensure they make sense

  4. Document format: Record which format your source data uses

Tip

For symmetric dates like 12/12/2012 or 01/01/2020, the interpretation doesn’t affect the result since both formats yield the same date. However, consistency is still maintained for audit purposes.

Warning

Mixing date formats within a single dataset (e.g., some files using DD/MM and others using MM/DD) can lead to inconsistent interpretations. Always use a consistent format across your dataset, preferably ISO 8601.

Encrypted Mapping Storage

Mapping tables are stored in a centralized location within the results/deidentified/mappings/ directory:

from cryptography.fernet import Fernet
from scripts.deidentify import DeidentificationConfig

# Generate and save key
encryption_key = Fernet.generate_key()
with open('encryption_key.bin', 'wb') as f:
    f.write(encryption_key)

# Use encrypted storage
config = DeidentificationConfig(
    enable_encryption=True,
    encryption_key=encryption_key
)

engine = DeidentificationEngine(config=config)

# Mappings stored in: results/deidentified/mappings/mappings.enc
# This single mapping file is used across all datasets and subdirectories

Record De-identification

# De-identify structured records
record = {
    "patient_name": "John Doe",
    "mrn": "123456",
    "notes": "Patient has diabetes. DOB: 01/15/1980",
    "lab_value": 95.5  # Numeric field preserved
}

# Specify which fields to de-identify
deidentified = engine.deidentify_record(
    record,
    text_fields=["patient_name", "notes"]
)

Validation

# Validate de-identified text
is_valid, issues = engine.validate_deidentification(deidentified_text)

if not is_valid:
    print(f"Validation failed! Issues: {issues}")
else:
    print("βœ“ No PHI detected")

# Validate entire dataset (processes all subdirectories)
from scripts.deidentify import validate_dataset

validation_results = validate_dataset(
    "results/deidentified/Indo-vap"
)

print(f"Valid: {validation_results['is_valid']}")
print(f"Issues: {len(validation_results['potential_phi_found'])}")
print(f"Files validated: {validation_results['total_files']}")
print(f"Records validated: {validation_results['total_records']}")

Security

Encryption

Mapping storage uses Fernet (symmetric encryption):

  • Encryption method: AES-128 in CBC mode

  • Key management: Separate from data files

  • Format: Base64-encoded encrypted JSON

Cryptographic Pseudonyms

Pseudonyms are generated using:

  • Hash method: SHA-256 hashing

  • Salt: Random or deterministic per session

  • Encoding: Base32 for readability

  • Property: Irreversible without mapping table

Best Practices

  1. Protect Encryption Keys

    • Store keys separately from mapping files

    • Use key management systems in production

    • Rotate keys periodically

  2. Enable Validation

    • Always validate after de-identification

    • Manual review of sample outputs

    • Regular updates to detection rules

  3. Audit Logging

    • Enable comprehensive logging

    • Monitor for validation failures

    • Track mapping usage

  4. Access Control

    • Restrict access to mapping files

    • Separate re-identification permissions

    • Log all mapping exports

HIPAA Compliance

The tool follows HIPAA Safe Harbor requirements:

βœ“ Removes all 18 HIPAA identifiers

βœ“ Ages over 89 handled appropriately

βœ“ Geographic subdivisions (ZIP codes) de-identified

βœ“ Dates shifted to preserve intervals

βœ“ No re-identification without authorization

Performance

Benchmarks

Typical performance on modern hardware:

  • Text Processing: ~1,000 records/second

  • Detection Speed: ~500 KB/second

  • Mapping Lookup: O(1) average case

  • Encryption Overhead: ~5-10% slowdown

Optimization Tips

  1. Batch Processing: Process files in parallel

  2. Detection Order: Put common items first

  3. Caching: Pseudonyms cached automatically

  4. Validation: Disable in production if pre-validated

Examples

See scripts/deidentify.py --help for command-line usage:

python -m scripts.deidentify --help

Examples include:

  1. Basic text de-identification

  2. Consistent pseudonyms

  3. Structured record de-identification

  4. Custom patterns

  5. Date shifting

  6. Batch processing

  7. Validation workflow

  8. Mapping management

  9. Security features

Testing

The de-identification tool can be tested using the main process:

# Test on a small dataset
python main.py --enable-deidentification

Expected Output

When processing the Indo-vap dataset:

De-identifying files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 86/86 [00:08<00:00, 10.34it/s]
INFO:reportalin:De-identification complete:
INFO:reportalin:  Texts processed: 1854110
INFO:reportalin:  Total detections: 365620
INFO:reportalin:  Unique mappings: 5398
INFO:reportalin:  Output structure:
INFO:reportalin:    - results/deidentified/Indo-vap/original/  (de-identified original files)
INFO:reportalin:    - results/deidentified/Indo-vap/cleaned/   (de-identified cleaned files)

What happens:

  • Processes both original/ and cleaned/ subdirectories (43 files each = 86 total)

  • Detects and replaces PHI/PII in all string fields

  • Creates 5,398 unique pseudonym mappings

  • Generates encrypted mapping table at results/deidentified/mappings/mappings.enc

  • Exports audit log at results/deidentified/Indo-vap/_deidentification_audit.json

Sample De-identification:

Before:

{
    "HHC1": "10200009B",
    "TST_DAT1": "2014-06-11 00:00:00",
    "TST_ENDAT1": "2014-06-14 00:00:00"
}

After:

{
    "HHC1": "[MRN-XTHM4A]",
    "TST_DAT1": "[DATE-A4A986]",
    "TST_ENDAT1": "[DATE-B3C874]"
}

Verification

βœ“ Detection for all PHI types

βœ“ Pseudonym consistency

βœ“ Date shifting and intervals

βœ“ Mapping storage and encryption

βœ“ Batch processing

βœ“ Validation

βœ“ Edge cases and error handling

Troubleshooting

Common Issues

β€œNo files matching β€˜*.jsonl’ found”

# Solution: Ensure extraction step completed first
python main.py --skip-deidentification  # Run extraction
python main.py --enable-deidentification --skip-extraction  # Then deidentify

Encryption error - β€œcryptography package not available”

# Solution: Install cryptography
pip install cryptography>=41.0.0

Validation fails on de-identified text

# Solution: Check detection order and exclusions
engine.validate_deidentification(text)

Dates not shifting consistently

# Solution: Enable interval preservation
config = DeidentificationConfig(
    enable_date_shifting=True,
    preserve_date_intervals=True
)

Custom patterns not detected

# Solution: Increase priority
custom_pattern = DetectionPattern(
    phi_type=PHIType.CUSTOM,
    pattern=your_detection_rule,
    priority=90  # Higher priority
)

Output directory structure different from input

# Solution: Ensure process_subdirs is enabled
stats = deidentify_dataset(
    input_dir="results/dataset/Indo-vap",
    output_dir="results/deidentified/Indo-vap",
    process_subdirs=True  # Must be True to preserve structure
)

β€œCould not parse date” warnings

# The tool uses smart multi-format date recognition
# Supported formats (auto-detected, original format preserved):
#   - YYYY-MM-DD: ISO 8601 standard (e.g., 2014-09-04)
#   - DD/MM/YYYY or MM/DD/YYYY: Slash-separated (e.g., 04/09/2014)
#   - DD-MM-YYYY or MM-DD-YYYY: Hyphen-separated (e.g., 04-09-2014)
#   - DD.MM.YYYY: Dot-separated European format (e.g., 04.09.2014)
#
# Format priority based on country:
#   - DD/MM/YYYY priority: India, UK, Australia, Indonesia, Brazil, South Africa, EU, Kenya, Nigeria, Ghana, Uganda
#   - MM/DD/YYYY priority: United States, Philippines, Canada
#
# Only truly unsupported formats are replaced with [DATE-HASH] placeholders

Date format interpretation and preservation

The date shifter automatically tries multiple formats and preserves the original format:

For India (IN) with DD/MM/YYYY priority:
- Input: 04/09/2014 β†’ Interpreted as September 4, 2014 (DD/MM/YYYY)
- Output: 14/12/2013 (format preserved: DD/MM/YYYY)

- Input: 2014-09-04 β†’ Interpreted as September 4, 2014 (ISO 8601)
- Output: 2013-12-14 (format preserved: YYYY-MM-DD)

For United States (US) with MM/DD/YYYY priority:
- Input: 04/09/2014 β†’ Interpreted as April 9, 2014 (MM/DD/YYYY)
- Output: 10/23/2013 (format preserved: MM/DD/YYYY)

- Input: 2014-04-09 β†’ Interpreted as April 9, 2014 (ISO 8601)
- Output: 2013-10-23 (format preserved: YYYY-MM-DD)

For all countries:
- 2014-09-04 is interpreted as September 4, 2014 (YYYY-MM-DD)
- Replaced with: [DATE-HASH] pseudonym

Technical Reference

For complete technical details, see the scripts.deidentify module documentation.

Key Classes

Key Functions

Migration Guide

Breaking Changes: None - The de-identification tool is fully backward compatible

New Features (Available in current version):

  1. Use Explicit Imports (Recommended):

    # Recommended import style
    from scripts.deidentify import DeidentificationEngine
    engine = DeidentificationEngine()
    
  2. Type Checking Benefits:

    If you use type checkers (mypy, pyright), you’ll get better type inference:

    # Type checkers now understand return types
    result: None = engine.save_mappings()  # Correctly inferred as None
    
  3. API Discovery:

    You can now see exactly what’s public:

    from scripts import deidentify
    print(deidentify.__all__)
    # ['PHIType', 'DetectionPattern', 'DeidentificationConfig', ...]
    

No Changes Required: All existing code continues to work without modification.

See Also

Related User Guides:

API & Technical References:

External Resources: