De-identification

For Users: Protecting Privacy

This feature helps you protect sensitive patient information by replacing real names, dates, and other personal details with safe placeholders. It follows medical privacy rules from 14 different countries.

Changed in version 0.3.0: Enhanced privacy protection with improved detection and better support for international regulations.

Overview

The privacy protection feature can detect and replace 21 types of personal information including:

Names and Addresses: Patient names, street addresses, cities
Dates: Birth dates, admission dates, appointment dates
ID Numbers: Social security numbers, medical record numbers, account numbers
Contact Info: Phone numbers, email addresses, website URLs
Encrypted Storage: Fernet encryption for mapping tables with secure key management
Date Shifting: Preserves temporal relationships while shifting dates by consistent offset
Validation: Comprehensive validation to ensure no PHI leakage in de-identified output
Security: Built-in encryption, access control, and audit trail capabilities
Directory Structure Preservation: Maintains original file organization during batch processing

What’s New in 0.8.6

Better Privacy Protection:

Improved detection of sensitive information
More secure replacement methods
Easier to use with better error messages

Enhanced Security:

Stronger encryption for mapping files
Better protection of patient information
Comprehensive audit trail for compliance

Improved Documentation:

Clear examples for common tasks
Step-by-step privacy protection guides
Easy-to-follow security best practices

How It Works

Privacy protection happens automatically as part of the data processing:

Step 1: Data Extraction

Your Excel files are converted to a simpler format (JSONL):

results/dataset/Indo-vap/original/ - All your data preserved
results/dataset/Indo-vap/cleaned/ - Duplicate information removed

Step 2: Privacy Protection (Optional)

Both folders are protected while keeping the same structure:

results/deidentified/Indo-vap/original/ - Protected original files
results/deidentified/Indo-vap/cleaned/ - Protected cleaned files
results/deidentified/mappings/mappings.enc - Encrypted lookup table
results/deidentified/Indo-vap/_deidentification_audit.json - Record of changes

What You Get:

Consistent Replacement: The same name always gets the same safe placeholder
Secure Storage: Your lookup table is encrypted and protected
Same Organization: Protected files are organized exactly like your original files
Complete Record: Full audit trail of what was changed (without showing the original values)
Easy Review: You can verify the protection worked correctly

Getting Started

Basic Usage

To protect patient privacy in your data, run:

# Protect data for United States privacy rules
python main.py --enable-deidentification --countries US

# Protect data for multiple countries
python main.py --enable-deidentification --countries IN US GB

This will: - Find and replace sensitive information like names, dates, and phone numbers - Create protected versions of your files - Save an encrypted lookup table so you can track changes - Generate a report showing what was protected

What Gets Protected

The privacy feature protects 21 types of sensitive information including:: MappingStore, # Secure mapping storage DeidentificationEngine, # Main engine

# Top-level Functions deidentify_dataset, # Batch processing validate_dataset, # Validation

)

What to Import:

For Basic Use: Import DeidentificationEngine and optionally DeidentificationConfig
For Batch Processing: Import deidentify_dataset and validate_dataset
For Advanced Use: Import specific classes like DateShifter, MappingStore, etc.
For Custom Patterns: Import PHIType and DetectionPattern

Example - Basic Usage:

from scripts.deidentify import DeidentificationEngine, DeidentificationConfig

# Configure with custom settings
config = DeidentificationConfig(
    enable_date_shifting=True,
    enable_encryption=True,
    countries=['US', 'IN']
)

# Create engine
engine = DeidentificationEngine(config=config)

# De-identify text
text = "Patient John Doe, MRN: AB123456, DOB: 01/15/1980"
deidentified = engine.deidentify_text(text)
print(deidentified)
# Output: "Patient [PATIENT-A4B8], MRN: [MRN-X7Y2], DOB: [DATE-1980-01-15]"

Example - Batch Processing:

from scripts.deidentify import deidentify_dataset, validate_dataset

# Process entire dataset
stats = deidentify_dataset(
    input_dir="data/patient_records",
    output_dir="data/deidentified",
    config=config
)

# Validate results
validation = validate_dataset(
    dataset_dir="data/deidentified"
)

if validation['is_valid']:
    print("✓ No PHI detected in output")
else:
    print(f"⚠ Found {len(validation['potential_phi_found'])} issues")

Example - Custom Patterns:

from scripts.deidentify import (
    DeidentificationEngine,
    PHIType,
    DetectionPattern
)
import re

# Define custom pattern for employee IDs
custom_pattern = DetectionPattern(
    phi_type=PHIType.CUSTOM,
    pattern=re.compile(r'EMP-\d{6}'),
    priority=85,
    description="Employee ID format: EMP-XXXXXX"
)

# Use with engine
engine = DeidentificationEngine()
text = "Employee EMP-123456 accessed record"
deidentified = engine.deidentify_text(text, custom_patterns=[custom_pattern])

Basic Usage

from scripts.deidentify import DeidentificationEngine

# Initialize engine
engine = DeidentificationEngine()

# De-identify text
original = "Patient John Doe, MRN: 123456, DOB: 01/15/1980"
deidentified = engine.deidentify_text(original)
# Output: "Patient [PATIENT-A4B8], MRN: [MRN-X7Y2], DOB: [DATE-1980-01-15]"

# Save mappings
engine.save_mappings()

Batch Processing

from scripts.deidentify import deidentify_dataset

# Process entire dataset (maintains directory structure)
# Input directory contains: original/ and cleaned/ subdirectories
stats = deidentify_dataset(
    input_dir="results/dataset/Indo-vap",
    output_dir="results/deidentified/Indo-vap",
    process_subdirs=True  # Recursively process subdirectories
)

print(f"Processed {stats['texts_processed']} texts")
print(f"Detected {stats['total_detections']} PHI items")

# Output structure:
# results/deidentified/Indo-vap/
#   ├── original/          (de-identified original files)
#   ├── cleaned/           (de-identified cleaned files)
#   └── _deidentification_audit.json

Command Line Interface

# Basic usage - processes subdirectories recursively
python -m scripts.deidentify \
    --input-dir results/dataset/Indo-vap \
    --output-dir results/deidentified/Indo-vap

# With validation
python -m scripts.deidentify \
    --input-dir results/dataset/Indo-vap \
    --output-dir results/deidentified/Indo-vap \
    --validate

# Specify text fields
python -m scripts.deidentify \
    --input-dir results/dataset/Indo-vap \
    --output-dir results/deidentified/Indo-vap \
    --text-fields patient_name notes diagnosis

# Disable encryption (not recommended)
python -m scripts.deidentify \
    --input-dir results/dataset/Indo-vap \
    --output-dir results/deidentified/Indo-vap \
    --no-encryption

Pipeline Integration

The de-identification step processes both original/ and cleaned/ subdirectories while maintaining the same file structure in the output directory.

# Enable de-identification in main pipeline
python main.py --enable-deidentification

# Skip de-identification
python main.py --enable-deidentification --skip-deidentification

# Disable encryption (not recommended for production)
python main.py --enable-deidentification --no-encryption

Output Directory Structure:

results/
├── dataset/
│   └── Indo-vap/
│       ├── original/        (extracted JSONL files)
│       └── cleaned/         (cleaned JSONL files)
├── deidentified/
│   ├── Indo-vap/
│   │   ├── original/        (de-identified original files)
│   │   ├── cleaned/         (de-identified cleaned files)
│   │   └── _deidentification_audit.json
│   └── mappings/
│       └── mappings.enc     (encrypted mapping table)
└── data_dictionary_mappings/

Important

Version Control Best Practices

The .gitignore file is pre-configured with security best practices:

Safe to Track in Git:

✅ De-identified datasets (results/deidentified/Indo-vap/)
✅ Data dictionary mappings (results/data_dictionary_mappings/)
✅ Source code and documentation

Never Commit to Git:

❌ Original datasets with PHI (results/dataset/)
❌ Deidentification mappings (results/deidentified/mappings/)
❌ Encryption keys (*.key, *.pem, *.fernet)
❌ Audit logs (*_deidentification_audit.json)

Always review git status before committing to ensure no PHI/PII files are staged.

Supported PHI/PII Types

The tool detects and de-identifies the following 21 HIPAA identifier types:

Names

First names
Last names
Full names

Medical Identifiers

Medical Record Numbers (MRN)
Account numbers
License/certificate numbers

Government Identifiers

Social Security Numbers (SSN)

Contact Information

Phone numbers (US and international formats)
Email addresses
Fax numbers

Geographic Information

Street addresses
Cities
States
ZIP codes

Temporal Information

Dates (all formats including DOB)
Ages over 89 (HIPAA requirement)

Technical Identifiers

Device identifiers
URLs
IP addresses (IPv4)

Custom Identifiers

Easy to extend with new detection rules
User-defined PHI types

Pseudonym Formats

Different PHI types use different pseudonym formats:

PHI Type	Example Original	Pseudonym Format
Name	John Doe	`[PATIENT-A4B8C2]`
MRN	AB123456	`[MRN-X7Y2Z9]`
SSN	123-45-6789	`[SSN-Q3W8E5]`
Phone	123-4567	`[PHONE-E5R7T9]`
Email	patient@example.com	`[EMAIL-T9Y3U8]`
Date	01/15/1980	Shifted date or `[DATE-1]`
Address	123 Main St	`[STREET-Z2X5C8]`
ZIP	12345	`[ZIP-K9L4M7]`
Age >89	Age 92	`[AGE-K4L8P6]`

Configuration

Directory Structure Processing

The de-identification tool automatically processes subfolders to maintain the same file structure between input and output directories:

from scripts.deidentify import deidentify_dataset

# Process with subdirectories (default)
stats = deidentify_dataset(
    input_dir="results/dataset/Indo-vap",
    output_dir="results/deidentified/Indo-vap",
    process_subdirs=True  # Recursively process all subdirectories
)

# Process only top-level files (no subdirectories)
stats = deidentify_dataset(
    input_dir="results/dataset/Indo-vap",
    output_dir="results/deidentified/Indo-vap",
    process_subdirs=False  # Only process files in the root directory
)

Features:

Maintains relative directory structure in output
Processes both original/ and cleaned/ subdirectories
Creates output directories automatically
Preserves file naming conventions
Single mapping table shared across all subdirectories

DeidentificationConfig

from scripts.deidentify import DeidentificationConfig, DeidentificationEngine

config = DeidentificationConfig(
    # Date shifting
    enable_date_shifting=True,
    date_shift_range_days=365,
    preserve_date_intervals=True,

    # Security
    enable_encryption=True,
    encryption_key=None,  # Auto-generate if None

    # Validation
    enable_validation=True,
    strict_mode=True,

    # Logging
    log_detections=True,
    log_level=logging.INFO
)

engine = DeidentificationEngine(config=config)

Custom PHI Patterns

from scripts.deidentify import DetectionPattern, PHIType
import re

# Define custom pattern
custom_pattern = DetectionPattern(
    phi_type=PHIType.CUSTOM,
    pattern=re.compile(r'\bSTUDY-\d{4}\b'),
    priority=85,
    description="Custom Study ID format"
)

# Use in de-identification
deidentified = engine.deidentify_text(
    text="Study ID: STUDY-1234",
    custom_patterns=[custom_pattern]
)

Advanced Features

Date Shifting

Date shifting preserves temporal relationships while obscuring actual dates. The date shifter automatically uses intelligent multi-format parsing with country-specific defaults:

from scripts.deidentify import DateShifter

# For India (DD/MM/YYYY format priority)
shifter_india = DateShifter(
    shift_range_days=365,
    preserve_intervals=True,
    country_code="IN"
)

# All dates shift by same offset, format preserved
date1 = shifter_india.shift_date("04/09/2014")  # September 4, 2014 (DD/MM/YYYY)
date2 = shifter_india.shift_date("09/09/2014")  # September 9, 2014
# Output: 14/12/2013, 19/12/2013 (5 days interval preserved)

# ISO 8601 format also supported
date3 = shifter_india.shift_date("2014-09-04")  # September 4, 2014
# Output: 2013-12-14 (format preserved)

# For United States (MM/DD/YYYY format priority)
shifter_us = DateShifter(
    shift_range_days=365,
    preserve_intervals=True,
    country_code="US"
)

date4 = shifter_us.shift_date("04/09/2014")  # April 9, 2014 (MM/DD/YYYY)
# Output: Different interpretation due to country format

Supported Date Formats (auto-detected):

ISO 8601: YYYY-MM-DD (e.g., 2014-09-04) - International standard, all countries
Slash-separated: DD/MM/YYYY or MM/DD/YYYY (e.g., 04/09/2014)
Hyphen-separated: DD-MM-YYYY or MM-DD-YYYY (e.g., 04-09-2014)
Dot-separated: DD.MM.YYYY (e.g., 04.09.2014) - European format

Primary Format by Country:

DD/MM/YYYY (Day first): India (IN), UK (GB), Australia (AU), Indonesia (ID), Brazil (BR), South Africa (ZA), EU countries, Kenya (KE), Nigeria (NG), Ghana (GH), Uganda (UG)
MM/DD/YYYY (Month first): United States (US), Philippines (PH), Canada (CA)

Key Features:

Intelligent multi-format detection (tries multiple formats automatically)
Original format preservation (shifted dates maintain the input format)
Consistent offset across all dates in a dataset
Temporal relationships preserved (intervals between dates maintained)
Country-specific format priority
Fallback to [DATE-HASH] placeholder only if all formats fail

Understanding Date Format Handling

Added in version 0.6.0: Improved date parsing with country-specific format priority and smart validation.

The date shifter uses an intelligent algorithm to handle ambiguous dates correctly:

The Ambiguity Problem

Dates like 08/09/2020 or 12/12/2012 can be interpreted in multiple ways:

Date String	US Format (MM/DD)	India Format (DD/MM)	Ambiguity
`08/09/2020`	August 9, 2020	September 8, 2020	⚠️ Both valid
`12/12/2012`	December 12, 2012	December 12, 2012	⚠️ Symmetric date
`13/05/2020`	❌ Invalid (no 13th month)	May 13, 2020	✅ Unambiguous
`05/25/2020`	May 25, 2020	❌ Invalid (no 25th month)	✅ Unambiguous

The Solution: Country-Based Priority with Smart Validation

The date shifter uses a three-step algorithm:

Try ISO 8601 First (YYYY-MM-DD): Always unambiguous, works for all countries
Try Country-Specific Format: Use the country’s preferred interpretation
Smart Validation: Reject formats that are logically impossible

Algorithm Details:

# Example: Processing "13/05/2020" for India (DD/MM preference)

Step 1: Try ISO 8601 (YYYY-MM-DD)
  Result: ❌ Doesn't match pattern

Step 2: Try DD/MM/YYYY (India preference)
  Parse: ✅ Day=13, Month=05 (May 13, 2020)
  Validate: first_num=13 > 12 ✅ Valid (day can be >12)
  Result: ✅ Success! → May 13, 2020

# Example: Processing "13/05/2020" for USA (MM/DD preference)

Step 1: Try ISO 8601 (YYYY-MM-DD)
  Result: ❌ Doesn't match pattern

Step 2: Try MM/DD/YYYY (USA preference)
  Parse: ❌ Month=13 is invalid (only 12 months)
  Result: Parsing fails, try next format

Step 3: Try DD/MM/YYYY (fallback)
  Parse: ✅ Day=13, Month=05
  Result: ✅ Success! → May 13, 2020

Smart Validation Rules:

If first number > 12 → Must be day (can’t be month)
If second number > 12 → Must be day (can’t be month)
If both numbers ≤ 12 → Trust country preference (ambiguous case)

Examples by Country:

from scripts.deidentify import DateShifter

# India: DD/MM/YYYY preference
shifter_india = DateShifter(country_code="IN")

shifter_india.shift_date("2020-01-15")   # ISO → Always Jan 15, 2020
shifter_india.shift_date("13/05/2020")   # Unambiguous → May 13, 2020
shifter_india.shift_date("08/09/2020")   # Ambiguous → Sep 8, 2020 (DD/MM)
shifter_india.shift_date("12/12/2012")   # Symmetric → Dec 12, 2012 (DD/MM)

# United States: MM/DD/YYYY preference
shifter_us = DateShifter(country_code="US")

shifter_us.shift_date("2020-01-15")      # ISO → Always Jan 15, 2020
shifter_us.shift_date("13/05/2020")      # Unambiguous → May 13, 2020
shifter_us.shift_date("08/09/2020")      # Ambiguous → Aug 9, 2020 (MM/DD)
shifter_us.shift_date("12/12/2012")      # Symmetric → Dec 12, 2012 (MM/DD)

Best Practices:

Use ISO 8601 when possible (YYYY-MM-DD): Eliminates all ambiguity
Set country code correctly: Ensures consistent interpretation within your dataset
Validate output: Review shifted dates to ensure they make sense
Document format: Record which format your source data uses

Tip

For symmetric dates like 12/12/2012 or 01/01/2020, the interpretation doesn’t affect the result since both formats yield the same date. However, consistency is still maintained for audit purposes.

Warning

Mixing date formats within a single dataset (e.g., some files using DD/MM and others using MM/DD) can lead to inconsistent interpretations. Always use a consistent format across your dataset, preferably ISO 8601.

Encrypted Mapping Storage

Mapping tables are stored in a centralized location within the results/deidentified/mappings/ directory:

from cryptography.fernet import Fernet
from scripts.deidentify import DeidentificationConfig

# Generate and save key
encryption_key = Fernet.generate_key()
with open('encryption_key.bin', 'wb') as f:
    f.write(encryption_key)

# Use encrypted storage
config = DeidentificationConfig(
    enable_encryption=True,
    encryption_key=encryption_key
)

engine = DeidentificationEngine(config=config)

# Mappings stored in: results/deidentified/mappings/mappings.enc
# This single mapping file is used across all datasets and subdirectories

Record De-identification

# De-identify structured records
record = {
    "patient_name": "John Doe",
    "mrn": "123456",
    "notes": "Patient has diabetes. DOB: 01/15/1980",
    "lab_value": 95.5  # Numeric field preserved
}

# Specify which fields to de-identify
deidentified = engine.deidentify_record(
    record,
    text_fields=["patient_name", "notes"]
)

Validation

# Validate de-identified text
is_valid, issues = engine.validate_deidentification(deidentified_text)

if not is_valid:
    print(f"Validation failed! Issues: {issues}")
else:
    print("✓ No PHI detected")

# Validate entire dataset (processes all subdirectories)
from scripts.deidentify import validate_dataset

validation_results = validate_dataset(
    "results/deidentified/Indo-vap"
)

print(f"Valid: {validation_results['is_valid']}")
print(f"Issues: {len(validation_results['potential_phi_found'])}")
print(f"Files validated: {validation_results['total_files']}")
print(f"Records validated: {validation_results['total_records']}")

Security

Encryption

Mapping storage uses Fernet (symmetric encryption):

Encryption method: AES-128 in CBC mode
Key management: Separate from data files
Format: Base64-encoded encrypted JSON

Cryptographic Pseudonyms

Pseudonyms are generated using:

Hash method: SHA-256 hashing
Salt: Random or deterministic per session
Encoding: Base32 for readability
Property: Irreversible without mapping table

Best Practices

Protect Encryption Keys
- Store keys separately from mapping files
- Use key management systems in production
- Rotate keys periodically
Enable Validation
- Always validate after de-identification
- Manual review of sample outputs
- Regular updates to detection rules
Audit Logging
- Enable comprehensive logging
- Monitor for validation failures
- Track mapping usage
Access Control
- Restrict access to mapping files
- Separate re-identification permissions
- Log all mapping exports

HIPAA Compliance

The tool follows HIPAA Safe Harbor requirements:

✓ Removes all 18 HIPAA identifiers

✓ Ages over 89 handled appropriately

✓ Geographic subdivisions (ZIP codes) de-identified

✓ Dates shifted to preserve intervals

✓ No re-identification without authorization

Performance

Benchmarks

Typical performance on modern hardware:

Text Processing: ~1,000 records/second
Detection Speed: ~500 KB/second
Mapping Lookup: O(1) average case
Encryption Overhead: ~5-10% slowdown

Optimization Tips

Batch Processing: Process files in parallel
Detection Order: Put common items first
Caching: Pseudonyms cached automatically
Validation: Disable in production if pre-validated

Examples

See scripts/deidentify.py --help for command-line usage:

python -m scripts.deidentify --help

Examples include:

Basic text de-identification
Consistent pseudonyms
Structured record de-identification
Custom patterns
Date shifting
Batch processing
Validation workflow
Mapping management
Security features

Testing

The de-identification tool can be tested using the main process:

# Test on a small dataset
python main.py --enable-deidentification

Expected Output

When processing the Indo-vap dataset:

De-identifying files: 100%|██████████| 86/86 [00:08<00:00, 10.34it/s]
INFO:reportalin:De-identification complete:
INFO:reportalin:  Texts processed: 1854110
INFO:reportalin:  Total detections: 365620
INFO:reportalin:  Unique mappings: 5398
INFO:reportalin:  Output structure:
INFO:reportalin:    - results/deidentified/Indo-vap/original/  (de-identified original files)
INFO:reportalin:    - results/deidentified/Indo-vap/cleaned/   (de-identified cleaned files)

What happens:

Processes both original/ and cleaned/ subdirectories (43 files each = 86 total)
Detects and replaces PHI/PII in all string fields
Creates 5,398 unique pseudonym mappings
Generates encrypted mapping table at results/deidentified/mappings/mappings.enc
Exports audit log at results/deidentified/Indo-vap/_deidentification_audit.json

Sample De-identification:

Before:

{
    "HHC1": "10200009B",
    "TST_DAT1": "2014-06-11 00:00:00",
    "TST_ENDAT1": "2014-06-14 00:00:00"
}

After:

{
    "HHC1": "[MRN-XTHM4A]",
    "TST_DAT1": "[DATE-A4A986]",
    "TST_ENDAT1": "[DATE-B3C874]"
}

Verification

✓ Detection for all PHI types

✓ Pseudonym consistency

✓ Date shifting and intervals

✓ Mapping storage and encryption

✓ Batch processing

✓ Validation

✓ Edge cases and error handling

Troubleshooting

Common Issues

“No files matching ‘*.jsonl’ found”

# Solution: Ensure extraction step completed first
python main.py --skip-deidentification  # Run extraction
python main.py --enable-deidentification --skip-extraction  # Then deidentify

Encryption error - “cryptography package not available”

# Solution: Install cryptography
pip install cryptography>=41.0.0

Validation fails on de-identified text

# Solution: Check detection order and exclusions
engine.validate_deidentification(text)

Dates not shifting consistently

# Solution: Enable interval preservation
config = DeidentificationConfig(
    enable_date_shifting=True,
    preserve_date_intervals=True
)

Custom patterns not detected

# Solution: Increase priority
custom_pattern = DetectionPattern(
    phi_type=PHIType.CUSTOM,
    pattern=your_detection_rule,
    priority=90  # Higher priority
)

Output directory structure different from input

# Solution: Ensure process_subdirs is enabled
stats = deidentify_dataset(
    input_dir="results/dataset/Indo-vap",
    output_dir="results/deidentified/Indo-vap",
    process_subdirs=True  # Must be True to preserve structure
)

“Could not parse date” warnings

# The tool uses smart multi-format date recognition
# Supported formats (auto-detected, original format preserved):
#   - YYYY-MM-DD: ISO 8601 standard (e.g., 2014-09-04)
#   - DD/MM/YYYY or MM/DD/YYYY: Slash-separated (e.g., 04/09/2014)
#   - DD-MM-YYYY or MM-DD-YYYY: Hyphen-separated (e.g., 04-09-2014)
#   - DD.MM.YYYY: Dot-separated European format (e.g., 04.09.2014)
#
# Format priority based on country:
#   - DD/MM/YYYY priority: India, UK, Australia, Indonesia, Brazil, South Africa, EU, Kenya, Nigeria, Ghana, Uganda
#   - MM/DD/YYYY priority: United States, Philippines, Canada
#
# Only truly unsupported formats are replaced with [DATE-HASH] placeholders

Date format interpretation and preservation

The date shifter automatically tries multiple formats and preserves the original format:

For India (IN) with DD/MM/YYYY priority:
- Input: 04/09/2014 → Interpreted as September 4, 2014 (DD/MM/YYYY)
- Output: 14/12/2013 (format preserved: DD/MM/YYYY)

- Input: 2014-09-04 → Interpreted as September 4, 2014 (ISO 8601)
- Output: 2013-12-14 (format preserved: YYYY-MM-DD)

For United States (US) with MM/DD/YYYY priority:
- Input: 04/09/2014 → Interpreted as April 9, 2014 (MM/DD/YYYY)
- Output: 10/23/2013 (format preserved: MM/DD/YYYY)

- Input: 2014-04-09 → Interpreted as April 9, 2014 (ISO 8601)
- Output: 2013-10-23 (format preserved: YYYY-MM-DD)

For all countries:
- 2014-09-04 is interpreted as September 4, 2014 (YYYY-MM-DD)
- Replaced with: [DATE-HASH] pseudonym

Technical Reference

For complete technical details, see the scripts.deidentify module documentation.

Key Classes

scripts.deidentify.DeidentificationEngine - Main processing engine
scripts.deidentify.PseudonymGenerator - Pseudonym generation
scripts.deidentify.DateShifter - Date shifting
scripts.deidentify.MappingStore - Encrypted storage
scripts.deidentify.PatternLibrary - PHI patterns

Key Functions

scripts.deidentify.deidentify_dataset() - Batch processing
scripts.deidentify.validate_dataset() - Dataset validation

Migration Guide

Breaking Changes: None - The de-identification tool is fully backward compatible

New Features (Available in current version):

Use Explicit Imports (Recommended):

# Recommended import style
from scripts.deidentify import DeidentificationEngine
engine = DeidentificationEngine()

Type Checking Benefits:

If you use type checkers (mypy, pyright), you’ll get better type inference:

# Type checkers now understand return types
result: None = engine.save_mappings()  # Correctly inferred as None

API Discovery:

You can now see exactly what’s public:

from scripts import deidentify
print(deidentify.__all__)
# ['PHIType', 'DetectionPattern', 'DeidentificationConfig', ...]

No Changes Required: All existing code continues to work without modification.

De-identification

Overview

What’s New in 0.8.6

How It Works

Getting Started

Basic Usage

What Gets Protected

Basic Usage

Batch Processing

Command Line Interface

Pipeline Integration

Supported PHI/PII Types

Names

Medical Identifiers

Government Identifiers

Contact Information

Geographic Information

Temporal Information

Technical Identifiers

Custom Identifiers

Pseudonym Formats

Configuration

Directory Structure Processing

DeidentificationConfig

Custom PHI Patterns

Advanced Features

Date Shifting

Understanding Date Format Handling

Encrypted Mapping Storage

Record De-identification

Validation

Security

Encryption

Cryptographic Pseudonyms

Best Practices

HIPAA Compliance

Performance

Benchmarks

Optimization Tips

Examples

Testing

Expected Output

Verification

Troubleshooting

Common Issues

Technical Reference

Key Classes

Key Functions

Migration Guide

See Also