De-identificationο
For Users: Protecting Privacy
This feature helps you protect sensitive patient information by replacing real names, dates, and other personal details with safe placeholders. It follows medical privacy rules from 14 different countries.
Changed in version 0.3.0: Enhanced privacy protection with improved detection and better support for international regulations.
See also
For information about privacy rules in specific countries, see Country-Specific Privacy Rules.
Overviewο
The privacy protection feature can detect and replace 21 types of personal information including:
Names and Addresses: Patient names, street addresses, cities
Dates: Birth dates, admission dates, appointment dates
ID Numbers: Social security numbers, medical record numbers, account numbers
Contact Info: Phone numbers, email addresses, website URLs
Encrypted Storage: Fernet encryption for mapping tables with secure key management
Date Shifting: Preserves temporal relationships while shifting dates by consistent offset
Validation: Comprehensive validation to ensure no PHI leakage in de-identified output
Security: Built-in encryption, access control, and audit trail capabilities
Directory Structure Preservation: Maintains original file organization during batch processing
Whatβs New in 0.8.6ο
- Better Privacy Protection:
Improved detection of sensitive information
More secure replacement methods
Easier to use with better error messages
- Enhanced Security:
Stronger encryption for mapping files
Better protection of patient information
Comprehensive audit trail for compliance
- Improved Documentation:
Clear examples for common tasks
Step-by-step privacy protection guides
Easy-to-follow security best practices
How It Worksο
Privacy protection happens automatically as part of the data processing:
Step 1: Data Extraction
Your Excel files are converted to a simpler format (JSONL):
results/dataset/Indo-vap/original/- All your data preservedresults/dataset/Indo-vap/cleaned/- Duplicate information removed
Step 2: Privacy Protection (Optional)
Both folders are protected while keeping the same structure:
results/deidentified/Indo-vap/original/- Protected original filesresults/deidentified/Indo-vap/cleaned/- Protected cleaned filesresults/deidentified/mappings/mappings.enc- Encrypted lookup tableresults/deidentified/Indo-vap/_deidentification_audit.json- Record of changes
What You Get:
Consistent Replacement: The same name always gets the same safe placeholder
Secure Storage: Your lookup table is encrypted and protected
Same Organization: Protected files are organized exactly like your original files
Complete Record: Full audit trail of what was changed (without showing the original values)
Easy Review: You can verify the protection worked correctly
Getting Startedο
Basic Usageο
To protect patient privacy in your data, run:
# Protect data for United States privacy rules
python main.py --enable-deidentification --countries US
# Protect data for multiple countries
python main.py --enable-deidentification --countries IN US GB
This will: - Find and replace sensitive information like names, dates, and phone numbers - Create protected versions of your files - Save an encrypted lookup table so you can track changes - Generate a report showing what was protected
What Gets Protectedο
- The privacy feature protects 21 types of sensitive information including:
MappingStore, # Secure mapping storage DeidentificationEngine, # Main engine
# Top-level Functions deidentify_dataset, # Batch processing validate_dataset, # Validation
)
What to Import:
For Basic Use: Import
DeidentificationEngineand optionallyDeidentificationConfigFor Batch Processing: Import
deidentify_datasetandvalidate_datasetFor Advanced Use: Import specific classes like
DateShifter,MappingStore, etc.For Custom Patterns: Import
PHITypeandDetectionPattern
Example - Basic Usage:
from scripts.deidentify import DeidentificationEngine, DeidentificationConfig
# Configure with custom settings
config = DeidentificationConfig(
enable_date_shifting=True,
enable_encryption=True,
countries=['US', 'IN']
)
# Create engine
engine = DeidentificationEngine(config=config)
# De-identify text
text = "Patient John Doe, MRN: AB123456, DOB: 01/15/1980"
deidentified = engine.deidentify_text(text)
print(deidentified)
# Output: "Patient [PATIENT-A4B8], MRN: [MRN-X7Y2], DOB: [DATE-1980-01-15]"
Example - Batch Processing:
from scripts.deidentify import deidentify_dataset, validate_dataset
# Process entire dataset
stats = deidentify_dataset(
input_dir="data/patient_records",
output_dir="data/deidentified",
config=config
)
# Validate results
validation = validate_dataset(
dataset_dir="data/deidentified"
)
if validation['is_valid']:
print("β No PHI detected in output")
else:
print(f"β Found {len(validation['potential_phi_found'])} issues")
Example - Custom Patterns:
from scripts.deidentify import (
DeidentificationEngine,
PHIType,
DetectionPattern
)
import re
# Define custom pattern for employee IDs
custom_pattern = DetectionPattern(
phi_type=PHIType.CUSTOM,
pattern=re.compile(r'EMP-\d{6}'),
priority=85,
description="Employee ID format: EMP-XXXXXX"
)
# Use with engine
engine = DeidentificationEngine()
text = "Employee EMP-123456 accessed record"
deidentified = engine.deidentify_text(text, custom_patterns=[custom_pattern])
Basic Usageο
from scripts.deidentify import DeidentificationEngine
# Initialize engine
engine = DeidentificationEngine()
# De-identify text
original = "Patient John Doe, MRN: 123456, DOB: 01/15/1980"
deidentified = engine.deidentify_text(original)
# Output: "Patient [PATIENT-A4B8], MRN: [MRN-X7Y2], DOB: [DATE-1980-01-15]"
# Save mappings
engine.save_mappings()
Batch Processingο
from scripts.deidentify import deidentify_dataset
# Process entire dataset (maintains directory structure)
# Input directory contains: original/ and cleaned/ subdirectories
stats = deidentify_dataset(
input_dir="results/dataset/Indo-vap",
output_dir="results/deidentified/Indo-vap",
process_subdirs=True # Recursively process subdirectories
)
print(f"Processed {stats['texts_processed']} texts")
print(f"Detected {stats['total_detections']} PHI items")
# Output structure:
# results/deidentified/Indo-vap/
# βββ original/ (de-identified original files)
# βββ cleaned/ (de-identified cleaned files)
# βββ _deidentification_audit.json
Command Line Interfaceο
# Basic usage - processes subdirectories recursively
python -m scripts.deidentify \
--input-dir results/dataset/Indo-vap \
--output-dir results/deidentified/Indo-vap
# With validation
python -m scripts.deidentify \
--input-dir results/dataset/Indo-vap \
--output-dir results/deidentified/Indo-vap \
--validate
# Specify text fields
python -m scripts.deidentify \
--input-dir results/dataset/Indo-vap \
--output-dir results/deidentified/Indo-vap \
--text-fields patient_name notes diagnosis
# Disable encryption (not recommended)
python -m scripts.deidentify \
--input-dir results/dataset/Indo-vap \
--output-dir results/deidentified/Indo-vap \
--no-encryption
Pipeline Integrationο
The de-identification step processes both original/ and cleaned/ subdirectories
while maintaining the same file structure in the output directory.
# Enable de-identification in main pipeline
python main.py --enable-deidentification
# Skip de-identification
python main.py --enable-deidentification --skip-deidentification
# Disable encryption (not recommended for production)
python main.py --enable-deidentification --no-encryption
Output Directory Structure:
results/
βββ dataset/
β βββ Indo-vap/
β βββ original/ (extracted JSONL files)
β βββ cleaned/ (cleaned JSONL files)
βββ deidentified/
β βββ Indo-vap/
β β βββ original/ (de-identified original files)
β β βββ cleaned/ (de-identified cleaned files)
β β βββ _deidentification_audit.json
β βββ mappings/
β βββ mappings.enc (encrypted mapping table)
βββ data_dictionary_mappings/
Important
Version Control Best Practices
The .gitignore file is pre-configured with security best practices:
Safe to Track in Git:
β De-identified datasets (
results/deidentified/Indo-vap/)β Data dictionary mappings (
results/data_dictionary_mappings/)β Source code and documentation
Never Commit to Git:
β Original datasets with PHI (
results/dataset/)β Deidentification mappings (
results/deidentified/mappings/)β Encryption keys (
*.key,*.pem,*.fernet)β Audit logs (
*_deidentification_audit.json)
Always review git status before committing to ensure no PHI/PII files are staged.
Supported PHI/PII Typesο
The tool detects and de-identifies the following 21 HIPAA identifier types:
Namesο
First names
Last names
Full names
Medical Identifiersο
Medical Record Numbers (MRN)
Account numbers
License/certificate numbers
Government Identifiersο
Social Security Numbers (SSN)
Contact Informationο
Phone numbers (US and international formats)
Email addresses
Fax numbers
Geographic Informationο
Street addresses
Cities
States
ZIP codes
Temporal Informationο
Dates (all formats including DOB)
Ages over 89 (HIPAA requirement)
Technical Identifiersο
Device identifiers
URLs
IP addresses (IPv4)
Custom Identifiersο
Easy to extend with new detection rules
User-defined PHI types
Pseudonym Formatsο
Different PHI types use different pseudonym formats:
PHI Type |
Example Original |
Pseudonym Format |
|---|---|---|
Name |
John Doe |
|
MRN |
AB123456 |
|
SSN |
123-45-6789 |
|
Phone |
|
|
|
||
Date |
01/15/1980 |
Shifted date or |
Address |
123 Main St |
|
ZIP |
12345 |
|
Age >89 |
Age 92 |
|
Configurationο
Directory Structure Processingο
The de-identification tool automatically processes subfolders to maintain the same file structure between input and output directories:
from scripts.deidentify import deidentify_dataset
# Process with subdirectories (default)
stats = deidentify_dataset(
input_dir="results/dataset/Indo-vap",
output_dir="results/deidentified/Indo-vap",
process_subdirs=True # Recursively process all subdirectories
)
# Process only top-level files (no subdirectories)
stats = deidentify_dataset(
input_dir="results/dataset/Indo-vap",
output_dir="results/deidentified/Indo-vap",
process_subdirs=False # Only process files in the root directory
)
Features:
Maintains relative directory structure in output
Processes both
original/andcleaned/subdirectoriesCreates output directories automatically
Preserves file naming conventions
Single mapping table shared across all subdirectories
DeidentificationConfigο
from scripts.deidentify import DeidentificationConfig, DeidentificationEngine
config = DeidentificationConfig(
# Date shifting
enable_date_shifting=True,
date_shift_range_days=365,
preserve_date_intervals=True,
# Security
enable_encryption=True,
encryption_key=None, # Auto-generate if None
# Validation
enable_validation=True,
strict_mode=True,
# Logging
log_detections=True,
log_level=logging.INFO
)
engine = DeidentificationEngine(config=config)
Custom PHI Patternsο
from scripts.deidentify import DetectionPattern, PHIType
import re
# Define custom pattern
custom_pattern = DetectionPattern(
phi_type=PHIType.CUSTOM,
pattern=re.compile(r'\bSTUDY-\d{4}\b'),
priority=85,
description="Custom Study ID format"
)
# Use in de-identification
deidentified = engine.deidentify_text(
text="Study ID: STUDY-1234",
custom_patterns=[custom_pattern]
)
Advanced Featuresο
Date Shiftingο
Date shifting preserves temporal relationships while obscuring actual dates. The date shifter automatically uses intelligent multi-format parsing with country-specific defaults:
from scripts.deidentify import DateShifter
# For India (DD/MM/YYYY format priority)
shifter_india = DateShifter(
shift_range_days=365,
preserve_intervals=True,
country_code="IN"
)
# All dates shift by same offset, format preserved
date1 = shifter_india.shift_date("04/09/2014") # September 4, 2014 (DD/MM/YYYY)
date2 = shifter_india.shift_date("09/09/2014") # September 9, 2014
# Output: 14/12/2013, 19/12/2013 (5 days interval preserved)
# ISO 8601 format also supported
date3 = shifter_india.shift_date("2014-09-04") # September 4, 2014
# Output: 2013-12-14 (format preserved)
# For United States (MM/DD/YYYY format priority)
shifter_us = DateShifter(
shift_range_days=365,
preserve_intervals=True,
country_code="US"
)
date4 = shifter_us.shift_date("04/09/2014") # April 9, 2014 (MM/DD/YYYY)
# Output: Different interpretation due to country format
Supported Date Formats (auto-detected):
ISO 8601:
YYYY-MM-DD(e.g., 2014-09-04) - International standard, all countriesSlash-separated:
DD/MM/YYYYorMM/DD/YYYY(e.g., 04/09/2014)Hyphen-separated:
DD-MM-YYYYorMM-DD-YYYY(e.g., 04-09-2014)Dot-separated:
DD.MM.YYYY(e.g., 04.09.2014) - European format
Primary Format by Country:
DD/MM/YYYY (Day first): India (IN), UK (GB), Australia (AU), Indonesia (ID), Brazil (BR), South Africa (ZA), EU countries, Kenya (KE), Nigeria (NG), Ghana (GH), Uganda (UG)
MM/DD/YYYY (Month first): United States (US), Philippines (PH), Canada (CA)
Key Features:
Intelligent multi-format detection (tries multiple formats automatically)
Original format preservation (shifted dates maintain the input format)
Consistent offset across all dates in a dataset
Temporal relationships preserved (intervals between dates maintained)
Country-specific format priority
Fallback to [DATE-HASH] placeholder only if all formats fail
Understanding Date Format Handlingο
Added in version 0.6.0: Improved date parsing with country-specific format priority and smart validation.
The date shifter uses an intelligent algorithm to handle ambiguous dates correctly:
The Ambiguity Problem
Dates like 08/09/2020 or 12/12/2012 can be interpreted in multiple ways:
Date String |
US Format (MM/DD) |
India Format (DD/MM) |
Ambiguity |
|---|---|---|---|
|
August 9, 2020 |
September 8, 2020 |
β οΈ Both valid |
|
December 12, 2012 |
December 12, 2012 |
β οΈ Symmetric date |
|
β Invalid (no 13th month) |
May 13, 2020 |
β Unambiguous |
|
May 25, 2020 |
β Invalid (no 25th month) |
β Unambiguous |
The Solution: Country-Based Priority with Smart Validation
The date shifter uses a three-step algorithm:
Try ISO 8601 First (
YYYY-MM-DD): Always unambiguous, works for all countriesTry Country-Specific Format: Use the countryβs preferred interpretation
Smart Validation: Reject formats that are logically impossible
Algorithm Details:
# Example: Processing "13/05/2020" for India (DD/MM preference)
Step 1: Try ISO 8601 (YYYY-MM-DD)
Result: β Doesn't match pattern
Step 2: Try DD/MM/YYYY (India preference)
Parse: β
Day=13, Month=05 (May 13, 2020)
Validate: first_num=13 > 12 β
Valid (day can be >12)
Result: β
Success! β May 13, 2020
# Example: Processing "13/05/2020" for USA (MM/DD preference)
Step 1: Try ISO 8601 (YYYY-MM-DD)
Result: β Doesn't match pattern
Step 2: Try MM/DD/YYYY (USA preference)
Parse: β Month=13 is invalid (only 12 months)
Result: Parsing fails, try next format
Step 3: Try DD/MM/YYYY (fallback)
Parse: β
Day=13, Month=05
Result: β
Success! β May 13, 2020
Smart Validation Rules:
If first number > 12 β Must be day (canβt be month)
If second number > 12 β Must be day (canβt be month)
If both numbers β€ 12 β Trust country preference (ambiguous case)
Examples by Country:
from scripts.deidentify import DateShifter
# India: DD/MM/YYYY preference
shifter_india = DateShifter(country_code="IN")
shifter_india.shift_date("2020-01-15") # ISO β Always Jan 15, 2020
shifter_india.shift_date("13/05/2020") # Unambiguous β May 13, 2020
shifter_india.shift_date("08/09/2020") # Ambiguous β Sep 8, 2020 (DD/MM)
shifter_india.shift_date("12/12/2012") # Symmetric β Dec 12, 2012 (DD/MM)
# United States: MM/DD/YYYY preference
shifter_us = DateShifter(country_code="US")
shifter_us.shift_date("2020-01-15") # ISO β Always Jan 15, 2020
shifter_us.shift_date("13/05/2020") # Unambiguous β May 13, 2020
shifter_us.shift_date("08/09/2020") # Ambiguous β Aug 9, 2020 (MM/DD)
shifter_us.shift_date("12/12/2012") # Symmetric β Dec 12, 2012 (MM/DD)
Best Practices:
Use ISO 8601 when possible (
YYYY-MM-DD): Eliminates all ambiguitySet country code correctly: Ensures consistent interpretation within your dataset
Validate output: Review shifted dates to ensure they make sense
Document format: Record which format your source data uses
Tip
For symmetric dates like 12/12/2012 or 01/01/2020, the interpretation
doesnβt affect the result since both formats yield the same date. However,
consistency is still maintained for audit purposes.
Warning
Mixing date formats within a single dataset (e.g., some files using DD/MM and others using MM/DD) can lead to inconsistent interpretations. Always use a consistent format across your dataset, preferably ISO 8601.
Encrypted Mapping Storageο
Mapping tables are stored in a centralized location within the results/deidentified/mappings/
directory:
from cryptography.fernet import Fernet
from scripts.deidentify import DeidentificationConfig
# Generate and save key
encryption_key = Fernet.generate_key()
with open('encryption_key.bin', 'wb') as f:
f.write(encryption_key)
# Use encrypted storage
config = DeidentificationConfig(
enable_encryption=True,
encryption_key=encryption_key
)
engine = DeidentificationEngine(config=config)
# Mappings stored in: results/deidentified/mappings/mappings.enc
# This single mapping file is used across all datasets and subdirectories
Record De-identificationο
# De-identify structured records
record = {
"patient_name": "John Doe",
"mrn": "123456",
"notes": "Patient has diabetes. DOB: 01/15/1980",
"lab_value": 95.5 # Numeric field preserved
}
# Specify which fields to de-identify
deidentified = engine.deidentify_record(
record,
text_fields=["patient_name", "notes"]
)
Validationο
# Validate de-identified text
is_valid, issues = engine.validate_deidentification(deidentified_text)
if not is_valid:
print(f"Validation failed! Issues: {issues}")
else:
print("β No PHI detected")
# Validate entire dataset (processes all subdirectories)
from scripts.deidentify import validate_dataset
validation_results = validate_dataset(
"results/deidentified/Indo-vap"
)
print(f"Valid: {validation_results['is_valid']}")
print(f"Issues: {len(validation_results['potential_phi_found'])}")
print(f"Files validated: {validation_results['total_files']}")
print(f"Records validated: {validation_results['total_records']}")
Securityο
Encryptionο
Mapping storage uses Fernet (symmetric encryption):
Encryption method: AES-128 in CBC mode
Key management: Separate from data files
Format: Base64-encoded encrypted JSON
Cryptographic Pseudonymsο
Pseudonyms are generated using:
Hash method: SHA-256 hashing
Salt: Random or deterministic per session
Encoding: Base32 for readability
Property: Irreversible without mapping table
Best Practicesο
Protect Encryption Keys
Store keys separately from mapping files
Use key management systems in production
Rotate keys periodically
Enable Validation
Always validate after de-identification
Manual review of sample outputs
Regular updates to detection rules
Audit Logging
Enable comprehensive logging
Monitor for validation failures
Track mapping usage
Access Control
Restrict access to mapping files
Separate re-identification permissions
Log all mapping exports
HIPAA Complianceο
The tool follows HIPAA Safe Harbor requirements:
β Removes all 18 HIPAA identifiers
β Ages over 89 handled appropriately
β Geographic subdivisions (ZIP codes) de-identified
β Dates shifted to preserve intervals
β No re-identification without authorization
Performanceο
Benchmarksο
Typical performance on modern hardware:
Text Processing: ~1,000 records/second
Detection Speed: ~500 KB/second
Mapping Lookup: O(1) average case
Encryption Overhead: ~5-10% slowdown
Optimization Tipsο
Batch Processing: Process files in parallel
Detection Order: Put common items first
Caching: Pseudonyms cached automatically
Validation: Disable in production if pre-validated
Examplesο
See scripts/deidentify.py --help for command-line usage:
python -m scripts.deidentify --help
Examples include:
Basic text de-identification
Consistent pseudonyms
Structured record de-identification
Custom patterns
Date shifting
Batch processing
Validation workflow
Mapping management
Security features
Testingο
The de-identification tool can be tested using the main process:
# Test on a small dataset
python main.py --enable-deidentification
Expected Outputο
When processing the Indo-vap dataset:
De-identifying files: 100%|ββββββββββ| 86/86 [00:08<00:00, 10.34it/s]
INFO:reportalin:De-identification complete:
INFO:reportalin: Texts processed: 1854110
INFO:reportalin: Total detections: 365620
INFO:reportalin: Unique mappings: 5398
INFO:reportalin: Output structure:
INFO:reportalin: - results/deidentified/Indo-vap/original/ (de-identified original files)
INFO:reportalin: - results/deidentified/Indo-vap/cleaned/ (de-identified cleaned files)
What happens:
Processes both
original/andcleaned/subdirectories (43 files each = 86 total)Detects and replaces PHI/PII in all string fields
Creates 5,398 unique pseudonym mappings
Generates encrypted mapping table at
results/deidentified/mappings/mappings.encExports audit log at
results/deidentified/Indo-vap/_deidentification_audit.json
Sample De-identification:
Before:
{
"HHC1": "10200009B",
"TST_DAT1": "2014-06-11 00:00:00",
"TST_ENDAT1": "2014-06-14 00:00:00"
}
After:
{
"HHC1": "[MRN-XTHM4A]",
"TST_DAT1": "[DATE-A4A986]",
"TST_ENDAT1": "[DATE-B3C874]"
}
Verificationο
β Detection for all PHI types
β Pseudonym consistency
β Date shifting and intervals
β Mapping storage and encryption
β Batch processing
β Validation
β Edge cases and error handling
Troubleshootingο
Common Issuesο
βNo files matching β*.jsonlβ foundβ
# Solution: Ensure extraction step completed first
python main.py --skip-deidentification # Run extraction
python main.py --enable-deidentification --skip-extraction # Then deidentify
Encryption error - βcryptography package not availableβ
# Solution: Install cryptography
pip install cryptography>=41.0.0
Validation fails on de-identified text
# Solution: Check detection order and exclusions
engine.validate_deidentification(text)
Dates not shifting consistently
# Solution: Enable interval preservation
config = DeidentificationConfig(
enable_date_shifting=True,
preserve_date_intervals=True
)
Custom patterns not detected
# Solution: Increase priority
custom_pattern = DetectionPattern(
phi_type=PHIType.CUSTOM,
pattern=your_detection_rule,
priority=90 # Higher priority
)
Output directory structure different from input
# Solution: Ensure process_subdirs is enabled
stats = deidentify_dataset(
input_dir="results/dataset/Indo-vap",
output_dir="results/deidentified/Indo-vap",
process_subdirs=True # Must be True to preserve structure
)
βCould not parse dateβ warnings
# The tool uses smart multi-format date recognition
# Supported formats (auto-detected, original format preserved):
# - YYYY-MM-DD: ISO 8601 standard (e.g., 2014-09-04)
# - DD/MM/YYYY or MM/DD/YYYY: Slash-separated (e.g., 04/09/2014)
# - DD-MM-YYYY or MM-DD-YYYY: Hyphen-separated (e.g., 04-09-2014)
# - DD.MM.YYYY: Dot-separated European format (e.g., 04.09.2014)
#
# Format priority based on country:
# - DD/MM/YYYY priority: India, UK, Australia, Indonesia, Brazil, South Africa, EU, Kenya, Nigeria, Ghana, Uganda
# - MM/DD/YYYY priority: United States, Philippines, Canada
#
# Only truly unsupported formats are replaced with [DATE-HASH] placeholders
Date format interpretation and preservation
The date shifter automatically tries multiple formats and preserves the original format:
For India (IN) with DD/MM/YYYY priority:
- Input: 04/09/2014 β Interpreted as September 4, 2014 (DD/MM/YYYY)
- Output: 14/12/2013 (format preserved: DD/MM/YYYY)
- Input: 2014-09-04 β Interpreted as September 4, 2014 (ISO 8601)
- Output: 2013-12-14 (format preserved: YYYY-MM-DD)
For United States (US) with MM/DD/YYYY priority:
- Input: 04/09/2014 β Interpreted as April 9, 2014 (MM/DD/YYYY)
- Output: 10/23/2013 (format preserved: MM/DD/YYYY)
- Input: 2014-04-09 β Interpreted as April 9, 2014 (ISO 8601)
- Output: 2013-10-23 (format preserved: YYYY-MM-DD)
For all countries:
- 2014-09-04 is interpreted as September 4, 2014 (YYYY-MM-DD)
- Replaced with: [DATE-HASH] pseudonym
Technical Referenceο
For complete technical details, see the scripts.deidentify module documentation.
Key Classesο
scripts.deidentify.DeidentificationEngine- Main processing enginescripts.deidentify.PseudonymGenerator- Pseudonym generationscripts.deidentify.DateShifter- Date shiftingscripts.deidentify.MappingStore- Encrypted storagescripts.deidentify.PatternLibrary- PHI patterns
Key Functionsο
scripts.deidentify.deidentify_dataset()- Batch processingscripts.deidentify.validate_dataset()- Dataset validation
Migration Guideο
Breaking Changes: None - The de-identification tool is fully backward compatible
New Features (Available in current version):
Use Explicit Imports (Recommended):
# Recommended import style from scripts.deidentify import DeidentificationEngine engine = DeidentificationEngine()
Type Checking Benefits:
If you use type checkers (mypy, pyright), youβll get better type inference:
# Type checkers now understand return types result: None = engine.save_mappings() # Correctly inferred as None
API Discovery:
You can now see exactly whatβs public:
from scripts import deidentify print(deidentify.__all__) # ['PHIType', 'DetectionPattern', 'DeidentificationConfig', ...]
No Changes Required: All existing code continues to work without modification.
See Alsoο
Related User Guides:
Quick Start - Getting started with RePORTaLiN
Usage Guide - General usage guide and examples
Configuration - De-identification configuration options
Country-Specific Privacy Rules - Country-specific privacy compliance
Troubleshooting - Common issues and solutions
API & Technical References:
scripts.deidentify module - Complete technical documentation
Contributing - Best practices for error handling and design
Extending RePORTaLiN - Extending de-identification features
Changelog - Version 0.0.6 changelog
External Resources:
HIPAA Safe Harbor Method - Official HIPAA de-identification guidance
GDPR Article 4(5) - GDPR pseudonymization definition
DPDPA 2023 (India) - Indian data protection regulations