De-identification
=================

**For Users: Protecting Privacy**

This feature helps you protect sensitive patient information by replacing real names, dates, 
and other personal details with safe placeholders. It follows medical privacy rules from 
14 different countries.

.. versionchanged:: 0.3.0
   Enhanced privacy protection with improved detection and better support for international regulations.

.. seealso::
   For information about privacy rules in specific countries, see :doc:`country_regulations`.

Overview
--------

The privacy protection feature can detect and replace 21 types of personal information including:

* **Names and Addresses**: Patient names, street addresses, cities
* **Dates**: Birth dates, admission dates, appointment dates  
* **ID Numbers**: Social security numbers, medical record numbers, account numbers
* **Contact Info**: Phone numbers, email addresses, website URLs
* **Encrypted Storage**: Fernet encryption for mapping tables with secure key management
* **Date Shifting**: Preserves temporal relationships while shifting dates by consistent offset
* **Validation**: Comprehensive validation to ensure no PHI leakage in de-identified output
* **Security**: Built-in encryption, access control, and audit trail capabilities
* **Directory Structure Preservation**: Maintains original file organization during batch processing

What's New in |version|
~~~~~~~~~~~~~~~~~~~~~~~~

**Better Privacy Protection**:
  - Improved detection of sensitive information
  - More secure replacement methods
  - Easier to use with better error messages

**Enhanced Security**:
  - Stronger encryption for mapping files
  - Better protection of patient information
  - Comprehensive audit trail for compliance

**Improved Documentation**:
  - Clear examples for common tasks
  - Step-by-step privacy protection guides
  - Easy-to-follow security best practices

How It Works
------------

Privacy protection happens automatically as part of the data processing:

**Step 1: Data Extraction**

Your Excel files are converted to a simpler format (JSONL):

* ``results/dataset/Indo-vap/original/`` - All your data preserved
* ``results/dataset/Indo-vap/cleaned/`` - Duplicate information removed

**Step 2: Privacy Protection** (Optional)

Both folders are protected while keeping the same structure:

* ``results/deidentified/Indo-vap/original/`` - Protected original files
* ``results/deidentified/Indo-vap/cleaned/`` - Protected cleaned files
* ``results/deidentified/mappings/mappings.enc`` - Encrypted lookup table
* ``results/deidentified/Indo-vap/_deidentification_audit.json`` - Record of changes

**What You Get:**

1. **Consistent Replacement**: The same name always gets the same safe placeholder
2. **Secure Storage**: Your lookup table is encrypted and protected
3. **Same Organization**: Protected files are organized exactly like your original files
4. **Complete Record**: Full audit trail of what was changed (without showing the original values)
5. **Easy Review**: You can verify the protection worked correctly

Getting Started
---------------

Basic Usage
~~~~~~~~~~~

To protect patient privacy in your data, run:

.. code-block:: bash

   # Protect data for United States privacy rules
   python main.py --enable-deidentification --countries US

   # Protect data for multiple countries
   python main.py --enable-deidentification --countries IN US GB

This will:
- Find and replace sensitive information like names, dates, and phone numbers
- Create protected versions of your files
- Save an encrypted lookup table so you can track changes
- Generate a report showing what was protected

What Gets Protected
~~~~~~~~~~~~~~~~~~~

The privacy feature protects 21 types of sensitive information including:
        MappingStore,               # Secure mapping storage
        DeidentificationEngine,     # Main engine
        
        # Top-level Functions
        deidentify_dataset,         # Batch processing
        validate_dataset,           # Validation
    )

**What to Import**:

- **For Basic Use**: Import ``DeidentificationEngine`` and optionally ``DeidentificationConfig``
- **For Batch Processing**: Import ``deidentify_dataset`` and ``validate_dataset``
- **For Advanced Use**: Import specific classes like ``DateShifter``, ``MappingStore``, etc.
- **For Custom Patterns**: Import ``PHIType`` and ``DetectionPattern``

**Example - Basic Usage**:

.. code-block:: python

    from scripts.deidentify import DeidentificationEngine, DeidentificationConfig
    
    # Configure with custom settings
    config = DeidentificationConfig(
        enable_date_shifting=True,
        enable_encryption=True,
        countries=['US', 'IN']
    )
    
    # Create engine
    engine = DeidentificationEngine(config=config)
    
    # De-identify text
    text = "Patient John Doe, MRN: AB123456, DOB: 01/15/1980"
    deidentified = engine.deidentify_text(text)
    print(deidentified)
    # Output: "Patient [PATIENT-A4B8], MRN: [MRN-X7Y2], DOB: [DATE-1980-01-15]"

**Example - Batch Processing**:

.. code-block:: python

    from scripts.deidentify import deidentify_dataset, validate_dataset
    
    # Process entire dataset
    stats = deidentify_dataset(
        input_dir="data/patient_records",
        output_dir="data/deidentified",
        config=config
    )
    
    # Validate results
    validation = validate_dataset(
        dataset_dir="data/deidentified"
    )
    
    if validation['is_valid']:
        print("✓ No PHI detected in output")
    else:
        print(f"⚠ Found {len(validation['potential_phi_found'])} issues")

**Example - Custom Patterns**:

.. code-block:: python

    from scripts.deidentify import (
        DeidentificationEngine,
        PHIType,
        DetectionPattern
    )
    import re
    
    # Define custom pattern for employee IDs
    custom_pattern = DetectionPattern(
        phi_type=PHIType.CUSTOM,
        pattern=re.compile(r'EMP-\d{6}'),
        priority=85,
        description="Employee ID format: EMP-XXXXXX"
    )
    
    # Use with engine
    engine = DeidentificationEngine()
    text = "Employee EMP-123456 accessed record"
    deidentified = engine.deidentify_text(text, custom_patterns=[custom_pattern])

Basic Usage
~~~~~~~~~~~

.. code-block:: python

    from scripts.deidentify import DeidentificationEngine

    # Initialize engine
    engine = DeidentificationEngine()

    # De-identify text
    original = "Patient John Doe, MRN: 123456, DOB: 01/15/1980"
    deidentified = engine.deidentify_text(original)
    # Output: "Patient [PATIENT-A4B8], MRN: [MRN-X7Y2], DOB: [DATE-1980-01-15]"

    # Save mappings
    engine.save_mappings()

Batch Processing
~~~~~~~~~~~~~~~~

.. code-block:: python

    from scripts.deidentify import deidentify_dataset

    # Process entire dataset (maintains directory structure)
    # Input directory contains: original/ and cleaned/ subdirectories
    stats = deidentify_dataset(
        input_dir="results/dataset/Indo-vap",
        output_dir="results/deidentified/Indo-vap",
        process_subdirs=True  # Recursively process subdirectories
    )

    print(f"Processed {stats['texts_processed']} texts")
    print(f"Detected {stats['total_detections']} PHI items")
    
    # Output structure:
    # results/deidentified/Indo-vap/
    #   ├── original/          (de-identified original files)
    #   ├── cleaned/           (de-identified cleaned files)
    #   └── _deidentification_audit.json

Command Line Interface
~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

    # Basic usage - processes subdirectories recursively
    python -m scripts.deidentify \
        --input-dir results/dataset/Indo-vap \
        --output-dir results/deidentified/Indo-vap

    # With validation
    python -m scripts.deidentify \
        --input-dir results/dataset/Indo-vap \
        --output-dir results/deidentified/Indo-vap \
        --validate

    # Specify text fields
    python -m scripts.deidentify \
        --input-dir results/dataset/Indo-vap \
        --output-dir results/deidentified/Indo-vap \
        --text-fields patient_name notes diagnosis
        
    # Disable encryption (not recommended)
    python -m scripts.deidentify \
        --input-dir results/dataset/Indo-vap \
        --output-dir results/deidentified/Indo-vap \
        --no-encryption

Pipeline Integration
~~~~~~~~~~~~~~~~~~~~

The de-identification step processes both ``original/`` and ``cleaned/`` subdirectories
while maintaining the same file structure in the output directory.

.. code-block:: bash

    # Enable de-identification in main pipeline
    python main.py --enable-deidentification

    # Skip de-identification
    python main.py --enable-deidentification --skip-deidentification
    
    # Disable encryption (not recommended for production)
    python main.py --enable-deidentification --no-encryption

**Output Directory Structure:**

.. code-block:: text

    results/
    ├── dataset/
    │   └── Indo-vap/
    │       ├── original/        (extracted JSONL files)
    │       └── cleaned/         (cleaned JSONL files)
    ├── deidentified/
    │   ├── Indo-vap/
    │   │   ├── original/        (de-identified original files)
    │   │   ├── cleaned/         (de-identified cleaned files)
    │   │   └── _deidentification_audit.json
    │   └── mappings/
    │       └── mappings.enc     (encrypted mapping table)
    └── data_dictionary_mappings/

.. important::
   **Version Control Best Practices**
   
   The ``.gitignore`` file is pre-configured with security best practices:
   
   **Safe to Track in Git:**
   
   * ✅ De-identified datasets (``results/deidentified/Indo-vap/``)
   * ✅ Data dictionary mappings (``results/data_dictionary_mappings/``)
   * ✅ Source code and documentation
   
   **Never Commit to Git:**
   
   * ❌ Original datasets with PHI (``results/dataset/``)
   * ❌ Deidentification mappings (``results/deidentified/mappings/``)
   * ❌ Encryption keys (``*.key``, ``*.pem``, ``*.fernet``)
   * ❌ Audit logs (``*_deidentification_audit.json``)
   
   Always review ``git status`` before committing to ensure no PHI/PII files are staged.

Supported PHI/PII Types
-----------------------

The tool detects and de-identifies the following 21 HIPAA identifier types:

Names
~~~~~

* First names
* Last names
* Full names

Medical Identifiers
~~~~~~~~~~~~~~~~~~~

* Medical Record Numbers (MRN)
* Account numbers
* License/certificate numbers

Government Identifiers
~~~~~~~~~~~~~~~~~~~~~~

* Social Security Numbers (SSN)

Contact Information
~~~~~~~~~~~~~~~~~~~

* Phone numbers (US and international formats)
* Email addresses
* Fax numbers

Geographic Information
~~~~~~~~~~~~~~~~~~~~~~

* Street addresses
* Cities
* States
* ZIP codes

Temporal Information
~~~~~~~~~~~~~~~~~~~~

* Dates (all formats including DOB)
* Ages over 89 (HIPAA requirement)

Technical Identifiers
~~~~~~~~~~~~~~~~~~~~~

* Device identifiers
* URLs
* IP addresses (IPv4)

Custom Identifiers
~~~~~~~~~~~~~~~~~~

* Easy to extend with new detection rules
* User-defined PHI types

Pseudonym Formats
-----------------

Different PHI types use different pseudonym formats:

.. list-table::
   :header-rows: 1
   :widths: 20 30 50

   * - PHI Type
     - Example Original
     - Pseudonym Format
   * - Name
     - John Doe
     - ``[PATIENT-A4B8C2]``
   * - MRN
     - AB123456
     - ``[MRN-X7Y2Z9]``
   * - SSN
     - 123-45-6789
     - ``[SSN-Q3W8E5]``
   * - Phone
     - (555) 123-4567
     - ``[PHONE-E5R7T9]``
   * - Email
     - patient@example.com
     - ``[EMAIL-T9Y3U8]``
   * - Date
     - 01/15/1980
     - Shifted date or ``[DATE-1]``
   * - Address
     - 123 Main St
     - ``[STREET-Z2X5C8]``
   * - ZIP
     - 12345
     - ``[ZIP-K9L4M7]``
   * - Age >89
     - Age 92
     - ``[AGE-K4L8P6]``

Configuration
-------------

Directory Structure Processing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The de-identification tool automatically processes subfolders to maintain 
the same file structure between input and output directories:

.. code-block:: python

    from scripts.deidentify import deidentify_dataset

    # Process with subdirectories (default)
    stats = deidentify_dataset(
        input_dir="results/dataset/Indo-vap",
        output_dir="results/deidentified/Indo-vap",
        process_subdirs=True  # Recursively process all subdirectories
    )
    
    # Process only top-level files (no subdirectories)
    stats = deidentify_dataset(
        input_dir="results/dataset/Indo-vap",
        output_dir="results/deidentified/Indo-vap",
        process_subdirs=False  # Only process files in the root directory
    )

**Features:**

* Maintains relative directory structure in output
* Processes both ``original/`` and ``cleaned/`` subdirectories
* Creates output directories automatically
* Preserves file naming conventions
* Single mapping table shared across all subdirectories

DeidentificationConfig
~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    from scripts.deidentify import DeidentificationConfig, DeidentificationEngine

    config = DeidentificationConfig(
        # Date shifting
        enable_date_shifting=True,
        date_shift_range_days=365,
        preserve_date_intervals=True,
        
        # Security
        enable_encryption=True,
        encryption_key=None,  # Auto-generate if None
        
        # Validation
        enable_validation=True,
        strict_mode=True,
        
        # Logging
        log_detections=True,
        log_level=logging.INFO
    )

    engine = DeidentificationEngine(config=config)

Custom PHI Patterns
~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    from scripts.deidentify import DetectionPattern, PHIType
    import re

    # Define custom pattern
    custom_pattern = DetectionPattern(
        phi_type=PHIType.CUSTOM,
        pattern=re.compile(r'\bSTUDY-\d{4}\b'),
        priority=85,
        description="Custom Study ID format"
    )

    # Use in de-identification
    deidentified = engine.deidentify_text(
        text="Study ID: STUDY-1234",
        custom_patterns=[custom_pattern]
    )

Advanced Features
-----------------

Date Shifting
~~~~~~~~~~~~~

Date shifting preserves temporal relationships while obscuring actual dates.
The date shifter automatically uses intelligent multi-format parsing with country-specific defaults:

.. code-block:: python

    from scripts.deidentify import DateShifter

    # For India (DD/MM/YYYY format priority)
    shifter_india = DateShifter(
        shift_range_days=365,
        preserve_intervals=True,
        country_code="IN"
    )

    # All dates shift by same offset, format preserved
    date1 = shifter_india.shift_date("04/09/2014")  # September 4, 2014 (DD/MM/YYYY)
    date2 = shifter_india.shift_date("09/09/2014")  # September 9, 2014
    # Output: 14/12/2013, 19/12/2013 (5 days interval preserved)
    
    # ISO 8601 format also supported
    date3 = shifter_india.shift_date("2014-09-04")  # September 4, 2014
    # Output: 2013-12-14 (format preserved)

    # For United States (MM/DD/YYYY format priority)
    shifter_us = DateShifter(
        shift_range_days=365,
        preserve_intervals=True,
        country_code="US"
    )

    date4 = shifter_us.shift_date("04/09/2014")  # April 9, 2014 (MM/DD/YYYY)
    # Output: Different interpretation due to country format

**Supported Date Formats** (auto-detected):

* **ISO 8601**: ``YYYY-MM-DD`` (e.g., 2014-09-04) - International standard, all countries
* **Slash-separated**: ``DD/MM/YYYY`` or ``MM/DD/YYYY`` (e.g., 04/09/2014)
* **Hyphen-separated**: ``DD-MM-YYYY`` or ``MM-DD-YYYY`` (e.g., 04-09-2014)
* **Dot-separated**: ``DD.MM.YYYY`` (e.g., 04.09.2014) - European format

**Primary Format by Country:**

* **DD/MM/YYYY** (Day first): India (IN), UK (GB), Australia (AU), Indonesia (ID), 
  Brazil (BR), South Africa (ZA), EU countries, Kenya (KE), Nigeria (NG), 
  Ghana (GH), Uganda (UG)
* **MM/DD/YYYY** (Month first): United States (US), Philippines (PH), Canada (CA)

**Key Features:**

* Intelligent multi-format detection (tries multiple formats automatically)
* Original format preservation (shifted dates maintain the input format)
* Consistent offset across all dates in a dataset
* Temporal relationships preserved (intervals between dates maintained)
* Country-specific format priority
* Fallback to [DATE-HASH] placeholder only if all formats fail

Understanding Date Format Handling
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. versionadded:: 0.6.0
   Improved date parsing with country-specific format priority and smart validation.

The date shifter uses an intelligent algorithm to handle ambiguous dates correctly:

**The Ambiguity Problem**

Dates like ``08/09/2020`` or ``12/12/2012`` can be interpreted in multiple ways:

.. list-table::
   :header-rows: 1
   :widths: 20 25 25 30

   * - Date String
     - US Format (MM/DD)
     - India Format (DD/MM)
     - Ambiguity
   * - ``08/09/2020``
     - August 9, 2020
     - September 8, 2020
     - ⚠️ Both valid
   * - ``12/12/2012``
     - December 12, 2012
     - December 12, 2012
     - ⚠️ Symmetric date
   * - ``13/05/2020``
     - ❌ Invalid (no 13th month)
     - May 13, 2020
     - ✅ Unambiguous
   * - ``05/25/2020``
     - May 25, 2020
     - ❌ Invalid (no 25th month)
     - ✅ Unambiguous

**The Solution: Country-Based Priority with Smart Validation**

The date shifter uses a three-step algorithm:

1. **Try ISO 8601 First** (``YYYY-MM-DD``): Always unambiguous, works for all countries
2. **Try Country-Specific Format**: Use the country's preferred interpretation
3. **Smart Validation**: Reject formats that are logically impossible

**Algorithm Details:**

.. code-block:: python

    # Example: Processing "13/05/2020" for India (DD/MM preference)
    
    Step 1: Try ISO 8601 (YYYY-MM-DD)
      Result: ❌ Doesn't match pattern
    
    Step 2: Try DD/MM/YYYY (India preference)
      Parse: ✅ Day=13, Month=05 (May 13, 2020)
      Validate: first_num=13 > 12 ✅ Valid (day can be >12)
      Result: ✅ Success! → May 13, 2020
    
    # Example: Processing "13/05/2020" for USA (MM/DD preference)
    
    Step 1: Try ISO 8601 (YYYY-MM-DD)
      Result: ❌ Doesn't match pattern
    
    Step 2: Try MM/DD/YYYY (USA preference)
      Parse: ❌ Month=13 is invalid (only 12 months)
      Result: Parsing fails, try next format
    
    Step 3: Try DD/MM/YYYY (fallback)
      Parse: ✅ Day=13, Month=05
      Result: ✅ Success! → May 13, 2020

**Smart Validation Rules:**

* If first number > 12 → **Must be day** (can't be month)
* If second number > 12 → **Must be day** (can't be month)  
* If both numbers ≤ 12 → **Trust country preference** (ambiguous case)

**Examples by Country:**

.. code-block:: python

    from scripts.deidentify import DateShifter
    
    # India: DD/MM/YYYY preference
    shifter_india = DateShifter(country_code="IN")
    
    shifter_india.shift_date("2020-01-15")   # ISO → Always Jan 15, 2020
    shifter_india.shift_date("13/05/2020")   # Unambiguous → May 13, 2020
    shifter_india.shift_date("08/09/2020")   # Ambiguous → Sep 8, 2020 (DD/MM)
    shifter_india.shift_date("12/12/2012")   # Symmetric → Dec 12, 2012 (DD/MM)
    
    # United States: MM/DD/YYYY preference
    shifter_us = DateShifter(country_code="US")
    
    shifter_us.shift_date("2020-01-15")      # ISO → Always Jan 15, 2020
    shifter_us.shift_date("13/05/2020")      # Unambiguous → May 13, 2020
    shifter_us.shift_date("08/09/2020")      # Ambiguous → Aug 9, 2020 (MM/DD)
    shifter_us.shift_date("12/12/2012")      # Symmetric → Dec 12, 2012 (MM/DD)

**Best Practices:**

1. **Use ISO 8601 when possible** (``YYYY-MM-DD``): Eliminates all ambiguity
2. **Set country code correctly**: Ensures consistent interpretation within your dataset
3. **Validate output**: Review shifted dates to ensure they make sense
4. **Document format**: Record which format your source data uses

.. tip::
   For symmetric dates like ``12/12/2012`` or ``01/01/2020``, the interpretation 
   doesn't affect the result since both formats yield the same date. However, 
   consistency is still maintained for audit purposes.

.. warning::
   Mixing date formats within a single dataset (e.g., some files using DD/MM and 
   others using MM/DD) can lead to inconsistent interpretations. Always use a 
   consistent format across your dataset, preferably ISO 8601.

Encrypted Mapping Storage
~~~~~~~~~~~~~~~~~~~~~~~~~~

Mapping tables are stored in a centralized location within the ``results/deidentified/mappings/``
directory:

.. code-block:: python

    from cryptography.fernet import Fernet
    from scripts.deidentify import DeidentificationConfig

    # Generate and save key
    encryption_key = Fernet.generate_key()
    with open('encryption_key.bin', 'wb') as f:
        f.write(encryption_key)

    # Use encrypted storage
    config = DeidentificationConfig(
        enable_encryption=True,
        encryption_key=encryption_key
    )

    engine = DeidentificationEngine(config=config)
    
    # Mappings stored in: results/deidentified/mappings/mappings.enc
    # This single mapping file is used across all datasets and subdirectories

Record De-identification
~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    # De-identify structured records
    record = {
        "patient_name": "John Doe",
        "mrn": "123456",
        "notes": "Patient has diabetes. DOB: 01/15/1980",
        "lab_value": 95.5  # Numeric field preserved
    }

    # Specify which fields to de-identify
    deidentified = engine.deidentify_record(
        record,
        text_fields=["patient_name", "notes"]
    )

Validation
~~~~~~~~~~

.. code-block:: python

    # Validate de-identified text
    is_valid, issues = engine.validate_deidentification(deidentified_text)

    if not is_valid:
        print(f"Validation failed! Issues: {issues}")
    else:
        print("✓ No PHI detected")

    # Validate entire dataset (processes all subdirectories)
    from scripts.deidentify import validate_dataset

    validation_results = validate_dataset(
        "results/deidentified/Indo-vap"
    )

    print(f"Valid: {validation_results['is_valid']}")
    print(f"Issues: {len(validation_results['potential_phi_found'])}")
    print(f"Files validated: {validation_results['total_files']}")
    print(f"Records validated: {validation_results['total_records']}")

Security
--------

Encryption
~~~~~~~~~~

Mapping storage uses **Fernet** (symmetric encryption):

* Encryption method: AES-128 in CBC mode
* Key management: Separate from data files
* Format: Base64-encoded encrypted JSON

Cryptographic Pseudonyms
~~~~~~~~~~~~~~~~~~~~~~~~~

Pseudonyms are generated using:

* Hash method: SHA-256 hashing
* Salt: Random or deterministic per session
* Encoding: Base32 for readability
* Property: Irreversible without mapping table

Best Practices
~~~~~~~~~~~~~~

1. **Protect Encryption Keys**

   * Store keys separately from mapping files
   * Use key management systems in production
   * Rotate keys periodically

2. **Enable Validation**

   * Always validate after de-identification
   * Manual review of sample outputs
   * Regular updates to detection rules

3. **Audit Logging**

   * Enable comprehensive logging
   * Monitor for validation failures
   * Track mapping usage

4. **Access Control**

   * Restrict access to mapping files
   * Separate re-identification permissions
   * Log all mapping exports

HIPAA Compliance
~~~~~~~~~~~~~~~~

The tool follows HIPAA Safe Harbor requirements:

✓ Removes all 18 HIPAA identifiers

✓ Ages over 89 handled appropriately

✓ Geographic subdivisions (ZIP codes) de-identified

✓ Dates shifted to preserve intervals

✓ No re-identification without authorization

Performance
-----------

Benchmarks
~~~~~~~~~~

Typical performance on modern hardware:

* **Text Processing**: ~1,000 records/second
* **Detection Speed**: ~500 KB/second
* **Mapping Lookup**: O(1) average case
* **Encryption Overhead**: ~5-10% slowdown

Optimization Tips
~~~~~~~~~~~~~~~~~

1. **Batch Processing**: Process files in parallel
2. **Detection Order**: Put common items first
3. **Caching**: Pseudonyms cached automatically
4. **Validation**: Disable in production if pre-validated

Examples
--------

See ``scripts/deidentify.py`` ``--help`` for command-line usage:

.. code-block:: bash

    python -m scripts.deidentify --help

Examples include:

1. Basic text de-identification
2. Consistent pseudonyms
3. Structured record de-identification
4. Custom patterns
5. Date shifting
6. Batch processing
7. Validation workflow
8. Mapping management
9. Security features

Testing
-------

The de-identification tool can be tested using the main process:

.. code-block:: bash

    # Test on a small dataset
    python main.py --enable-deidentification

Expected Output
~~~~~~~~~~~~~~~

When processing the Indo-vap dataset:

.. code-block:: text

    De-identifying files: 100%|██████████| 86/86 [00:08<00:00, 10.34it/s]
    INFO:reportalin:De-identification complete:
    INFO:reportalin:  Texts processed: 1854110
    INFO:reportalin:  Total detections: 365620
    INFO:reportalin:  Unique mappings: 5398
    INFO:reportalin:  Output structure:
    INFO:reportalin:    - results/deidentified/Indo-vap/original/  (de-identified original files)
    INFO:reportalin:    - results/deidentified/Indo-vap/cleaned/   (de-identified cleaned files)

**What happens:**

* Processes both ``original/`` and ``cleaned/`` subdirectories (43 files each = 86 total)
* Detects and replaces PHI/PII in all string fields
* Creates 5,398 unique pseudonym mappings
* Generates encrypted mapping table at ``results/deidentified/mappings/mappings.enc``
* Exports audit log at ``results/deidentified/Indo-vap/_deidentification_audit.json``

**Sample De-identification:**

Before:

.. code-block:: json

    {
        "HHC1": "10200009B",
        "TST_DAT1": "2014-06-11 00:00:00",
        "TST_ENDAT1": "2014-06-14 00:00:00"
    }

After:

.. code-block:: json

    {
        "HHC1": "[MRN-XTHM4A]",
        "TST_DAT1": "[DATE-A4A986]",
        "TST_ENDAT1": "[DATE-B3C874]"
    }

Verification
~~~~~~~~~~~~~

✓ Detection for all PHI types

✓ Pseudonym consistency

✓ Date shifting and intervals

✓ Mapping storage and encryption

✓ Batch processing

✓ Validation

✓ Edge cases and error handling

Troubleshooting
---------------

Common Issues
~~~~~~~~~~~~~

**"No files matching '*.jsonl' found"**

.. code-block:: python

    # Solution: Ensure extraction step completed first
    python main.py --skip-deidentification  # Run extraction
    python main.py --enable-deidentification --skip-extraction  # Then deidentify

**Encryption error - "cryptography package not available"**

.. code-block:: bash

    # Solution: Install cryptography
    pip install cryptography>=41.0.0

**Validation fails on de-identified text**

.. code-block:: python

    # Solution: Check detection order and exclusions
    engine.validate_deidentification(text)

**Dates not shifting consistently**

.. code-block:: python

    # Solution: Enable interval preservation
    config = DeidentificationConfig(
        enable_date_shifting=True,
        preserve_date_intervals=True
    )

**Custom patterns not detected**

.. code-block:: python

    # Solution: Increase priority
    custom_pattern = DetectionPattern(
        phi_type=PHIType.CUSTOM,
        pattern=your_detection_rule,
        priority=90  # Higher priority
    )

**Output directory structure different from input**

.. code-block:: python

    # Solution: Ensure process_subdirs is enabled
    stats = deidentify_dataset(
        input_dir="results/dataset/Indo-vap",
        output_dir="results/deidentified/Indo-vap",
        process_subdirs=True  # Must be True to preserve structure
    )

**"Could not parse date" warnings**

.. code-block:: text

    # The tool uses smart multi-format date recognition
    # Supported formats (auto-detected, original format preserved):
    #   - YYYY-MM-DD: ISO 8601 standard (e.g., 2014-09-04)
    #   - DD/MM/YYYY or MM/DD/YYYY: Slash-separated (e.g., 04/09/2014)
    #   - DD-MM-YYYY or MM-DD-YYYY: Hyphen-separated (e.g., 04-09-2014)
    #   - DD.MM.YYYY: Dot-separated European format (e.g., 04.09.2014)
    # 
    # Format priority based on country:
    #   - DD/MM/YYYY priority: India, UK, Australia, Indonesia, Brazil, South Africa, EU, Kenya, Nigeria, Ghana, Uganda
    #   - MM/DD/YYYY priority: United States, Philippines, Canada
    # 
    # Only truly unsupported formats are replaced with [DATE-HASH] placeholders

**Date format interpretation and preservation**

The date shifter automatically tries multiple formats and preserves the original format:

.. code-block:: text

    For India (IN) with DD/MM/YYYY priority:
    - Input: 04/09/2014 → Interpreted as September 4, 2014 (DD/MM/YYYY)
    - Output: 14/12/2013 (format preserved: DD/MM/YYYY)
    
    - Input: 2014-09-04 → Interpreted as September 4, 2014 (ISO 8601)
    - Output: 2013-12-14 (format preserved: YYYY-MM-DD)
    
    For United States (US) with MM/DD/YYYY priority:
    - Input: 04/09/2014 → Interpreted as April 9, 2014 (MM/DD/YYYY)
    - Output: 10/23/2013 (format preserved: MM/DD/YYYY)
    
    - Input: 2014-04-09 → Interpreted as April 9, 2014 (ISO 8601)
    - Output: 2013-10-23 (format preserved: YYYY-MM-DD)
    
    For all countries:
    - 2014-09-04 is interpreted as September 4, 2014 (YYYY-MM-DD)
    - Replaced with: [DATE-HASH] pseudonym

Technical Reference
-------------------

For complete technical details, see the :doc:`../api/scripts.deidentify` documentation.

Key Classes
~~~~~~~~~~~

* :class:`scripts.deidentify.DeidentificationEngine` - Main processing engine
* :class:`scripts.deidentify.PseudonymGenerator` - Pseudonym generation
* :class:`scripts.deidentify.DateShifter` - Date shifting
* :class:`scripts.deidentify.MappingStore` - Encrypted storage
* :class:`scripts.deidentify.PatternLibrary` - PHI patterns

Key Functions
~~~~~~~~~~~~~

* :func:`scripts.deidentify.deidentify_dataset` - Batch processing
* :func:`scripts.deidentify.validate_dataset` - Dataset validation

Migration Guide
---------------

**Breaking Changes**: None - The de-identification tool is fully backward compatible

**New Features** (Available in current version):

1. **Use Explicit Imports** (Recommended):

   .. code-block:: python
   
      # Recommended import style
      from scripts.deidentify import DeidentificationEngine
      engine = DeidentificationEngine()

2. **Type Checking Benefits**:
   
   If you use type checkers (mypy, pyright), you'll get better type inference:
   
   .. code-block:: python
   
      # Type checkers now understand return types
      result: None = engine.save_mappings()  # Correctly inferred as None

3. **API Discovery**:
   
   You can now see exactly what's public:
   
   .. code-block:: python
   
      from scripts import deidentify
      print(deidentify.__all__)
      # ['PHIType', 'DetectionPattern', 'DeidentificationConfig', ...]

**No Changes Required**: All existing code continues to work without modification.

See Also
--------

**Related User Guides**:

* :doc:`quickstart` - Getting started with RePORTaLiN
* :doc:`usage` - General usage guide and examples
* :doc:`configuration` - De-identification configuration options
* :doc:`country_regulations` - Country-specific privacy compliance
* :doc:`troubleshooting` - Common issues and solutions

**API & Technical References**:

* :doc:`../api/scripts.deidentify` - Complete technical documentation
* :doc:`../developer_guide/contributing` - Best practices for error handling and design
* :doc:`../developer_guide/extending` - Extending de-identification features
* :doc:`../changelog` - Version 0.0.6 changelog

**External Resources**:

* `HIPAA Safe Harbor Method <https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html>`_ - Official HIPAA de-identification guidance
* `GDPR Article 4(5) <https://gdpr.eu/article-4-definitions/>`_ - GDPR pseudonymization definition
* `DPDPA 2023 (India) <https://www.meity.gov.in/writereaddata/files/Digital%20Personal%20Data%20Protection%20Act%202023.pdf>`_ - Indian data protection regulations