scripts package
===============

.. automodule:: scripts
   :members:
   :undoc-members:
   :show-inheritance:

Overview
--------

The ``scripts`` package contains the core processing modules for RePORTaLiN.

**Enhanced in v0.0.9:**

- ✅ Enhanced package-level documentation with comprehensive usage examples
- ✅ Clear public API definition (2 high-level functions)
- ✅ Integration examples for complete data processing pipeline
- ✅ De-identification workflow documentation
- ✅ Module structure and cross-references

Package-Level Public API
~~~~~~~~~~~~~~~~~~~~~~~~~

The package exports 2 high-level functions for the main processing pipeline:

1. **load_study_dictionary** - Process data dictionary Excel files
2. **extract_excel_to_jsonl** - Extract dataset Excel files to JSONL

**Quick Start:**

.. code-block:: python

   from scripts import load_study_dictionary, extract_excel_to_jsonl
   
   # Step 1: Load data dictionary
   dict_success = load_study_dictionary()
   
   # Step 2: Extract dataset
   extract_success = extract_excel_to_jsonl(
       input_dir="data/dataset/Indo-vap",
       output_dir="results/dataset/Indo-vap"
   )

For specialized functionality, import directly from submodules:

- ``scripts.load_dictionary`` - 2 public functions
- ``scripts.extract_data`` - 6 public functions  
- ``scripts.deidentify`` - 10 public functions
- ``scripts.utils.country_regulations`` - 6 public functions
- ``scripts.utils.logging`` - 12 public functions

Module Organization
~~~~~~~~~~~~~~~~~~~

::

   scripts/
   ├── __init__.py              # Package API (2 exports)
   ├── load_dictionary.py       # Data dictionary (2 exports)
   ├── extract_data.py          # Data extraction (6 exports)
   └── utils/
       ├── deidentify.py        # De-identification (10 exports)
       ├── country_regulations.py  # Privacy rules (6 exports)
       └── logging.py           # Logging (12 exports)

Submodules
----------

.. toctree::
   :maxdepth: 2

   scripts.extract_data
   scripts.load_dictionary
   scripts.utils.logging
   scripts.deidentify
   scripts.utils.country_regulations

Module Summary
--------------

extract_data
~~~~~~~~~~~~

.. currentmodule:: scripts.extract_data

Main data extraction module for converting Excel files to JSONL format.

Key functions:

- :func:`extract_excel_to_jsonl`: Batch processing of Excel files
- :func:`process_excel_file`: Single file processing
- :func:`convert_dataframe_to_jsonl`: DataFrame to JSONL conversion
- :func:`clean_record_for_json`: Type conversion for JSON serialization
- :func:`find_excel_files`: File discovery
- :func:`is_dataframe_empty`: Empty DataFrame detection

See: :doc:`scripts.extract_data`

load_dictionary
~~~~~~~~~~~~~~~

.. currentmodule:: scripts.load_dictionary

Data dictionary processing module with intelligent table detection.

Key functions:

- :func:`load_study_dictionary`: High-level API for dictionary loading
- :func:`process_excel_file`: Excel file processing
- :func:`_split_sheet_into_tables`: Automatic table detection
- :func:`_process_and_save_tables`: Table output
- :func:`_deduplicate_columns`: Duplicate column handling

See: :doc:`scripts.load_dictionary`

utils
~~~~~

Utility modules including de-identification and logging.

De-identification (``scripts.deidentify``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. currentmodule:: scripts.deidentify

PHI/PII de-identification module with pseudonymization and encryption.

Key classes:

- :class:`DeidentificationEngine`: Main processing engine
- :class:`PseudonymGenerator`: Cryptographic pseudonym generation
- :class:`DateShifter`: Multi-format date shifting with interval preservation and format preservation
- :class:`MappingStore`: Encrypted mapping storage
- :class:`PatternLibrary`: PHI/PII detection patterns

Key functions:

- :func:`deidentify_dataset`: Batch dataset de-identification
- :func:`validate_dataset`: Dataset validation

See: :doc:`scripts.deidentify`

Logging (``scripts.utils.logging``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. currentmodule:: scripts.utils.logging

Centralized logging module with custom SUCCESS level.

Key features:

- Custom SUCCESS log level
- Timestamped log files
- Dual output (console + file)
- Structured logging

See: :doc:`scripts.utils.logging`

Country Regulations (``scripts.utils.country_regulations``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. currentmodule:: scripts.utils.country_regulations

Country-specific data privacy regulations module for compliance.

Key features:

- Multi-country support (14 countries)
- Privacy frameworks (PUBLIC to CRITICAL levels)
- Identifier detection and validation
- Regulatory requirements management
- Configuration export/import

See: :doc:`scripts.utils.country_regulations`

Quick Examples
--------------

Data Extraction
~~~~~~~~~~~~~~~

.. code-block:: python

   from scripts.extract_data import extract_excel_to_jsonl
   import config
   
   # Extract all Excel files
   extract_excel_to_jsonl(
       input_dir=config.DATASET_DIR,
       output_dir=config.CLEAN_DATASET_DIR
   )

Dictionary Loading
~~~~~~~~~~~~~~~~~~

.. code-block:: python

   from scripts.load_dictionary import load_study_dictionary
   import config
   
   # Load data dictionary
   load_study_dictionary(
       excel_file=config.DICTIONARY_EXCEL_FILE,
       output_dir=config.DICTIONARY_JSON_OUTPUT_DIR
   )

De-identification
~~~~~~~~~~~~~~~~~

For comprehensive de-identification examples, see :doc:`../user_guide/deidentification`.

Quick example:

.. code-block:: python

   from scripts.deidentify import DeidentificationEngine
   
   engine = DeidentificationEngine()
   original = "Patient John Doe, MRN: 123456, SSN: 123-45-6789"
   deidentified = engine.deidentify_text(original)

See :ref:`deidentification-basic-usage` for complete usage patterns.

Batch De-identification
~~~~~~~~~~~~~~~~~~~~~~~~

For batch processing with directory structure preservation, 
see :ref:`deidentification-batch-processing`.

Quick example:

.. code-block:: python

   from scripts.deidentify import deidentify_dataset
   
   stats = deidentify_dataset(
       input_dir="results/dataset/Indo-vap",
       output_dir="results/deidentified/Indo-vap"
   )

Single File Processing
~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   from scripts.extract_data import process_excel_file
   from pathlib import Path
   
   # Process one file
   input_file = Path("data/dataset/Indo-vap/10_TST.xlsx")
   output_dir = Path("results/dataset/Indo-vap")
   
   result = process_excel_file(str(input_file), str(output_dir))
   print(f"Processed {result['records']} records")

Custom Logging
~~~~~~~~~~~~~~

.. code-block:: python

   from scripts.utils import logging as log
   
   # Use custom logger
   log.info("Processing started")
   log.success("Operation completed successfully")
   log.warning("Potential issue detected")
   log.error("An error occurred", exc_info=True)

Country-Specific De-identification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   from scripts.utils.country_regulations import CountryRegulationManager
   
   # Initialize for India
   manager = CountryRegulationManager()
   manager.set_country("IN")
   
   # Get identifiers
   identifiers = manager.get_identifiers()
   print(f"Found {len(identifiers)} identifiers for India")
   
   # Validate Aadhaar number
   is_valid = manager.validate_identifier("AADHAAR", "1234 5678 9012")

Module Dependencies
-------------------

.. code-block:: text

   scripts/
   ├── extract_data.py
   │   └── uses: logging, config
   │
   ├── load_dictionary.py
   │   └── uses: logging, config
   │
   └── utils/
       ├── logging.py
       │   └── uses: config
       ├── deidentify.py
       │   └── uses: config, country_regulations, cryptography (optional)
       └── country_regulations.py
           └── uses: re, json, dataclasses

See Also
--------

:doc:`../user_guide/usage`
   Usage examples

:doc:`../developer_guide/architecture`
   Architecture documentation

:doc:`main`
   Main module that orchestrates scripts