Configuration
=============

**For Users: Customizing Your Setup**

RePORTaLiN comes with sensible defaults that work out of the box. This guide shows you 
how to adjust settings if you need to customize where files are stored or how the tool behaves.

.. versionchanged:: 0.3.0
   Added automatic directory creation and configuration validation to make setup easier.

Configuration File
------------------

The main configuration file is ``config.py`` in the project root. It defines all paths, 
settings, and parameters used throughout the pipeline.

What's New in |version|
~~~~~~~~~~~~~~~~~~~~~~~~

✨ **New Features**:
   - Automatically creates folders you need
   - Checks your setup and warns you if something's wrong
   - Better error messages when files are missing
   - Improved handling of dataset folder names

Dynamic Dataset Detection
-------------------------

RePORTaLiN automatically detects your dataset folder:

.. code-block:: python

   # config.py automatically finds the first folder in data/dataset/
   DATASET_DIR = os.path.join(DATA_DIR, "dataset", dataset_folder)

This means you can work with any dataset without modifying code:

.. code-block:: text

   data/dataset/
   └── my_study_data/         # Automatically detected
       ├── file1.xlsx
       └── file2.xlsx

.. versionchanged:: 0.3.0
   Improved automatic detection with better error handling for special cases.

Configuration Variables
-----------------------

Project Root
~~~~~~~~~~~~

.. code-block:: python

   ROOT_DIR = os.path.dirname(os.path.abspath(__file__)) if '__file__' in globals() else os.getcwd()

- **Purpose**: Absolute path to project root directory
- **Usage**: All other paths are relative to this
- **Modification**: Not recommended (auto-detected)

.. versionchanged:: 0.3.0
   Added support for running in interactive environments like Jupyter notebooks.

Data Directories
~~~~~~~~~~~~~~~~

.. code-block:: python

   DATA_DIR = os.path.join(ROOT_DIR, "data")
   RESULTS_DIR = os.path.join(ROOT_DIR, "results")

- **DATA_DIR**: Location of raw input data
- **RESULTS_DIR**: Location for processed outputs
- **Modification**: Can be changed if you want different locations

Dataset Paths
~~~~~~~~~~~~~

.. code-block:: python

   DATASET_BASE_DIR = os.path.join(DATA_DIR, "dataset")
   DATASET_FOLDER_NAME = get_dataset_folder()  # Auto-detected
   DATASET_DIR = os.path.join(DATASET_BASE_DIR, DATASET_FOLDER_NAME or DEFAULT_DATASET_NAME)
   DATASET_NAME = normalize_dataset_name(DATASET_FOLDER_NAME)

- **DATASET_BASE_DIR**: Parent directory for all datasets
- **DATASET_FOLDER_NAME**: Name of detected folder (returned by ``get_dataset_folder()``)
- **DATASET_DIR**: Full path to current dataset (auto-detected)
- **DATASET_NAME**: Cleaned dataset name (e.g., "Indo-vap_csv_files" → "Indo-vap")

.. versionchanged:: 0.3.0
   Now automatically cleans up dataset names and handles common file endings better.

Output Directories
~~~~~~~~~~~~~~~~~~

.. code-block:: python

   CLEAN_DATASET_DIR = os.path.join(RESULTS_DIR, "dataset", DATASET_NAME)
   DICTIONARY_JSON_OUTPUT_DIR = os.path.join(RESULTS_DIR, "data_dictionary_mappings")

- **CLEAN_DATASET_DIR**: Where extracted JSONL files are saved
- **DICTIONARY_JSON_OUTPUT_DIR**: Where dictionary tables are saved

Data Dictionary
~~~~~~~~~~~~~~~

.. code-block:: python

   DICTIONARY_EXCEL_FILE = os.path.join(
       DATA_DIR, 
       "data_dictionary_and_mapping_specifications",
       "RePORT_DEB_to_Tables_mapping.xlsx"
   )

- **Purpose**: Path to the data dictionary Excel file
- **Modification**: Change filename if your dictionary has a different name

Logging Settings
~~~~~~~~~~~~~~~~

.. code-block:: python

   LOG_LEVEL = logging.INFO
   LOG_NAME = "reportalin"

- **LOG_LEVEL**: Controls verbosity (INFO, DEBUG, WARNING, ERROR)
- **LOG_NAME**: Logger instance name

Available log levels:

- ``logging.DEBUG``: Detailed diagnostic information
- ``logging.INFO``: General informational messages (default)
- ``logging.WARNING``: Warning messages
- ``logging.ERROR``: Error messages only

De-identification Settings
~~~~~~~~~~~~~~~~~~~~~~~~~~

.. versionadded:: 0.3.0
   De-identification configuration is now documented with comprehensive examples.

De-identification settings can be customized using the configuration options:

.. code-block:: python

   from scripts.deidentify import DeidentificationConfig
   
   config = DeidentificationConfig(
       # Pseudonym templates
       pseudonym_templates={
           PHIType.NAME_FULL: "PATIENT-{id}",
           PHIType.MRN: "MRN-{id}",
           # ... other templates
       },
       
       # Date shifting
       enable_date_shifting=True,
       date_shift_range_days=365,
       preserve_date_intervals=True,
       
       # Security
       enable_encryption=True,
       encryption_key=None,  # Auto-generated if None
       
       # Validation
       enable_validation=True,
       strict_mode=True,
       
       # Logging
       log_detections=True,
       log_level=logging.INFO,
       
       # Country-specific regulations
       countries=['IN', 'US'],  # None for default (IN)
       enable_country_patterns=True
   )

**Key Parameters**:

- **pseudonym_templates**: Custom format for pseudonyms (e.g., "PATIENT-{id}")
- **enable_date_shifting**: Shift dates by consistent offset
- **date_shift_range_days**: Maximum shift range (±365 days default)
- **preserve_date_intervals**: Keep time intervals consistent
- **enable_encryption**: Encrypt mapping files with Fernet
- **encryption_key**: Custom encryption key (auto-generated if None)
- **enable_validation**: Validate de-identified output
- **strict_mode**: Fail on validation errors
- **log_detections**: Log detected PHI/PII items
- **countries**: List of country codes for country-specific patterns
- **enable_country_patterns**: Use country-specific detection patterns

**Example Configurations**:

Basic de-identification (India-specific):

.. code-block:: python

   config = DeidentificationConfig()  # Uses defaults

Multi-country de-identification:

.. code-block:: python

   config = DeidentificationConfig(
       countries=['US', 'IN', 'BR', 'ID'],
       enable_encryption=True
   )

Testing/development (no encryption):

.. code-block:: python

   config = DeidentificationConfig(
       enable_encryption=False,
       log_level=logging.DEBUG
   )

See :doc:`deidentification` for complete de-identification guide.

Helper Tools
------------

.. versionadded:: 0.3.0

The configuration now provides helpful tools for common tasks.

ensure_directories()
~~~~~~~~~~~~~~~~~~~~

Automatically creates all required directories if they don't exist.

.. code-block:: python

   from config import ensure_directories
   
   # Create all necessary directories
   ensure_directories()

**What it creates**:
   - ``RESULTS_DIR``
   - ``CLEAN_DATASET_DIR``
   - ``DICTIONARY_JSON_OUTPUT_DIR``

**When to use**: 
   - At the start of your pipeline
   - Before writing any output files
   - When setting up a new environment

validate_config()
~~~~~~~~~~~~~~~~~

Validates the configuration and returns a list of warnings.

.. code-block:: python

   from config import validate_config
   
   warnings = validate_config()
   if warnings:
       print("Configuration warnings:")
       for warning in warnings:
           print(f"  - {warning}")
   else:
       print("Configuration is valid!")

**What it checks**:
   - ``DATA_DIR`` exists
   - ``DATASET_DIR`` exists
   - ``DICTIONARY_EXCEL_FILE`` exists

**Returns**: 
   - Empty list if all valid
   - List of warning strings if issues found

**When to use**: 
   - Before starting the pipeline
   - For debugging configuration issues
   - In automated testing

normalize_dataset_name()
~~~~~~~~~~~~~~~~~~~~~~~~

Normalize a dataset folder name by removing common suffixes.

.. code-block:: python

   from config import normalize_dataset_name
   
   name = normalize_dataset_name("Indo-vap_csv_files")
   print(name)  # Output: "Indo-vap"

**Parameters**:
   - ``folder_name`` (Optional[str]): Dataset folder name

**Returns**: 
   - Normalized name, or ``DEFAULT_DATASET_NAME`` if None

**Examples**:

.. code-block:: python

   normalize_dataset_name("study_csv_files")  # → "study"
   normalize_dataset_name("test_files")       # → "test"
   normalize_dataset_name(None)               # → "RePORTaLiN_sample"

Customizing Configuration
--------------------------

Example 1: Change Log Level
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To see more detailed debug information:

.. code-block:: python

   # config.py
   import logging
   
   LOG_LEVEL = logging.DEBUG  # More verbose logging

Example 2: Custom Data Location
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To use a different data directory:

.. code-block:: python

   # config.py
   DATA_DIR = "/path/to/my/data"
   RESULTS_DIR = "/path/to/my/results"

Example 3: Different Dictionary File
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If your data dictionary has a different name:

.. code-block:: python

   # config.py
   DICTIONARY_EXCEL_FILE = os.path.join(
       DATA_DIR,
       "data_dictionary_and_mapping_specifications",
       "MyCustomDictionary.xlsx"
   )

Environment Variables
---------------------

You can also use environment variables for configuration:

.. code-block:: python

   # config.py
   import os
   
   # Use environment variable with fallback
   DATA_DIR = os.getenv("REPORTALIN_DATA_DIR", os.path.join(ROOT_DIR, "data"))

Then set the environment variable:

.. code-block:: bash

   export REPORTALIN_DATA_DIR="/my/custom/data/path"
   python main.py

Configuration Best Practices
-----------------------------

1. **Don't Hardcode Paths**
   
   ❌ Bad:
   
   .. code-block:: python
   
      file_path = "/Users/john/data/file.xlsx"
   
   ✅ Good:
   
   .. code-block:: python
   
      file_path = os.path.join(config.DATA_DIR, "file.xlsx")

2. **Use Path Objects**
   
   For more robust path handling:
   
   .. code-block:: python
   
      from pathlib import Path
      
      DATA_DIR = Path(ROOT_DIR) / "data"
      DATASET_DIR = DATA_DIR / "dataset" / dataset_name

3. **Keep Configuration Separate**
   
   Don't mix configuration with business logic:
   
   ❌ Bad: Hardcoding paths in processing functions
   
   ✅ Good: Use the configuration file

4. **Document Changes**
   
   If you modify ``config.py``, document why:
   
   .. code-block:: python
   
      # Changed to use external storage per project requirements
      DATA_DIR = "/mnt/shared/research_data"

Accessing Configuration
-----------------------

In Your Code
~~~~~~~~~~~~

.. code-block:: python

   import config
   
   # Access configuration variables
   print(f"Dataset: {config.DATASET_NAME}")
   print(f"Input dir: {config.DATASET_DIR}")
   print(f"Output dir: {config.CLEAN_DATASET_DIR}")

From Command Line
~~~~~~~~~~~~~~~~~

.. code-block:: bash

   # Print current configuration
   python -c "import config; print(f'Dataset: {config.DATASET_NAME}')"

Directory Structure
-------------------

The configuration creates this structure:

.. code-block:: text

   RePORTaLiN/
   ├── data/
   │   ├── dataset/
   │   │   └── <dataset_name>/          # Auto-detected
   │   └── data_dictionary_and_mapping_specifications/
   │       └── RePORT_DEB_to_Tables_mapping.xlsx
   │
   └── results/
       ├── dataset/
       │   └── <dataset_name>/          # Mirrors input structure
       └── data_dictionary_mappings/
           ├── Codelists/
           ├── tblENROL/
           └── ...

Troubleshooting Configuration
------------------------------

Problem: "Dataset not found"
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Cause**: No folder exists in ``data/dataset/``

**Solution**: Create a dataset folder:

.. code-block:: bash

   mkdir -p data/dataset/my_dataset
   # Add Excel files to this directory

Problem: "Permission denied"
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Cause**: Output directories not writable

**Solution**: Check permissions:

.. code-block:: bash

   chmod -R 755 results/
   chmod 755 .logs/

Problem: "Config file not found"
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Cause**: Not running from the correct folder

**Solution**: Ensure you're in the correct directory:

.. code-block:: bash

   cd /path/to/RePORTaLiN
   python main.py

See Also
--------

- :doc:`quickstart`: Quick start guide with validation examples
- :doc:`usage`: How to use configuration in practice
- :doc:`troubleshooting`: Configuration troubleshooting with ``validate_config()``
- :doc:`../api/config`: Complete technical documentation for configuration settings
- :doc:`../developer_guide/extending`: Extending configuration for custom needs
- :doc:`../changelog`: Version 0.0.3 configuration enhancements