config module ============= .. automodule:: config :members: :undoc-members: :show-inheritance: Overview -------- The ``config`` module provides centralized configuration management for RePORTaLiN. All paths, settings, and parameters are defined here to ensure consistency across all pipeline components. .. versionchanged:: 0.3.0 Added ``ensure_directories()``, ``validate_config()``, and ``normalize_dataset_name()`` functions. Enhanced error handling and type safety. Fixed suffix removal bug. Module Metadata --------------- __version__ ~~~~~~~~~~~ .. code-block:: python __version__ = '1.0.0' Module version string. __all__ ~~~~~~~ .. code-block:: python __all__ = [ 'ROOT_DIR', 'DATA_DIR', 'RESULTS_DIR', 'DATASET_BASE_DIR', 'DATASET_FOLDER_NAME', 'DATASET_DIR', 'DATASET_NAME', 'CLEAN_DATASET_DIR', 'DICTIONARY_EXCEL_FILE', 'DICTIONARY_JSON_OUTPUT_DIR', 'LOG_LEVEL', 'LOG_NAME', 'ensure_directories', 'validate_config', 'DEFAULT_DATASET_NAME' ] Public API exports. Only these symbols are exported with ``from config import *``. Configuration Variables ----------------------- Constants ~~~~~~~~~ DEFAULT_DATASET_NAME ^^^^^^^^^^^^^^^^^^^^ .. code-block:: python DEFAULT_DATASET_NAME = "RePORTaLiN_sample" Default dataset name used when no dataset folder is detected. .. versionadded:: 0.3.0 Extracted as a public constant. DATASET_SUFFIXES ^^^^^^^^^^^^^^^^ .. code-block:: python DATASET_SUFFIXES = ('_csv_files', '_files') Tuple of common suffixes removed from dataset folder names during normalization. .. versionadded:: 0.3.0 Internal constant (not in __all__). Directory Paths ~~~~~~~~~~~~~~~ ROOT_DIR ^^^^^^^^ .. code-block:: python ROOT_DIR = os.path.dirname(os.path.abspath(__file__)) if '__file__' in globals() else os.getcwd() Absolute path to the project root directory. All other paths are relative to this. .. versionchanged:: 0.3.0 Added fallback to ``os.getcwd()`` when ``__file__`` is not available (REPL environments). DATA_DIR ^^^^^^^^ .. code-block:: python DATA_DIR = os.path.join(ROOT_DIR, "data") Path to the data directory containing input files. RESULTS_DIR ^^^^^^^^^^^ .. code-block:: python RESULTS_DIR = os.path.join(ROOT_DIR, "results") Path to the results directory for output files. Dataset Paths ~~~~~~~~~~~~~ DATASET_BASE_DIR ^^^^^^^^^^^^^^^^ .. code-block:: python DATASET_BASE_DIR = os.path.join(DATA_DIR, "dataset") Base directory containing dataset folders. DATASET_DIR ^^^^^^^^^^^ .. code-block:: python DATASET_DIR = get_dataset_folder() Path to the current dataset directory (auto-detected). DATASET_NAME ^^^^^^^^^^^^ .. code-block:: python DATASET_NAME = normalize_dataset_name(DATASET_FOLDER_NAME) Name of the current dataset (e.g., "Indo-vap"), extracted by removing common suffixes from the dataset folder name using the ``normalize_dataset_name()`` function. .. versionchanged:: 0.3.0 Now uses ``normalize_dataset_name()`` function for improved suffix handling. Output Paths ~~~~~~~~~~~~ CLEAN_DATASET_DIR ^^^^^^^^^^^^^^^^^ .. code-block:: python CLEAN_DATASET_DIR = os.path.join(RESULTS_DIR, "dataset", DATASET_NAME) Output directory for extracted JSONL files. DICTIONARY_JSON_OUTPUT_DIR ^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python DICTIONARY_JSON_OUTPUT_DIR = os.path.join(RESULTS_DIR, "data_dictionary_mappings") Output directory for data dictionary tables. Dictionary File ~~~~~~~~~~~~~~~ DICTIONARY_EXCEL_FILE ^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python DICTIONARY_EXCEL_FILE = os.path.join( DATA_DIR, "data_dictionary_and_mapping_specifications", "RePORT_DEB_to_Tables_mapping.xlsx" ) Path to the data dictionary Excel file. Logging Settings ~~~~~~~~~~~~~~~~ LOG_LEVEL ^^^^^^^^^ .. code-block:: python LOG_LEVEL = logging.INFO Logging verbosity level. Options: - ``logging.DEBUG``: Detailed diagnostic information - ``logging.INFO``: General informational messages (default) - ``logging.WARNING``: Warning messages - ``logging.ERROR``: Error messages only LOG_NAME ^^^^^^^^ .. code-block:: python LOG_NAME = "reportalin" Logger instance name used throughout the application. Functions --------- get_dataset_folder ~~~~~~~~~~~~~~~~~~ .. autofunction:: config.get_dataset_folder :no-index: Automatically detect the dataset folder from the file system. Returns the first alphabetically sorted folder in ``data/dataset/``, excluding hidden folders (those starting with '.'). **Returns**: - ``str``: Name of the detected dataset folder - ``None``: If no folders exist or directory is inaccessible **Example**: .. code-block:: python from config import get_dataset_folder folder = get_dataset_folder() if folder: print(f"Detected dataset: {folder}") else: print("No dataset folder found") .. versionchanged:: 0.3.0 Removed faulty ``'..' not in f`` check. Added explicit empty list validation. normalize_dataset_name ~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: config.normalize_dataset_name :no-index: Normalize dataset folder name by removing common suffixes. **Parameters**: - ``folder_name`` (``Optional[str]``): The dataset folder name to normalize **Returns**: - ``str``: Normalized dataset name **Algorithm**: Removes the longest matching suffix from ``DATASET_SUFFIXES`` to handle overlapping suffixes correctly (e.g., ``_csv_files`` vs ``_files``). **Examples**: .. code-block:: python from config import normalize_dataset_name # Remove suffix name = normalize_dataset_name("Indo-vap_csv_files") print(name) # Output: "Indo-vap" # Handle overlapping suffixes name = normalize_dataset_name("test_files") print(name) # Output: "test" # Fallback to default name = normalize_dataset_name(None) print(name) # Output: "RePORTaLiN_sample" .. versionadded:: 0.3.0 Extracted from inline code. Uses longest-match algorithm. ensure_directories ~~~~~~~~~~~~~~~~~~ .. autofunction:: config.ensure_directories :no-index: Create necessary directories if they don't exist. Creates: - ``RESULTS_DIR`` - ``CLEAN_DATASET_DIR`` - ``DICTIONARY_JSON_OUTPUT_DIR`` **Example**: .. code-block:: python from config import ensure_directories # Create all required directories ensure_directories() .. versionadded:: 0.3.0 New utility function for directory management. validate_config ~~~~~~~~~~~~~~~ .. autofunction:: config.validate_config :no-index: Validate configuration and return list of warnings for missing or invalid paths. **Returns**: - ``List[str]``: List of warning messages (empty if all valid) **Validates**: - ``DATA_DIR`` exists - ``DATASET_DIR`` exists - ``DICTIONARY_EXCEL_FILE`` exists **Example**: .. code-block:: python from config import validate_config warnings = validate_config() if warnings: print("Configuration warnings:") for warning in warnings: print(f" - {warning}") else: print("Configuration is valid!") .. versionadded:: 0.3.0 New utility function for configuration validation. Usage Examples -------------- Basic Usage ~~~~~~~~~~~ .. code-block:: python import config # Access configuration print(f"Dataset: {config.DATASET_NAME}") print(f"Input: {config.DATASET_DIR}") print(f"Output: {config.CLEAN_DATASET_DIR}") Using New Utility Functions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from config import ensure_directories, validate_config # Ensure all directories exist ensure_directories() # Validate configuration warnings = validate_config() if warnings: for warning in warnings: print(f"Warning: {warning}") Custom Configuration ~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # config.py modifications import os # Use environment variable DATA_DIR = os.getenv("REPORTALIN_DATA", os.path.join(ROOT_DIR, "data")) # Custom logging import logging LOG_LEVEL = logging.DEBUG Best Practices -------------- 1. **Always call ensure_directories()** before file operations: .. code-block:: python from config import ensure_directories, CLEAN_DATASET_DIR ensure_directories() # Now safe to write to CLEAN_DATASET_DIR 2. **Validate configuration at startup**: .. code-block:: python from config import validate_config warnings = validate_config() if warnings: logger.warning("Configuration issues detected:") for warning in warnings: logger.warning(f" {warning}") 3. **Use constants instead of hardcoded values**: .. code-block:: python from config import DEFAULT_DATASET_NAME # Good dataset = folder_name or DEFAULT_DATASET_NAME # Avoid dataset = folder_name or "RePORTaLiN_sample" Directory Structure ------------------- The configuration defines this structure: .. code-block:: text RePORTaLiN/ ├── data/ (DATA_DIR) │ ├── dataset/ (DATASET_BASE_DIR) │ │ └── / (DATASET_DIR) │ └── data_dictionary_and_mapping_specifications/ │ └── RePORT_DEB_to_Tables_mapping.xlsx (DICTIONARY_EXCEL_FILE) │ └── results/ (RESULTS_DIR) ├── dataset/ │ └── / (CLEAN_DATASET_DIR) └── data_dictionary_mappings/ (DICTIONARY_JSON_OUTPUT_DIR) See Also -------- :doc:`../user_guide/configuration` User guide for configuration and utility functions :doc:`../user_guide/quickstart` Quick start guide with configuration validation :doc:`../user_guide/troubleshooting` Troubleshooting with validation utilities :doc:`../developer_guide/extending` Extending the configuration module :doc:`../developer_guide/contributing` Configuration module contribution guidelines :doc:`../changelog` Version 0.0.3 changes and enhancements :mod:`main` Main pipeline that uses configuration :mod:`scripts.extract_data` Data extraction using configuration paths