config module

RePORTaLiN Configuration Module

Centralized configuration management with dynamic dataset detection, automatic path resolution, and flexible logging configuration.

config.ensure_directories()[source]

Create necessary directories if they don’t exist.

Return type:: None

config.get_dataset_folder()[source]

Detect first dataset folder in data/dataset/, excluding hidden folders.

Return type:: Optional[str]
Returns:: Name of the first dataset folder found, or None if none exists

Note

Folders starting with ‘.’ are excluded as they are typically hidden. Errors during directory listing are silently handled to avoid issues during module initialization (before logging is configured).

config.normalize_dataset_name(folder_name)[source]

Normalize dataset folder name by removing common suffixes.

Parameters:: folder_name (Optional[str]) – The dataset folder name to normalize
Return type:: str
Returns:: Normalized dataset name

Note

Removes the longest matching suffix to ensure correct normalization regardless of suffix ordering (e.g., ‘_csv_files’ before ‘_files’). Whitespace is stripped before suffix removal for consistent behavior.

config.validate_config()[source]

Validate configuration and return list of warnings.

Return type:: List[str]
Returns:: List of warning messages for missing or invalid configuration

Overview

The config module provides centralized configuration management for RePORTaLiN. All paths, settings, and parameters are defined here to ensure consistency across all pipeline components.

Changed in version 0.3.0: Added ensure_directories(), validate_config(), and normalize_dataset_name() functions. Enhanced error handling and type safety. Fixed suffix removal bug.

Module Metadata

version

__version__ = '1.0.0'

Module version string.

all

__all__ = [
    'ROOT_DIR', 'DATA_DIR', 'RESULTS_DIR', 'DATASET_BASE_DIR',
    'DATASET_FOLDER_NAME', 'DATASET_DIR', 'DATASET_NAME', 'CLEAN_DATASET_DIR',
    'DICTIONARY_EXCEL_FILE', 'DICTIONARY_JSON_OUTPUT_DIR',
    'LOG_LEVEL', 'LOG_NAME',
    'ensure_directories', 'validate_config',
    'DEFAULT_DATASET_NAME'
]

Public API exports. Only these symbols are exported with from config import *.

Configuration Variables

Constants

DEFAULT_DATASET_NAME

DEFAULT_DATASET_NAME = "RePORTaLiN_sample"

Default dataset name used when no dataset folder is detected.

Added in version 0.3.0: Extracted as a public constant.

DATASET_SUFFIXES

DATASET_SUFFIXES = ('_csv_files', '_files')

Tuple of common suffixes removed from dataset folder names during normalization.

Added in version 0.3.0: Internal constant (not in __all__).

Directory Paths

ROOT_DIR

ROOT_DIR = os.path.dirname(os.path.abspath(__file__)) if '__file__' in globals() else os.getcwd()

Absolute path to the project root directory. All other paths are relative to this.

Changed in version 0.3.0: Added fallback to os.getcwd() when __file__ is not available (REPL environments).

DATA_DIR

DATA_DIR = os.path.join(ROOT_DIR, "data")

Path to the data directory containing input files.

RESULTS_DIR

RESULTS_DIR = os.path.join(ROOT_DIR, "results")

Path to the results directory for output files.

Dataset Paths

DATASET_BASE_DIR

DATASET_BASE_DIR = os.path.join(DATA_DIR, "dataset")

Base directory containing dataset folders.

DATASET_DIR

DATASET_DIR = get_dataset_folder()

Path to the current dataset directory (auto-detected).

DATASET_NAME

DATASET_NAME = normalize_dataset_name(DATASET_FOLDER_NAME)

Name of the current dataset (e.g., “Indo-vap”), extracted by removing common suffixes from the dataset folder name using the normalize_dataset_name() function.

Changed in version 0.3.0: Now uses normalize_dataset_name() function for improved suffix handling.

Output Paths

CLEAN_DATASET_DIR

CLEAN_DATASET_DIR = os.path.join(RESULTS_DIR, "dataset", DATASET_NAME)

Output directory for extracted JSONL files.

DICTIONARY_JSON_OUTPUT_DIR

DICTIONARY_JSON_OUTPUT_DIR = os.path.join(RESULTS_DIR, "data_dictionary_mappings")

Output directory for data dictionary tables.

Dictionary File

DICTIONARY_EXCEL_FILE

DICTIONARY_EXCEL_FILE = os.path.join(
    DATA_DIR,
    "data_dictionary_and_mapping_specifications",
    "RePORT_DEB_to_Tables_mapping.xlsx"
)

Path to the data dictionary Excel file.

Logging Settings

LOG_LEVEL

LOG_LEVEL = logging.INFO

Logging verbosity level. Options:

logging.DEBUG: Detailed diagnostic information
logging.INFO: General informational messages (default)
logging.WARNING: Warning messages
logging.ERROR: Error messages only

LOG_NAME

LOG_NAME = "reportalin"

Logger instance name used throughout the application.

Functions

get_dataset_folder

config.get_dataset_folder()[source]

Detect first dataset folder in data/dataset/, excluding hidden folders.

Return type:: Optional[str]
Returns:: Name of the first dataset folder found, or None if none exists

Note

Folders starting with ‘.’ are excluded as they are typically hidden. Errors during directory listing are silently handled to avoid issues during module initialization (before logging is configured).

Automatically detect the dataset folder from the file system. Returns the first alphabetically sorted folder in data/dataset/, excluding hidden folders (those starting with ‘.’).

Returns:

str: Name of the detected dataset folder
None: If no folders exist or directory is inaccessible

Example:

from config import get_dataset_folder

folder = get_dataset_folder()
if folder:
    print(f"Detected dataset: {folder}")
else:
    print("No dataset folder found")

Changed in version 0.3.0: Removed faulty '..' not in f check. Added explicit empty list validation.

normalize_dataset_name

config.normalize_dataset_name(folder_name)[source]

Normalize dataset folder name by removing common suffixes.

Parameters:: folder_name (Optional[str]) – The dataset folder name to normalize
Return type:: str
Returns:: Normalized dataset name

Note

Removes the longest matching suffix to ensure correct normalization regardless of suffix ordering (e.g., ‘_csv_files’ before ‘_files’). Whitespace is stripped before suffix removal for consistent behavior.

Normalize dataset folder name by removing common suffixes.

Parameters:

folder_name (Optional[str]): The dataset folder name to normalize

Returns:

str: Normalized dataset name

Algorithm:

Removes the longest matching suffix from DATASET_SUFFIXES to handle overlapping suffixes correctly (e.g., _csv_files vs _files).

Examples:

from config import normalize_dataset_name

# Remove suffix
name = normalize_dataset_name("Indo-vap_csv_files")
print(name)  # Output: "Indo-vap"

# Handle overlapping suffixes
name = normalize_dataset_name("test_files")
print(name)  # Output: "test"

# Fallback to default
name = normalize_dataset_name(None)
print(name)  # Output: "RePORTaLiN_sample"

Added in version 0.3.0: Extracted from inline code. Uses longest-match algorithm.

ensure_directories

config.ensure_directories()[source]

Create necessary directories if they don’t exist.

Return type:: None

Create necessary directories if they don’t exist. Creates:

RESULTS_DIR
CLEAN_DATASET_DIR
DICTIONARY_JSON_OUTPUT_DIR

Example:

from config import ensure_directories

# Create all required directories
ensure_directories()

Added in version 0.3.0: New utility function for directory management.

validate_config

config.validate_config()[source]

Validate configuration and return list of warnings.

Return type:: List[str]
Returns:: List of warning messages for missing or invalid configuration

Validate configuration and return list of warnings for missing or invalid paths.

Returns:

List[str]: List of warning messages (empty if all valid)

Validates:

DATA_DIR exists
DATASET_DIR exists
DICTIONARY_EXCEL_FILE exists

Example:

from config import validate_config

warnings = validate_config()
if warnings:
    print("Configuration warnings:")
    for warning in warnings:
        print(f"  - {warning}")
else:
    print("Configuration is valid!")

Added in version 0.3.0: New utility function for configuration validation.

Usage Examples

Basic Usage

import config

# Access configuration
print(f"Dataset: {config.DATASET_NAME}")
print(f"Input: {config.DATASET_DIR}")
print(f"Output: {config.CLEAN_DATASET_DIR}")

Using New Utility Functions

from config import ensure_directories, validate_config

# Ensure all directories exist
ensure_directories()

# Validate configuration
warnings = validate_config()
if warnings:
    for warning in warnings:
        print(f"Warning: {warning}")

Custom Configuration

# config.py modifications
import os

# Use environment variable
DATA_DIR = os.getenv("REPORTALIN_DATA", os.path.join(ROOT_DIR, "data"))

# Custom logging
import logging
LOG_LEVEL = logging.DEBUG

Best Practices

Always call ensure_directories() before file operations:

from config import ensure_directories, CLEAN_DATASET_DIR

ensure_directories()
# Now safe to write to CLEAN_DATASET_DIR

Validate configuration at startup:

from config import validate_config

warnings = validate_config()
if warnings:
    logger.warning("Configuration issues detected:")
    for warning in warnings:
        logger.warning(f"  {warning}")

Use constants instead of hardcoded values:

from config import DEFAULT_DATASET_NAME

# Good
dataset = folder_name or DEFAULT_DATASET_NAME

# Avoid
dataset = folder_name or "RePORTaLiN_sample"

Directory Structure

The configuration defines this structure:

RePORTaLiN/
├── data/                           (DATA_DIR)
│   ├── dataset/                    (DATASET_BASE_DIR)
│   │   └── <dataset_name>/         (DATASET_DIR)
│   └── data_dictionary_and_mapping_specifications/
│       └── RePORT_DEB_to_Tables_mapping.xlsx  (DICTIONARY_EXCEL_FILE)
│
└── results/                        (RESULTS_DIR)
    ├── dataset/
    │   └── <dataset_name>/         (CLEAN_DATASET_DIR)
    └── data_dictionary_mappings/   (DICTIONARY_JSON_OUTPUT_DIR)

config module

RePORTaLiN Configuration Module

Overview

Module Metadata

__version__

__all__

Configuration Variables

Constants

DEFAULT_DATASET_NAME

DATASET_SUFFIXES

Directory Paths

ROOT_DIR

DATA_DIR

RESULTS_DIR

Dataset Paths

DATASET_BASE_DIR

DATASET_DIR

DATASET_NAME

Output Paths

CLEAN_DATASET_DIR

DICTIONARY_JSON_OUTPUT_DIR

Dictionary File

DICTIONARY_EXCEL_FILE

Logging Settings

LOG_LEVEL

LOG_NAME

Functions

get_dataset_folder

normalize_dataset_name

ensure_directories

validate_config

Usage Examples

Basic Usage

Using New Utility Functions

Custom Configuration

Best Practices

Directory Structure

See Also

version

all