config module

RePORTaLiN Configuration Module

Centralized configuration management with dynamic dataset detection, automatic path resolution, and flexible logging configuration.

config.ensure_directories()[source]

Create necessary directories if they don’t exist.

Return type:

None

config.get_dataset_folder()[source]

Detect first dataset folder in data/dataset/, excluding hidden folders.

Return type:

Optional[str]

Returns:

Name of the first dataset folder found, or None if none exists

Note

Folders starting with ‘.’ are excluded as they are typically hidden. Errors during directory listing are silently handled to avoid issues during module initialization (before logging is configured).

config.normalize_dataset_name(folder_name)[source]

Normalize dataset folder name by removing common suffixes.

Parameters:

folder_name (Optional[str]) – The dataset folder name to normalize

Return type:

str

Returns:

Normalized dataset name

Note

Removes the longest matching suffix to ensure correct normalization regardless of suffix ordering (e.g., ‘_csv_files’ before ‘_files’). Whitespace is stripped before suffix removal for consistent behavior.

config.validate_config()[source]

Validate configuration and return list of warnings.

Return type:

List[str]

Returns:

List of warning messages for missing or invalid configuration

Overview

The config module provides centralized configuration management for RePORTaLiN. All paths, settings, and parameters are defined here to ensure consistency across all pipeline components.

Changed in version 0.3.0: Added ensure_directories(), validate_config(), and normalize_dataset_name() functions. Enhanced error handling and type safety. Fixed suffix removal bug.

Module Metadata

__version__

__version__ = '1.0.0'

Module version string.

__all__

__all__ = [
    'ROOT_DIR', 'DATA_DIR', 'RESULTS_DIR', 'DATASET_BASE_DIR',
    'DATASET_FOLDER_NAME', 'DATASET_DIR', 'DATASET_NAME', 'CLEAN_DATASET_DIR',
    'DICTIONARY_EXCEL_FILE', 'DICTIONARY_JSON_OUTPUT_DIR',
    'LOG_LEVEL', 'LOG_NAME',
    'ensure_directories', 'validate_config',
    'DEFAULT_DATASET_NAME'
]

Public API exports. Only these symbols are exported with from config import *.

Configuration Variables

Constants

DEFAULT_DATASET_NAME

DEFAULT_DATASET_NAME = "RePORTaLiN_sample"

Default dataset name used when no dataset folder is detected.

Added in version 0.3.0: Extracted as a public constant.

DATASET_SUFFIXES

DATASET_SUFFIXES = ('_csv_files', '_files')

Tuple of common suffixes removed from dataset folder names during normalization.

Added in version 0.3.0: Internal constant (not in __all__).

Directory Paths

ROOT_DIR

ROOT_DIR = os.path.dirname(os.path.abspath(__file__)) if '__file__' in globals() else os.getcwd()

Absolute path to the project root directory. All other paths are relative to this.

Changed in version 0.3.0: Added fallback to os.getcwd() when __file__ is not available (REPL environments).

DATA_DIR

DATA_DIR = os.path.join(ROOT_DIR, "data")

Path to the data directory containing input files.

RESULTS_DIR

RESULTS_DIR = os.path.join(ROOT_DIR, "results")

Path to the results directory for output files.

Dataset Paths

DATASET_BASE_DIR

DATASET_BASE_DIR = os.path.join(DATA_DIR, "dataset")

Base directory containing dataset folders.

DATASET_DIR

DATASET_DIR = get_dataset_folder()

Path to the current dataset directory (auto-detected).

DATASET_NAME

DATASET_NAME = normalize_dataset_name(DATASET_FOLDER_NAME)

Name of the current dataset (e.g., “Indo-vap”), extracted by removing common suffixes from the dataset folder name using the normalize_dataset_name() function.

Changed in version 0.3.0: Now uses normalize_dataset_name() function for improved suffix handling.

Output Paths

CLEAN_DATASET_DIR

CLEAN_DATASET_DIR = os.path.join(RESULTS_DIR, "dataset", DATASET_NAME)

Output directory for extracted JSONL files.

DICTIONARY_JSON_OUTPUT_DIR

DICTIONARY_JSON_OUTPUT_DIR = os.path.join(RESULTS_DIR, "data_dictionary_mappings")

Output directory for data dictionary tables.

Dictionary File

DICTIONARY_EXCEL_FILE

DICTIONARY_EXCEL_FILE = os.path.join(
    DATA_DIR,
    "data_dictionary_and_mapping_specifications",
    "RePORT_DEB_to_Tables_mapping.xlsx"
)

Path to the data dictionary Excel file.

Logging Settings

LOG_LEVEL

LOG_LEVEL = logging.INFO

Logging verbosity level. Options:

  • logging.DEBUG: Detailed diagnostic information

  • logging.INFO: General informational messages (default)

  • logging.WARNING: Warning messages

  • logging.ERROR: Error messages only

LOG_NAME

LOG_NAME = "reportalin"

Logger instance name used throughout the application.

Functions

get_dataset_folder

config.get_dataset_folder()[source]

Detect first dataset folder in data/dataset/, excluding hidden folders.

Return type:

Optional[str]

Returns:

Name of the first dataset folder found, or None if none exists

Note

Folders starting with ‘.’ are excluded as they are typically hidden. Errors during directory listing are silently handled to avoid issues during module initialization (before logging is configured).

Automatically detect the dataset folder from the file system. Returns the first alphabetically sorted folder in data/dataset/, excluding hidden folders (those starting with ‘.’).

Returns:
  • str: Name of the detected dataset folder

  • None: If no folders exist or directory is inaccessible

Example:

from config import get_dataset_folder

folder = get_dataset_folder()
if folder:
    print(f"Detected dataset: {folder}")
else:
    print("No dataset folder found")

Changed in version 0.3.0: Removed faulty '..' not in f check. Added explicit empty list validation.

normalize_dataset_name

config.normalize_dataset_name(folder_name)[source]

Normalize dataset folder name by removing common suffixes.

Parameters:

folder_name (Optional[str]) – The dataset folder name to normalize

Return type:

str

Returns:

Normalized dataset name

Note

Removes the longest matching suffix to ensure correct normalization regardless of suffix ordering (e.g., ‘_csv_files’ before ‘_files’). Whitespace is stripped before suffix removal for consistent behavior.

Normalize dataset folder name by removing common suffixes.

Parameters:
  • folder_name (Optional[str]): The dataset folder name to normalize

Returns:
  • str: Normalized dataset name

Algorithm:

Removes the longest matching suffix from DATASET_SUFFIXES to handle overlapping suffixes correctly (e.g., _csv_files vs _files).

Examples:

from config import normalize_dataset_name

# Remove suffix
name = normalize_dataset_name("Indo-vap_csv_files")
print(name)  # Output: "Indo-vap"

# Handle overlapping suffixes
name = normalize_dataset_name("test_files")
print(name)  # Output: "test"

# Fallback to default
name = normalize_dataset_name(None)
print(name)  # Output: "RePORTaLiN_sample"

Added in version 0.3.0: Extracted from inline code. Uses longest-match algorithm.

ensure_directories

config.ensure_directories()[source]

Create necessary directories if they don’t exist.

Return type:

None

Create necessary directories if they don’t exist. Creates:
  • RESULTS_DIR

  • CLEAN_DATASET_DIR

  • DICTIONARY_JSON_OUTPUT_DIR

Example:

from config import ensure_directories

# Create all required directories
ensure_directories()

Added in version 0.3.0: New utility function for directory management.

validate_config

config.validate_config()[source]

Validate configuration and return list of warnings.

Return type:

List[str]

Returns:

List of warning messages for missing or invalid configuration

Validate configuration and return list of warnings for missing or invalid paths.

Returns:
  • List[str]: List of warning messages (empty if all valid)

Validates:
  • DATA_DIR exists

  • DATASET_DIR exists

  • DICTIONARY_EXCEL_FILE exists

Example:

from config import validate_config

warnings = validate_config()
if warnings:
    print("Configuration warnings:")
    for warning in warnings:
        print(f"  - {warning}")
else:
    print("Configuration is valid!")

Added in version 0.3.0: New utility function for configuration validation.

Usage Examples

Basic Usage

import config

# Access configuration
print(f"Dataset: {config.DATASET_NAME}")
print(f"Input: {config.DATASET_DIR}")
print(f"Output: {config.CLEAN_DATASET_DIR}")

Using New Utility Functions

from config import ensure_directories, validate_config

# Ensure all directories exist
ensure_directories()

# Validate configuration
warnings = validate_config()
if warnings:
    for warning in warnings:
        print(f"Warning: {warning}")

Custom Configuration

# config.py modifications
import os

# Use environment variable
DATA_DIR = os.getenv("REPORTALIN_DATA", os.path.join(ROOT_DIR, "data"))

# Custom logging
import logging
LOG_LEVEL = logging.DEBUG

Best Practices

  1. Always call ensure_directories() before file operations:

    from config import ensure_directories, CLEAN_DATASET_DIR
    
    ensure_directories()
    # Now safe to write to CLEAN_DATASET_DIR
    
  2. Validate configuration at startup:

    from config import validate_config
    
    warnings = validate_config()
    if warnings:
        logger.warning("Configuration issues detected:")
        for warning in warnings:
            logger.warning(f"  {warning}")
    
  3. Use constants instead of hardcoded values:

    from config import DEFAULT_DATASET_NAME
    
    # Good
    dataset = folder_name or DEFAULT_DATASET_NAME
    
    # Avoid
    dataset = folder_name or "RePORTaLiN_sample"
    

Directory Structure

The configuration defines this structure:

RePORTaLiN/
├── data/                           (DATA_DIR)
│   ├── dataset/                    (DATASET_BASE_DIR)
│   │   └── <dataset_name>/         (DATASET_DIR)
│   └── data_dictionary_and_mapping_specifications/
│       └── RePORT_DEB_to_Tables_mapping.xlsx  (DICTIONARY_EXCEL_FILE)
│
└── results/                        (RESULTS_DIR)
    ├── dataset/
    │   └── <dataset_name>/         (CLEAN_DATASET_DIR)
    └── data_dictionary_mappings/   (DICTIONARY_JSON_OUTPUT_DIR)

See Also

Configuration

User guide for configuration and utility functions

Quick Start

Quick start guide with configuration validation

Troubleshooting

Troubleshooting with validation utilities

Extending RePORTaLiN

Extending the configuration module

Contributing

Configuration module contribution guidelines

Changelog

Version 0.0.3 changes and enhancements

main

Main pipeline that uses configuration

scripts.extract_data

Data extraction using configuration paths