config module
RePORTaLiN Configuration Module
Centralized configuration management with dynamic dataset detection, automatic path resolution, and flexible logging configuration.
- config.get_dataset_folder()[source]
Detect first dataset folder in data/dataset/, excluding hidden folders.
Note
Folders starting with ‘.’ are excluded as they are typically hidden. Errors during directory listing are silently handled to avoid issues during module initialization (before logging is configured).
- config.normalize_dataset_name(folder_name)[source]
Normalize dataset folder name by removing common suffixes.
- Parameters:
folder_name (
Optional[str]) – The dataset folder name to normalize- Return type:
- Returns:
Normalized dataset name
Note
Removes the longest matching suffix to ensure correct normalization regardless of suffix ordering (e.g., ‘_csv_files’ before ‘_files’). Whitespace is stripped before suffix removal for consistent behavior.
Overview
The config module provides centralized configuration management for RePORTaLiN.
All paths, settings, and parameters are defined here to ensure consistency across
all pipeline components.
Changed in version 0.3.0: Added ensure_directories(), validate_config(), and normalize_dataset_name() functions.
Enhanced error handling and type safety. Fixed suffix removal bug.
Module Metadata
__version__
__version__ = '1.0.0'
Module version string.
__all__
__all__ = [
'ROOT_DIR', 'DATA_DIR', 'RESULTS_DIR', 'DATASET_BASE_DIR',
'DATASET_FOLDER_NAME', 'DATASET_DIR', 'DATASET_NAME', 'CLEAN_DATASET_DIR',
'DICTIONARY_EXCEL_FILE', 'DICTIONARY_JSON_OUTPUT_DIR',
'LOG_LEVEL', 'LOG_NAME',
'ensure_directories', 'validate_config',
'DEFAULT_DATASET_NAME'
]
Public API exports. Only these symbols are exported with from config import *.
Configuration Variables
Constants
DEFAULT_DATASET_NAME
DEFAULT_DATASET_NAME = "RePORTaLiN_sample"
Default dataset name used when no dataset folder is detected.
Added in version 0.3.0: Extracted as a public constant.
DATASET_SUFFIXES
DATASET_SUFFIXES = ('_csv_files', '_files')
Tuple of common suffixes removed from dataset folder names during normalization.
Added in version 0.3.0: Internal constant (not in __all__).
Directory Paths
ROOT_DIR
ROOT_DIR = os.path.dirname(os.path.abspath(__file__)) if '__file__' in globals() else os.getcwd()
Absolute path to the project root directory. All other paths are relative to this.
Changed in version 0.3.0: Added fallback to os.getcwd() when __file__ is not available (REPL environments).
DATA_DIR
DATA_DIR = os.path.join(ROOT_DIR, "data")
Path to the data directory containing input files.
RESULTS_DIR
RESULTS_DIR = os.path.join(ROOT_DIR, "results")
Path to the results directory for output files.
Dataset Paths
DATASET_BASE_DIR
DATASET_BASE_DIR = os.path.join(DATA_DIR, "dataset")
Base directory containing dataset folders.
DATASET_DIR
DATASET_DIR = get_dataset_folder()
Path to the current dataset directory (auto-detected).
DATASET_NAME
DATASET_NAME = normalize_dataset_name(DATASET_FOLDER_NAME)
Name of the current dataset (e.g., “Indo-vap”), extracted by removing common suffixes
from the dataset folder name using the normalize_dataset_name() function.
Changed in version 0.3.0: Now uses normalize_dataset_name() function for improved suffix handling.
Output Paths
CLEAN_DATASET_DIR
CLEAN_DATASET_DIR = os.path.join(RESULTS_DIR, "dataset", DATASET_NAME)
Output directory for extracted JSONL files.
DICTIONARY_JSON_OUTPUT_DIR
DICTIONARY_JSON_OUTPUT_DIR = os.path.join(RESULTS_DIR, "data_dictionary_mappings")
Output directory for data dictionary tables.
Dictionary File
DICTIONARY_EXCEL_FILE
DICTIONARY_EXCEL_FILE = os.path.join(
DATA_DIR,
"data_dictionary_and_mapping_specifications",
"RePORT_DEB_to_Tables_mapping.xlsx"
)
Path to the data dictionary Excel file.
Logging Settings
LOG_LEVEL
LOG_LEVEL = logging.INFO
Logging verbosity level. Options:
logging.DEBUG: Detailed diagnostic informationlogging.INFO: General informational messages (default)logging.WARNING: Warning messageslogging.ERROR: Error messages only
LOG_NAME
LOG_NAME = "reportalin"
Logger instance name used throughout the application.
Functions
get_dataset_folder
- config.get_dataset_folder()[source]
Detect first dataset folder in data/dataset/, excluding hidden folders.
Note
Folders starting with ‘.’ are excluded as they are typically hidden. Errors during directory listing are silently handled to avoid issues during module initialization (before logging is configured).
Automatically detect the dataset folder from the file system. Returns the first
alphabetically sorted folder in data/dataset/, excluding hidden folders
(those starting with ‘.’).
- Returns:
str: Name of the detected dataset folderNone: If no folders exist or directory is inaccessible
Example:
from config import get_dataset_folder
folder = get_dataset_folder()
if folder:
print(f"Detected dataset: {folder}")
else:
print("No dataset folder found")
Changed in version 0.3.0: Removed faulty '..' not in f check. Added explicit empty list validation.
normalize_dataset_name
- config.normalize_dataset_name(folder_name)[source]
Normalize dataset folder name by removing common suffixes.
- Parameters:
folder_name (
Optional[str]) – The dataset folder name to normalize- Return type:
- Returns:
Normalized dataset name
Note
Removes the longest matching suffix to ensure correct normalization regardless of suffix ordering (e.g., ‘_csv_files’ before ‘_files’). Whitespace is stripped before suffix removal for consistent behavior.
Normalize dataset folder name by removing common suffixes.
- Parameters:
folder_name(Optional[str]): The dataset folder name to normalize
- Returns:
str: Normalized dataset name
- Algorithm:
Removes the longest matching suffix from
DATASET_SUFFIXESto handle overlapping suffixes correctly (e.g.,_csv_filesvs_files).
Examples:
from config import normalize_dataset_name
# Remove suffix
name = normalize_dataset_name("Indo-vap_csv_files")
print(name) # Output: "Indo-vap"
# Handle overlapping suffixes
name = normalize_dataset_name("test_files")
print(name) # Output: "test"
# Fallback to default
name = normalize_dataset_name(None)
print(name) # Output: "RePORTaLiN_sample"
Added in version 0.3.0: Extracted from inline code. Uses longest-match algorithm.
ensure_directories
- config.ensure_directories()[source]
Create necessary directories if they don’t exist.
- Return type:
- Create necessary directories if they don’t exist. Creates:
RESULTS_DIRCLEAN_DATASET_DIRDICTIONARY_JSON_OUTPUT_DIR
Example:
from config import ensure_directories
# Create all required directories
ensure_directories()
Added in version 0.3.0: New utility function for directory management.
validate_config
- config.validate_config()[source]
Validate configuration and return list of warnings.
Validate configuration and return list of warnings for missing or invalid paths.
- Returns:
List[str]: List of warning messages (empty if all valid)
- Validates:
DATA_DIRexistsDATASET_DIRexistsDICTIONARY_EXCEL_FILEexists
Example:
from config import validate_config
warnings = validate_config()
if warnings:
print("Configuration warnings:")
for warning in warnings:
print(f" - {warning}")
else:
print("Configuration is valid!")
Added in version 0.3.0: New utility function for configuration validation.
Usage Examples
Basic Usage
import config
# Access configuration
print(f"Dataset: {config.DATASET_NAME}")
print(f"Input: {config.DATASET_DIR}")
print(f"Output: {config.CLEAN_DATASET_DIR}")
Using New Utility Functions
from config import ensure_directories, validate_config
# Ensure all directories exist
ensure_directories()
# Validate configuration
warnings = validate_config()
if warnings:
for warning in warnings:
print(f"Warning: {warning}")
Custom Configuration
# config.py modifications
import os
# Use environment variable
DATA_DIR = os.getenv("REPORTALIN_DATA", os.path.join(ROOT_DIR, "data"))
# Custom logging
import logging
LOG_LEVEL = logging.DEBUG
Best Practices
Always call ensure_directories() before file operations:
from config import ensure_directories, CLEAN_DATASET_DIR ensure_directories() # Now safe to write to CLEAN_DATASET_DIR
Validate configuration at startup:
from config import validate_config warnings = validate_config() if warnings: logger.warning("Configuration issues detected:") for warning in warnings: logger.warning(f" {warning}")
Use constants instead of hardcoded values:
from config import DEFAULT_DATASET_NAME # Good dataset = folder_name or DEFAULT_DATASET_NAME # Avoid dataset = folder_name or "RePORTaLiN_sample"
Directory Structure
The configuration defines this structure:
RePORTaLiN/
├── data/ (DATA_DIR)
│ ├── dataset/ (DATASET_BASE_DIR)
│ │ └── <dataset_name>/ (DATASET_DIR)
│ └── data_dictionary_and_mapping_specifications/
│ └── RePORT_DEB_to_Tables_mapping.xlsx (DICTIONARY_EXCEL_FILE)
│
└── results/ (RESULTS_DIR)
├── dataset/
│ └── <dataset_name>/ (CLEAN_DATASET_DIR)
└── data_dictionary_mappings/ (DICTIONARY_JSON_OUTPUT_DIR)
See Also
- Configuration
User guide for configuration and utility functions
- Quick Start
Quick start guide with configuration validation
- Troubleshooting
Troubleshooting with validation utilities
- Extending RePORTaLiN
Extending the configuration module
- Contributing
Configuration module contribution guidelines
- Changelog
Version 0.0.3 changes and enhancements
mainMain pipeline that uses configuration
scripts.extract_dataData extraction using configuration paths