Configuration
For Users: Customizing Your Setup
RePORTaLiN comes with sensible defaults that work out of the box. This guide shows you how to adjust settings if you need to customize where files are stored or how the tool behaves.
Changed in version 0.3.0: Added automatic directory creation and configuration validation to make setup easier.
Configuration File
The main configuration file is config.py in the project root. It defines all paths,
settings, and parameters used throughout the pipeline.
What’s New in 0.8.6
- ✨ New Features:
Automatically creates folders you need
Checks your setup and warns you if something’s wrong
Better error messages when files are missing
Improved handling of dataset folder names
Dynamic Dataset Detection
RePORTaLiN automatically detects your dataset folder:
# config.py automatically finds the first folder in data/dataset/
DATASET_DIR = os.path.join(DATA_DIR, "dataset", dataset_folder)
This means you can work with any dataset without modifying code:
data/dataset/
└── my_study_data/ # Automatically detected
├── file1.xlsx
└── file2.xlsx
Changed in version 0.3.0: Improved automatic detection with better error handling for special cases.
Configuration Variables
Project Root
ROOT_DIR = os.path.dirname(os.path.abspath(__file__)) if '__file__' in globals() else os.getcwd()
Purpose: Absolute path to project root directory
Usage: All other paths are relative to this
Modification: Not recommended (auto-detected)
Changed in version 0.3.0: Added support for running in interactive environments like Jupyter notebooks.
Data Directories
DATA_DIR = os.path.join(ROOT_DIR, "data")
RESULTS_DIR = os.path.join(ROOT_DIR, "results")
DATA_DIR: Location of raw input data
RESULTS_DIR: Location for processed outputs
Modification: Can be changed if you want different locations
Dataset Paths
DATASET_BASE_DIR = os.path.join(DATA_DIR, "dataset")
DATASET_FOLDER_NAME = get_dataset_folder() # Auto-detected
DATASET_DIR = os.path.join(DATASET_BASE_DIR, DATASET_FOLDER_NAME or DEFAULT_DATASET_NAME)
DATASET_NAME = normalize_dataset_name(DATASET_FOLDER_NAME)
DATASET_BASE_DIR: Parent directory for all datasets
DATASET_FOLDER_NAME: Name of detected folder (returned by
get_dataset_folder())DATASET_DIR: Full path to current dataset (auto-detected)
DATASET_NAME: Cleaned dataset name (e.g., “Indo-vap_csv_files” → “Indo-vap”)
Changed in version 0.3.0: Now automatically cleans up dataset names and handles common file endings better.
Output Directories
CLEAN_DATASET_DIR = os.path.join(RESULTS_DIR, "dataset", DATASET_NAME)
DICTIONARY_JSON_OUTPUT_DIR = os.path.join(RESULTS_DIR, "data_dictionary_mappings")
CLEAN_DATASET_DIR: Where extracted JSONL files are saved
DICTIONARY_JSON_OUTPUT_DIR: Where dictionary tables are saved
Data Dictionary
DICTIONARY_EXCEL_FILE = os.path.join(
DATA_DIR,
"data_dictionary_and_mapping_specifications",
"RePORT_DEB_to_Tables_mapping.xlsx"
)
Purpose: Path to the data dictionary Excel file
Modification: Change filename if your dictionary has a different name
Logging Settings
LOG_LEVEL = logging.INFO
LOG_NAME = "reportalin"
LOG_LEVEL: Controls verbosity (INFO, DEBUG, WARNING, ERROR)
LOG_NAME: Logger instance name
Available log levels:
logging.DEBUG: Detailed diagnostic informationlogging.INFO: General informational messages (default)logging.WARNING: Warning messageslogging.ERROR: Error messages only
De-identification Settings
Added in version 0.3.0: De-identification configuration is now documented with comprehensive examples.
De-identification settings can be customized using the configuration options:
from scripts.deidentify import DeidentificationConfig
config = DeidentificationConfig(
# Pseudonym templates
pseudonym_templates={
PHIType.NAME_FULL: "PATIENT-{id}",
PHIType.MRN: "MRN-{id}",
# ... other templates
},
# Date shifting
enable_date_shifting=True,
date_shift_range_days=365,
preserve_date_intervals=True,
# Security
enable_encryption=True,
encryption_key=None, # Auto-generated if None
# Validation
enable_validation=True,
strict_mode=True,
# Logging
log_detections=True,
log_level=logging.INFO,
# Country-specific regulations
countries=['IN', 'US'], # None for default (IN)
enable_country_patterns=True
)
Key Parameters:
pseudonym_templates: Custom format for pseudonyms (e.g., “PATIENT-{id}”)
enable_date_shifting: Shift dates by consistent offset
date_shift_range_days: Maximum shift range (±365 days default)
preserve_date_intervals: Keep time intervals consistent
enable_encryption: Encrypt mapping files with Fernet
encryption_key: Custom encryption key (auto-generated if None)
enable_validation: Validate de-identified output
strict_mode: Fail on validation errors
log_detections: Log detected PHI/PII items
countries: List of country codes for country-specific patterns
enable_country_patterns: Use country-specific detection patterns
Example Configurations:
Basic de-identification (India-specific):
config = DeidentificationConfig() # Uses defaults
Multi-country de-identification:
config = DeidentificationConfig(
countries=['US', 'IN', 'BR', 'ID'],
enable_encryption=True
)
Testing/development (no encryption):
config = DeidentificationConfig(
enable_encryption=False,
log_level=logging.DEBUG
)
See De-identification for complete de-identification guide.
Helper Tools
Added in version 0.3.0.
The configuration now provides helpful tools for common tasks.
ensure_directories()
Automatically creates all required directories if they don’t exist.
from config import ensure_directories
# Create all necessary directories
ensure_directories()
- What it creates:
RESULTS_DIRCLEAN_DATASET_DIRDICTIONARY_JSON_OUTPUT_DIR
- When to use:
At the start of your pipeline
Before writing any output files
When setting up a new environment
validate_config()
Validates the configuration and returns a list of warnings.
from config import validate_config
warnings = validate_config()
if warnings:
print("Configuration warnings:")
for warning in warnings:
print(f" - {warning}")
else:
print("Configuration is valid!")
- What it checks:
DATA_DIRexistsDATASET_DIRexistsDICTIONARY_EXCEL_FILEexists
- Returns:
Empty list if all valid
List of warning strings if issues found
- When to use:
Before starting the pipeline
For debugging configuration issues
In automated testing
normalize_dataset_name()
Normalize a dataset folder name by removing common suffixes.
from config import normalize_dataset_name
name = normalize_dataset_name("Indo-vap_csv_files")
print(name) # Output: "Indo-vap"
- Parameters:
folder_name(Optional[str]): Dataset folder name
- Returns:
Normalized name, or
DEFAULT_DATASET_NAMEif None
Examples:
normalize_dataset_name("study_csv_files") # → "study"
normalize_dataset_name("test_files") # → "test"
normalize_dataset_name(None) # → "RePORTaLiN_sample"
Customizing Configuration
Example 1: Change Log Level
To see more detailed debug information:
# config.py
import logging
LOG_LEVEL = logging.DEBUG # More verbose logging
Example 2: Custom Data Location
To use a different data directory:
# config.py
DATA_DIR = "/path/to/my/data"
RESULTS_DIR = "/path/to/my/results"
Example 3: Different Dictionary File
If your data dictionary has a different name:
# config.py
DICTIONARY_EXCEL_FILE = os.path.join(
DATA_DIR,
"data_dictionary_and_mapping_specifications",
"MyCustomDictionary.xlsx"
)
Environment Variables
You can also use environment variables for configuration:
# config.py
import os
# Use environment variable with fallback
DATA_DIR = os.getenv("REPORTALIN_DATA_DIR", os.path.join(ROOT_DIR, "data"))
Then set the environment variable:
export REPORTALIN_DATA_DIR="/my/custom/data/path"
python main.py
Configuration Best Practices
Don’t Hardcode Paths
❌ Bad:
file_path = "/Users/john/data/file.xlsx"
✅ Good:
file_path = os.path.join(config.DATA_DIR, "file.xlsx")
Use Path Objects
For more robust path handling:
from pathlib import Path DATA_DIR = Path(ROOT_DIR) / "data" DATASET_DIR = DATA_DIR / "dataset" / dataset_name
Keep Configuration Separate
Don’t mix configuration with business logic:
❌ Bad: Hardcoding paths in processing functions
✅ Good: Use the configuration file
Document Changes
If you modify
config.py, document why:# Changed to use external storage per project requirements DATA_DIR = "/mnt/shared/research_data"
Accessing Configuration
In Your Code
import config
# Access configuration variables
print(f"Dataset: {config.DATASET_NAME}")
print(f"Input dir: {config.DATASET_DIR}")
print(f"Output dir: {config.CLEAN_DATASET_DIR}")
From Command Line
# Print current configuration
python -c "import config; print(f'Dataset: {config.DATASET_NAME}')"
Directory Structure
The configuration creates this structure:
RePORTaLiN/
├── data/
│ ├── dataset/
│ │ └── <dataset_name>/ # Auto-detected
│ └── data_dictionary_and_mapping_specifications/
│ └── RePORT_DEB_to_Tables_mapping.xlsx
│
└── results/
├── dataset/
│ └── <dataset_name>/ # Mirrors input structure
└── data_dictionary_mappings/
├── Codelists/
├── tblENROL/
└── ...
Troubleshooting Configuration
Problem: “Dataset not found”
Cause: No folder exists in data/dataset/
Solution: Create a dataset folder:
mkdir -p data/dataset/my_dataset
# Add Excel files to this directory
Problem: “Permission denied”
Cause: Output directories not writable
Solution: Check permissions:
chmod -R 755 results/
chmod 755 .logs/
Problem: “Config file not found”
Cause: Not running from the correct folder
Solution: Ensure you’re in the correct directory:
cd /path/to/RePORTaLiN
python main.py
See Also
Quick Start: Quick start guide with validation examples
Usage Guide: How to use configuration in practice
Troubleshooting: Configuration troubleshooting with
validate_config()config module: Complete technical documentation for configuration settings
Extending RePORTaLiN: Extending configuration for custom needs
Changelog: Version 0.0.3 configuration enhancements