Configuration

For Users: Customizing Your Setup

RePORTaLiN comes with sensible defaults that work out of the box. This guide shows you how to adjust settings if you need to customize where files are stored or how the tool behaves.

Changed in version 0.3.0: Added automatic directory creation and configuration validation to make setup easier.

Configuration File

The main configuration file is config.py in the project root. It defines all paths, settings, and parameters used throughout the pipeline.

What’s New in 0.8.6

New Features:
  • Automatically creates folders you need

  • Checks your setup and warns you if something’s wrong

  • Better error messages when files are missing

  • Improved handling of dataset folder names

Dynamic Dataset Detection

RePORTaLiN automatically detects your dataset folder:

# config.py automatically finds the first folder in data/dataset/
DATASET_DIR = os.path.join(DATA_DIR, "dataset", dataset_folder)

This means you can work with any dataset without modifying code:

data/dataset/
└── my_study_data/         # Automatically detected
    ├── file1.xlsx
    └── file2.xlsx

Changed in version 0.3.0: Improved automatic detection with better error handling for special cases.

Configuration Variables

Project Root

ROOT_DIR = os.path.dirname(os.path.abspath(__file__)) if '__file__' in globals() else os.getcwd()
  • Purpose: Absolute path to project root directory

  • Usage: All other paths are relative to this

  • Modification: Not recommended (auto-detected)

Changed in version 0.3.0: Added support for running in interactive environments like Jupyter notebooks.

Data Directories

DATA_DIR = os.path.join(ROOT_DIR, "data")
RESULTS_DIR = os.path.join(ROOT_DIR, "results")
  • DATA_DIR: Location of raw input data

  • RESULTS_DIR: Location for processed outputs

  • Modification: Can be changed if you want different locations

Dataset Paths

DATASET_BASE_DIR = os.path.join(DATA_DIR, "dataset")
DATASET_FOLDER_NAME = get_dataset_folder()  # Auto-detected
DATASET_DIR = os.path.join(DATASET_BASE_DIR, DATASET_FOLDER_NAME or DEFAULT_DATASET_NAME)
DATASET_NAME = normalize_dataset_name(DATASET_FOLDER_NAME)
  • DATASET_BASE_DIR: Parent directory for all datasets

  • DATASET_FOLDER_NAME: Name of detected folder (returned by get_dataset_folder())

  • DATASET_DIR: Full path to current dataset (auto-detected)

  • DATASET_NAME: Cleaned dataset name (e.g., “Indo-vap_csv_files” → “Indo-vap”)

Changed in version 0.3.0: Now automatically cleans up dataset names and handles common file endings better.

Output Directories

CLEAN_DATASET_DIR = os.path.join(RESULTS_DIR, "dataset", DATASET_NAME)
DICTIONARY_JSON_OUTPUT_DIR = os.path.join(RESULTS_DIR, "data_dictionary_mappings")
  • CLEAN_DATASET_DIR: Where extracted JSONL files are saved

  • DICTIONARY_JSON_OUTPUT_DIR: Where dictionary tables are saved

Data Dictionary

DICTIONARY_EXCEL_FILE = os.path.join(
    DATA_DIR,
    "data_dictionary_and_mapping_specifications",
    "RePORT_DEB_to_Tables_mapping.xlsx"
)
  • Purpose: Path to the data dictionary Excel file

  • Modification: Change filename if your dictionary has a different name

Logging Settings

LOG_LEVEL = logging.INFO
LOG_NAME = "reportalin"
  • LOG_LEVEL: Controls verbosity (INFO, DEBUG, WARNING, ERROR)

  • LOG_NAME: Logger instance name

Available log levels:

  • logging.DEBUG: Detailed diagnostic information

  • logging.INFO: General informational messages (default)

  • logging.WARNING: Warning messages

  • logging.ERROR: Error messages only

De-identification Settings

Added in version 0.3.0: De-identification configuration is now documented with comprehensive examples.

De-identification settings can be customized using the configuration options:

from scripts.deidentify import DeidentificationConfig

config = DeidentificationConfig(
    # Pseudonym templates
    pseudonym_templates={
        PHIType.NAME_FULL: "PATIENT-{id}",
        PHIType.MRN: "MRN-{id}",
        # ... other templates
    },

    # Date shifting
    enable_date_shifting=True,
    date_shift_range_days=365,
    preserve_date_intervals=True,

    # Security
    enable_encryption=True,
    encryption_key=None,  # Auto-generated if None

    # Validation
    enable_validation=True,
    strict_mode=True,

    # Logging
    log_detections=True,
    log_level=logging.INFO,

    # Country-specific regulations
    countries=['IN', 'US'],  # None for default (IN)
    enable_country_patterns=True
)

Key Parameters:

  • pseudonym_templates: Custom format for pseudonyms (e.g., “PATIENT-{id}”)

  • enable_date_shifting: Shift dates by consistent offset

  • date_shift_range_days: Maximum shift range (±365 days default)

  • preserve_date_intervals: Keep time intervals consistent

  • enable_encryption: Encrypt mapping files with Fernet

  • encryption_key: Custom encryption key (auto-generated if None)

  • enable_validation: Validate de-identified output

  • strict_mode: Fail on validation errors

  • log_detections: Log detected PHI/PII items

  • countries: List of country codes for country-specific patterns

  • enable_country_patterns: Use country-specific detection patterns

Example Configurations:

Basic de-identification (India-specific):

config = DeidentificationConfig()  # Uses defaults

Multi-country de-identification:

config = DeidentificationConfig(
    countries=['US', 'IN', 'BR', 'ID'],
    enable_encryption=True
)

Testing/development (no encryption):

config = DeidentificationConfig(
    enable_encryption=False,
    log_level=logging.DEBUG
)

See De-identification for complete de-identification guide.

Helper Tools

Added in version 0.3.0.

The configuration now provides helpful tools for common tasks.

ensure_directories()

Automatically creates all required directories if they don’t exist.

from config import ensure_directories

# Create all necessary directories
ensure_directories()
What it creates:
  • RESULTS_DIR

  • CLEAN_DATASET_DIR

  • DICTIONARY_JSON_OUTPUT_DIR

When to use:
  • At the start of your pipeline

  • Before writing any output files

  • When setting up a new environment

validate_config()

Validates the configuration and returns a list of warnings.

from config import validate_config

warnings = validate_config()
if warnings:
    print("Configuration warnings:")
    for warning in warnings:
        print(f"  - {warning}")
else:
    print("Configuration is valid!")
What it checks:
  • DATA_DIR exists

  • DATASET_DIR exists

  • DICTIONARY_EXCEL_FILE exists

Returns:
  • Empty list if all valid

  • List of warning strings if issues found

When to use:
  • Before starting the pipeline

  • For debugging configuration issues

  • In automated testing

normalize_dataset_name()

Normalize a dataset folder name by removing common suffixes.

from config import normalize_dataset_name

name = normalize_dataset_name("Indo-vap_csv_files")
print(name)  # Output: "Indo-vap"
Parameters:
  • folder_name (Optional[str]): Dataset folder name

Returns:
  • Normalized name, or DEFAULT_DATASET_NAME if None

Examples:

normalize_dataset_name("study_csv_files")  # → "study"
normalize_dataset_name("test_files")       # → "test"
normalize_dataset_name(None)               # → "RePORTaLiN_sample"

Customizing Configuration

Example 1: Change Log Level

To see more detailed debug information:

# config.py
import logging

LOG_LEVEL = logging.DEBUG  # More verbose logging

Example 2: Custom Data Location

To use a different data directory:

# config.py
DATA_DIR = "/path/to/my/data"
RESULTS_DIR = "/path/to/my/results"

Example 3: Different Dictionary File

If your data dictionary has a different name:

# config.py
DICTIONARY_EXCEL_FILE = os.path.join(
    DATA_DIR,
    "data_dictionary_and_mapping_specifications",
    "MyCustomDictionary.xlsx"
)

Environment Variables

You can also use environment variables for configuration:

# config.py
import os

# Use environment variable with fallback
DATA_DIR = os.getenv("REPORTALIN_DATA_DIR", os.path.join(ROOT_DIR, "data"))

Then set the environment variable:

export REPORTALIN_DATA_DIR="/my/custom/data/path"
python main.py

Configuration Best Practices

  1. Don’t Hardcode Paths

    ❌ Bad:

    file_path = "/Users/john/data/file.xlsx"
    

    ✅ Good:

    file_path = os.path.join(config.DATA_DIR, "file.xlsx")
    
  2. Use Path Objects

    For more robust path handling:

    from pathlib import Path
    
    DATA_DIR = Path(ROOT_DIR) / "data"
    DATASET_DIR = DATA_DIR / "dataset" / dataset_name
    
  3. Keep Configuration Separate

    Don’t mix configuration with business logic:

    ❌ Bad: Hardcoding paths in processing functions

    ✅ Good: Use the configuration file

  4. Document Changes

    If you modify config.py, document why:

    # Changed to use external storage per project requirements
    DATA_DIR = "/mnt/shared/research_data"
    

Accessing Configuration

In Your Code

import config

# Access configuration variables
print(f"Dataset: {config.DATASET_NAME}")
print(f"Input dir: {config.DATASET_DIR}")
print(f"Output dir: {config.CLEAN_DATASET_DIR}")

From Command Line

# Print current configuration
python -c "import config; print(f'Dataset: {config.DATASET_NAME}')"

Directory Structure

The configuration creates this structure:

RePORTaLiN/
├── data/
│   ├── dataset/
│   │   └── <dataset_name>/          # Auto-detected
│   └── data_dictionary_and_mapping_specifications/
│       └── RePORT_DEB_to_Tables_mapping.xlsx
│
└── results/
    ├── dataset/
    │   └── <dataset_name>/          # Mirrors input structure
    └── data_dictionary_mappings/
        ├── Codelists/
        ├── tblENROL/
        └── ...

Troubleshooting Configuration

Problem: “Dataset not found”

Cause: No folder exists in data/dataset/

Solution: Create a dataset folder:

mkdir -p data/dataset/my_dataset
# Add Excel files to this directory

Problem: “Permission denied”

Cause: Output directories not writable

Solution: Check permissions:

chmod -R 755 results/
chmod 755 .logs/

Problem: “Config file not found”

Cause: Not running from the correct folder

Solution: Ensure you’re in the correct directory:

cd /path/to/RePORTaLiN
python main.py

See Also