Configuration

For Users: Customizing Your Setup

RePORTaLiN comes with sensible defaults that work out of the box. This guide shows you how to adjust settings if you need to customize where files are stored or how the tool behaves.

Changed in version 0.3.0: Added automatic directory creation and configuration validation to make setup easier.

Configuration File

The main configuration file is config.py in the project root. It defines all paths, settings, and parameters used throughout the pipeline.

What’s New in 0.8.6

✨ New Features:

Automatically creates folders you need
Checks your setup and warns you if something’s wrong
Better error messages when files are missing
Improved handling of dataset folder names

Dynamic Dataset Detection

RePORTaLiN automatically detects your dataset folder:

# config.py automatically finds the first folder in data/dataset/
DATASET_DIR = os.path.join(DATA_DIR, "dataset", dataset_folder)

This means you can work with any dataset without modifying code:

data/dataset/
└── my_study_data/         # Automatically detected
    ├── file1.xlsx
    └── file2.xlsx

Changed in version 0.3.0: Improved automatic detection with better error handling for special cases.

Configuration Variables

Project Root

ROOT_DIR = os.path.dirname(os.path.abspath(__file__)) if '__file__' in globals() else os.getcwd()

Purpose: Absolute path to project root directory
Usage: All other paths are relative to this
Modification: Not recommended (auto-detected)

Changed in version 0.3.0: Added support for running in interactive environments like Jupyter notebooks.

Data Directories

DATA_DIR = os.path.join(ROOT_DIR, "data")
RESULTS_DIR = os.path.join(ROOT_DIR, "results")

DATA_DIR: Location of raw input data
RESULTS_DIR: Location for processed outputs
Modification: Can be changed if you want different locations

Dataset Paths

DATASET_BASE_DIR = os.path.join(DATA_DIR, "dataset")
DATASET_FOLDER_NAME = get_dataset_folder()  # Auto-detected
DATASET_DIR = os.path.join(DATASET_BASE_DIR, DATASET_FOLDER_NAME or DEFAULT_DATASET_NAME)
DATASET_NAME = normalize_dataset_name(DATASET_FOLDER_NAME)

DATASET_BASE_DIR: Parent directory for all datasets
DATASET_FOLDER_NAME: Name of detected folder (returned by get_dataset_folder())
DATASET_DIR: Full path to current dataset (auto-detected)
DATASET_NAME: Cleaned dataset name (e.g., “Indo-vap_csv_files” → “Indo-vap”)

Changed in version 0.3.0: Now automatically cleans up dataset names and handles common file endings better.

Output Directories

CLEAN_DATASET_DIR = os.path.join(RESULTS_DIR, "dataset", DATASET_NAME)
DICTIONARY_JSON_OUTPUT_DIR = os.path.join(RESULTS_DIR, "data_dictionary_mappings")

CLEAN_DATASET_DIR: Where extracted JSONL files are saved
DICTIONARY_JSON_OUTPUT_DIR: Where dictionary tables are saved

Data Dictionary

DICTIONARY_EXCEL_FILE = os.path.join(
    DATA_DIR,
    "data_dictionary_and_mapping_specifications",
    "RePORT_DEB_to_Tables_mapping.xlsx"
)

Purpose: Path to the data dictionary Excel file
Modification: Change filename if your dictionary has a different name

Logging Settings

LOG_LEVEL = logging.INFO
LOG_NAME = "reportalin"

LOG_LEVEL: Controls verbosity (INFO, DEBUG, WARNING, ERROR)
LOG_NAME: Logger instance name

Available log levels:

logging.DEBUG: Detailed diagnostic information
logging.INFO: General informational messages (default)
logging.WARNING: Warning messages
logging.ERROR: Error messages only

De-identification Settings

Added in version 0.3.0: De-identification configuration is now documented with comprehensive examples.

De-identification settings can be customized using the configuration options:

from scripts.deidentify import DeidentificationConfig

config = DeidentificationConfig(
    # Pseudonym templates
    pseudonym_templates={
        PHIType.NAME_FULL: "PATIENT-{id}",
        PHIType.MRN: "MRN-{id}",
        # ... other templates
    },

    # Date shifting
    enable_date_shifting=True,
    date_shift_range_days=365,
    preserve_date_intervals=True,

    # Security
    enable_encryption=True,
    encryption_key=None,  # Auto-generated if None

    # Validation
    enable_validation=True,
    strict_mode=True,

    # Logging
    log_detections=True,
    log_level=logging.INFO,

    # Country-specific regulations
    countries=['IN', 'US'],  # None for default (IN)
    enable_country_patterns=True
)

Key Parameters:

pseudonym_templates: Custom format for pseudonyms (e.g., “PATIENT-{id}”)
enable_date_shifting: Shift dates by consistent offset
date_shift_range_days: Maximum shift range (±365 days default)
preserve_date_intervals: Keep time intervals consistent
enable_encryption: Encrypt mapping files with Fernet
encryption_key: Custom encryption key (auto-generated if None)
enable_validation: Validate de-identified output
strict_mode: Fail on validation errors
log_detections: Log detected PHI/PII items
countries: List of country codes for country-specific patterns
enable_country_patterns: Use country-specific detection patterns

Example Configurations:

Basic de-identification (India-specific):

config = DeidentificationConfig()  # Uses defaults

Multi-country de-identification:

config = DeidentificationConfig(
    countries=['US', 'IN', 'BR', 'ID'],
    enable_encryption=True
)

Testing/development (no encryption):

config = DeidentificationConfig(
    enable_encryption=False,
    log_level=logging.DEBUG
)

See De-identification for complete de-identification guide.

Helper Tools

Added in version 0.3.0.

The configuration now provides helpful tools for common tasks.

ensure_directories()

Automatically creates all required directories if they don’t exist.

from config import ensure_directories

# Create all necessary directories
ensure_directories()

What it creates:

RESULTS_DIR
CLEAN_DATASET_DIR
DICTIONARY_JSON_OUTPUT_DIR

When to use:

At the start of your pipeline
Before writing any output files
When setting up a new environment

validate_config()

Validates the configuration and returns a list of warnings.

from config import validate_config

warnings = validate_config()
if warnings:
    print("Configuration warnings:")
    for warning in warnings:
        print(f"  - {warning}")
else:
    print("Configuration is valid!")

What it checks:

DATA_DIR exists
DATASET_DIR exists
DICTIONARY_EXCEL_FILE exists

Returns:

Empty list if all valid
List of warning strings if issues found

When to use:

Before starting the pipeline
For debugging configuration issues
In automated testing

normalize_dataset_name()

Normalize a dataset folder name by removing common suffixes.

from config import normalize_dataset_name

name = normalize_dataset_name("Indo-vap_csv_files")
print(name)  # Output: "Indo-vap"

Parameters:

folder_name (Optional[str]): Dataset folder name

Returns:

Normalized name, or DEFAULT_DATASET_NAME if None

Examples:

normalize_dataset_name("study_csv_files")  # → "study"
normalize_dataset_name("test_files")       # → "test"
normalize_dataset_name(None)               # → "RePORTaLiN_sample"

Customizing Configuration

Example 1: Change Log Level

To see more detailed debug information:

# config.py
import logging

LOG_LEVEL = logging.DEBUG  # More verbose logging

Example 2: Custom Data Location

To use a different data directory:

# config.py
DATA_DIR = "/path/to/my/data"
RESULTS_DIR = "/path/to/my/results"

Example 3: Different Dictionary File

If your data dictionary has a different name:

# config.py
DICTIONARY_EXCEL_FILE = os.path.join(
    DATA_DIR,
    "data_dictionary_and_mapping_specifications",
    "MyCustomDictionary.xlsx"
)

Environment Variables

You can also use environment variables for configuration:

# config.py
import os

# Use environment variable with fallback
DATA_DIR = os.getenv("REPORTALIN_DATA_DIR", os.path.join(ROOT_DIR, "data"))

Then set the environment variable:

export REPORTALIN_DATA_DIR="/my/custom/data/path"
python main.py

Configuration Best Practices

Don’t Hardcode Paths

❌ Bad:

file_path = "/Users/john/data/file.xlsx"

✅ Good:

file_path = os.path.join(config.DATA_DIR, "file.xlsx")

Use Path Objects

For more robust path handling:

from pathlib import Path

DATA_DIR = Path(ROOT_DIR) / "data"
DATASET_DIR = DATA_DIR / "dataset" / dataset_name

Keep Configuration Separate

Don’t mix configuration with business logic:

❌ Bad: Hardcoding paths in processing functions

✅ Good: Use the configuration file

Document Changes

If you modify config.py, document why:

# Changed to use external storage per project requirements
DATA_DIR = "/mnt/shared/research_data"

Accessing Configuration

In Your Code

import config

# Access configuration variables
print(f"Dataset: {config.DATASET_NAME}")
print(f"Input dir: {config.DATASET_DIR}")
print(f"Output dir: {config.CLEAN_DATASET_DIR}")

From Command Line

# Print current configuration
python -c "import config; print(f'Dataset: {config.DATASET_NAME}')"

Directory Structure

The configuration creates this structure:

RePORTaLiN/
├── data/
│   ├── dataset/
│   │   └── <dataset_name>/          # Auto-detected
│   └── data_dictionary_and_mapping_specifications/
│       └── RePORT_DEB_to_Tables_mapping.xlsx
│
└── results/
    ├── dataset/
    │   └── <dataset_name>/          # Mirrors input structure
    └── data_dictionary_mappings/
        ├── Codelists/
        ├── tblENROL/
        └── ...

Troubleshooting Configuration

Problem: “Dataset not found”

Cause: No folder exists in data/dataset/

Solution: Create a dataset folder:

mkdir -p data/dataset/my_dataset
# Add Excel files to this directory

Problem: “Permission denied”

Cause: Output directories not writable

Solution: Check permissions:

chmod -R 755 results/
chmod 755 .logs/

Problem: “Config file not found”

Cause: Not running from the correct folder

Solution: Ensure you’re in the correct directory:

cd /path/to/RePORTaLiN
python main.py