Quick Start

For Users: Simplified Execution Guide

This guide provides clear, step-by-step instructions to get you started with RePORTaLiN in just a few minutes. No technical expertise required!

What Does RePORTaLiN Do?

RePORTaLiN is a tool that:

  1. πŸ“Š Converts Excel files to a simpler JSON format (JSONL)

  2. πŸ” Organizes data dictionary information into structured tables

  3. πŸ”’ Protects sensitive patient information (optional de-identification)

  4. βœ… Validates data integrity and generates detailed logs

Think of it as an automated data processing assistant that handles tedious file conversions safely and efficiently.

Prerequisites

Before you begin, ensure you have:

βœ… Python 3.13 or higher installed

Check version: python3 --version

βœ… Project files downloaded or cloned to your computer

βœ… Excel data files in the data/dataset/ folder

βœ… 5-10 minutes of time for initial setup

Verify Configuration (Optional)

Before running the pipeline, validate your setup:

from config import validate_config

warnings = validate_config()
if warnings:
    for warning in warnings:
        print(warning)
else:
    print("Configuration is valid")

See Configuration for details.

Expected Output

You should see output similar to:

Processing sheets: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 14/14 [00:00<00:00, 122.71sheet/s]
SUCCESS: Excel processing complete!
SUCCESS: Step 0: Loading Data Dictionary completed successfully.
Found 43 Excel files to process...
Processing files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 43/43 [00:15<00:00, 2.87file/s]
SUCCESS: Step 1: Extracting Raw Data to JSONL completed successfully.
RePORTaLiN pipeline finished.

Understanding the Output

After the pipeline completes, you’ll find:

  1. Extracted Data in results/dataset/<dataset_name>/

    results/dataset/Indo-vap/
    β”œβ”€β”€ original/                 (All columns preserved)
    β”‚   β”œβ”€β”€ 10_TST.jsonl          (631 records)
    β”‚   β”œβ”€β”€ 11_IGRA.jsonl         (262 records)
    β”‚   β”œβ”€β”€ 12A_FUA.jsonl         (2,831 records)
    β”‚   └── ...                   (43 files total)
    └── cleaned/                  (Duplicate columns removed)
        β”œβ”€β”€ 10_TST.jsonl          (631 records)
        β”œβ”€β”€ 11_IGRA.jsonl         (262 records)
        β”œβ”€β”€ 12A_FUA.jsonl         (2,831 records)
        └── ...                   (43 files total)
    

    Note: Each extraction creates two versions in separate subdirectories:

    • original/ - All columns preserved as-is from Excel files

    • cleaned/ - Duplicate columns removed (e.g., SUBJID2, SUBJID3)

  2. Data Dictionary Mappings in results/data_dictionary_mappings/

    results/data_dictionary_mappings/
    β”œβ”€β”€ Codelists/
    β”‚   β”œβ”€β”€ Codelists_table_1.jsonl
    β”‚   └── Codelists_table_2.jsonl
    β”œβ”€β”€ tblENROL/
    β”‚   └── tblENROL_table.jsonl
    └── ...                       (14 sheets)
    
  3. De-identified Data (if --enable-deidentification is used) in results/deidentified/<dataset_name>/

    results/deidentified/Indo-vap/
    β”œβ”€β”€ original/                 (De-identified original files)
    β”‚   β”œβ”€β”€ 10_TST.jsonl
    β”‚   └── ...
    β”œβ”€β”€ cleaned/                  (De-identified cleaned files)
    β”‚   β”œβ”€β”€ 10_TST.jsonl
    β”‚   └── ...
    └── _deidentification_audit.json
    
  4. Execution Logs in .logs/

    .logs/
    └── reportalin_20251002_132124.log
    

Viewing the Results

JSONL files can be viewed in several ways:

Using a text editor:

# View first few lines
head results/dataset/Indo-vap/original/10_TST.jsonl

Using Python:

import pandas as pd

# Read JSONL file
df = pd.read_json('results/dataset/Indo-vap/original/10_TST.jsonl', lines=True)
print(df.head())

Using jq (command-line JSON processor):

# Pretty-print first record
head -n 1 results/dataset/Indo-vap/original/10_TST.jsonl | jq

Command-Line Options

Skip Specific Steps

You can skip individual pipeline steps:

# Skip data dictionary loading
python main.py --skip-dictionary

# Skip data extraction
python main.py --skip-extraction

# Skip both (useful for testing)
python main.py --skip-dictionary --skip-extraction

View Help

python main.py --help

Using Make Commands

For convenience, you can use Make commands:

# Run the pipeline
make run

# Clean cache files
make clean

# Run tests (if available)
make test

Working with Different Datasets

RePORTaLiN automatically detects your dataset:

  1. Place your Excel files in data/dataset/<your_dataset_name>/

  2. Run python main.py

  3. Results appear in results/dataset/<your_dataset_name>/

Example:

# Your data structure
data/dataset/
└── my_research_data/
    β”œβ”€β”€ file1.xlsx
    β”œβ”€β”€ file2.xlsx
    └── ...

# Automatically creates
results/dataset/
└── my_research_data/
    β”œβ”€β”€ file1.jsonl
    β”œβ”€β”€ file2.jsonl
    └── ...

Checking the Logs

Logs provide detailed information about the extraction process:

# View the latest log
ls -lt .logs/ | head -n 2
cat .logs/reportalin_20251002_132124.log

Logs include:

  • Timestamp for each operation

  • Files processed and record counts

  • Warnings and errors (if any)

  • Success confirmations

Common First-Run Issues

Issue: β€œNo Excel files found”

Solution: Ensure your Excel files are in data/dataset/<folder_name>/

ls data/dataset/*/

β€”

Issue: β€œPermission denied” when creating logs

Solution: Ensure the .logs directory is writable:

chmod 755 .logs/

β€”

Issue: β€œPackage not found”

Solution: Ensure dependencies are installed:

pip install -r requirements.txt

Step-by-Step Execution

Step 1: Install Dependencies (One-time setup)

Open your terminal/command prompt and navigate to the RePORTaLiN project folder:

cd /path/to/RePORTaLiN

Install required Python packages:

pip install -r requirements.txt

You should see packages being installed (pandas, openpyxl, tqdm, etc.). This takes 1-2 minutes.

βœ… Expected Output: β€œSuccessfully installed pandas-2.0.0 openpyxl-3.1.0…” (versions may vary)

β€”

Step 2: Verify Your Data Files

Check that your Excel files are in the right location:

ls data/dataset/

βœ… Expected Output: You should see a folder (e.g., Indo-vap_csv_files/) containing .xlsx files

If you don’t see any folders, create one and place your Excel files there:

mkdir -p data/dataset/my_dataset/
cp /path/to/your/excel/files/*.xlsx data/dataset/my_dataset/

β€”

Step 3: Run the Basic Pipeline

Execute the main pipeline with this simple command:

python3 main.py

βœ… Expected Output: You’ll see two progress bars:

Processing sheets: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 14/14 [00:01<00:00, 12.71sheet/s]
SUCCESS: Step 0: Loading Data Dictionary completed successfully.

Found 43 Excel files to process...
Processing files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 43/43 [00:15<00:00, 2.87file/s]
SUCCESS: Step 1: Extracting Raw Data to JSONL completed successfully.

RePORTaLiN pipeline finished.

⏱️ Time: Usually 15-30 seconds depending on file size

β€”

Step 4: Check Your Results

Navigate to the results folder:

cd results/dataset/
ls

βœ… Expected Output: You’ll see a folder with your dataset name (e.g., Indo-vap/)

Look inside:

ls results/dataset/Indo-vap/

βœ… Expected Output:

original/    (Contains .jsonl files with all original columns)
cleaned/     (Contains .jsonl files with duplicate columns removed)

Each folder contains the same files but with different processing levels: - original/ = Exact Excel data, just converted to JSONL - cleaned/ = Duplicate columns (like SUBJID2, SUBJID3) removed for cleaner data

β€”

Step 5: View Your Converted Data (Optional)

Open any .jsonl file to see the converted data:

head -n 5 results/dataset/Indo-vap/original/10_TST.jsonl

βœ… Expected Output: You’ll see JSON-formatted data, one record per line:

{"SUBJID": "INV001", "VISIT": 1, "TST_RESULT": "Positive", "source_file": "10_TST.xlsx"}
{"SUBJID": "INV002", "VISIT": 1, "TST_RESULT": "Negative", "source_file": "10_TST.xlsx"}
...

πŸŽ‰ Congratulations! Your data has been successfully converted!

Advanced Usage: De-identification

If you need to remove sensitive patient information (PHI/PII), use the de-identification feature:

Step 1: Run with De-identification Enabled

python3 main.py --enable-deidentification

βœ… Expected Output: Additional processing step for de-identification:

De-identifying dataset: results/dataset/Indo-vap -> results/deidentified/Indo-vap
Processing both 'original' and 'cleaned' subdirectories...
Countries: IN (default)
De-identifying files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 43/43 [00:25<00:00, 1.72file/s]

De-identification complete:
  Texts processed: 15,234
  Total detections: 1,250
  Countries: IN (default)
  Unique mappings: 485

⏱️ Time: Additional 20-40 seconds for de-identification

β€”

Step 2: Specify Countries (For multi-country studies)

python3 main.py --enable-deidentification --countries IN US ID BR

This applies privacy regulations for India, United States, Indonesia, and Brazil.

βœ… Supported Countries: US, IN, ID, BR, PH, ZA, EU, GB, CA, AU, KE, NG, GH, UG

β€”

Step 3: View De-identified Results

head -n 3 results/deidentified/Indo-vap/original/10_TST.jsonl

βœ… Expected Output: Sensitive data replaced with placeholders:

{"SUBJID": "[MRN-X7Y2A9]", "PATIENT_NAME": "[PATIENT-A4B8C3]", "DOB": "[DATE-1]", ...}
{"SUBJID": "[MRN-K2M5P1]", "PATIENT_NAME": "[PATIENT-D9F2G7]", "DOB": "[DATE-2]", ...}

Note: Original β†’ Pseudonym mappings are encrypted and stored securely in: results/deidentified/mappings/mappings.enc

Troubleshooting

Problem: β€œNo Excel files found”

Solution: Check that your Excel files (.xlsx) are in data/dataset/<folder_name>/

ls data/dataset/*/

β€”

Problem: β€œPackage β€˜pandas’ not found”

Solution: Install dependencies:

pip install -r requirements.txt

β€”

Problem: β€œPermission denied” when accessing files

Solution: Run with appropriate permissions:

# On macOS/Linux
chmod +x main.py
python3 main.py

# On Windows (run as Administrator)
python main.py

β€”

Problem: Files are being skipped

Solution: This is normal! The pipeline skips files that were already processed successfully. To reprocess, delete the output folder:

rm -rf results/dataset/my_dataset/
python3 main.py

β€”

Problem: β€œValidation found potential PHI” warning after de-identification

Solution: This is a cautious warning. Review the log file for details:

cat .logs/reportalin_*.log | grep "potential PHI"

If it’s a false positive (like β€œ[MRN-ABC123]” being detected), you can safely proceed.

Common Use Cases

Use Case 1: Process only data dictionary, skip extraction

python3 main.py --skip-extraction

β€”

Use Case 2: Process only data extraction, skip dictionary

python3 main.py --skip-dictionary

β€”

Use Case 3: Reprocess everything from scratch

rm -rf results/
python3 main.py

β€”

Use Case 4: De-identify for multiple countries without encryption (testing only)

python3 main.py --enable-deidentification --countries ALL --no-encryption

⚠️ Warning: --no-encryption should only be used for testing! Always use encryption in production.

β€”

Next Steps

βœ… You’re done! Your data has been successfully processed.

What’s next?

  1. πŸ“Š Analyze your data: Use the .jsonl files with pandas, jq, or any JSON tool

  2. πŸ“– Read the full documentation: Learn about advanced configuration options

  3. πŸ”’ Review de-identification: Check the audit log at results/deidentified/_deidentification_audit.json

  4. πŸ“ Check logs: Detailed operation logs are in .logs/reportalin_<timestamp>.log

Need help? See the Troubleshooting guide or review the logs for detailed error messages.