Usage Guide

For Users: Working with RePORTaLiN

This guide shows you different ways to use RePORTaLiN for your daily tasks.

Basic Usage

Run Complete Pipeline

python main.py

This executes both pipeline steps: 1. Data dictionary loading 2. Data extraction

Skip Specific Steps

# Skip dictionary loading (useful if already processed)
python main.py --skip-dictionary

# Skip data extraction
python main.py --skip-extraction

# Run neither (useful for testing configuration)
python main.py --skip-dictionary --skip-extraction

Detailed Logging Mode

Want to see exactly what’s happening during processing? Use verbose mode to get detailed tree-view logs for each step:

# Enable detailed logging
python main.py --verbose

# Short form
python main.py -v

# With privacy protection enabled
python main.py -v --enable-deidentification --countries IN US

Verbose Mode Output: When running with --verbose, you get detailed step-by-step logs in the format:

DEBUG: β”œβ”€ Processing: Data dictionary loading (N sheets)
DEBUG:   β”œβ”€ Total sheets: N
DEBUG:   β”œβ”€ Sheet 1/N: 'sheet_name'
DEBUG:   β”‚  β”œβ”€ Tables detected: 2
DEBUG:   β”‚  β”œβ”€ Table 1 (sheet_table_1)
DEBUG:   β”‚  β”‚  β”œβ”€ Rows: 100
DEBUG:   β”‚  β”‚  β”œβ”€ Columns: 25
DEBUG:   β”‚  β”‚  β”œβ”€ Saved to: /path/to/output.jsonl
DEBUG:   β”‚  β”‚  └─ ⏱ Table processing time: 0.23s
DEBUG:   └─ ⏱ Overall processing time: 2.45s

What you’ll see in the log file:

  • File-level details: Which files are being processed and their locations

  • Sheet/Table details: How many sheets/tables found, rows/columns per table

  • Processing phases: Load, save, clean, validate operations

  • Metrics: Counts, totals, and statistics for each operation

  • Timing information: Duration for each step and overall processing time

  • Error context: Detailed error messages with processing state

  • Progress updates: Real-time updates for large datasets

Where to find logs: Look in the .logs/ folder for files named reportalin_YYYYMMDD_HHMMSS.log

Impact on speed: Minimal - your processing will be just as fast

Log File Structure: The log file captures all verbose output with timestamps:

2024-10-23 14:30:45,123 - reportalin - DEBUG - β”œβ”€ Processing: Data dictionary loading (43 sheets)
2024-10-23 14:30:45,125 - reportalin - DEBUG -   β”œβ”€ Total sheets: 43
2024-10-23 14:30:45,200 - reportalin - DEBUG -   β”œβ”€ Sheet 1/43: 'TST Screening'
2024-10-23 14:30:45,250 - reportalin - DEBUG -   β”‚  β”œβ”€ Tables detected: 1
2024-10-23 14:30:45,260 - reportalin - DEBUG -   β”‚  β”œβ”€ Table 1 (TST Screening_table)
2024-10-23 14:30:45,270 - reportalin - DEBUG -   β”‚  β”‚  β”œβ”€ Rows: 50
2024-10-23 14:30:45,271 - reportalin - DEBUG -   β”‚  β”‚  β”œβ”€ Columns: 12
2024-10-23 14:30:45,280 - reportalin - DEBUG -   β”‚  β”‚  └─ ⏱ Table processing time: 0.08s

Note

Console output remains unchanged (only SUCCESS/ERROR messages shown). All verbose output goes to the log file for detailed analysis without cluttering the terminal.

Working with Multiple Datasets

RePORTaLiN can process different datasets by simply changing the data directory:

Scenario 1: Sequential Processing

Process multiple datasets one at a time:

# Process first dataset
# Ensure data/dataset/ contains only Indo-vap_csv_files
python main.py

# Move results to backup
mv results/dataset/Indo-vap results/dataset/Indo-vap_backup

# Process second dataset
# Replace data/dataset/ contents with new dataset
python main.py

Scenario 2: Parallel Processing

Use separate project directories for parallel processing:

# Terminal 1
cd /path/to/RePORTaLiN_project1
python main.py

# Terminal 2
cd /path/to/RePORTaLiN_project2
python main.py

De-identification Workflows

Running De-identification

Enable de-identification in the main pipeline:

# Basic de-identification (uses default: India)
python main.py --enable-deidentification

# Specify countries
python main.py --enable-deidentification --countries IN US ID

# Use all supported countries
python main.py --enable-deidentification --countries ALL

# Disable encryption (testing only - NOT recommended)
python main.py --enable-deidentification --no-encryption

Country-Specific De-identification

The system supports 14 countries with specific privacy regulations:

# India (default)
python main.py --enable-deidentification --countries IN

# Multiple countries (for international studies)
python main.py --enable-deidentification --countries IN US ID BR

# All countries (detects identifiers from all 14 supported countries)
python main.py --enable-deidentification --countries ALL

Supported countries: US, EU, GB, CA, AU, IN, ID, BR, PH, ZA, KE, NG, GH, UG

For detailed information, see Country-Specific Privacy Rules.

De-identification Output Structure

The de-identified data maintains the same directory structure:

results/deidentified/Indo-vap/
β”œβ”€β”€ original/
β”‚   β”œβ”€β”€ 10_TST.jsonl          # De-identified original files
β”‚   β”œβ”€β”€ 11_IGRA.jsonl
β”‚   └── ...
β”œβ”€β”€ cleaned/
β”‚   β”œβ”€β”€ 10_TST.jsonl          # De-identified cleaned files
β”‚   β”œβ”€β”€ 11_IGRA.jsonl
β”‚   └── ...
└── _deidentification_audit.json  # Audit log

results/deidentified/mappings/
└── mappings.enc                   # Encrypted mapping table

Standalone De-identification

You can also run de-identification separately:

# De-identify existing dataset
python -m scripts.deidentify \
    --input-dir results/dataset/Indo-vap \
    --output-dir results/deidentified/Indo-vap \
    --countries IN US

# List supported countries
python -m scripts.deidentify --list-countries

# Validate de-identified output
python -m scripts.deidentify \
    --input-dir results/dataset/Indo-vap \
    --output-dir results/deidentified/Indo-vap \
    --validate

Working with De-identified Data

import pandas as pd

# Read de-identified file
df = pd.read_json('results/deidentified/Indo-vap/cleaned/10_TST.jsonl', lines=True)

# PHI/PII has been replaced with pseudonyms
print(df.head())
# Shows: [PATIENT-X7Y2], [SSN-A4B8], [DATE-1], etc.

For complete de-identification documentation, see De-identification.

Understanding Progress Output

Progress Bars and Status Messages

RePORTaLiN provides real-time feedback during processing using progress bars:

Processing Files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 43/43 [00:15<00:00,  2.87files/s]
βœ“ Processing 10_TST.xlsx: 1,234 rows
βœ“ Processing 11_IGRA.xlsx: 2,456 rows
...

Summary:
--------
Successfully processed: 43 files
Total records: 50,123
Time elapsed: 15.2 seconds

Key Features:

  • tqdm progress bars: Show percentage, speed, and time remaining

  • Clean output: Status messages use tqdm.write() to avoid interfering with progress bars

  • Real-time updates: Instant feedback on current operation

  • Summary statistics: Final counts and timing information

Modules with Progress Tracking:

  1. Data Dictionary Loading (load_dictionary.py):

    • Progress bar for processing sheets

    • Status messages for each table extracted

    • Summary of tables created

  2. Data Extraction (extract_data.py):

    • Progress bar for files being processed

    • Per-file row counts

    • Final summary with totals

  3. De-identification (deidentify.py):

    • Progress bar for batch processing

    • Detection statistics per file

    • Final summary with replacement counts

Note: Progress bars require the tqdm library, which is installed automatically with pip install -r requirements.txt.

Verbose Logging Details

Understanding Verbose Output

Each pipeline step (Dictionary Loading, Data Extraction, De-identification) produces detailed tree-view logs when --verbose is enabled.

Step 0: Data Dictionary Loading

Verbose output shows the processing of Excel sheets and detected tables:

β”œβ”€ Processing: dictionary_file.xlsx (43 sheets)
β”‚  β”œβ”€ Total sheets: 43
β”‚  β”œβ”€ Sheet 1/43: 'TST Screening v1.0'
β”‚  β”‚  β”œβ”€ Loading Excel file
β”‚  β”‚  β”‚  β”œβ”€ Rows: 45
β”‚  β”‚  β”‚  β”œβ”€ Columns: 15
β”‚  β”‚  β”œβ”€ Tables detected: 1
β”‚  β”‚  β”œβ”€ Table 1 (TST_Screening_table)
β”‚  β”‚  β”‚  β”œβ”€ Rows: 44
β”‚  β”‚  β”‚  β”œβ”€ Columns: 15
β”‚  β”‚  β”‚  β”œβ”€ Saved to: /path/to/TST_Screening/TST_Screening_table.jsonl
β”‚  β”‚  β”‚  └─ ⏱ Table processing time: 0.15s
β”‚  β”‚  └─ ⏱ Sheet processing time: 0.18s
β”‚  └─ ⏱ Overall processing time: 45.23s

What this tells you: - Total sheets and which sheet is being processed - Number of tables found in each sheet - Rows/columns for each table - Time taken for each table and sheet - Output file locations

Step 1: Data Extraction

Verbose output shows file processing with duplicate column removal:

β”œβ”€ Processing: Data extraction (65 files)
β”‚  β”œβ”€ Total files to process: 65
β”‚  β”œβ”€ File 1/65: 10_TST.xlsx
β”‚  β”‚  β”œβ”€ Loading Excel file
β”‚  β”‚  β”‚  β”œβ”€ Rows: 412
β”‚  β”‚  β”‚  β”œβ”€ Columns: 28
β”‚  β”‚  β”œβ”€ Saving original version
β”‚  β”‚  β”‚  β”œβ”€ Created: 10_TST.jsonl (412 records)
β”‚  β”‚  β”œβ”€ Cleaning duplicate columns
β”‚  β”‚  β”‚  β”œβ”€ Marking SUBJID2 for removal (duplicate of SUBJID)
β”‚  β”‚  β”‚  β”œβ”€ Keeping NAME_ALT (different from NAME)
β”‚  β”‚  β”‚  β”œβ”€ Removed 3 duplicate columns: SUBJID2, AGE_2, PHONE_3
β”‚  β”‚  β”œβ”€ Saving cleaned version
β”‚  β”‚  β”‚  β”œβ”€ Created: 10_TST.jsonl (412 records)
β”‚  β”‚  β”‚  └─ ⏱ Total processing time: 0.45s
β”‚  β”‚  β”œβ”€ File 2/65: 11_IGRA.xlsx
β”‚  β”‚  β”‚  ...
β”‚  └─ ⏱ Overall extraction time: 32.15s

What this tells you: - Which file is being processed and current progress - Rows/columns in the source file - Duplicate column detection and removal - Records created for original and cleaned versions - Processing time per file

Step 2: De-identification (when --enable-deidentification is used)

Verbose output shows de-identification and validation details:

β”œβ”€ Processing: De-identification (65 files)
β”‚  β”œβ”€ Total files to process: 65
β”‚  β”œβ”€ File 1/65: results/dataset/original/10_TST.jsonl
β”‚  β”‚  β”œβ”€ Reading and de-identifying records
β”‚  β”‚  β”‚  β”œβ”€ Processed 100 records...
β”‚  β”‚  β”‚  β”œβ”€ Processed 200 records...
β”‚  β”‚  β”‚  β”œβ”€ Processed 412 records
β”‚  β”‚  β”œβ”€ Records processed: 412
β”‚  β”‚  └─ ⏱ File processing time: 0.89s
β”‚  β”œβ”€ File 2/65: results/dataset/original/11_IGRA.jsonl
β”‚  β”‚  ...
β”‚  └─ ⏱ Overall de-identification time: 78.34s
β”‚
β”œβ”€ Processing: Dataset validation (65 files)
β”‚  β”œβ”€ Total files to validate: 65
β”‚  β”œβ”€ File 1/65: 10_TST.jsonl
β”‚  β”‚  β”œβ”€ Records validated: 412
β”‚  β”‚  └─ ⏱ File validation time: 0.12s
β”‚  └─ ⏱ Overall validation time: 8.45s

What this tells you: - De-identification progress with record counts - Validation results per file - Processing and validation times - Any PHI/PII issues detected during validation

Analyzing Log Files

When processing completes, analyze the log file in .logs/:

# View the latest log file
tail -f .logs/reportalin_*.log

# Search for specific errors
grep "ERROR" .logs/reportalin_*.log

# Count operations
grep "βœ“ Complete" .logs/reportalin_*.log | wc -l

# Extract timing information
grep "⏱" .logs/reportalin_*.log

Performance Tuning: Use verbose logs to identify bottlenecks:

  • If table processing is slow: Check for large tables or memory issues

  • If file extraction is slow: Check for duplicate column detection overhead

  • If de-identification is slow: Check for slow pattern matching or encryption

See Also

For additional information: