Usage Guideο
For Users: Working with RePORTaLiN
This guide shows you different ways to use RePORTaLiN for your daily tasks.
Basic Usageο
Run Complete Pipelineο
python main.py
This executes both pipeline steps: 1. Data dictionary loading 2. Data extraction
Skip Specific Stepsο
# Skip dictionary loading (useful if already processed)
python main.py --skip-dictionary
# Skip data extraction
python main.py --skip-extraction
# Run neither (useful for testing configuration)
python main.py --skip-dictionary --skip-extraction
Detailed Logging Modeο
Want to see exactly whatβs happening during processing? Use verbose mode to get detailed tree-view logs for each step:
# Enable detailed logging
python main.py --verbose
# Short form
python main.py -v
# With privacy protection enabled
python main.py -v --enable-deidentification --countries IN US
Verbose Mode Output: When running with --verbose, you get detailed step-by-step logs in the format:
DEBUG: ββ Processing: Data dictionary loading (N sheets)
DEBUG: ββ Total sheets: N
DEBUG: ββ Sheet 1/N: 'sheet_name'
DEBUG: β ββ Tables detected: 2
DEBUG: β ββ Table 1 (sheet_table_1)
DEBUG: β β ββ Rows: 100
DEBUG: β β ββ Columns: 25
DEBUG: β β ββ Saved to: /path/to/output.jsonl
DEBUG: β β ββ β± Table processing time: 0.23s
DEBUG: ββ β± Overall processing time: 2.45s
What youβll see in the log file:
File-level details: Which files are being processed and their locations
Sheet/Table details: How many sheets/tables found, rows/columns per table
Processing phases: Load, save, clean, validate operations
Metrics: Counts, totals, and statistics for each operation
Timing information: Duration for each step and overall processing time
Error context: Detailed error messages with processing state
Progress updates: Real-time updates for large datasets
Where to find logs: Look in the .logs/ folder for files named reportalin_YYYYMMDD_HHMMSS.log
Impact on speed: Minimal - your processing will be just as fast
Log File Structure: The log file captures all verbose output with timestamps:
2024-10-23 14:30:45,123 - reportalin - DEBUG - ββ Processing: Data dictionary loading (43 sheets)
2024-10-23 14:30:45,125 - reportalin - DEBUG - ββ Total sheets: 43
2024-10-23 14:30:45,200 - reportalin - DEBUG - ββ Sheet 1/43: 'TST Screening'
2024-10-23 14:30:45,250 - reportalin - DEBUG - β ββ Tables detected: 1
2024-10-23 14:30:45,260 - reportalin - DEBUG - β ββ Table 1 (TST Screening_table)
2024-10-23 14:30:45,270 - reportalin - DEBUG - β β ββ Rows: 50
2024-10-23 14:30:45,271 - reportalin - DEBUG - β β ββ Columns: 12
2024-10-23 14:30:45,280 - reportalin - DEBUG - β β ββ β± Table processing time: 0.08s
Note
Console output remains unchanged (only SUCCESS/ERROR messages shown). All verbose output goes to the log file for detailed analysis without cluttering the terminal.
Working with Multiple Datasetsο
RePORTaLiN can process different datasets by simply changing the data directory:
Scenario 1: Sequential Processingο
Process multiple datasets one at a time:
# Process first dataset
# Ensure data/dataset/ contains only Indo-vap_csv_files
python main.py
# Move results to backup
mv results/dataset/Indo-vap results/dataset/Indo-vap_backup
# Process second dataset
# Replace data/dataset/ contents with new dataset
python main.py
Scenario 2: Parallel Processingο
Use separate project directories for parallel processing:
# Terminal 1
cd /path/to/RePORTaLiN_project1
python main.py
# Terminal 2
cd /path/to/RePORTaLiN_project2
python main.py
De-identification Workflowsο
Running De-identificationο
Enable de-identification in the main pipeline:
# Basic de-identification (uses default: India)
python main.py --enable-deidentification
# Specify countries
python main.py --enable-deidentification --countries IN US ID
# Use all supported countries
python main.py --enable-deidentification --countries ALL
# Disable encryption (testing only - NOT recommended)
python main.py --enable-deidentification --no-encryption
Country-Specific De-identificationο
The system supports 14 countries with specific privacy regulations:
# India (default)
python main.py --enable-deidentification --countries IN
# Multiple countries (for international studies)
python main.py --enable-deidentification --countries IN US ID BR
# All countries (detects identifiers from all 14 supported countries)
python main.py --enable-deidentification --countries ALL
Supported countries: US, EU, GB, CA, AU, IN, ID, BR, PH, ZA, KE, NG, GH, UG
For detailed information, see Country-Specific Privacy Rules.
De-identification Output Structureο
The de-identified data maintains the same directory structure:
results/deidentified/Indo-vap/
βββ original/
β βββ 10_TST.jsonl # De-identified original files
β βββ 11_IGRA.jsonl
β βββ ...
βββ cleaned/
β βββ 10_TST.jsonl # De-identified cleaned files
β βββ 11_IGRA.jsonl
β βββ ...
βββ _deidentification_audit.json # Audit log
results/deidentified/mappings/
βββ mappings.enc # Encrypted mapping table
Standalone De-identificationο
You can also run de-identification separately:
# De-identify existing dataset
python -m scripts.deidentify \
--input-dir results/dataset/Indo-vap \
--output-dir results/deidentified/Indo-vap \
--countries IN US
# List supported countries
python -m scripts.deidentify --list-countries
# Validate de-identified output
python -m scripts.deidentify \
--input-dir results/dataset/Indo-vap \
--output-dir results/deidentified/Indo-vap \
--validate
Working with De-identified Dataο
import pandas as pd
# Read de-identified file
df = pd.read_json('results/deidentified/Indo-vap/cleaned/10_TST.jsonl', lines=True)
# PHI/PII has been replaced with pseudonyms
print(df.head())
# Shows: [PATIENT-X7Y2], [SSN-A4B8], [DATE-1], etc.
For complete de-identification documentation, see De-identification.
Understanding Progress Outputο
Progress Bars and Status Messagesο
RePORTaLiN provides real-time feedback during processing using progress bars:
Processing Files: 100%|ββββββββββββββββ| 43/43 [00:15<00:00, 2.87files/s]
β Processing 10_TST.xlsx: 1,234 rows
β Processing 11_IGRA.xlsx: 2,456 rows
...
Summary:
--------
Successfully processed: 43 files
Total records: 50,123
Time elapsed: 15.2 seconds
Key Features:
tqdm progress bars: Show percentage, speed, and time remaining
Clean output: Status messages use
tqdm.write()to avoid interfering with progress barsReal-time updates: Instant feedback on current operation
Summary statistics: Final counts and timing information
Modules with Progress Tracking:
Data Dictionary Loading (
load_dictionary.py):Progress bar for processing sheets
Status messages for each table extracted
Summary of tables created
Data Extraction (
extract_data.py):Progress bar for files being processed
Per-file row counts
Final summary with totals
De-identification (
deidentify.py):Progress bar for batch processing
Detection statistics per file
Final summary with replacement counts
Note: Progress bars require the tqdm library, which is installed automatically with pip install -r requirements.txt.
Verbose Logging Detailsο
Understanding Verbose Outputο
Each pipeline step (Dictionary Loading, Data Extraction, De-identification) produces detailed tree-view logs when --verbose is enabled.
Step 0: Data Dictionary Loading
Verbose output shows the processing of Excel sheets and detected tables:
ββ Processing: dictionary_file.xlsx (43 sheets)
β ββ Total sheets: 43
β ββ Sheet 1/43: 'TST Screening v1.0'
β β ββ Loading Excel file
β β β ββ Rows: 45
β β β ββ Columns: 15
β β ββ Tables detected: 1
β β ββ Table 1 (TST_Screening_table)
β β β ββ Rows: 44
β β β ββ Columns: 15
β β β ββ Saved to: /path/to/TST_Screening/TST_Screening_table.jsonl
β β β ββ β± Table processing time: 0.15s
β β ββ β± Sheet processing time: 0.18s
β ββ β± Overall processing time: 45.23s
What this tells you: - Total sheets and which sheet is being processed - Number of tables found in each sheet - Rows/columns for each table - Time taken for each table and sheet - Output file locations
Step 1: Data Extraction
Verbose output shows file processing with duplicate column removal:
ββ Processing: Data extraction (65 files)
β ββ Total files to process: 65
β ββ File 1/65: 10_TST.xlsx
β β ββ Loading Excel file
β β β ββ Rows: 412
β β β ββ Columns: 28
β β ββ Saving original version
β β β ββ Created: 10_TST.jsonl (412 records)
β β ββ Cleaning duplicate columns
β β β ββ Marking SUBJID2 for removal (duplicate of SUBJID)
β β β ββ Keeping NAME_ALT (different from NAME)
β β β ββ Removed 3 duplicate columns: SUBJID2, AGE_2, PHONE_3
β β ββ Saving cleaned version
β β β ββ Created: 10_TST.jsonl (412 records)
β β β ββ β± Total processing time: 0.45s
β β ββ File 2/65: 11_IGRA.xlsx
β β β ...
β ββ β± Overall extraction time: 32.15s
What this tells you: - Which file is being processed and current progress - Rows/columns in the source file - Duplicate column detection and removal - Records created for original and cleaned versions - Processing time per file
Step 2: De-identification (when --enable-deidentification is used)
Verbose output shows de-identification and validation details:
ββ Processing: De-identification (65 files)
β ββ Total files to process: 65
β ββ File 1/65: results/dataset/original/10_TST.jsonl
β β ββ Reading and de-identifying records
β β β ββ Processed 100 records...
β β β ββ Processed 200 records...
β β β ββ Processed 412 records
β β ββ Records processed: 412
β β ββ β± File processing time: 0.89s
β ββ File 2/65: results/dataset/original/11_IGRA.jsonl
β β ...
β ββ β± Overall de-identification time: 78.34s
β
ββ Processing: Dataset validation (65 files)
β ββ Total files to validate: 65
β ββ File 1/65: 10_TST.jsonl
β β ββ Records validated: 412
β β ββ β± File validation time: 0.12s
β ββ β± Overall validation time: 8.45s
What this tells you: - De-identification progress with record counts - Validation results per file - Processing and validation times - Any PHI/PII issues detected during validation
Analyzing Log Files
When processing completes, analyze the log file in .logs/:
# View the latest log file
tail -f .logs/reportalin_*.log
# Search for specific errors
grep "ERROR" .logs/reportalin_*.log
# Count operations
grep "β Complete" .logs/reportalin_*.log | wc -l
# Extract timing information
grep "β±" .logs/reportalin_*.log
Performance Tuning: Use verbose logs to identify bottlenecks:
If table processing is slow: Check for large tables or memory issues
If file extraction is slow: Check for duplicate column detection overhead
If de-identification is slow: Check for slow pattern matching or encryption
See Alsoο
For additional information:
Quick Start: Quick start guide
Configuration: Configuration options
De-identification: Complete de-identification guide
Country-Specific Privacy Rules: Country-specific privacy regulations
Troubleshooting: Common issues and solutions