Troubleshooting
For Users: Solving Common Problems
This guide helps you fix common issues you might run into while using RePORTaLiN.
Installation Issues
Missing Package Errors
Problem: Error message saying a package like ‘pandas’ is not found
Solution 1: Install dependencies
pip install -r requirements.txt
Solution 2: Verify virtual environment
# Check if virtual environment is activated
which python
# Should show path to .venv/bin/python
# If not, activate:
source .venv/bin/activate
Solution 3: Reinstall dependencies
pip install --force-reinstall -r requirements.txt
Python Version Issues
Problem: SyntaxError or version compatibility errors
Solution: Ensure Python 3.13+ is installed
python --version
# Should show Python 3.13.x or higher
If you have multiple Python versions:
python3.13 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Permission Errors
Problem: Permission denied when installing packages
Solution 1: Use virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Solution 2: Install for user only
pip install --user -r requirements.txt
Data Processing Issues
Debugging with Verbose Logging
Problem: Need to understand what the pipeline is doing or troubleshoot issues
Solution: Enable verbose (DEBUG) logging
# Enable verbose logging
python main.py -v
# View log file in real-time
tail -f .logs/reportalin_*.log
# Filter for specific issues
python main.py -v 2>&1 | grep -E "ERROR|WARNING|DEBUG.*Processing"
What you’ll see in verbose mode:
File Discovery
DEBUG - Excel files: ['10_TST.xlsx', '11_IGRA.xlsx', '12A_FUA.xlsx', ...] DEBUG - Processing 10_TST.xlsx
Table Detection
DEBUG - Excel file loaded successfully. Found 17 sheets: ['Codelists', 'Notes', ...] DEBUG - Processing 3 tables from sheet 'Codelists'
De-identification Details
DEBUG - Initialized DeidentificationEngine with config: countries=['IN'], encryption=True DEBUG - Files to process: ['1A_ICScreening.jsonl', '1B_HCScreening.jsonl', ...] DEBUG - Processed 1000 records from 1A_ICScreening.jsonl DEBUG - Detected 42 PHI/PII items: ['person_name', 'phone', 'email', ...]
No Excel Files Found
Problem: Found 0 Excel files to process
Diagnosis: Check if files exist
ls -la data/dataset/*/
# Should show .xlsx files
Solution 1: Verify directory structure
data/
└── dataset/
└── <dataset_name>/ # Must have a folder here
├── file1.xlsx
└── file2.xlsx
Solution 2: Check file extensions
# Excel files must have .xlsx extension (not .xls)
# Convert .xls to .xlsx if needed
Solution 3: Verify configuration
python -c "import config; print(config.DATASET_DIR)"
# Should print correct path
Empty Output Files
Problem: JSONL files are created but contain no data
Diagnosis: Check if Excel sheets have data
import pandas as pd
df = pd.read_excel('data/dataset/myfile.xlsx')
print(df.shape) # Should show (rows, columns)
print(df.head())
Solution: RePORTaLiN automatically skips empty sheets. This is expected behavior. Check logs for details:
cat .logs/reportalin_*.log | grep "empty"
Memory Errors
Problem: MemoryError when processing large files
Solution 1: Process files one at a time
from scripts.extract_data import process_excel_file
# Process individually instead of batch
for excel_file in excel_files:
process_excel_file(excel_file, output_dir)
Solution 2: Increase available memory
# Close other applications
# Or run on a machine with more RAM
Solution 3: Process in chunks (for very large files)
import pandas as pd
# Read in chunks
for chunk in pd.read_excel('large_file.xlsx', chunksize=1000):
# Process chunk
pass
Date/Time Conversion Issues
Problem: Dates not converting correctly or appearing as numbers
Explanation: Excel stores dates as numbers (days since 1900-01-01). RePORTaLiN automatically handles this conversion.
Solution: If dates still appear incorrect:
import pandas as pd
# Read with explicit date columns
df = pd.read_excel(
'file.xlsx',
parse_dates=['date_column1', 'date_column2']
)
Logging Issues
Changed in version 0.3.0: Logging system enhanced for better reliability and speed.
No Log Files Created
Problem: .logs/ folder is empty after running the tool
Solution 1: Check folder permissions
chmod 755 .logs/
python main.py
Solution 2: Verify logging is enabled
python -c "import config; print(config.LOG_LEVEL)"
Solution 3: Check for early errors
# Run with verbose output
python main.py 2>&1 | tee output.log
Note: The logging system is designed to work reliably even with multiple processes. If logs are missing, check for early errors or folder permission issues.
Log Files Too Large
Problem: Log files consuming too much disk space
Solution: Implement log rotation
# In config.py or logging.py
from logging.handlers import RotatingFileHandler
handler = RotatingFileHandler(
log_file,
maxBytes=10*1024*1024, # 10 MB
backupCount=5
)
Console Output Issues
Problem: Console shows too much or too little output
Solution: The console handler is filtered to show only SUCCESS, ERROR, and CRITICAL messages by default.
# To see all messages (including INFO and DEBUG), check the log files
cat .logs/reportalin_*.log
# Or modify the console filter in scripts/utils/logging.py
Configuration Issues
Quick Configuration Check
Added in version 0.3.0.
Use the built-in validation utility:
from config import validate_config
warnings = validate_config()
if warnings:
print("Configuration issues found:")
for warning in warnings:
print(f" ⚠️ {warning}")
else:
print("✓ Configuration is valid!")
- This automatically checks for:
Missing data directory
Missing dataset directory
Missing data dictionary file
Dataset Not Auto-Detected
Problem: Pipeline doesn’t detect dataset folder
Diagnosis: Check what’s being detected
python -c "import config; print(config.DATASET_NAME)"
Solution 1: Use validation utility
from config import validate_config, ensure_directories
# Check for issues
warnings = validate_config()
for warning in warnings:
print(warning)
# Ensure directories exist
ensure_directories()
Solution 2: Ensure folder exists in correct location
mkdir -p data/dataset/my_dataset
cp *.xlsx data/dataset/my_dataset/
Solution 3: Check for hidden folders
ls -la data/dataset/
# Should show folders (not starting with '.')
Solution 4: Manually specify in config.py
# config.py
from config import DEFAULT_DATASET_NAME
DATASET_NAME = "my_dataset" # Or use DEFAULT_DATASET_NAME
DATASET_DIR = os.path.join(DATA_DIR, "dataset", DATASET_NAME)
Wrong Output Directory
Problem: Results appear in unexpected location
Solution: Check configuration
python -c "import config; print(config.CLEAN_DATASET_DIR)"
The output should be: results/dataset/<dataset_name>/
Path Issues
Problem: FileNotFoundError for data dictionary or other files
Solution 1: Verify you’re in project root
pwd
# Should show /path/to/RePORTaLiN
# If not:
cd /path/to/RePORTaLiN
python main.py
Solution 2: Check if files exist
ls data/data_dictionary_and_mapping_specifications/*.xlsx
Solution 3: Update paths in config.py if files are elsewhere
Performance Issues
Slow Processing
Problem: Pipeline takes much longer than expected (~15-20 seconds)
Diagnosis: Check file count and sizes
find data/dataset/ -name "*.xlsx" | wc -l
du -sh data/dataset/
Solution 1: Verify no network drives
# Process locally, not on network drives
cp -r /network/drive/data ./data
Solution 2: Check system resources
# macOS
top
# Linux
htop
Solution 3: Disable antivirus temporarily
Antivirus software can slow file operations significantly.
Progress Bar Not Showing
Problem: Progress bars don’t display
Solution 1: Ensure tqdm is installed
pip install tqdm
Solution 2: Check if running in proper terminal
Some IDEs don’t support progress bars. Run in regular terminal:
python main.py
Data Quality Issues
Duplicate Column Names
Problem: Warning about duplicate columns in data dictionary
Explanation: This is handled automatically. RePORTaLiN renames duplicates
to column_name_2, column_name_3, etc.
No Action Needed: This is expected behavior for some Excel files.
Missing Data/NaN Values
Problem: null values in JSONL output
Explanation: This is correct. Empty cells in Excel are converted to null
in JSON format.
If You Need Different Behavior:
import pandas as pd
# Read JSONL and fill nulls
df = pd.read_json('output.jsonl', lines=True)
df.fillna('', inplace=True) # or other value
# Save back
df.to_json('output_cleaned.jsonl', orient='records', lines=True)
Incorrect Data Types
Problem: Numbers stored as strings or vice versa
Solution: The pipeline automatically infers types. If you need specific types:
import pandas as pd
df = pd.read_json('output.jsonl', lines=True)
# Convert specific columns
df['age'] = df['age'].astype(int)
df['date'] = pd.to_datetime(df['date'])
Advanced Troubleshooting
Enable Debug Logging
For detailed diagnostic information:
# config.py
import logging
LOG_LEVEL = logging.DEBUG
Then run:
python main.py 2>&1 | tee debug.log
Inspect Intermediate Results
Check what’s happening at each stage:
from scripts.load_dictionary import load_study_dictionary
from scripts.extract_data import process_excel_file
import config
# Test dictionary loading
load_study_dictionary(
config.DICTIONARY_EXCEL_FILE,
config.DICTIONARY_JSON_OUTPUT_DIR
)
# Check output
import os
print(os.listdir(config.DICTIONARY_JSON_OUTPUT_DIR))
Test Single File
Process one file in isolation:
from scripts.extract_data import process_excel_file
from pathlib import Path
test_file = Path("data/dataset/Indo-vap/10_TST.xlsx")
output_dir = Path("test_output")
output_dir.mkdir(exist_ok=True)
result = process_excel_file(str(test_file), str(output_dir))
print(result)
Verify Dependencies
Ensure all dependencies are correctly installed:
pip list | grep -E 'pandas|openpyxl|numpy|tqdm'
Should show:
numpy 2.x.x
openpyxl 3.x.x
pandas 2.x.x
tqdm 4.x.x
Getting Help
If you’re still experiencing issues:
Check the logs:
cat .logs/reportalin_*.logSearch existing issues: Check the GitHub repository
Create a minimal reproducible example
Include diagnostic information:
python --version pip list python -c "import config; print(config.DATASET_DIR)"
Common Error Messages
TypeError: Object of type 'Timestamp' is not JSON serializable
Cause: Date conversion issue
Solution: Already handled in the pipeline. If you see this, update to latest version.
UnicodeDecodeError
Cause: File encoding issue
Solution: Ensure Excel files are saved in standard format (Excel 2007+ .xlsx)
PermissionError: [Errno 13] Permission denied
Cause: File in use or insufficient permissions
Solution:
# Close Excel files
# Check permissions
chmod -R 755 data/ results/
See Also
Configuration: Configuration options
Usage Guide: Usage examples
Architecture: Technical system design
GitHub Issues: Report new problems