Quick Startο
For Users: Simplified Execution Guide
This guide provides clear, step-by-step instructions to get you started with RePORTaLiN in just a few minutes. No technical expertise required!
What Does RePORTaLiN Do?ο
RePORTaLiN is a tool that:
π Converts Excel files to a simpler JSON format (JSONL)
π Organizes data dictionary information into structured tables
π Protects sensitive patient information (optional de-identification)
β Validates data integrity and generates detailed logs
Think of it as an automated data processing assistant that handles tedious file conversions safely and efficiently.
Prerequisitesο
Before you begin, ensure you have:
- β Python 3.13 or higher installed
Check version:
python3 --version
β Project files downloaded or cloned to your computer
β
Excel data files in the data/dataset/ folder
β 5-10 minutes of time for initial setup
Verify Configuration (Optional)ο
Before running the pipeline, validate your setup:
from config import validate_config
warnings = validate_config()
if warnings:
for warning in warnings:
print(warning)
else:
print("Configuration is valid")
See Configuration for details.
Expected Outputο
You should see output similar to:
Processing sheets: 100%|ββββββββββ| 14/14 [00:00<00:00, 122.71sheet/s]
SUCCESS: Excel processing complete!
SUCCESS: Step 0: Loading Data Dictionary completed successfully.
Found 43 Excel files to process...
Processing files: 100%|ββββββββββ| 43/43 [00:15<00:00, 2.87file/s]
SUCCESS: Step 1: Extracting Raw Data to JSONL completed successfully.
RePORTaLiN pipeline finished.
Understanding the Outputο
After the pipeline completes, youβll find:
Extracted Data in
results/dataset/<dataset_name>/results/dataset/Indo-vap/ βββ original/ (All columns preserved) β βββ 10_TST.jsonl (631 records) β βββ 11_IGRA.jsonl (262 records) β βββ 12A_FUA.jsonl (2,831 records) β βββ ... (43 files total) βββ cleaned/ (Duplicate columns removed) βββ 10_TST.jsonl (631 records) βββ 11_IGRA.jsonl (262 records) βββ 12A_FUA.jsonl (2,831 records) βββ ... (43 files total)Note: Each extraction creates two versions in separate subdirectories:
original/ - All columns preserved as-is from Excel files
cleaned/ - Duplicate columns removed (e.g., SUBJID2, SUBJID3)
Data Dictionary Mappings in
results/data_dictionary_mappings/results/data_dictionary_mappings/ βββ Codelists/ β βββ Codelists_table_1.jsonl β βββ Codelists_table_2.jsonl βββ tblENROL/ β βββ tblENROL_table.jsonl βββ ... (14 sheets)
De-identified Data (if
--enable-deidentificationis used) inresults/deidentified/<dataset_name>/results/deidentified/Indo-vap/ βββ original/ (De-identified original files) β βββ 10_TST.jsonl β βββ ... βββ cleaned/ (De-identified cleaned files) β βββ 10_TST.jsonl β βββ ... βββ _deidentification_audit.json
Execution Logs in
.logs/.logs/ βββ reportalin_20251002_132124.log
Viewing the Resultsο
JSONL files can be viewed in several ways:
Using a text editor:
# View first few lines
head results/dataset/Indo-vap/original/10_TST.jsonl
Using Python:
import pandas as pd
# Read JSONL file
df = pd.read_json('results/dataset/Indo-vap/original/10_TST.jsonl', lines=True)
print(df.head())
Using jq (command-line JSON processor):
# Pretty-print first record
head -n 1 results/dataset/Indo-vap/original/10_TST.jsonl | jq
Command-Line Optionsο
Skip Specific Stepsο
You can skip individual pipeline steps:
# Skip data dictionary loading
python main.py --skip-dictionary
# Skip data extraction
python main.py --skip-extraction
# Skip both (useful for testing)
python main.py --skip-dictionary --skip-extraction
View Helpο
python main.py --help
Using Make Commandsο
For convenience, you can use Make commands:
# Run the pipeline
make run
# Clean cache files
make clean
# Run tests (if available)
make test
Working with Different Datasetsο
RePORTaLiN automatically detects your dataset:
Place your Excel files in
data/dataset/<your_dataset_name>/Run
python main.pyResults appear in
results/dataset/<your_dataset_name>/
Example:
# Your data structure
data/dataset/
βββ my_research_data/
βββ file1.xlsx
βββ file2.xlsx
βββ ...
# Automatically creates
results/dataset/
βββ my_research_data/
βββ file1.jsonl
βββ file2.jsonl
βββ ...
Checking the Logsο
Logs provide detailed information about the extraction process:
# View the latest log
ls -lt .logs/ | head -n 2
cat .logs/reportalin_20251002_132124.log
Logs include:
Timestamp for each operation
Files processed and record counts
Warnings and errors (if any)
Success confirmations
Common First-Run Issuesο
Issue: βNo Excel files foundβ
Solution: Ensure your Excel files are in data/dataset/<folder_name>/
ls data/dataset/*/
β
Issue: βPermission deniedβ when creating logs
Solution: Ensure the .logs directory is writable:
chmod 755 .logs/
β
Issue: βPackage not foundβ
Solution: Ensure dependencies are installed:
pip install -r requirements.txt
Step-by-Step Executionο
Step 1: Install Dependencies (One-time setup)
Open your terminal/command prompt and navigate to the RePORTaLiN project folder:
cd /path/to/RePORTaLiN
Install required Python packages:
pip install -r requirements.txt
You should see packages being installed (pandas, openpyxl, tqdm, etc.). This takes 1-2 minutes.
β Expected Output: βSuccessfully installed pandas-2.0.0 openpyxl-3.1.0β¦β (versions may vary)
β
Step 2: Verify Your Data Files
Check that your Excel files are in the right location:
ls data/dataset/
β
Expected Output: You should see a folder (e.g., Indo-vap_csv_files/) containing .xlsx files
If you donβt see any folders, create one and place your Excel files there:
mkdir -p data/dataset/my_dataset/
cp /path/to/your/excel/files/*.xlsx data/dataset/my_dataset/
β
Step 3: Run the Basic Pipeline
Execute the main pipeline with this simple command:
python3 main.py
β Expected Output: Youβll see two progress bars:
Processing sheets: 100%|ββββββββββ| 14/14 [00:01<00:00, 12.71sheet/s]
SUCCESS: Step 0: Loading Data Dictionary completed successfully.
Found 43 Excel files to process...
Processing files: 100%|ββββββββββ| 43/43 [00:15<00:00, 2.87file/s]
SUCCESS: Step 1: Extracting Raw Data to JSONL completed successfully.
RePORTaLiN pipeline finished.
β±οΈ Time: Usually 15-30 seconds depending on file size
β
Step 4: Check Your Results
Navigate to the results folder:
cd results/dataset/
ls
β
Expected Output: Youβll see a folder with your dataset name (e.g., Indo-vap/)
Look inside:
ls results/dataset/Indo-vap/
β Expected Output:
original/ (Contains .jsonl files with all original columns)
cleaned/ (Contains .jsonl files with duplicate columns removed)
Each folder contains the same files but with different processing levels: - original/ = Exact Excel data, just converted to JSONL - cleaned/ = Duplicate columns (like SUBJID2, SUBJID3) removed for cleaner data
β
Step 5: View Your Converted Data (Optional)
Open any .jsonl file to see the converted data:
head -n 5 results/dataset/Indo-vap/original/10_TST.jsonl
β Expected Output: Youβll see JSON-formatted data, one record per line:
{"SUBJID": "INV001", "VISIT": 1, "TST_RESULT": "Positive", "source_file": "10_TST.xlsx"}
{"SUBJID": "INV002", "VISIT": 1, "TST_RESULT": "Negative", "source_file": "10_TST.xlsx"}
...
π Congratulations! Your data has been successfully converted!
Advanced Usage: De-identificationο
If you need to remove sensitive patient information (PHI/PII), use the de-identification feature:
Step 1: Run with De-identification Enabled
python3 main.py --enable-deidentification
β Expected Output: Additional processing step for de-identification:
De-identifying dataset: results/dataset/Indo-vap -> results/deidentified/Indo-vap
Processing both 'original' and 'cleaned' subdirectories...
Countries: IN (default)
De-identifying files: 100%|ββββββββββ| 43/43 [00:25<00:00, 1.72file/s]
De-identification complete:
Texts processed: 15,234
Total detections: 1,250
Countries: IN (default)
Unique mappings: 485
β±οΈ Time: Additional 20-40 seconds for de-identification
β
Step 2: Specify Countries (For multi-country studies)
python3 main.py --enable-deidentification --countries IN US ID BR
This applies privacy regulations for India, United States, Indonesia, and Brazil.
β Supported Countries: US, IN, ID, BR, PH, ZA, EU, GB, CA, AU, KE, NG, GH, UG
β
Step 3: View De-identified Results
head -n 3 results/deidentified/Indo-vap/original/10_TST.jsonl
β Expected Output: Sensitive data replaced with placeholders:
{"SUBJID": "[MRN-X7Y2A9]", "PATIENT_NAME": "[PATIENT-A4B8C3]", "DOB": "[DATE-1]", ...}
{"SUBJID": "[MRN-K2M5P1]", "PATIENT_NAME": "[PATIENT-D9F2G7]", "DOB": "[DATE-2]", ...}
Note: Original β Pseudonym mappings are encrypted and stored securely in:
results/deidentified/mappings/mappings.enc
Troubleshootingο
Problem: βNo Excel files foundβ
Solution: Check that your Excel files (.xlsx) are in data/dataset/<folder_name>/
ls data/dataset/*/
β
Problem: βPackage βpandasβ not foundβ
Solution: Install dependencies:
pip install -r requirements.txt
β
Problem: βPermission deniedβ when accessing files
Solution: Run with appropriate permissions:
# On macOS/Linux
chmod +x main.py
python3 main.py
# On Windows (run as Administrator)
python main.py
β
Problem: Files are being skipped
Solution: This is normal! The pipeline skips files that were already processed successfully. To reprocess, delete the output folder:
rm -rf results/dataset/my_dataset/
python3 main.py
β
Problem: βValidation found potential PHIβ warning after de-identification
Solution: This is a cautious warning. Review the log file for details:
cat .logs/reportalin_*.log | grep "potential PHI"
If itβs a false positive (like β[MRN-ABC123]β being detected), you can safely proceed.
Common Use Casesο
Use Case 1: Process only data dictionary, skip extraction
python3 main.py --skip-extraction
β
Use Case 2: Process only data extraction, skip dictionary
python3 main.py --skip-dictionary
β
Use Case 3: Reprocess everything from scratch
rm -rf results/
python3 main.py
β
Use Case 4: De-identify for multiple countries without encryption (testing only)
python3 main.py --enable-deidentification --countries ALL --no-encryption
β οΈ Warning: --no-encryption should only be used for testing! Always use encryption in production.
β
Next Stepsο
β Youβre done! Your data has been successfully processed.
Whatβs next?
π Analyze your data: Use the .jsonl files with pandas, jq, or any JSON tool
π Read the full documentation: Learn about advanced configuration options
π Review de-identification: Check the audit log at
results/deidentified/_deidentification_audit.jsonπ Check logs: Detailed operation logs are in
.logs/reportalin_<timestamp>.log
Need help? See the Troubleshooting guide or review the logs for detailed error messages.