Quick Start =========== **For Users: Simplified Execution Guide** This guide provides clear, step-by-step instructions to get you started with RePORTaLiN in just a few minutes. No technical expertise required! What Does RePORTaLiN Do? ------------------------- RePORTaLiN is a tool that: 1. 📊 **Converts** Excel files to a simpler JSON format (JSONL) 2. 🔍 **Organizes** data dictionary information into structured tables 3. 🔒 **Protects** sensitive patient information (optional de-identification) 4. ✅ **Validates** data integrity and generates detailed logs Think of it as an automated data processing assistant that handles tedious file conversions safely and efficiently. Prerequisites ------------- Before you begin, ensure you have: ✅ **Python 3.13 or higher** installed Check version: ``python3 --version`` ✅ **Project files** downloaded or cloned to your computer ✅ **Excel data files** in the ``data/dataset/`` folder ✅ **5-10 minutes** of time for initial setup Verify Configuration (Optional) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Before running the pipeline, validate your setup: .. code-block:: python from config import validate_config warnings = validate_config() if warnings: for warning in warnings: print(warning) else: print("Configuration is valid") See :doc:`configuration` for details. Expected Output --------------- You should see output similar to: .. code-block:: text Processing sheets: 100%|██████████| 14/14 [00:00<00:00, 122.71sheet/s] SUCCESS: Excel processing complete! SUCCESS: Step 0: Loading Data Dictionary completed successfully. Found 43 Excel files to process... Processing files: 100%|██████████| 43/43 [00:15<00:00, 2.87file/s] SUCCESS: Step 1: Extracting Raw Data to JSONL completed successfully. RePORTaLiN pipeline finished. Understanding the Output ------------------------ After the pipeline completes, you'll find: 1. **Extracted Data** in ``results/dataset//`` .. code-block:: text results/dataset/Indo-vap/ ├── original/ (All columns preserved) │ ├── 10_TST.jsonl (631 records) │ ├── 11_IGRA.jsonl (262 records) │ ├── 12A_FUA.jsonl (2,831 records) │ └── ... (43 files total) └── cleaned/ (Duplicate columns removed) ├── 10_TST.jsonl (631 records) ├── 11_IGRA.jsonl (262 records) ├── 12A_FUA.jsonl (2,831 records) └── ... (43 files total) **Note:** Each extraction creates two versions in separate subdirectories: - **original/** - All columns preserved as-is from Excel files - **cleaned/** - Duplicate columns removed (e.g., SUBJID2, SUBJID3) 2. **Data Dictionary Mappings** in ``results/data_dictionary_mappings/`` .. code-block:: text results/data_dictionary_mappings/ ├── Codelists/ │ ├── Codelists_table_1.jsonl │ └── Codelists_table_2.jsonl ├── tblENROL/ │ └── tblENROL_table.jsonl └── ... (14 sheets) 3. **De-identified Data** (if ``--enable-deidentification`` is used) in ``results/deidentified//`` .. code-block:: text results/deidentified/Indo-vap/ ├── original/ (De-identified original files) │ ├── 10_TST.jsonl │ └── ... ├── cleaned/ (De-identified cleaned files) │ ├── 10_TST.jsonl │ └── ... └── _deidentification_audit.json 4. **Execution Logs** in ``.logs/`` .. code-block:: text .logs/ └── reportalin_20251002_132124.log Viewing the Results ------------------- JSONL files can be viewed in several ways: **Using a text editor:** .. code-block:: bash # View first few lines head results/dataset/Indo-vap/original/10_TST.jsonl **Using Python:** .. code-block:: python import pandas as pd # Read JSONL file df = pd.read_json('results/dataset/Indo-vap/original/10_TST.jsonl', lines=True) print(df.head()) **Using jq (command-line JSON processor):** .. code-block:: bash # Pretty-print first record head -n 1 results/dataset/Indo-vap/original/10_TST.jsonl | jq Command-Line Options -------------------- Skip Specific Steps ~~~~~~~~~~~~~~~~~~~ You can skip individual pipeline steps: .. code-block:: bash # Skip data dictionary loading python main.py --skip-dictionary # Skip data extraction python main.py --skip-extraction # Skip both (useful for testing) python main.py --skip-dictionary --skip-extraction View Help ~~~~~~~~~ .. code-block:: bash python main.py --help Using Make Commands ------------------- For convenience, you can use Make commands: .. code-block:: bash # Run the pipeline make run # Clean cache files make clean # Run tests (if available) make test Working with Different Datasets -------------------------------- RePORTaLiN automatically detects your dataset: 1. Place your Excel files in ``data/dataset//`` 2. Run ``python main.py`` 3. Results appear in ``results/dataset//`` Example: .. code-block:: text # Your data structure data/dataset/ └── my_research_data/ ├── file1.xlsx ├── file2.xlsx └── ... # Automatically creates results/dataset/ └── my_research_data/ ├── file1.jsonl ├── file2.jsonl └── ... Checking the Logs ----------------- Logs provide detailed information about the extraction process: .. code-block:: bash # View the latest log ls -lt .logs/ | head -n 2 cat .logs/reportalin_20251002_132124.log Logs include: - Timestamp for each operation - Files processed and record counts - Warnings and errors (if any) - Success confirmations Common First-Run Issues ----------------------- **Issue**: "No Excel files found" **Solution**: Ensure your Excel files are in ``data/dataset//`` .. code-block:: bash ls data/dataset/*/ --- **Issue**: "Permission denied" when creating logs **Solution**: Ensure the ``.logs`` directory is writable: .. code-block:: bash chmod 755 .logs/ --- **Issue**: "Package not found" **Solution**: Ensure dependencies are installed: .. code-block:: bash pip install -r requirements.txt Step-by-Step Execution ----------------------- **Step 1: Install Dependencies** (One-time setup) Open your terminal/command prompt and navigate to the RePORTaLiN project folder: .. code-block:: bash cd /path/to/RePORTaLiN Install required Python packages: .. code-block:: bash pip install -r requirements.txt You should see packages being installed (pandas, openpyxl, tqdm, etc.). This takes 1-2 minutes. ✅ **Expected Output:** "Successfully installed pandas-2.0.0 openpyxl-3.1.0..." (versions may vary) --- **Step 2: Verify Your Data Files** Check that your Excel files are in the right location: .. code-block:: bash ls data/dataset/ ✅ **Expected Output:** You should see a folder (e.g., ``Indo-vap_csv_files/``) containing .xlsx files If you don't see any folders, create one and place your Excel files there: .. code-block:: bash mkdir -p data/dataset/my_dataset/ cp /path/to/your/excel/files/*.xlsx data/dataset/my_dataset/ --- **Step 3: Run the Basic Pipeline** Execute the main pipeline with this simple command: .. code-block:: bash python3 main.py ✅ **Expected Output:** You'll see two progress bars: .. code-block:: text Processing sheets: 100%|██████████| 14/14 [00:01<00:00, 12.71sheet/s] SUCCESS: Step 0: Loading Data Dictionary completed successfully. Found 43 Excel files to process... Processing files: 100%|██████████| 43/43 [00:15<00:00, 2.87file/s] SUCCESS: Step 1: Extracting Raw Data to JSONL completed successfully. RePORTaLiN pipeline finished. ⏱️ **Time:** Usually 15-30 seconds depending on file size --- **Step 4: Check Your Results** Navigate to the results folder: .. code-block:: bash cd results/dataset/ ls ✅ **Expected Output:** You'll see a folder with your dataset name (e.g., ``Indo-vap/``) Look inside: .. code-block:: bash ls results/dataset/Indo-vap/ ✅ **Expected Output:** .. code-block:: text original/ (Contains .jsonl files with all original columns) cleaned/ (Contains .jsonl files with duplicate columns removed) Each folder contains the same files but with different processing levels: - **original/** = Exact Excel data, just converted to JSONL - **cleaned/** = Duplicate columns (like SUBJID2, SUBJID3) removed for cleaner data --- **Step 5: View Your Converted Data** (Optional) Open any .jsonl file to see the converted data: .. code-block:: bash head -n 5 results/dataset/Indo-vap/original/10_TST.jsonl ✅ **Expected Output:** You'll see JSON-formatted data, one record per line: .. code-block:: text {"SUBJID": "INV001", "VISIT": 1, "TST_RESULT": "Positive", "source_file": "10_TST.xlsx"} {"SUBJID": "INV002", "VISIT": 1, "TST_RESULT": "Negative", "source_file": "10_TST.xlsx"} ... 🎉 **Congratulations!** Your data has been successfully converted! Advanced Usage: De-identification ---------------------------------- If you need to remove sensitive patient information (PHI/PII), use the de-identification feature: **Step 1: Run with De-identification Enabled** .. code-block:: bash python3 main.py --enable-deidentification ✅ **Expected Output:** Additional processing step for de-identification: .. code-block:: text De-identifying dataset: results/dataset/Indo-vap -> results/deidentified/Indo-vap Processing both 'original' and 'cleaned' subdirectories... Countries: IN (default) De-identifying files: 100%|██████████| 43/43 [00:25<00:00, 1.72file/s] De-identification complete: Texts processed: 15,234 Total detections: 1,250 Countries: IN (default) Unique mappings: 485 ⏱️ **Time:** Additional 20-40 seconds for de-identification --- **Step 2: Specify Countries** (For multi-country studies) .. code-block:: bash python3 main.py --enable-deidentification --countries IN US ID BR This applies privacy regulations for India, United States, Indonesia, and Brazil. ✅ **Supported Countries:** US, IN, ID, BR, PH, ZA, EU, GB, CA, AU, KE, NG, GH, UG --- **Step 3: View De-identified Results** .. code-block:: bash head -n 3 results/deidentified/Indo-vap/original/10_TST.jsonl ✅ **Expected Output:** Sensitive data replaced with placeholders: .. code-block:: text {"SUBJID": "[MRN-X7Y2A9]", "PATIENT_NAME": "[PATIENT-A4B8C3]", "DOB": "[DATE-1]", ...} {"SUBJID": "[MRN-K2M5P1]", "PATIENT_NAME": "[PATIENT-D9F2G7]", "DOB": "[DATE-2]", ...} **Note:** Original → Pseudonym mappings are encrypted and stored securely in: ``results/deidentified/mappings/mappings.enc`` Troubleshooting --------------- **Problem:** "No Excel files found" **Solution:** Check that your Excel files (.xlsx) are in ``data/dataset//`` .. code-block:: bash ls data/dataset/*/ --- **Problem:** "Package 'pandas' not found" **Solution:** Install dependencies: .. code-block:: bash pip install -r requirements.txt --- **Problem:** "Permission denied" when accessing files **Solution:** Run with appropriate permissions: .. code-block:: bash # On macOS/Linux chmod +x main.py python3 main.py # On Windows (run as Administrator) python main.py --- **Problem:** Files are being skipped **Solution:** This is normal! The pipeline skips files that were already processed successfully. To reprocess, delete the output folder: .. code-block:: bash rm -rf results/dataset/my_dataset/ python3 main.py --- **Problem:** "Validation found potential PHI" warning after de-identification **Solution:** This is a cautious warning. Review the log file for details: .. code-block:: bash cat .logs/reportalin_*.log | grep "potential PHI" If it's a false positive (like "[MRN-ABC123]" being detected), you can safely proceed. Common Use Cases ---------------- **Use Case 1: Process only data dictionary, skip extraction** .. code-block:: bash python3 main.py --skip-extraction --- **Use Case 2: Process only data extraction, skip dictionary** .. code-block:: bash python3 main.py --skip-dictionary --- **Use Case 3: Reprocess everything from scratch** .. code-block:: bash rm -rf results/ python3 main.py --- **Use Case 4: De-identify for multiple countries without encryption** (testing only) .. code-block:: bash python3 main.py --enable-deidentification --countries ALL --no-encryption **⚠️ Warning:** ``--no-encryption`` should only be used for testing! Always use encryption in production. --- Next Steps ---------- ✅ **You're done!** Your data has been successfully processed. **What's next?** 1. 📊 **Analyze your data:** Use the .jsonl files with pandas, jq, or any JSON tool 2. 📖 **Read the full documentation:** Learn about advanced configuration options 3. 🔒 **Review de-identification:** Check the audit log at ``results/deidentified/_deidentification_audit.json`` 4. 📝 **Check logs:** Detailed operation logs are in ``.logs/reportalin_.log`` **Need help?** See the :doc:`troubleshooting` guide or review the logs for detailed error messages.