Usage Guide =========== **For Users: Working with RePORTaLiN** This guide shows you different ways to use RePORTaLiN for your daily tasks. Basic Usage ----------- Run Complete Pipeline ~~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash python main.py This executes both pipeline steps: 1. Data dictionary loading 2. Data extraction Skip Specific Steps ~~~~~~~~~~~~~~~~~~~ .. code-block:: bash # Skip dictionary loading (useful if already processed) python main.py --skip-dictionary # Skip data extraction python main.py --skip-extraction # Run neither (useful for testing configuration) python main.py --skip-dictionary --skip-extraction Detailed Logging Mode ~~~~~~~~~~~~~~~~~~~~~ Want to see exactly what's happening during processing? Use verbose mode to get detailed tree-view logs for each step: .. code-block:: bash # Enable detailed logging python main.py --verbose # Short form python main.py -v # With privacy protection enabled python main.py -v --enable-deidentification --countries IN US **Verbose Mode Output**: When running with ``--verbose``, you get detailed step-by-step logs in the format: :: DEBUG: ├─ Processing: Data dictionary loading (N sheets) DEBUG: ├─ Total sheets: N DEBUG: ├─ Sheet 1/N: 'sheet_name' DEBUG: │ ├─ Tables detected: 2 DEBUG: │ ├─ Table 1 (sheet_table_1) DEBUG: │ │ ├─ Rows: 100 DEBUG: │ │ ├─ Columns: 25 DEBUG: │ │ ├─ Saved to: /path/to/output.jsonl DEBUG: │ │ └─ ⏱ Table processing time: 0.23s DEBUG: └─ ⏱ Overall processing time: 2.45s **What you'll see in the log file:** - **File-level details**: Which files are being processed and their locations - **Sheet/Table details**: How many sheets/tables found, rows/columns per table - **Processing phases**: Load, save, clean, validate operations - **Metrics**: Counts, totals, and statistics for each operation - **Timing information**: Duration for each step and overall processing time - **Error context**: Detailed error messages with processing state - **Progress updates**: Real-time updates for large datasets **Where to find logs:** Look in the ``.logs/`` folder for files named ``reportalin_YYYYMMDD_HHMMSS.log`` **Impact on speed:** Minimal - your processing will be just as fast **Log File Structure**: The log file captures all verbose output with timestamps: .. code-block:: text 2024-10-23 14:30:45,123 - reportalin - DEBUG - ├─ Processing: Data dictionary loading (43 sheets) 2024-10-23 14:30:45,125 - reportalin - DEBUG - ├─ Total sheets: 43 2024-10-23 14:30:45,200 - reportalin - DEBUG - ├─ Sheet 1/43: 'TST Screening' 2024-10-23 14:30:45,250 - reportalin - DEBUG - │ ├─ Tables detected: 1 2024-10-23 14:30:45,260 - reportalin - DEBUG - │ ├─ Table 1 (TST Screening_table) 2024-10-23 14:30:45,270 - reportalin - DEBUG - │ │ ├─ Rows: 50 2024-10-23 14:30:45,271 - reportalin - DEBUG - │ │ ├─ Columns: 12 2024-10-23 14:30:45,280 - reportalin - DEBUG - │ │ └─ ⏱ Table processing time: 0.08s .. note:: Console output remains unchanged (only SUCCESS/ERROR messages shown). All verbose output goes to the log file for detailed analysis without cluttering the terminal. Working with Multiple Datasets ------------------------------- RePORTaLiN can process different datasets by simply changing the data directory: Scenario 1: Sequential Processing ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Process multiple datasets one at a time: .. code-block:: bash # Process first dataset # Ensure data/dataset/ contains only Indo-vap_csv_files python main.py # Move results to backup mv results/dataset/Indo-vap results/dataset/Indo-vap_backup # Process second dataset # Replace data/dataset/ contents with new dataset python main.py Scenario 2: Parallel Processing ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Use separate project directories for parallel processing: .. code-block:: bash # Terminal 1 cd /path/to/RePORTaLiN_project1 python main.py # Terminal 2 cd /path/to/RePORTaLiN_project2 python main.py De-identification Workflows ---------------------------- Running De-identification ~~~~~~~~~~~~~~~~~~~~~~~~~~ Enable de-identification in the main pipeline: .. code-block:: bash # Basic de-identification (uses default: India) python main.py --enable-deidentification # Specify countries python main.py --enable-deidentification --countries IN US ID # Use all supported countries python main.py --enable-deidentification --countries ALL # Disable encryption (testing only - NOT recommended) python main.py --enable-deidentification --no-encryption Country-Specific De-identification ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The system supports 14 countries with specific privacy regulations: .. code-block:: bash # India (default) python main.py --enable-deidentification --countries IN # Multiple countries (for international studies) python main.py --enable-deidentification --countries IN US ID BR # All countries (detects identifiers from all 14 supported countries) python main.py --enable-deidentification --countries ALL Supported countries: US, EU, GB, CA, AU, IN, ID, BR, PH, ZA, KE, NG, GH, UG For detailed information, see :doc:`country_regulations`. De-identification Output Structure ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The de-identified data maintains the same directory structure: .. code-block:: text results/deidentified/Indo-vap/ ├── original/ │ ├── 10_TST.jsonl # De-identified original files │ ├── 11_IGRA.jsonl │ └── ... ├── cleaned/ │ ├── 10_TST.jsonl # De-identified cleaned files │ ├── 11_IGRA.jsonl │ └── ... └── _deidentification_audit.json # Audit log results/deidentified/mappings/ └── mappings.enc # Encrypted mapping table Standalone De-identification ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can also run de-identification separately: .. code-block:: bash # De-identify existing dataset python -m scripts.deidentify \ --input-dir results/dataset/Indo-vap \ --output-dir results/deidentified/Indo-vap \ --countries IN US # List supported countries python -m scripts.deidentify --list-countries # Validate de-identified output python -m scripts.deidentify \ --input-dir results/dataset/Indo-vap \ --output-dir results/deidentified/Indo-vap \ --validate Working with De-identified Data ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python import pandas as pd # Read de-identified file df = pd.read_json('results/deidentified/Indo-vap/cleaned/10_TST.jsonl', lines=True) # PHI/PII has been replaced with pseudonyms print(df.head()) # Shows: [PATIENT-X7Y2], [SSN-A4B8], [DATE-1], etc. For complete de-identification documentation, see :doc:`deidentification`. Understanding Progress Output ------------------------------ Progress Bars and Status Messages ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ RePORTaLiN provides real-time feedback during processing using progress bars: .. code-block:: text Processing Files: 100%|████████████████| 43/43 [00:15<00:00, 2.87files/s] ✓ Processing 10_TST.xlsx: 1,234 rows ✓ Processing 11_IGRA.xlsx: 2,456 rows ... Summary: -------- Successfully processed: 43 files Total records: 50,123 Time elapsed: 15.2 seconds **Key Features**: - **tqdm progress bars**: Show percentage, speed, and time remaining - **Clean output**: Status messages use ``tqdm.write()`` to avoid interfering with progress bars - **Real-time updates**: Instant feedback on current operation - **Summary statistics**: Final counts and timing information **Modules with Progress Tracking**: 1. **Data Dictionary Loading** (``load_dictionary.py``): - Progress bar for processing sheets - Status messages for each table extracted - Summary of tables created 2. **Data Extraction** (``extract_data.py``): - Progress bar for files being processed - Per-file row counts - Final summary with totals 3. **De-identification** (``deidentify.py``): - Progress bar for batch processing - Detection statistics per file - Final summary with replacement counts **Note**: Progress bars require the ``tqdm`` library, which is installed automatically with ``pip install -r requirements.txt``. Verbose Logging Details ----------------------- Understanding Verbose Output ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Each pipeline step (Dictionary Loading, Data Extraction, De-identification) produces detailed tree-view logs when ``--verbose`` is enabled. **Step 0: Data Dictionary Loading** Verbose output shows the processing of Excel sheets and detected tables: .. code-block:: text ├─ Processing: dictionary_file.xlsx (43 sheets) │ ├─ Total sheets: 43 │ ├─ Sheet 1/43: 'TST Screening v1.0' │ │ ├─ Loading Excel file │ │ │ ├─ Rows: 45 │ │ │ ├─ Columns: 15 │ │ ├─ Tables detected: 1 │ │ ├─ Table 1 (TST_Screening_table) │ │ │ ├─ Rows: 44 │ │ │ ├─ Columns: 15 │ │ │ ├─ Saved to: /path/to/TST_Screening/TST_Screening_table.jsonl │ │ │ └─ ⏱ Table processing time: 0.15s │ │ └─ ⏱ Sheet processing time: 0.18s │ └─ ⏱ Overall processing time: 45.23s **What this tells you:** - Total sheets and which sheet is being processed - Number of tables found in each sheet - Rows/columns for each table - Time taken for each table and sheet - Output file locations **Step 1: Data Extraction** Verbose output shows file processing with duplicate column removal: .. code-block:: text ├─ Processing: Data extraction (65 files) │ ├─ Total files to process: 65 │ ├─ File 1/65: 10_TST.xlsx │ │ ├─ Loading Excel file │ │ │ ├─ Rows: 412 │ │ │ ├─ Columns: 28 │ │ ├─ Saving original version │ │ │ ├─ Created: 10_TST.jsonl (412 records) │ │ ├─ Cleaning duplicate columns │ │ │ ├─ Marking SUBJID2 for removal (duplicate of SUBJID) │ │ │ ├─ Keeping NAME_ALT (different from NAME) │ │ │ ├─ Removed 3 duplicate columns: SUBJID2, AGE_2, PHONE_3 │ │ ├─ Saving cleaned version │ │ │ ├─ Created: 10_TST.jsonl (412 records) │ │ │ └─ ⏱ Total processing time: 0.45s │ │ ├─ File 2/65: 11_IGRA.xlsx │ │ │ ... │ └─ ⏱ Overall extraction time: 32.15s **What this tells you:** - Which file is being processed and current progress - Rows/columns in the source file - Duplicate column detection and removal - Records created for original and cleaned versions - Processing time per file **Step 2: De-identification** (when ``--enable-deidentification`` is used) Verbose output shows de-identification and validation details: .. code-block:: text ├─ Processing: De-identification (65 files) │ ├─ Total files to process: 65 │ ├─ File 1/65: results/dataset/original/10_TST.jsonl │ │ ├─ Reading and de-identifying records │ │ │ ├─ Processed 100 records... │ │ │ ├─ Processed 200 records... │ │ │ ├─ Processed 412 records │ │ ├─ Records processed: 412 │ │ └─ ⏱ File processing time: 0.89s │ ├─ File 2/65: results/dataset/original/11_IGRA.jsonl │ │ ... │ └─ ⏱ Overall de-identification time: 78.34s │ ├─ Processing: Dataset validation (65 files) │ ├─ Total files to validate: 65 │ ├─ File 1/65: 10_TST.jsonl │ │ ├─ Records validated: 412 │ │ └─ ⏱ File validation time: 0.12s │ └─ ⏱ Overall validation time: 8.45s **What this tells you:** - De-identification progress with record counts - Validation results per file - Processing and validation times - Any PHI/PII issues detected during validation **Analyzing Log Files** When processing completes, analyze the log file in `.logs/`: .. code-block:: bash # View the latest log file tail -f .logs/reportalin_*.log # Search for specific errors grep "ERROR" .logs/reportalin_*.log # Count operations grep "✓ Complete" .logs/reportalin_*.log | wc -l # Extract timing information grep "⏱" .logs/reportalin_*.log **Performance Tuning**: Use verbose logs to identify bottlenecks: - If table processing is slow: Check for large tables or memory issues - If file extraction is slow: Check for duplicate column detection overhead - If de-identification is slow: Check for slow pattern matching or encryption See Also -------- For additional information: - :doc:`quickstart`: Quick start guide - :doc:`configuration`: Configuration options - :doc:`deidentification`: Complete de-identification guide - :doc:`country_regulations`: Country-specific privacy regulations - :doc:`troubleshooting`: Common issues and solutions