scripts.extract_data module

Data Extraction Module

Extracts raw data from Excel files and converts to JSONL format with type conversion, progress tracking, and error recovery.

This module provides robust Excel-to-JSONL conversion with duplicate column handling, data validation, and comprehensive error recovery.

Key Features:

Dual output: Creates both original and cleaned JSONL versions
Duplicate column removal: Intelligently removes SUBJID2, SUBJID3, etc.
Type conversion: Handles pandas/numpy types, dates, NaN values
Integrity checks: Validates output files before skipping
Error recovery: Continues processing even if individual files fail
Progress tracking: Real-time progress bars
Verbose logging: Detailed tree-view logs with timing (v0.0.12+)

Verbose Mode:

When running with --verbose flag, detailed logs are generated including file-by-file processing, Excel loading details (rows/columns), duplicate column detection, and per-file timing information.

Overview

The extract_data module converts Excel files to JSONL format with intelligent type conversion, progress tracking, and comprehensive error handling.

Enhanced in v0.0.7:

✅ Added explicit public API definition via __all__ (6 exports)
✅ Enhanced module docstring with comprehensive usage examples (40 lines, 790% increase)
✅ Complete type hint coverage verified and robust error handling
✅ Code quality verified with backward compatibility preserved

Public API (6 Exports)

The module explicitly exports these 6 functions via __all__:

extract_excel_to_jsonl - Batch process all Excel files in a directory
process_excel_file - Process a single Excel file
find_excel_files - Find all Excel files in a directory
convert_dataframe_to_jsonl - Convert DataFrame to JSONL format
clean_record_for_json - Clean record for JSON serialization
clean_duplicate_columns - Remove duplicate columns from DataFrame

This explicit API definition:

✅ Provides clear separation of public vs internal API
✅ Improves IDE autocomplete and type checking
✅ Prevents accidental usage of private implementation details
✅ Documents the stable, supported interface

Enhanced Documentation (v0.0.7)

The module docstring has been enhanced from 171 to 1,524 characters (40 lines), including:

Three Comprehensive Usage Examples:

Basic Batch Processing - Process all Excel files in a directory
Single File Processing - Process one file with error handling
Custom DataFrame Conversion - Convert custom DataFrame to JSONL

Key Features Highlighted:

Intelligent type conversion (timestamps, NaN values, pandas types)
Progress tracking with file counts and rates
Duplicate column handling (cleaned vs original versions)
Comprehensive error handling and logging

For Complete Examples:

See the module docstring for ready-to-use code snippets that demonstrate:

Real-world file paths and directory structures
Error handling patterns
Expected output formats
Processing metrics (records, timing, files)

These examples can be copied directly into your code for immediate use.

Functions

extract_excel_to_jsonl

scripts.extract_data.extract_excel_to_jsonl()[source]

Extract all Excel files from dataset directory, creating original and cleaned JSONL versions.

Return type:: Dict[str, Any]
Returns:: Dictionary with extraction statistics

Batch process all Excel files in a directory.

Parameters:

input_dir (str): Directory containing Excel files
output_dir (str): Directory for JSONL output files

Example:

from scripts.extract_data import extract_excel_to_jsonl

extract_excel_to_jsonl(
    input_dir="data/dataset/Indo-vap",
    output_dir="results/dataset/Indo-vap"
)

process_excel_file

scripts.extract_data.process_excel_file(excel_file, output_dir)[source]

Process Excel file to JSONL format, creating both original and cleaned versions.

Return type:: Tuple[bool, int, Optional[str]]

Process a single Excel file.

Parameters:

input_file (str): Path to Excel file
output_dir (str): Directory for output JSONL file

Returns:

dict: Processing results with keys: - records (int): Number of records processed - file (str): Output file path

Example:

from scripts.extract_data import process_excel_file

result = process_excel_file(
    "data/dataset/Indo-vap/10_TST.xlsx",
    "results/dataset/Indo-vap"
)
print(f"Processed {result['records']} records")

convert_dataframe_to_jsonl

scripts.extract_data.convert_dataframe_to_jsonl(df, output_file, source_filename)[source]

Convert DataFrame to JSONL format, handling empty DataFrames with column metadata.

Return type:: int

Convert a pandas DataFrame to JSONL format.

Parameters:

df (pd.DataFrame): DataFrame to convert
output_file (str): Path to output JSONL file
source_file (str): Original source file name (for metadata)

Example:

import pandas as pd
from scripts.extract_data import convert_dataframe_to_jsonl

df = pd.read_excel("input.xlsx")
convert_dataframe_to_jsonl(
    df=df,
    output_file="output.jsonl",
    source_file="input.xlsx"
)

clean_record_for_json

scripts.extract_data.clean_record_for_json(record)[source]

Convert pandas record to JSON-serializable types.

Handles NaN, infinity, numpy types, and datetime objects, ensuring all values are properly serializable to JSON format.

Parameters:: record (dict) – Dictionary with potentially non-JSON-serializable values
Return type:: dict
Returns:: Dictionary with all values converted to JSON-serializable types

Note

NaN values are converted to None
Infinity values (+inf, -inf) are converted to None
Numpy types are converted to Python native types
Datetime objects are converted to ISO format strings

Clean a record for JSON serialization.

Handles:

pd.Timestamp → string
pd.NA / np.nan → None
float('nan') → None
Other types → preserved

Parameters:

record (dict): Record to clean

Returns:

dict: Cleaned record

Example:

import pandas as pd
from scripts.extract_data import clean_record_for_json

record = {
    'date': pd.Timestamp('2025-01-01'),
    'value': 42,
    'missing': pd.NA
}

cleaned = clean_record_for_json(record)
# Result: {'date': '2025-01-01 00:00:00', 'value': 42, 'missing': None}

find_excel_files

scripts.extract_data.find_excel_files(directory)[source]

Find all Excel files (.xlsx) in the specified directory.

Return type:: List[Path]

Find all Excel files (.xlsx) in a directory.

Parameters:

directory (str): Directory to search

Returns:

list: List of Excel file paths (sorted)

Example:

from scripts.extract_data import find_excel_files

files = find_excel_files("data/dataset/Indo-vap")
print(f"Found {len(files)} Excel files")
for file in files:
    print(f"  - {file}")

is_dataframe_empty

scripts.extract_data.is_dataframe_empty(df)[source]

Check if DataFrame is completely empty (no rows AND no columns).

Return type:: bool

Check if a DataFrame is empty or contains only NaN values.

Parameters:

df (pd.DataFrame): DataFrame to check

Returns:

bool: True if empty, False otherwise

Example:

import pandas as pd
from scripts.extract_data import is_dataframe_empty

df1 = pd.DataFrame()
print(is_dataframe_empty(df1))  # True

df2 = pd.DataFrame({'a': [1, 2, 3]})
print(is_dataframe_empty(df2))  # False

Data Flow

The extraction process follows this flow:

1. extract_excel_to_jsonl(input_dir, output_dir)
   │
   ├── find_excel_files(input_dir)
   │   └── Returns: [file1.xlsx, file2.xlsx, ...]
   │
   └── For each file:
       └── process_excel_file(file, output_dir)
           │
           ├── Read Excel: pd.read_excel(file)
           │
           ├── Check if empty: is_dataframe_empty(df)
           │   └── If empty: Skip file
           │
           └── convert_dataframe_to_jsonl(df, output_file, source)
               │
               └── For each record:
                   ├── clean_record_for_json(record)
                   │   └── Convert types for JSON
                   │
                   └── Write to JSONL file

Type Conversions

The module handles these type conversions:

Input Type	Output Type	Notes
`pd.Timestamp`	`str`	ISO format
`pd.NA`	`None`	Null value
`np.nan`	`None`	Null value
`float('nan')`	`None`	Null value
`int`, `float`	Preserved	As-is
`str`	Preserved	As-is
`bool`	Preserved	As-is

Error Handling

The module handles these error scenarios:

File Not Found: Logs error and skips file
Excel Read Error: Logs error with details
Empty DataFrame: Skips silently (not an error)
Type Conversion Error: Logs warning, converts to None
Write Error: Logs error and raises exception

Progress Tracking

Uses tqdm for progress bars:

File-level progress: Processing files: 100%|██| 43/43 [00:15<00:00]
Real-time file count and processing rate
Estimated time remaining

Performance

Typical performance:

43 files: ~15-20 seconds
~50,000 records: ~2-3 files/second
Memory usage: One file in memory at a time

scripts.extract_data module

Data Extraction Module

Overview

Public API (6 Exports)

Enhanced Documentation (v0.0.7)

Functions

extract_excel_to_jsonl

process_excel_file

convert_dataframe_to_jsonl

clean_record_for_json

find_excel_files

is_dataframe_empty

Data Flow

Type Conversions

Error Handling

Progress Tracking

Performance

See Also