Contributing
For Developers: Contributing to RePORTaLiN
We welcome contributions to RePORTaLiN! This guide will help you get started with development, testing, and submitting your improvements.
Current Version: |version| (October 28, 2025)
Project Status:
✅ Complete data extraction and transformation pipeline ✅ De-identification with encryption and multi-country compliance ✅ Comprehensive documentation (user guides, developer guides, API reference) ✅ Production-ready codebase with robust error handling and type safety ✅ 68% code reduction while maintaining 100% functionality
For detailed version history and feature additions, see Changelog.
Getting Started
Fork the Repository
Visit the GitHub repository and click “Fork”
Clone Your Fork
git clone https://github.com/YOUR_USERNAME/RePORTaLiN.git cd RePORTaLiN
Set Up Development Environment
# Create virtual environment python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate # Install dependencies pip install -r requirements.txt
Create a Branch
git checkout -b feature/your-feature-name
Version Management
Added in version 0.2.0: Hybrid version management system with automatic semantic versioning via conventional commits.
RePORTaLiN uses a hybrid version management system that combines:
Single source of truth:
__version__.pyAutomatic bumping: Post-commit hook for VS Code/GUI workflows
Manual control: Makefile targets and CLI scripts when needed
How Version Bumping Works
Automatic (Recommended for Most Users):
When you commit with a conventional commit message, the version is automatically bumped:
# From VS Code or command line - just commit normally!
git add .
git commit -m "feat: add new feature"
# → Post-commit hook detects "feat:" and bumps 0.2.0 → 0.3.0
# → Amends commit to include __version__.py change
Version Bump Rules:
Commit Type |
Version Bump |
Example |
|---|---|---|
|
Patch |
0.2.0 → 0.2.1 |
|
Minor |
0.2.0 → 0.3.0 |
|
Major |
0.2.0 → 1.0.0 |
|
No bump |
0.2.0 → 0.2.0 |
Manual Version Bumping:
When you need explicit control (e.g., for releases):
# Bump patch version (0.2.0 → 0.2.1)
make bump-patch
git commit -m "chore: bump version to 0.2.1"
# Bump minor version (0.2.0 → 0.3.0)
make bump-minor
git commit -m "chore: bump version to 0.3.0"
# Bump major version (0.2.0 → 1.0.0)
make bump-major
git commit -m "chore: bump version to 1.0.0"
Smart Commit (Preview Before Committing):
Use smart-commit when you want to see the version change before committing:
# Preview version bump
./scripts/utils/smart-commit "feat: add new feature"
# Shows: Current: 0.2.0 → New: 0.3.0
# Asks for confirmation before committing
Check Current Version:
# Quick version check
make show-version
# or
python main.py --version
Version Import Pattern
All modules import version from __version__.py:
# Correct: Import from __version__.py
from __version__ import __version__
# Then use in your module
__version__ = __version__ # Re-export at module level
This ensures version consistency across:
CLI output (
--versionflag)Module
__version__attributesSphinx documentation (
docs/sphinx/conf.py)Package metadata
Never hardcode versions in module files - always import from __version__.py.
Development Workflow
Making Changes
Make your changes in your feature branch
Follow the Coding Standards below
Add or update tests as needed
Update documentation if needed
Ensure all tests pass
# Run tests (if available)
make test
# Clean build artifacts
make clean
# Test the pipeline
python main.py
Commit Guidelines
Changed in version 0.2.0: RePORTaLiN now uses Conventional Commits with automatic semantic versioning. Version bumps are handled automatically via post-commit hook.
Use Conventional Commits for automatic semantic versioning:
# Patch bump (0.2.0 → 0.2.1) - Bug fixes
✅ fix: correct date conversion bug in extract_data.py
✅ fix(deidentify): handle missing PHI patterns gracefully
# Minor bump (0.2.0 → 0.3.0) - New features
✅ feat: add CSV output format support
✅ feat(cli): add --verbose flag for DEBUG logging
# Major bump (0.2.0 → 1.0.0) - Breaking changes
✅ feat!: redesign configuration file structure
✅ feat: remove deprecated --legacy-mode flag
BREAKING CHANGE: Configuration now uses YAML instead of JSON
# No version bump - Documentation, refactoring, etc.
✅ docs: update README with new examples
✅ refactor: simplify table detection algorithm
✅ chore: update dependencies
Conventional Commit Format:
<type>[optional scope][optional !]: <description>
[optional body]
[optional footer(s)]
Commit Types:
feat:: New feature (→ Minor bump)fix:: Bug fix (→ Patch bump)feat!:orBREAKING CHANGE:: Breaking change (→ Major bump)docs:: Documentation only (no version bump)style:: Code style/formatting (no version bump)refactor:: Code refactoring (no version bump)test:: Add/update tests (no version bump)chore:: Maintenance tasks (no version bump)
How It Works:
Commit normally from VS Code or CLI with conventional commit message
Post-commit hook automatically detects commit type
Version bumped in
__version__.pybased on commit typeCommit is amended to include version change
Final commit contains both your changes AND version bump
Examples:
# Option 1: VS Code (recommended)
# Just commit normally - version bumps automatically!
git add .
git commit -m "feat: add CSV export" # → Auto-bumps to 0.3.0
# Option 2: CLI with preview (smart-commit)
./scripts/utils/smart-commit "feat: add CSV export" # Shows version before commit
# Option 3: Manual version bump
make bump-minor # Bump minor version manually
git commit -m "chore: bump version"
Good Examples:
✅ feat: add support for CSV output format
✅ fix: correct date parsing in extract_data.py
✅ docs: update configuration documentation
✅ feat(deidentify): add encryption support
✅ fix(cli)!: change --output flag to --output-dir
BREAKING CHANGE: --output flag renamed for clarity
Bad Examples:
❌ Update (too vague, no type)
❌ Fix bug (no description, no type)
❌ Changes (meaningless)
❌ Added feature (wrong tense, no type)
Coding Standards
Python Style
Follow PEP 8 guidelines:
Use 4 spaces for indentation
Max line length: 100 characters (flexible for readability)
Use descriptive variable names
Add docstrings to all public functions
Example:
def process_data(input_file: str, output_dir: str) -> dict:
"""
Process a single data file.
Args:
input_file: Path to input Excel file
output_dir: Directory for output JSONL file
Returns:
Dictionary with processing results
Raises:
FileNotFoundError: If input_file doesn't exist
"""
# Implementation here
pass
Documentation
Use Google-style docstrings:
def my_function(param1: str, param2: int = 0) -> bool:
"""
Brief description of function.
Longer description with more details about what the function
does and why it exists.
Args:
param1 (str): Description of param1
param2 (int, optional): Description of param2. Defaults to 0.
Returns:
bool: Description of return value
Raises:
ValueError: When param1 is empty
TypeError: When param2 is negative
Example:
>>> result = my_function("test", 5)
>>> print(result)
True
Note:
Any important notes about usage
See Also:
:func:`related_function`: Related functionality
"""
pass
Building Documentation
Added in version 0.3.0: Added make docs-watch for auto-rebuild on file changes.
The project uses Sphinx for documentation with autodoc enabled. Documentation is automatically extracted from Python docstrings when you build the docs.
Build Commands:
# Build HTML documentation (manual)
make docs
# Build and open in browser
make docs-open
# Auto-rebuild on file changes (requires sphinx-autobuild)
make docs-watch
Auto-Rebuild Workflow (Recommended for documentation development):
Install
sphinx-autobuild(already in requirements.txt):pip install -r requirements.txt
Start the auto-rebuild server:
make docs-watchOpen http://127.0.0.1:8000 in your browser
Edit any
.rstfile or Python docstring - changes appear automatically!
What Gets Auto-Generated:
All Python module documentation (via
.. automodule::directives)Function signatures with type hints
Class hierarchies and methods
Cross-references between modules
Best Practices:
Always update docstrings when changing function signatures
Run
make docsbefore committing to catch documentation errorsUse auto-rebuild during development for instant feedback
Check that autodoc picks up your changes correctly
Note: Documentation does NOT rebuild automatically on every code change by default.
You must explicitly run make docs or use make docs-watch for auto-rebuild.
Code Organization
One class/major function per file (for large implementations)
Related utility functions can be grouped
Keep functions focused (single responsibility)
Limit function length (prefer < 50 lines)
Example structure:
# module.py
"""
Module docstring explaining purpose.
"""
import standard_library
import third_party
import local_modules
# Constants
MAX_RETRIES = 3
DEFAULT_TIMEOUT = 30
# Main functions
def public_function():
"""Public API function."""
pass
def _private_helper():
"""Private helper function."""
pass
Error Handling
Changed in version 0.3.0: Logging module now uses specific exceptions (ValueError) instead of generic Exception.
Changed in version 0.3.0: De-identification module demonstrates robust error handling with 9 try/except blocks for cryptography imports, country regulations, pattern loading, mapping I/O, and file processing.
Use appropriate exception handling:
# Good: Specific exception handling
try:
data = read_file(path)
except FileNotFoundError:
log.error(f"File not found: {path}")
raise
except PermissionError:
log.error(f"Permission denied: {path}")
raise
Best Practices for Error Handling:
Optional Dependency Handling:
# From deidentify.py - handling optional cryptography try: from cryptography.fernet import Fernet CRYPTO_AVAILABLE = True except ImportError: CRYPTO_AVAILABLE = False logging.warning("cryptography package not available. Encryption disabled.")
This pattern allows graceful degradation when optional dependencies are missing.
File I/O Error Handling:
# From deidentify.py - mapping storage try: with open(self.storage_path, 'rb') as f: data = f.read() # Process data... except FileNotFoundError: # Expected on first run return except Exception as e: logging.error(f"Failed to load mappings: {e}") self.mappings = {}
Batch Processing with Granular Error Handling:
# From deidentify.py - dataset processing for jsonl_file in files: try: # Process file... files_processed += 1 except FileNotFoundError: files_failed += 1 tqdm.write(f"✗ File not found: {jsonl_file}") except json.JSONDecodeError as e: files_failed += 1 tqdm.write(f"✗ JSON error: {str(e)}") except Exception as e: files_failed += 1 tqdm.write(f"✗ Error: {str(e)}")
This ensures one file’s error doesn’t stop the entire batch.
Re-raising After Logging:
# Critical errors should be re-raised after logging try: self.storage_path.parent.mkdir(parents=True, exist_ok=True) # Save data... except Exception as e: logging.error(f"Failed to save mappings: {e}") raise # Re-raise to signal failure to caller
Public API Definition
Added in version 0.3.0: All utility modules now define explicit public APIs using __all__.
Define __all__ to explicitly declare your module’s public API:
# At the top of your module (after imports)
__all__ = [
# Enums
'MyEnum',
# Data Classes
'MyDataClass',
# Classes
'MyMainClass',
'MyHelperClass',
# Functions
'my_public_function',
'validate_data',
]
Benefits:
Prevents accidental exposure of internal implementation
Improves IDE autocomplete and import suggestions
Makes API surface explicit and maintainable
Helps with API versioning and deprecation
Example from De-identification Module:
__all__ = [
# Enums
'PHIType',
# Data Classes
'DetectionPattern',
'DeidentificationConfig',
# Core Classes
'PatternLibrary',
'PseudonymGenerator',
'DateShifter',
'MappingStore',
'DeidentificationEngine',
# Top-level Functions
'deidentify_dataset',
'validate_dataset',
]
Return Type Annotations
Changed in version 0.3.0: All functions now include explicit return type annotations, including -> None for
functions that don’t return values.
Always include return type annotations:
# Good: Explicit return types
def process_data(data: Dict[str, Any]) -> List[str]:
"""Process data and return results."""
return []
def save_results(path: Path, data: Dict) -> None:
"""Save results to file. Returns nothing."""
with open(path, 'w') as f:
json.dump(data, f)
# Avoid: Missing return type
def unclear_function(x): # What does this return?
pass