References
What. Every regulation, standard, paper, and external resource cited in the RePORT AI Portal codebase or IRB/Auditor profile, collected in one place with URLs and a line on which pillar / module they back.
Why. Regulatory traceability is a developer concern. If you’re touching the PHI scrubber’s catalog or the agent-boundary gate, you should be able to reach the primary source for HIPAA §164.514(b)(2)(i) or ICMR §11.7 from the docs in one click. If you’re adding a new rule class, you should know which regulation the rule answers.
How. Sorted by concern area (regulation / standard / technique / benchmark). Each entry includes a short “used for” line pointing at the module or pillar the reference backs.
Primary Regulations
HIPAA Privacy Rule — §164.514 De-identification
Full text. https://www.ecfr.gov/current/title-45/subtitle-A/subchapter-C/part-164/subpart-E/section-164.514
HHS guidance. https://www.hhs.gov/hipaa/for-professionals/special-topics/de-identification/index.html
What we use it for. §164.514(b)(2)(i)(A–R) is the reference list of 18 identifier classes that the PHI Architecture catalog maps to. §164.514(b)(2)(i)(C) specifically backs the age-over-89 cap and the date-precision rules.
DPDPA 2023 — Digital Personal Data Protection Act
Text. https://www.meity.gov.in/static/uploads/2024/06/2bf1f0e9f04e6fb4f8fef35e82c42aa5.pdf
DPDP Rules 2025 (notified 13 Nov 2025). https://www.meity.gov.in/static/uploads/2025/11/53450e6e5dc0bfa85ebd78686cadad39.pdf
What we use it for. India’s primary personal-data regulation. §2(t) defines “personal data”; §9 governs children’s data; §8 governs data-fiduciary obligations (minimization, retention, accuracy). The PHI catalog’s drop rules for Indian government IDs (Aadhaar / ABHA / PAN / voter / PM-JAY) anchor to DPDPA.
SPDI Rules 2011 (under IT Act §43A)
Brief. https://prsindia.org/files/bills_acts/bills_parliament/2011/IT_Rules_and_Regulations_Brief_2011.pdf
What we use it for. Still the in-force regulation until DPDPA’s substantive provisions kick in. Rule 3 defines Sensitive Personal Data or Information (SPDI); health data and biometric data are explicitly covered.
Aadhaar Act 2016 — §29 restrictions on sharing identity information
Text. https://uidai.gov.in/images/Aadhaar_Act_2016_as_amended.pdf
What we use it for. Justifies the aggressive drop rule for any Aadhaar-shaped identifier (12-digit,
1234 5678 9012) in both the scrub catalog and the phi_patterns BLOCKING regex list.
ICMR National Ethical Guidelines for Biomedical & Health Research (2017)
Text. https://main.icmr.nic.in/sites/default/files/guidelines/ICMR_Ethical_Guidelines_2017.pdf
What we use it for. §11 (confidentiality and community-level privacy) backs the suppress_small_cell action for household-contact counts; §11.7 explicitly mandates k-anonymity-style controls for cohort studies; §5 backs date-precision requirements.
ABDM Health Data Management Policy (NHA)
Text. https://abdm.gov.in/publications/health_data_management_policy
What we use it for. Governs ABHA (health ID) records. Referenced in the PHI catalog for any
ABHA/health_idcolumn shape.
RePORT India Common Protocol
Project site. https://www.reportindia.org
What we use it for. The parent study protocol under which Indo-VAP runs. Dictates the 72-hour IRB notification window for PHI breaches, which the study team must encode in its breach-response runbook before production ingest.
Standards & Frameworks
NIST SP 800-188 — De-Identifying Government Datasets
Text. https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-188.pdf
What we use it for. §5.2 backs the integrity-chain requirement (SHA-256 of every raw input in every row’s provenance + in the lineage manifest). §6.3-6.5 backs the AMBER transient-workspace hardening (mode 0700, zero-fill on teardown).
NIST SP 800-175B — Guideline for Using Cryptographic Standards
Text. https://csrc.nist.gov/publications/detail/sp/800-175b/rev-1/final
What we use it for. HMAC-SHA256 as a keyed-MAC construction; key-rotation semantics for the sidecar HMAC key.
NIST SP 800-53 (SI-7 Software, Firmware, and Information Integrity)
Text. https://csrc.nist.gov/publications/detail/sp/800-53/rev-5/final
What we use it for. SI-7 backs the per-run lineage manifest hash chain.
STROBE — Strengthening the Reporting of Observational Studies in Epidemiology
Text. https://www.equator-network.org/reporting-guidelines/strobe/
What we use it for. Reporting-fidelity requirements that drive the provenance-on-every-row design; specifically §6 (data sources + measurement) and §14 (descriptive data).
RECORD — REporting of studies Conducted using Observational Routinely-collected health Data
What we use it for. Extension of STROBE for routinely-collected data (EHR, registry). §3 backs NA-preservation behaviour — clinical strings like “NR” / “NA” / “NK” must not be coerced to Python None during extraction.
CDISC SDTM / ODM — Clinical Data Interchange Standards
What we use it for. Origin / source traceability requirements answered by the provenance dict’s
source_file/sheet_name/row_indexfields; ODM variable-definition shape in the study dictionary.
FDA 21 CFR Part 11 — Electronic Records + Electronic Signatures
Text. https://www.ecfr.gov/current/title-21/chapter-I/subchapter-A/part-11
What we use it for. §11.10(e) backs the audit-trail requirement (who / what / when for every transformation). The lineage manifest plus the per-row
_provenance.pipeline_version+extraction_enginefields satisfy this.
HHS Honest Broker guidance
OHRP guidance. https://www.hhs.gov/ohrp/regulations-and-policy/guidance/research-with-protected-health-information/index.html
What we use it for. Canonical “honest broker” pattern that the four-tier architecture implements as code (raw → staging → published → agent).
Techniques
SANT — Shift-And-Not-Truncate
Primary citation. El Emam et al., “A method for managing re-identification risk from small geographic areas in Canada,” BMC Medical Informatics and Decision Making, 2010.
What we use it for. Per-subject constant date offset so intra-subject intervals are preserved exactly — the
scripts.security.phi_scrub.date_offset_days()algorithm.
k-anonymity — Sweeney 2002
Paper. https://epic.org/wp-content/uploads/privacy/reidentification/Sweeney_Article.pdf
What we use it for. Backs the
scripts.security.kanon_gate.kanon_check()equivalence-class test at the agent boundary. Default k = 5 per ICMR §11.7.
HMAC (RFC 2104)
What we use it for. Keyed-hash-based pseudonymization of subject IDs (
scripts.security.phi_scrub.pseudo_id()) and per-subject date offset derivation.
Benchmarks & Comparative Studies
Microsoft Presidio benchmarks (2024-2025)
Paper / writeup. https://microsoft.github.io/presidio/evaluation/
What we use it for. Cited in Architecture Decisions (ADRs) ADR-004 as evidence for the “rule+allowlist over Presidio” choice — 22.7% precision in mixed enterprise data, ~84% F1 on clinical notes.
John Snow Labs Clinical NER
What we use it for. Reference point for what commercial clinical NER can do (~98.6% F1). Not used at runtime — documented in ADR-004 as a “what we gave up” note.
Tools & Libraries Cited in Decisions
pdfplumber
What we use it for. Long-term target for local-only PDF extraction (to replace the current external-API path under ADR-006). Not yet in the runtime.
Ollama
Site. https://ollama.com
What we use it for. Local LLM runtime used by the agent.
Reading Order for a New Contributor
If you’re new to the project and need to come up to speed:
Read Overview for the pain narrative.
Read PHI Architecture for the four-tier + 8-action story.
Read Architecture Decisions (ADRs) in full — the Why answers are here.
Come back here as a reference when you need to justify or challenge an architectural choice.
Read the HIPAA §164.514(b)(2)(i) primary source and the NIST SP 800-188 first three sections to ground the regulatory vocabulary.