References

What. Every regulation, standard, paper, and external resource cited in the RePORT AI Portal codebase or IRB/Auditor profile, collected in one place with URLs and a line on which pillar / module they back.

Why. Regulatory traceability is a developer concern. If you’re touching the PHI scrubber’s catalog or the agent-boundary gate, you should be able to reach the primary source for HIPAA §164.514(b)(2)(i) or ICMR §11.7 from the docs in one click. If you’re adding a new rule class, you should know which regulation the rule answers.

How. Sorted by concern area (regulation / standard / technique / benchmark). Each entry includes a short “used for” line pointing at the module or pillar the reference backs.

Primary Regulations 

HIPAA Privacy Rule — §164.514 De-identification 

Full text. https://www.ecfr.gov/current/title-45/subtitle-A/subchapter-C/part-164/subpart-E/section-164.514
HHS guidance. https://www.hhs.gov/hipaa/for-professionals/special-topics/de-identification/index.html
What we use it for. §164.514(b)(2)(i)(A–R) is the reference list of 18 identifier classes that the PHI Architecture catalog maps to. §164.514(b)(2)(i)(C) specifically backs the age-over-89 cap and the date-precision rules.

DPDPA 2023 — Digital Personal Data Protection Act 

Text. https://www.meity.gov.in/static/uploads/2024/06/2bf1f0e9f04e6fb4f8fef35e82c42aa5.pdf
DPDP Rules 2025 (notified 13 Nov 2025). https://www.meity.gov.in/static/uploads/2025/11/53450e6e5dc0bfa85ebd78686cadad39.pdf
What we use it for. India’s primary personal-data regulation. §2(t) defines “personal data”; §9 governs children’s data; §8 governs data-fiduciary obligations (minimization, retention, accuracy). The PHI catalog’s drop rules for Indian government IDs (Aadhaar / ABHA / PAN / voter / PM-JAY) anchor to DPDPA.

SPDI Rules 2011 (under IT Act §43A)

Brief. https://prsindia.org/files/bills_acts/bills_parliament/2011/IT_Rules_and_Regulations_Brief_2011.pdf
What we use it for. Still the in-force regulation until DPDPA’s substantive provisions kick in. Rule 3 defines Sensitive Personal Data or Information (SPDI); health data and biometric data are explicitly covered.

ICMR National Ethical Guidelines for Biomedical & Health Research (2017)

Text. https://main.icmr.nic.in/sites/default/files/guidelines/ICMR_Ethical_Guidelines_2017.pdf
What we use it for. §11 (confidentiality and community-level privacy) backs the suppress_small_cell action for household-contact counts; §11.7 explicitly mandates k-anonymity-style controls for cohort studies; §5 backs date-precision requirements.

ABDM Health Data Management Policy (NHA)

Text. https://abdm.gov.in/publications/health_data_management_policy
What we use it for. Governs ABHA (health ID) records. Referenced in the PHI catalog for any ABHA / health_id column shape.

RePORT India Common Protocol 

Project site. https://www.reportindia.org
What we use it for. The parent study protocol under which Indo-VAP runs. Dictates the 72-hour IRB notification window for PHI breaches, which the study team must encode in its breach-response runbook before production ingest.

Standards & Frameworks 

NIST SP 800-188 — De-Identifying Government Datasets 

Text. https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-188.pdf
What we use it for. §5.2 backs the integrity-chain requirement (SHA-256 of every raw input in every row’s provenance + in the lineage manifest). §6.3-6.5 backs the AMBER transient-workspace hardening (mode 0700, zero-fill on teardown).

NIST SP 800-175B — Guideline for Using Cryptographic Standards 

Text. https://csrc.nist.gov/publications/detail/sp/800-175b/rev-1/final
What we use it for. HMAC-SHA256 as a keyed-MAC construction; key-rotation semantics for the sidecar HMAC key.

NIST SP 800-53 (SI-7 Software, Firmware, and Information Integrity)

Text. https://csrc.nist.gov/publications/detail/sp/800-53/rev-5/final
What we use it for. SI-7 backs the per-run lineage manifest hash chain.

STROBE — Strengthening the Reporting of Observational Studies in Epidemiology 

Text. https://www.equator-network.org/reporting-guidelines/strobe/
What we use it for. Reporting-fidelity requirements that drive the provenance-on-every-row design; specifically §6 (data sources + measurement) and §14 (descriptive data).

RECORD — REporting of studies Conducted using Observational Routinely-collected health Data 

Text. https://www.record-statement.org
What we use it for. Extension of STROBE for routinely-collected data (EHR, registry). §3 backs NA-preservation behaviour — clinical strings like “NR” / “NA” / “NK” must not be coerced to Python None during extraction.

CDISC SDTM / ODM — Clinical Data Interchange Standards 

Overview. https://www.cdisc.org/standards/foundational/sdtm
ODM. https://www.cdisc.org/standards/data-exchange/odm
What we use it for. Origin / source traceability requirements answered by the provenance dict’s source_file / sheet_name / row_index fields; ODM variable-definition shape in the study dictionary.

FDA 21 CFR Part 11 — Electronic Records + Electronic Signatures 

Text. https://www.ecfr.gov/current/title-21/chapter-I/subchapter-A/part-11
What we use it for. §11.10(e) backs the audit-trail requirement (who / what / when for every transformation). The lineage manifest plus the per-row _provenance.pipeline_version + extraction_engine fields satisfy this.

HHS Honest Broker guidance 

OHRP guidance. https://www.hhs.gov/ohrp/regulations-and-policy/guidance/research-with-protected-health-information/index.html
What we use it for. Canonical “honest broker” pattern that the four-tier architecture implements as code (raw → staging → published → agent).

Techniques 

SANT — Shift-And-Not-Truncate 

Primary citation. El Emam et al., “A method for managing re-identification risk from small geographic areas in Canada,” BMC Medical Informatics and Decision Making, 2010.
What we use it for. Per-subject constant date offset so intra-subject intervals are preserved exactly — the scripts.security.phi_scrub.date_offset_days() algorithm.

k-anonymity — Sweeney 2002 

Paper. https://epic.org/wp-content/uploads/privacy/reidentification/Sweeney_Article.pdf
What we use it for. Backs the scripts.security.kanon_gate.kanon_check() equivalence-class test at the agent boundary. Default k = 5 per ICMR §11.7.

l-diversity — Machanavajjhala et al. 2007 

Paper. https://dl.acm.org/doi/10.1145/1217299.1217302
What we use it for. Complement to k-anonymity for sensitive attribute homogeneity; relevant when a small equivalence class happens to all share the same outcome. The agent boundary enforces k-anon (k=5) AND l-diversity (l=2) on every row-returning tool — see scripts.security.kanon_gate.l_diversity_check() and scripts.security.kanon_gate.guard_rows_with_kanon_and_ldiv().

HMAC (RFC 2104)

Text. https://datatracker.ietf.org/doc/html/rfc2104
What we use it for. Keyed-hash-based pseudonymization of subject IDs (scripts.security.phi_scrub.pseudo_id()) and per-subject date offset derivation.

Benchmarks & Comparative Studies 

Microsoft Presidio benchmarks (2024-2025)

Paper / writeup. https://microsoft.github.io/presidio/evaluation/
What we use it for. Cited in Architecture Decisions (ADRs) ADR-004 as evidence for the “rule+allowlist over Presidio” choice — 22.7% precision in mixed enterprise data, ~84% F1 on clinical notes.

John Snow Labs Clinical NER 

Site. https://www.johnsnowlabs.com
What we use it for. Reference point for what commercial clinical NER can do (~98.6% F1). Not used at runtime — documented in ADR-004 as a “what we gave up” note.

i2b2 / n2c2 de-identification shared tasks 

Site. https://n2c2.dbmi.hms.harvard.edu/data-sets
What we use it for. Corpus used to benchmark clinical de- identification systems and justify the current rule-catalog posture.

Tools & Libraries Cited in Decisions 

pdfplumber 

Site. https://github.com/jsvine/pdfplumber
What we use it for. Long-term target for local-only PDF extraction (to replace the current external-API path under ADR-006). Not yet in the runtime.

Ollama 

Site. https://ollama.com
What we use it for. Local LLM runtime used by the agent.

Reading Order for a New Contributor 

If you’re new to the project and need to come up to speed:

Read Overview for the pain narrative.
Read PHI Architecture for the four-tier + 8-action story.
Read Architecture Decisions (ADRs) in full — the Why answers are here.
Come back here as a reference when you need to justify or challenge an architectural choice.
Read the HIPAA §164.514(b)(2)(i) primary source and the NIST SP 800-188 first three sections to ground the regulatory vocabulary.