Data Pipeline ============= This page explains what happens when a user loads a study. It avoids implementation detail; developers should use :doc:`../developer_guide/architecture` and :doc:`../developer_guide/operations`. What "Load Study" Means ----------------------- Loading a study turns local raw study files into a published study bundle that the assistant can query. At a high level, the portal: 1. reads the study datasets, data dictionary, and optional annotated PDFs; 2. stages extracted files in a temporary workspace; 3. applies the PHI-scrub rules to dataset fields; 4. cleans and aligns the study artifacts; 5. publishes the scrubbed bundle under ``output/{STUDY}/trio_bundle/``; 6. writes audit files under ``output/{STUDY}/audit/``; 7. opens the assistant against the published bundle. Input Folder ------------ Place one study under ``data/raw/{STUDY_NAME}/``: .. code-block:: text data/raw/Indo-VAP/ ├── datasets/ # .xlsx or .csv study files ├── data_dictionary/ # data dictionary workbook or CSV └── annotated_pdfs/ # optional CRF templates The repository does not ship raw study data. The local study team owns which files are placed here. Output Folder ------------- After a successful run, look under ``output/{STUDY_NAME}/``: .. code-block:: text output/Indo-VAP/ ├── trio_bundle/ # scrubbed bundle used by the assistant ├── audit/ # counts and lineage evidence ├── agent/ # chat state and generated analysis └── README.md # local output summary Users normally interact with ``trio_bundle/`` through the chat UI. The ``audit/`` folder is for review and troubleshooting. Running the Pipeline -------------------- Normal users run the pipeline from the web UI: .. code-block:: bash make chat Then click **Load Study**. The command-line ``make pipeline`` path is for developers and deployment operators who have already provisioned the local PHI key. Using an Existing Study ----------------------- If ``output/{STUDY}/trio_bundle/`` already exists, the web UI can skip the pipeline and use the existing published bundle. This is useful when the study was already loaded and you only want to ask questions. PDFs ---- Annotated PDFs are optional. If they are present, the portal can use them to enrich variable descriptions. If they are missing or unavailable, the dataset and dictionary portions can still be used. If your PDFs may contain PHI and you plan to use a hosted LLM provider, review the setting in :doc:`configuration` before running the PDF path. Audit Files ----------- The audit folder is the user-facing evidence trail. It can help answer: * which raw files were processed; * whether scrub rules ran; * what was published; * whether a run needs review. The audit files are not a substitute for study-team review, but they give the team a concrete starting point. Troubleshooting --------------- If a run fails: * check the terminal output first; * confirm ``STUDY_NAME`` matches the folder under ``data/raw/``; * confirm expected subfolders exist; * confirm the PHI key exists if the scrubber asks for it; * rerun after fixing the input or configuration issue. For failure semantics, snapshot maintenance, and low-level pipeline behavior, see :doc:`../developer_guide/operations`.