Data Pipeline
This page explains what happens when a user loads a study. It avoids implementation detail; developers should use Architecture and Operations.
What “Load Study” Means
Loading a study turns local raw study files into a published study bundle that the assistant can query.
At a high level, the portal:
reads the study datasets, data dictionary, and optional annotated PDFs;
stages extracted files in a temporary workspace;
applies the PHI-scrub rules to dataset fields;
cleans and aligns the study artifacts;
publishes the scrubbed bundle under
output/{STUDY}/trio_bundle/;writes audit files under
output/{STUDY}/audit/;opens the assistant against the published bundle.
Input Folder
Place one study under data/raw/{STUDY_NAME}/:
data/raw/Indo-VAP/
├── datasets/ # .xlsx or .csv study files
├── data_dictionary/ # data dictionary workbook or CSV
└── annotated_pdfs/ # optional CRF templates
The repository does not ship raw study data. The local study team owns which files are placed here.
Output Folder
After a successful run, look under output/{STUDY_NAME}/:
output/Indo-VAP/
├── trio_bundle/ # scrubbed bundle used by the assistant
├── audit/ # counts and lineage evidence
├── agent/ # chat state and generated analysis
└── README.md # local output summary
Users normally interact with trio_bundle/ through the chat UI. The
audit/ folder is for review and troubleshooting.
Running the Pipeline
Normal users run the pipeline from the web UI:
make chat
Then click Load Study.
The command-line make pipeline path is for developers and deployment
operators who have already provisioned the local PHI key.
Using an Existing Study
If output/{STUDY}/trio_bundle/ already exists, the web UI can skip
the pipeline and use the existing published bundle. This is useful when
the study was already loaded and you only want to ask questions.
PDFs
Annotated PDFs are optional. If they are present, the portal can use them to enrich variable descriptions. If they are missing or unavailable, the dataset and dictionary portions can still be used.
If your PDFs may contain PHI and you plan to use a hosted LLM provider, review the setting in Configuration before running the PDF path.
Audit Files
The audit folder is the user-facing evidence trail. It can help answer:
which raw files were processed;
whether scrub rules ran;
what was published;
whether a run needs review.
The audit files are not a substitute for study-team review, but they give the team a concrete starting point.
Troubleshooting
If a run fails:
check the terminal output first;
confirm
STUDY_NAMEmatches the folder underdata/raw/;confirm expected subfolders exist;
confirm the PHI key exists if the scrubber asks for it;
rerun after fixing the input or configuration issue.
For failure semantics, snapshot maintenance, and low-level pipeline behavior, see Operations.