Data Pipeline

This page explains what happens when a user loads a study. It avoids implementation detail; developers should use Architecture and Operations.

What “Load Study” Means

Loading a study turns local raw study files into a published study bundle that the assistant can query.

At a high level, the portal:

  1. reads the study datasets, data dictionary, and optional annotated PDFs;

  2. stages extracted files in a temporary workspace;

  3. applies the PHI-scrub rules to dataset fields;

  4. cleans and aligns the study artifacts;

  5. publishes the scrubbed bundle under output/{STUDY}/trio_bundle/;

  6. writes audit files under output/{STUDY}/audit/;

  7. opens the assistant against the published bundle.

Input Folder

Place one study under data/raw/{STUDY_NAME}/:

data/raw/Indo-VAP/
├── datasets/          # .xlsx or .csv study files
├── data_dictionary/   # data dictionary workbook or CSV
└── annotated_pdfs/    # optional CRF templates

The repository does not ship raw study data. The local study team owns which files are placed here.

Output Folder

After a successful run, look under output/{STUDY_NAME}/:

output/Indo-VAP/
├── trio_bundle/       # scrubbed bundle used by the assistant
├── audit/             # counts and lineage evidence
├── agent/             # chat state and generated analysis
└── README.md          # local output summary

Users normally interact with trio_bundle/ through the chat UI. The audit/ folder is for review and troubleshooting.

Running the Pipeline

Normal users run the pipeline from the web UI:

make chat

Then click Load Study.

The command-line make pipeline path is for developers and deployment operators who have already provisioned the local PHI key.

Using an Existing Study

If output/{STUDY}/trio_bundle/ already exists, the web UI can skip the pipeline and use the existing published bundle. This is useful when the study was already loaded and you only want to ask questions.

PDFs

Annotated PDFs are optional. If they are present, the portal can use them to enrich variable descriptions. If they are missing or unavailable, the dataset and dictionary portions can still be used.

If your PDFs may contain PHI and you plan to use a hosted LLM provider, review the setting in Configuration before running the PDF path.

Audit Files

The audit folder is the user-facing evidence trail. It can help answer:

  • which raw files were processed;

  • whether scrub rules ran;

  • what was published;

  • whether a run needs review.

The audit files are not a substitute for study-team review, but they give the team a concrete starting point.

Troubleshooting

If a run fails:

  • check the terminal output first;

  • confirm STUDY_NAME matches the folder under data/raw/;

  • confirm expected subfolders exist;

  • confirm the PHI key exists if the scrubber asks for it;

  • rerun after fixing the input or configuration issue.

For failure semantics, snapshot maintenance, and low-level pipeline behavior, see Operations.