Pachyderm Pipeline System (PPS)¶
Pachyderm Pipeline System is a parallel, containerized analysis platform
It is designed to:
PPS has two components, and understand each gives you a full picture of PPS.
Jobs are transformations that are only run once.
Broadly, they take the following inputs:
- a transformation image, refer to the pipeline spec for instructions on creating your own image
- an entry point to run the transformation
- some other configuration options about how to run the job (parallelism, partitioning method, etc)
- at least one PFS input
Repocontaining some data
CommitID per input repo
When creating a job, PPS:
- creates an output
Repowith the same name as the job
- uses kubernetes to spin up containers w the image you specify, in the configuration you specify
- mounts the input
/pfs/your_repo_namefor use by your code on that container
/pfs/outfor writing output, which is connected to the newly created output
- runs the containers with the entry point you provided
- the output is stored in a new commit on the new output
You’ll be using and composing pipelines frequently with PPS. Quickly, you’re going to want to understand how your outputs are related to the inputs.
Check out the flush-commit docs for specifics on how to track provenance.