During development, it’s very common to update pipelines, whether it’s changing
your code or just cranking up parallelism. For example, when developing a
machine learning model you will likely need to try out a bunch of different
versions of your model while your training data stays relatively constant.
This is where
update-pipeline comes in.
Updating your pipeline specification¶
In cases in which you are updating parallelism, adding another input repo, or
otherwise modifying your pipeline
specification, you just need to update your
JSON file and call
$ pachctl update-pipeline -f pipeline.json
update-pipeline with the
-f flag can also
take a URL if your JSON manifest is hosted on GitHub or elsewhere.
Updating the code used in a pipeline¶
You can also use
update-pipeline to update the code you are using in one or
more of your pipelines. To update the code in your pipeline:
- Make the code changes.
- Re-build your Docker image.
You need to call
update-pipeline with the
--push-images flag because, if
you have already run your pipeline, Pachyderm has already pulled the specified
images. It won’t re-pull new versions of the images, unless we tell it to
(which ensures that we don’t waste time pulling images when we don’t need to).
--push-images is specified, Pachyderm will do the following:
- Tag your image with a new unique tag.
- Push that tagged image to your registry (e.g., DockerHub).
- Update the pipeline specification that you previously gave to Pachyderm with the new unique tag.
For example, you could update the Python code used in the OpenCV pipeline via:
pachctl update-pipeline -f edges.json --push-images --password <registry password> -u <registry user>
As of 1.5.1, updating a pipeline will NOT reprocess previously processed data by default. New data that’s committed to the inputs will be processed with the new code and “mixed” with the results of processing data with the previous code. Furthermore, data that Pachyderm tried and failed to process with the previous code due to code erroring will be processed with the new code.
update-pipeline (without flags) is designed for the situation where your code needs to be
fixed because it encountered an unexpected new form of data.
If you’d like to update your pipeline and have that updated pipeline reprocess all the data
that is currently in the HEAD commit of your input repos, you
should use the
--reprocess flag. This type of update will automatically trigger a job that reprocesses all of the input data in its current state (i.e., the HEAD commits)
with the updated pipeline. Then from that point on, the updated pipeline will continue to be used to process any new input data. Previous results will still be
available in via their corresponding commit IDs.