Updating Pipelines

During development, it’s very common to update pipelines, whether it’s changing your code or just cranking up parallelism. For example, when developing a machine learning model you will likely need to try out a bunch of different versions of your model while your training data stays relatively constant. This is where update-pipeline comes in.

Updating your pipeline specification

In cases in which you are updating parallelism, adding another input repo, or otherwise modifying your pipeline specification, you just need to update your JSON file and call update-pipeline:

$ pachctl update-pipeline -f pipeline.json 

Similar to create-pipeline, update-pipeline with the -f flag can also take a URL if your JSON manifest is hosted on GitHub or elsewhere.

Updating the code used in a pipeline

You can also use update-pipeline to update the code you are using in one or more of your piplines. To update the code in your pipeline:

  1. Make the code changes.
  2. Re-build your Docker image.
  3. Call update-pipeline with the --push-images flag.

You need to call update-pipeline with the --push-images flag because, if you have already run your pipeline, Pachyderm has already pulled the specified images. It won’t re-pull new versions of the images, unless we tell it to (which ensures that we don’t waste time pulling images when we don’t need to). When --push-images is specified, Pachyderm will do the following:

  1. Tag your image with a new unique tag.
  2. Push that tagged image to your registry (e.g., DockerHub).
  3. Update the pipeline specification that you previously gave to Pachyderm with the new unique tag.

For example, you could update the Python code used in the OpenCV pipeline via:

pachctl update-pipeline -f edges.json --push-images --password <registry password> -u <registry user>

Re-processing commits, from commit

Changing your pipeline code implies that your previously computed results aren’t in sync with (or generated by) your most recent code. By default (if the “from-commit” field in the pipeline spec is not given), Pachyderm will start a new “commit tree” for your new code and re-compute the results with your new code (committing to the new commit tree).

In some cases, such as changing parallelism, you don’t want to archive previous data and re-compute results. Or maybe you want to only utilize your new code for new input data (that that point on). If so, you can specify the “from” field in your pipeline specification with a commit ID. Pachyderm will then only process new data from that commit ID on with the new code.

Note that from can take a branch name. As such, you can just specify "from": "master" to process only the new data with the updated pipeline (because master points to the latest commit).