Getting Your Data into Pachyderm

If you’re running Pachyderm in the cloud, data in Pachyderm is backed an object store such as S3 or GCS. Files in Pachyderm are content-addressed as part of how we buid our version control semantics and are therefore not “human-readable.” We recommend you give Pachyderm its own bucket.

There a bunch of different ways to get your data into Pachyderm.

PFS Mount: This is a “toy” method for getting data into Pachyderm if you just have some local files (or dummy files) and you just want to test things out.

Pachctl CLI: This is the best option for real use cases and scripting the input process.

Golang Client: Ideal for Golang users who want to script the file input process.

Other Language Clients: Pachyderm uses a protobuf API which supports many other languages, we just haven’t built full clients yet.

PFS Mount

This is the easiest method if you just have some local files (or dummy files) and you just want to test things out in Pachyderm. This is NOT a production method for getting data into Pachyderm.

Pachyderm allows you to mount data in the distributed file system locally using and explore it using FUSE.

FUSE comes pre-installed on most Linux distributions. For OS X, you’ll need to install [OSX FUSE](https://osxfuse.github.io/)

First create the mount point:

$ mkdir ~/pfs

And then mount it:

# We background this process because it blocks.
$ pachctl mount ~/pfs &

This will mount pfs on ~/pfs you can inspect the filesystem like you would any other local filesystem using ls or a web browser.

Once you have pfs mounted, you can add files to Pachyderm via whatever method you prefer to manipulate a local file system: mv, cp, >, |, etc.

Don’t forget, you’ll need create a repo and commit in Pachyderm first with:

# Create a repo called "data"
$ pachctl create-repo data

# Start a commit on repo "data"
$ pachctl start-commit data



Now add whatever files you want to ``~/pfs/<repo_name>/<commit_ID>/<file_name>``.

Pachctl CLI

The pachctl CLI is the primary method of interaction with Pachyderm. To get data into Pachyderm, you should use the put-file command. Below are a example uses of put-file. Go to ./pachctl put-file for complete documentation.

Note

Commits in Pachyderm must be explicitly started and finished so put-file can only be called on an open commit (started, but not finished). The -c option allows you to start and finish the commit in addition to putting data as a one-line command.

Add a single file:

$ pachctl put-file <repo> <branch> -f <file>

Start and finish the commit while adding a file using -c:

$ pachctl put-file -c <repo> <branch> -f <file>

Put data from a URL:

$ pachctl put-file <repo> <branch> -f http://url_path

Add multiple files at once by using the -i option. The target file should be a list of files, paths, or URLs that you want to input all at once:

$ pachctl put-file <repo> <branch> -i <file>

Pipe data from stdin into a file:

$ echo "data" | pachctl put-file <repo> <branch> <path>

Add an entire directory by using the recursive flag, -r:

$ pachctl put-file -r <repo> <branch> -f <dir>

Golang Client

For any Go users, we’ve built a Golang client so you can easily script Pachyderm commands. Check out the autogenerated godocs on put-file.

Other Language Clients

Pachyderm uses a simple protocol buffer API. Protobufs support a bunch of other languages, any of which can be used to programatically use Pachyderm. We haven’t built clients for them yet, but it’s not too hard. It’s an easy way to contribute to Pachyderm if you’re looking to get involved.