Data Movement Operators

Slack Docker Pulls

Data movement operators help users to move data across different UFSs. Operators will trigger background jobs through Alluxio’s job service. Users will be able to track these jobs’ progress from different CLIs. Also, data movement operators are the building blocks for (PDDM), which triggers different data movement operators periodically based on the settings. The configuration of each job will be logged in our logs. We will also output error logs for the errors occurred in the jobs.


The copy operator copies a file or directory in the Alluxio file system distributed across workers using the scheduler.

If copy is run on a directory, files in the directory will be recursively copied.

$ ./bin/alluxio job copy --src <src> --dst <dst> --submit [--check-content] [--partial-listing]


  • --check-content option specify whether to check content hash when copying files.
  • --partial-listing option specify whether to using batch listStatus API or traditional listStatus. This limits the memory usage and starts copy sooner for larger directory. But progress report cannot report on the total number of files because the whole directory is not listed yet.

After submit the command, you can check the status by running the following

$ ./bin/alluxio job copy --src <src> --dst <dst> --progress [--format TEXT|JSON] [--verbose]

And you would get the following output for a running job:

Progress for jobId 1c849041-ef26-4ed7-a932-2af5549754d7 copying path '/dir-99' to '/dir-100':
        Settings: "check-content: false"
        Job Submitted: 2023-06-30 12:30:45.0
        Job Id: 111111
        Job State: RUNNING
        Files qualified so far: 1, 826.38MB
        Files Failed: 0
        Files Skipped: 0
        Files Succeeded: 1
        Bytes Copied: 826.38MB
        Throughput: 1621.09KB/s
        Files failure rate: 0.00%

You would get the following output for a finished job:

Progress for jobId 1c849041-ef26-4ed7-a932-2af5549754d7 copying path '/dir-99' to '/dir-100':
        Settings: "check-content: false"
        Job Submitted: 2023-06-30 12:30:45.0
        Job Id: 111111
        Job State: SUCCEEDED, finished at 2023-06-30 12:45:50.1
        Files qualified: 1, 826.38MB
        Files Failed: 0
        Files Skipped: 0
        Files Succeeded: 1
        Bytes Copied: 826.38MB
        Throughput: 1621.09KB/s
        Files failure rate: 0.00%


  • --format option specify output format. TEXT as default
  • --verbose option output job details.

If you want to stop the command by running the following

$ ./bin/alluxio job copy --src <src> --dst <dst> --stop

And you would get the following output:

Copy job from '/dir-99' to '/dir-100' is successfully stopped.


  • When a job is terminated, the job status will show as STOPPED
  • Tasks that already started will not be terminated, and the list will not be reported to REST API
  • Alluxio scans file by batches, so Files qualified may not be the final count when it is in progress. When a job finishes, one can expect that Files Qualified = Files Failed + Files Skipped + Files Succeeded
  • Alluxio actually counts objects instead of files. Objects include files and directories.
  • If a new task for the same file gets kicked off before the previous task finishes, one of them will error out
  • The max time a user need to wait for a task to finish after STOPPED equal to the configured gRPC timeline time (this configuration( is global for the entire cluster)

Delta copy:

  • Copy now supports delta copy, which will skip copying files or directories that is already existed on the target side.


  • Currently, copy only supports S3, HDFS and GCS as UFSs.


The move operator copies a file or directory in the Alluxio file system distributed across workers using the scheduler and deletes the original file or directory.

If move is run on a directory, files in the directory will be recursively moved.

$ ./bin/alluxio job move --src <src> --dst <dst> --submit [--check-content] [--partial-listing]


  • --check-content option specify whether to check content hash when moving files.
  • --partial-listing option specify whether to using batch listStatus API or traditional listStatus. This limits the memory usage and starts move sooner for larger directory. But progress report cannot report on the total number of files because the whole directory is not listed yet.

After submit the command, you can check the status by running the following

$ ./bin/alluxio job move --src <src> --dst <dst> --progress [--format TEXT|JSON] [--verbose]

And you would get the following output for a running job:

Progress for jobId 1c849041-ef26-4ed7-a932-2af5549754d7 moving path '/dir-99' to '/dir-100':
        Settings: "check-content: false"
        Job Submitted: 2023-06-30 12:30:45.0
        Job Id: 111111
        Job State: RUNNING
        Files qualified so far: 1, 826.38MB
        Files Failed: 0
        Files Succeeded: 1
        Bytes Moved: 826.38MB
        Throughput: 1621.09KB/s
        Files failure rate: 0.00%

You would get the following output for a finished job:

Progress for jobId 1c849041-ef26-4ed7-a932-2af5549754d7 moving path '/dir-99' to '/dir-100':
        Settings: "check-content: false"
        Job Submitted: 2023-06-30 12:30:45.0
        Job Id: 111111
        Job State: SUCCEEDED, finished at 2023-06-30 12:45:50.1
        Files qualified: 1, 826.38MB
        Files Failed: 0
        Files Succeeded: 1
        Bytes Moved: 826.38MB
        Throughput: 1621.09KB/s
        Files failure rate: 0.00%
* `--format` option specify output format. TEXT as default
* `--verbose` option output job details.

If you want to stop the command by running the following
$ ./bin/alluxio job move --src <src> --dst <dst> --stop

And you would get the following output:

Move job from '/dir-99' to '/dir-100' is successfully stopped.


  • When a job is terminated, the job status will show as STOPPED
  • Tasks that already started will not be terminated, and the list will not be reported to REST API
  • Alluxio scans file by batches, so Files qualified may not be the final count when it is in progress. When a job finishes, one can expect that Files Qualified = Files Failed + Files Skipped + Files Succeeded
  • Alluxio actually counts objects instead of files. Objects include files and directories.
  • If a new task for the same file gets kicked off before the previous task finishes, one of them will error out
  • The max time a user need to wait for a task to finish after STOPPED equal to the configured gRPC timeline time (this configuration( is global for the entire cluster)


  • Currently, move only supports S3, HDFS and GCS as UFSs.
  • Note that when moving files from HDFS, GCS(or similar file system) to another UFS, there might be concurrent issue with deleting directories, we would leave empty directory in the source directory even files get moved.


The load operator loads data/metadata from the under storage system into Alluxio storage. For example, load can be used to prefetch data for analytics jobs. If load is run on a directory, files in the directory will be recursively loaded.

$ ./bin/alluxio job load --path <path> --submit [--metadata-only]


  • --metadata-only option specify whether loading metadata only

After submit the command, you can check the status by running the following

$ ./bin/alluxio job load --path <path> --progress [--format TEXT|JSON] [--verbose]

And you would get the following output:

Progress for loading path '/dir-99':
        Settings:       bandwidth: unlimited    verify: false
        Job State: SUCCEEDED
        Files Processed: 1000
        Bytes Loaded: 125.00MB
        Throughput: 2509.80KB/s
        Block load failure rate: 0.00%
        Files Failed: 0


  • --format option specify output format. TEXT as default
  • --verbose option output job details.

If you want to stop the command by running the following

$ ./bin/alluxio job load --path <path> --stop