Union Mount

Slack Docker Pulls

Moving data from one storage system to another, while keeping computation jobs running correctly is a common challenge for many Alluxio users. Traditionally this is done by manually copying the data to the destination, asking every user of the data to update their applications and queries to use the new URI, waiting until all the updates are complete. With Alluxio, this can be done seamlessly by mounting multiple storage systems to a single mount point using a union UFS.

On a union UFS, each under storage system is mounted as a sub-UFS. A path in Alluxio is mapped to one or more corresponding files in the mounted sub-UFSes. When a file is written to Alluxio, it can be configured to be written to one or more sub-UFSes. When a file is read from Alluxio, it can be read transparently from any sub-UFS mounted with union UFS. This allows user to move data among storage systems easily while accessing data with a single Alluxio URI regardless which storage system the file is in.

This page describes the instructions to configure a union UFS to mount multiple storage systems.

Mounting Multiple Storage Systems with Union UFS

Note: Before attempting a union UFS mount, it will help to validate that each sub-UFS is able to perform basic operations on its own by using the ./bin/alluxio runTests [--directory <PATH>] command. Start by mounting each sub-UFS as its own mount point and making sure it passes the tests. Once that is validated for all the different sub-UFSs, make sure they are unmounted from Alluxio before proceeding with configuring the union UFS.

Similar to mounting any other under storage system, a union UFS can be mounted using alluxio fs mount command:

./bin/alluxio fs mount \
   --option alluxio-union.<UFS_A_ALIAS>.uri=<UFS_A_URI> \
   --option alluxio-union.<UFS_A_ALIAS>.option.<KEY>=<VALUE> \
   --option alluxio-union.<UFS_B_ALIAS>.uri=<UFS_B_URI> \ 
   --option alluxio-union.<UFS_B_ALIAS>.option.<KEY>=<VALUE> \
   --option alluxio-union.priority.read=<ALIAS_1>,<ALIAS_2>,<ALIAS_3>... \
   --option alluxio-union.collection.create=<UFS_ALIASES>  \
   <ALLUXIO_MOUNT_PATH> union://<UNION_UFS_AUTHORITY>/

In the command above:

  • alluxio-union.<UFS_ALIAS>.uri specifies a URI of a sub under storage system(sub-UFS) to be mounted. <UFS_ALIAS> should be replaced with an alias representing the sub-UFS. Users can specify multiple UFSs to be mounted with different aliases. For example, alluxio-union.hdfs.uri=hdfs://local_hdfs/user_a/ tells the union UFS to mount a sub-UFS at hdfs://local_hdfs/user_a/ with the alias hdfs.
  • alluxio-union.<UFS_ALIAS>.option.<KEY> specifies an optional UFS option for a specific sub-UFS. <UFS_ALIAS> denotes an alias of sub-UFS defined by alluxio-union.<UFS_ALIAS>.uri, <KEY> denotes the option key to be set to the target sub-UFS. For example, alluxio-union.s3.option.aws.accessKeyId=MYKEYID sets aws.accessKeyId option for sub-UFS s3 to value MYKEYID.
  • alluxio-union.priority.read specifies an ordered list of UFS aliases to set the read priority of sub-UFSes. When a file is read from a union UFS, it will be attempted on sub-UFSes in the same order as the aliases in this list. The first sub-UFS with the file available for reading will be used. All sub-UFS aliases must appear in this list.
  • alluxio-union.collection.create specifies an unordered list of UFSes to write new files to. When a new file is written to a union UFS, it will be written to all the UFSes whose aliases specified in this list.
  • <ALLUXIO_MOUNT_PATH> specifies the Alluxio path where the union UFS is mounted.
  • <UNION_UFS_AUTHORITY> specifies a unique authority of the mounted union UFS. It can be empty or any arbitrary name. This is used to identify a specific union UFS.

For example, the following command can be used to mount an HDFS directory and an S3 bucket under the Alluxio path /union:

./bin/alluxio fs mount \
   --option alluxio-union.hdfs.uri=hdfs://local_hdfs/user_a/ \
   --option alluxio-union.hdfs.option.alluxio.underfs.hdfs.configuration=/opt/hdfs/core-site.xml:/opt/hdfs/hdfs-site.xml \
   --option alluxio-union.s3.uri=s3://mybucket/ \
   --option alluxio-union.s3.option.aws.accessKeyId=MYKEYID \
   --option alluxio-union.s3.option.aws.secretKey=MYSECRETKEY \
   --option alluxio-union.priority.read=hdfs,s3 \
   --option alluxio-union.collection.create=s3  \
   /union union://union_ufs_1/

Reading and Writing Data on Union UFS

Data on a union UFS can be accessed just like any other UFS. Files that exist in any of the sub-UFSes will show up in the mount point. For example, if sub-UFS a contains a file at path /a/b and another one at /c, and sub-UFS b contains a file at path /a/d and and a file at /c, then the union UFS will expose a directory structure as follows:

+-a
| +b
| +d
|
+-c

When file data is read from a union UFS, it will be read from the sub-UFS of the highest priority which contains the file. In the above example, if sub-UFS a has the highest priority, then reading data from /c will end up reading from sub-UFS a. Reading data from /a/d will end up reading from sub-UFS b since only b has the file.

When file data is written to a union UFS, it will be written to all sub-UFSes specified in the alluxio-union.collection.create property. A write request will complete after the file is completed on all target sub-UFSes. If one of the sub-UFSes fails to write the file, the write request will fail.

It may be helpful to use the ./bin/alluxio runTests [--directory <PATH>] command to validate that most of the basic read/write operations will work on the union UFS.

Example: Running Alluxio Locally with Union UFS

First, start the Alluxio servers locally:

./bin/alluxio format
./bin/alluxio-start.sh local

If your ramdisk is not mounted, likely because this is the first time you are running Alluxio, you may need to start Alluxio with the SudoMount option.

./bin/alluxio-start.sh local SudoMount

This will start one Alluxio master and one Alluxio worker locally. You can see the master UI at http://localhost:19999.

Run a simple example program:

./bin/alluxio runTests

If the test fails with permission errors, make sure that the current user (${USER}) has read/write access to the local directory mounted to Alluxio. By default, the login user is the current user of the host OS. To change the user, set the value of alluxio.security.login.username in conf/alluxio-site.properties to the desired username.

After this succeeds, create two local directories to be used as sub-UFSes. In this example we will use /tmp/ufs1 and /tmp/ufs2:

mkdir /tmp/ufs1
mkdir /tmp/ufs2

Mount the union UFS with the local sub-UFSes using the following command:

./bin/alluxio fs mount \
  --option alluxio-union.a.uri=/tmp/ufs1/ \
  --option alluxio-union.b.uri=/tmp/ufs2/ \
  --option alluxio-union.priority.read=a,b \
  --option alluxio-union.collection.create=a \
  /union/ union://test/

To test the mounted union UFS, run the following command to copy a new file to the mount point and persist it to UFS:

./bin/alluxio fs -Dalluxio.user.file.writetype.default=CACHE_THROUGH copyFromLocal LICENSE /union

After the command finishes, you should find the LICENSE file in local directory /tmp/ufs1, but not in /tmp/ufs2.

Stop Alluxio by running:

./bin/alluxio-stop.sh local