Java API

Slack Docker Pulls GitHub edit source

Applications primarily interact with Alluxio through its File System API. Java users can either use the Alluxio Java Client, or the Hadoop-Compatible Java Client, which wraps the Alluxio Java Client to implement the Hadoop API.

Another common option is through Alluxio S3 API. Users can interact with Alluxio using the same S3 clients used for AWS S3 operations. This makes it easy to change existing S3 workloads to use Alluxio.

Alluxio also provides a POSIX API after mounting Alluxio as a local FUSE volume. Right now Alluxio POSIX API mainly targets the ML/AI workloads (especially read heavy workloads).

By setting up an Alluxio Proxy, users can also interact with Alluxio through REST API for operations S3 API doesn’t support. This is mainly for admin actions not supported by S3 API. For example, mount and unmount operations. The REST API is currently used for the Go and Python language bindings.

Java Client

Alluxio provides access to data through a file system interface. Files in Alluxio offer write-once semantics: they become immutable after they have been written in their entirety and cannot be read before being completed. Alluxio provides users with two different File System APIs to access the same file system:

  1. Alluxio file system API and
  2. Hadoop compatible file system API

The Alluxio file system API provides full functionality, while the Hadoop compatible API gives users the flexibility of leveraging Alluxio without having to modify existing code written using Hadoop’s API with limitations.

Configuring Dependency

To build your Java application to access Alluxio File System using maven, include the artifact alluxio-shaded-client in your pom.xml like the following:

<dependency>
  <groupId>org.alluxio</groupId>
  <artifactId>alluxio-shaded-client</artifactId>
  <version>2.9.5</version>
</dependency>

Available since 2.0.1, this artifact is self-contained by including all its transitive dependencies in a shaded form to prevent potential dependency conflicts. This artifact is recommended generally for a project to use Alluxio client.

Alternatively, an application can also depend on the alluxio-core-client-fs artifact for the Alluxio file system interface or the alluxio-core-client-hdfs artifact for the Hadoop compatible file system interface of Alluxio. These two artifacts do not include transitive dependencies and therefore much smaller in size. They also both include in alluxio-shaded-client artifact.

Alluxio Java API

This section introduces the basic operations to use the Alluxio FileSystem interface. Read the javadoc for the complete list of API methods. All resources with the Alluxio Java API are specified through an AlluxioURI which represents the path to the resource.

Getting a File System Client

To obtain an Alluxio File System client in Java code, use FileSystem.Factory#get():

FileSystem fs = FileSystem.Factory.get();

Creating a File

All metadata operations as well as opening a file for reading or creating a file for writing are executed through the FileSystem object. Since Alluxio files are immutable once written, the idiomatic way to create files is to use FileSystem#createFile(AlluxioURI), which returns a stream object that can be used to write the file. For example:

Note: there are some file path name limitation when creating files through Alluxio. Please check Alluxio limitations

FileSystem fs = FileSystem.Factory.get();
AlluxioURI path = new AlluxioURI("/myFile");
// Create a file and get its output stream
FileOutStream out = fs.createFile(path);
// Write data
out.write(...);
// Close and complete file
out.close();

Accessing an existing file in Alluxio

All operations on existing files or directories require the user to specify the AlluxioURI. An AlluxioURI can be used to perform various operations, such as modifying the file metadata (i.e. TTL or pin state) or getting an input stream to read the file.

Reading Data

Use FileSystem#openFile(AlluxioURI) to obtain a stream object that can be used to read a file. For example:

FileSystem fs = FileSystem.Factory.get();
AlluxioURI path = new AlluxioURI("/myFile");
// Open the file for reading
FileInStream in = fs.openFile(path);
// Read data
in.read(...);
// Close file relinquishing the lock
in.close();

Specifying Operation Options

For all FileSystem operations, an additional options field may be specified, which allows users to specify non-default settings for the operation. For example:

FileSystem fs = FileSystem.Factory.get();
AlluxioURI path = new AlluxioURI("/myFile");
// Generate options to set a custom blocksize of 64 MB
CreateFilePOptions options = CreateFilePOptions
                              .newBuilder()
                              .setBlockSizeBytes(64 * Constants.MB)
                              .build();
FileOutStream out = fs.createFile(path, options);

Programmatically Modifying Configuration

Alluxio configuration can be set through alluxio-site.properties, but these properties apply to all instances of Alluxio that read from the file. If fine-grained configuration management is required, pass in a customized configuration object when creating the FileSystem object. The generated FileSystem object will have modified configuration properties, independent of any other FileSystem clients.

FileSystem normalFs = FileSystem.Factory.get();
AlluxioURI normalPath = new AlluxioURI("/normalFile");
// Create a file with default properties
FileOutStream normalOut = normalFs.createFile(normalPath);
...
normalOut.close();

// Create a file system with custom configuration
InstancedConfiguration conf = InstancedConfiguration.defaults();
conf.set(PropertyKey.SECURITY_LOGIN_USERNAME, "alice");
FileSystem customizedFs = FileSystem.Factory.create(conf);
AlluxioURI customizedPath = new AlluxioURI("/customizedFile");
// The newly created file will be created under the username "alice"
FileOutStream customizedOut = customizedFs.createFile(customizedPath);
...
customizedOut.close();

// normalFs can still be used as a FileSystem client with the default username.
// Likewise, using customizedFs will use the username "alice".

IO Options

Alluxio uses two different storage types: Alluxio managed storage and under storage. Alluxio managed storage is the memory, SSD, and/or HDD allocated to Alluxio workers. Under storage is the storage resource managed by the underlying storage system, such as S3, Swift or HDFS. Users can specify the interaction with Alluxio managed storage and under storage through ReadType and WriteType. ReadType specifies the data read behavior when reading a file. WriteType specifies the data write behavior when writing a new file, i.e. whether the data should be written in Alluxio Storage.

Below is a table of the expected behaviors of ReadType. Reads will always prefer Alluxio storage over the under storage.

Read TypeBehavior
CACHE_PROMOTE Data is moved to the highest tier in the worker where the data was read. If the data was not in the Alluxio storage of the local worker, a replica will be added to the local Alluxio worker. If there is no local Alluxio worker, a replica will be added to a remote Alluxio worker if the data was fetched from the under storage system.
CACHE If the data was not in the Alluxio storage of the local worker, a replica will be added to the local Alluxio worker. If there is no local Alluxio worker, a replica will be added to a remote Alluxio worker if the data was fetched from the under storage system.
NO_CACHE Data is read without storing a replica in Alluxio. Note that a replica may already exist in Alluxio.

Below is a table of the expected behaviors of WriteType.

Write TypeBehavior
CACHE_THROUGH Data is written synchronously to a Alluxio worker and the under storage system.
MUST_CACHE Data is written synchronously to a Alluxio worker. No data will be written to the under storage.
THROUGH Data is written synchronously to the under storage. No data will be written to Alluxio.
ASYNC_THROUGH Data is written synchronously to a Alluxio worker and asynchronously to the under storage system. This is the default write type.

Location policy

Alluxio provides location policy to choose which workers to store the blocks of a file.

Using Alluxio’s Java API, users can set the policy in CreateFilePOptions for writing files and OpenFilePOptions for reading files into Alluxio.

Users can override the default policy class in the configuration file. Two configuration properties are available:

  1. alluxio.user.ufs.block.read.location.policy This controls which worker is selected to cache a block that is not currently cached in Alluxio and will be read from UFS.
  2. alluxio.user.block.write.location.policy.class This controls which worker is selected to cache a block generated from the client, and possibly persist it to the UFS.

The built-in policies include:

  • LocalFirstPolicy

    This is the default policy.

    A policy that returns the local worker first, and if the local worker doesn’t exist or doesn’t have enough capacity, will select the nearest worker from the active workers list with sufficient capacity.

    • If no worker meets capacity criteria, will randomly select a worker from the list of all workers.
  • LocalFirstAvoidEvictionPolicy

    This is the same as LocalFirstPolicy with the following addition:

    The property alluxio.user.block.avoid.eviction.policy.reserved.size.bytes is used as buffer space on each worker when calculating available space to store each block.

    • If no worker meets availability criteria, will randomly select a worker from the list of all workers.
  • MostAvailableFirstPolicy

    A policy that returns the worker with the most available bytes.

    • If no worker meets availability criteria, will randomly select a worker from the list of all workers.
  • RoundRobinPolicy

    A policy that chooses the worker for the next block in a round-robin manner and skips workers that do not have enough space.

    • If no worker meets availability criteria, will randomly select a worker from the list of all workers.
  • SpecificHostPolicy

    Always returns a worker with the hostname specified by property alluxio.worker.hostname.

    • If no value is set, will randomly select a worker from the list of all workers.
  • DeterministicHashPolicy

    This policy maps the blockId to several deterministic Alluxio workers. The number of workers a block can be mapped to can be specified by alluxio.user.ufs.block.read.location.policy.deterministic.hash.shards. The default is 1. It skips the workers that do not have enough capacity to hold the block.

    This policy is useful for limiting the amount of replication that occurs when reading blocks from the UFS with high concurrency. With 30 workers and 100 remote clients reading the same block concurrently, the replication level for the block would get close to 30 as each worker reads and caches the block for one or more clients. If the clients use DeterministicHashPolicy with 3 shards, the 100 clients will split their reads between just 3 workers, so that the replication level for the block will be only 3 when the data is first loaded.

    Note that the hash function relies on the number of workers in the cluster, so if the number of workers changes, the workers chosen by the policy for a given block will likely change.

  • CapacityBaseRandomPolicy

    This policy chooses a worker with a probability equal to the worker’s normalized capacity, i.e. the ratio of its capacity over the total capacity of all workers. It randomly distributes workload based on the worker capacities so that larger workers get more requests.

    This policy is useful for clusters where workers have heterogeneous storage capabilities, but the distribution of workload does not match that of storage. For example, in a cluster of 5 workers, one of the workers has only half the capacity of the others, however, it is co-located with a client that generates twice the amount of read requests than others. In this scenario, the default LocalFirstPolicy will quickly cause the smaller worker to go out of space, while the larger workers has plenty of storage left unused. Although the client will retry with a different worker when the local worker is out of space, this will increase I/O latency.

    Note that the randomness is based on capacity instead of availability, because in the long run, all workers will be filled up and have availability close to 0, which would cause this policy to degrade to a uniformly distributed random policy.

  • CapacityBasedDeterministicHashPolicy

    This policy is a combination of DeterministicHashPolicy and CapacityBaseRandomPolicy. It ensures each block is always assigned to the same set of workers. Additionally, provided that block requests follow a uniform distribution, they are assigned to each worker with a probability equal to the worker’s normalized capacity. The number of workers that a block can be assigned to can be specified by alluxio.user.ufs.block.read.location.policy.deterministic.hash.shards.

    This policy is useful when CapacityBaseRandomPolicy causes too many replicas across multiple workers, and one wish to limit the number of replication, in a way similar to DeterministicHashPolicy.

    Note that this is not a random policy in itself. The outcome distribution of this policy is dependent on the distribution of the block requests. When the distribution of block requests is highly skewed, the workers chosen will not follow a distribution based on workers’ normalized capacities.

Alluxio supports custom policies, so you can also develop your own policy appropriate for your workload by implementing the interface alluxio.client.block.policy.BlockLocationPolicy. Note that a default policy must have a constructor which takes alluxio.conf.AlluxioConfiguration. To use the ASYNC_THROUGH write type, all the blocks of a file must be written to the same worker.

Write Tier

Alluxio allows a client to select a tier preference when writing blocks to a local worker. Currently this policy preference exists only for local workers, not remote workers; remote workers will write blocks to the highest tier.

By default, data is written to the top tier. Users can modify the default setting through the alluxio.user.file.write.tier.default property or override it through an option to the FileSystem#createFile(AlluxioURI, CreateFilePOptions) API call.

Javadoc

For additional API information, please refer to the Alluxio javadocs.

Hadoop-Compatible Java Client

On top of the Alluxio file system, Alluxio also has a convenience class alluxio.hadoop.FileSystem that provides applications with a Hadoop compatible FileSystem interface. This client translates Hadoop file operations to Alluxio file system operations, allowing users to reuse existing code written for Hadoop without modification. Read its javadoc for more details.

Example

Here is a piece of example code to read ORC files from the Alluxio file system using the Hadoop interface.

// create a new hadoop configuration
org.apache.hadoop.conf.Configuration conf = new org.apache.hadoop.conf.Configuration();
// enforce hadoop client to bind alluxio.hadoop.FileSystem for URIs like alluxio://
conf.set("fs.alluxio.impl", "alluxio.hadoop.FileSystem");
conf.set("fs.AbstractFileSystem.alluxio.impl", "alluxio.hadoop.AlluxioFileSystem");

// Now alluxio address can be used like any other hadoop-compatible file system URIs
org.apache.orc.OrcFile.ReaderOptions options = new org.apache.orc.OrcFile.ReaderOptions(conf)
org.apache.orc.Reader orc = org.apache.orc.OrcFile.createReader(
    new Path("alluxio://localhost:19998/path/file.orc"), options);

Examples in Source Code

There are several example Java programs. They are: