File System API
Applications primarily interact with Alluxio through its File System API. Java users can either use the Alluxio Java Client, or the Hadoop-Compatible Java Client, which wraps the Alluxio Java Client to implement the Hadoop API.
Alluxio also provides a POSIX API after mounting Alluxio as a local FUSE volume.
By setting up an Alluxio Proxy, users can also interact with Alluxio through a REST API similar to the File System API. The REST API is currently used for the Go and Python language bindings.
A fourth option is to interact with Alluxio through its S3 API. Users can interact using the same S3 clients used for AWS S3 operations. This makes it easy to change existing S3 workloads to use Alluxio.
Java Client
Alluxio provides access to data through a file system interface. Files in Alluxio offer write-once semantics: they become immutable after they have been written in their entirety and cannot be read before being completed. Alluxio provides users two different File System APIs to access the same file system:
The Alluxio file system API provides full functionality, while the Hadoop compatible API gives users the flexibility of leveraging Alluxio without having to modify existing code written using Hadoop’s API with limitations.
Configuring Dependency
To build your Java application to access Alluxio File System using maven,
include the artifact alluxio-shaded-client
in your pom.xml
like the following:
<dependency>
<groupId>org.alluxio</groupId>
<artifactId>alluxio-shaded-client</artifactId>
<version>2.7.2</version>
</dependency>
Available since 2.0.1
, this artifact is self-contained by including all its
transitive dependencies in a shaded form to prevent potential dependency conflicts.
This artifact is recommended generally for a project to use Alluxio client.
Alternatively, an application can also depend on the alluxio-core-client-fs
artifact for
the Alluxio file system interface
or the alluxio-core-client-hdfs
artifact for the
Hadoop compatible file system interface of Alluxio.
These two artifacts do not include transitive dependencies and therefore much smaller in size,
also both included in alluxio-shaded-client
artifact.
Alluxio Java API
This section introduces the basic operations to use the Alluxio FileSystem
interface.
Read the javadoc
for the complete list of API methods.
All resources with the Alluxio Java API are specified through an
AlluxioURI
which represents the path to the resource.
Getting a File System Client
To obtain an Alluxio File System client in Java code, use
FileSystem.Factory#get()
:
FileSystem fs = FileSystem.Factory.get();
Creating a File
All metadata operations as well as opening a file for reading or creating a file for writing are
executed through the FileSystem
object. Since Alluxio files are immutable once written, the
idiomatic way to create files is to use
FileSystem#createFile(AlluxioURI)
,
which returns a stream object that can be used to write the file. For example:
FileSystem fs = FileSystem.Factory.get();
AlluxioURI path = new AlluxioURI("/myFile");
// Create a file and get its output stream
FileOutStream out = fs.createFile(path);
// Write data
out.write(...);
// Close and complete file
out.close();
Accessing an existing file in Alluxio
All operations on existing files or directories require the user to specify the AlluxioURI
.
An AlluxioURI
can be used to perform various operations, such as modifying the file
metadata (i.e. TTL or pin state) or getting an input stream to read the file.
Reading Data
Use FileSystem#openFile(AlluxioURI)
to obtain a stream object that can be used to read a file. For example:
FileSystem fs = FileSystem.Factory.get();
AlluxioURI path = new AlluxioURI("/myFile");
// Open the file for reading
FileInStream in = fs.openFile(path);
// Read data
in.read(...);
// Close file relinquishing the lock
in.close();
Specifying Operation Options
For all FileSystem
operations, an additional options
field may be specified, which allows
users to specify non-default settings for the operation. For example:
FileSystem fs = FileSystem.Factory.get();
AlluxioURI path = new AlluxioURI("/myFile");
// Generate options to set a custom blocksize of 64 MB
CreateFilePOptions options = CreateFilePOptions
.newBuilder()
.setBlockSizeBytes(64 * Constants.MB)
.build();
FileOutStream out = fs.createFile(path, options);
Programmatically Modifying Configuration
Alluxio configuration can be set through alluxio-site.properties
, but these properties apply to
all instances of Alluxio that read from the file. If fine-grained configuration management is
required, pass in a customized configuration object when creating the FileSystem
object.
The generated FileSystem
object will have modified configuration properties, independent of any
other FileSystem
clients.
FileSystem normalFs = FileSystem.Factory.get();
AlluxioURI normalPath = new AlluxioURI("/normalFile");
// Create a file with default properties
FileOutStream normalOut = normalFs.createFile(normalPath);
...
normalOut.close();
// Create a file system with custom configuration
InstancedConfiguration conf = InstancedConfiguration.defaults();
conf.set(PropertyKey.SECURITY_LOGIN_USERNAME, "alice");
FileSystem customizedFs = FileSystem.Factory.create(conf);
AlluxioURI normalPath = new AlluxioURI("/customizedFile");
// The newly created file will be created under the username "alice"
FileOutStream customizedOut = customizedFs.createFile(customizedPath);
...
customizedOut.close();
// normalFs can still be used as a FileSystem client with the default username.
// Likewise, using customizedFs will use the username "alice".
IO Options
Alluxio uses two different storage types: Alluxio managed storage and under storage. Alluxio managed
storage is the memory, SSD, and/or HDD allocated to Alluxio workers. Under storage is the storage
resource managed by the underlying storage system, such as S3, Swift or HDFS. Users can specify the
interaction with Alluxio managed storage and under storage through ReadType
and WriteType
.
ReadType
specifies the data read behavior when reading a file. WriteType
specifies the data
write behavior when writing a new file, i.e. whether the data should be written in Alluxio Storage.
Below is a table of the expected behaviors of ReadType
. Reads will always prefer Alluxio storage
over the under storage.
Read Type | Behavior |
---|---|
CACHE_PROMOTE | Data is moved to the highest tier in the worker where the data was read. If the data was not in the Alluxio storage of the local worker, a replica will be added to the local Alluxio worker. If there is no local Alluxio worker, a replica will be added to a remote Alluxio worker if the data was fetched from the under storage system. |
CACHE | If the data was not in the Alluxio storage of the local worker, a replica will be added to the local Alluxio worker. If there is no local Alluxio worker, a replica will be added to a remote Alluxio worker if the data was fetched from the under storage system. |
NO_CACHE | Data is read without storing a replica in Alluxio. Note that a replica may already exist in Alluxio. |
Below is a table of the expected behaviors of WriteType
.
Write Type | Behavior |
---|---|
CACHE_THROUGH | Data is written synchronously to a Alluxio worker and the under storage system. |
MUST_CACHE | Data is written synchronously to a Alluxio worker. No data will be written to the under storage. This is the default write type. |
THROUGH | Data is written synchronously to the under storage. No data will be written to Alluxio. |
ASYNC_THROUGH | Data is written synchronously to a Alluxio worker and asynchronously to the under storage system. Experimental. |
Location policy
Alluxio provides location policy to choose which workers to store the blocks of a file.
Using Alluxio’s Java API, users can set the policy in CreateFilePOptions
for writing files and
OpenFilePOptions
for reading files into Alluxio.
Users can override the default policy class in the
configuration file at property
alluxio.user.block.write.location.policy.class
. The built-in policies include:
-
This is the default policy.
A policy that returns the local worker first, and if the local worker doesn’t exist or have enough availability, will select the nearest worker from the active workers list with sufficient availability.
The definition of ‘nearest worker’ is based on ‘Tiered Locality’.
The calculation of which worker gets selected is done for each block write.
- If no worker meets availability criteria, will randomly select a worker from the list of all workers.
-
This is the same as
LocalFirstPolicy
with the following addition:A policy that returns the local worker first, and if the local worker doesn’t exist or have enough availability, will select the nearest worker from the active workers list with sufficient availability.
The calculation of which worker gets selected is done for each block write.
The PropertyKey
USER_FILE_WRITE_AVOID_EVICTION_POLICY_RESERVED_BYTES
(alluxio.user.block.avoid.eviction.policy.reserved.size.bytes) is used as buffer space on each worker when calculating available space to store each block.- If no worker meets availability criteria, will randomly select a worker from the list of all workers.
-
A policy that returns the worker with the most available bytes.
- If no worker meets availability criteria, will randomly select a worker from the list of all workers.
-
A policy that chooses the worker for the next block in a round-robin manner and skips workers that do not have enough space.
- If no worker meets availability criteria, will randomly select a worker from the list of all workers.
-
Always returns a worker with the hostname specified by PropertyKey.WORKER_HOSTNAME (alluxio.worker.hostname).
- If no value is set, will randomly select a worker from the list of all workers.
-
This policy maps the blockId to several deterministic Alluxio workers. The number of workers a block can be mapped to can be passed through the constructor. The default is 1. It skips the workers that do not have enough capacity to hold the block.
This policy is useful for limiting the amount of replication that occurs when reading blocks from the UFS with high concurrency. With 30 workers and 100 remote clients reading the same block concurrently, the replication level for the block would get close to 30 as each worker reads and caches the block for one or more clients. If the clients use DeterministicHashPolicy with 3 shards, the 100 clients will split their reads between just 3 workers, so that the replication level for the block will be only 3 when the data is first loaded.
Note that the hash function relies on the number of workers in the cluster, so if the number of workers changes, the workers chosen by the policy for a given block will likely change.
Alluxio supports custom policies, so you can also develop your own policy appropriate for your
workload by implementing the interface alluxio.client.block.policy.BlockLocationPolicy
. Note that a
default policy must have a constructor which takes alluxio.conf.AlluxioConfiguration
.
To use the ASYNC_THROUGH
write type, all the blocks of a file must be written to the same worker.
Write Tier
Alluxio allows a client to select a tier preference when writing blocks to a local worker. Currently this policy preference exists only for local workers, not remote workers; remote workers will write blocks to the highest tier.
By default, data is written to the top tier. Users can modify the default setting through the
alluxio.user.file.write.tier.default
configuration
property or override it through an option to the
FileSystem#createFile(AlluxioURI, CreateFilePOptions)
API call.
Javadoc
For additional API information, please refer to the Alluxio javadocs.
Hadoop-Compatible Java Client
On top of the Alluxio file system, Alluxio also has a convenience class
alluxio.hadoop.FileSystem
that provides applications a
Hadoop compatible FileSystem
interface.
This client translates Hadoop file operations to Alluxio file system operations,
allowing users to reuse existing code written for Hadoop without modification.
Read its javadoc
for more details.
Example
Here is a piece of example code to read ORC files from the Alluxio file system using the Hadoop interface.
// create a new hadoop configuration
org.apache.hadoop.conf.Configuration conf = new org.apache.hadoop.conf.Configuration();
// enforce hadoop client to bind alluxio.hadoop.FileSystem for URIs like alluxio://
conf.set("fs.alluxio.impl", "alluxio.hadoop.FileSystem");
conf.set("fs.AbstractFileSystem.alluxio.impl", "alluxio.hadoop.AlluxioFileSystem");
// Now alluxio address can be used like any other hadoop-compatible file system URIs
org.apache.orc.OrcFile.ReaderOptions options = new org.apache.orc.OrcFile.ReaderOptions(conf)
org.apache.orc.Reader orc = org.apache.orc.OrcFile.createReader(
new Path("alluxio://localhost:19998/path/file.orc"), options);
Rest API
For portability with other languages, the Alluxio API is also accessible via an HTTP proxy in the form of a REST API. Alluxio’s Python and Go clients rely on this REST API to talk to Alluxio.
The REST API documentation
is generated as part of the Alluxio build and accessible through
${ALLUXIO_HOME}/core/server/proxy/target/miredot/index.html
. The main difference between
the REST API and the Alluxio Java API is in how streams are represented. While the Alluxio Java API
can use in-memory streams, the REST API decouples the stream creation and access (see the
create
and open
REST API methods and the streams
resource endpoints for details).
The HTTP proxy is a standalone server that can be started using
${ALLUXIO_HOME}/bin/alluxio-start.sh proxy
and stopped using ${ALLUXIO_HOME}/bin/alluxio-stop.sh
proxy
. By default, the REST API is available on port 39999.
There are performance implications of using the HTTP proxy. In particular, using the proxy requires an extra network hop to perform filesystem operations. For optimal performance, it is recommended to run the proxy server and an Alluxio worker on each compute node.
Python
Alluxio has a Python Client for interacting with Alluxio through its REST API. The Python client exposes an API similar to the Alluxio Java API. See the doc for detailed documentation about all available methods. See the example on how to perform basic file system operations in Alluxio.
The Python client requires an Alluxio proxy that exposes the REST API to function.
Install Python Client Library
$ pip install alluxio
Example Usage
The following program includes examples of how to create directory, download, upload, check existence for, and list status for files in Alluxio.
This example can also be found here in the Python package’s repository.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import json
import sys
import alluxio
from alluxio import option
def colorize(code):
def _(text, bold=False):
c = code
if bold:
c = '1;%s' % c
return '\033[%sm%s\033[0m' % (c, text)
return _
green = colorize('32')
def info(s):
print green(s)
def pretty_json(obj):
return json.dumps(obj, indent=2)
def main():
py_test_root_dir = '/py-test-dir'
py_test_nested_dir = '/py-test-dir/nested'
py_test = py_test_nested_dir + '/py-test'
py_test_renamed = py_test_root_dir + '/py-test-renamed'
client = alluxio.Client('localhost', 39999)
info("creating directory %s" % py_test_nested_dir)
opt = option.CreateDirectory(recursive=True)
client.create_directory(py_test_nested_dir, opt)
info("done")
info("writing to %s" % py_test)
with client.open(py_test, 'w') as f:
f.write('Alluxio works with Python!\n')
with open(sys.argv[0]) as this_file:
f.write(this_file)
info("done")
info("getting status of %s" % py_test)
stat = client.get_status(py_test)
print pretty_json(stat.json())
info("done")
info("renaming %s to %s" % (py_test, py_test_renamed))
client.rename(py_test, py_test_renamed)
info("done")
info("getting status of %s" % py_test_renamed)
stat = client.get_status(py_test_renamed)
print pretty_json(stat.json())
info("done")
info("reading %s" % py_test_renamed)
with client.open(py_test_renamed, 'r') as f:
print f.read()
info("done")
info("listing status of paths under /")
root_stats = client.list_status('/')
for stat in root_stats:
print pretty_json(stat.json())
info("done")
info("deleting %s" % py_test_root_dir)
opt = option.Delete(recursive=True)
client.delete(py_test_root_dir, opt)
info("done")
info("asserting that %s is deleted" % py_test_root_dir)
assert not client.exists(py_test_root_dir)
info("done")
if __name__ == '__main__':
main()
Go
Alluxio has a Go Client for interacting with Alluxio through its REST API. The Go client exposes an API similar to the Alluxio Java API. See the godoc for detailed documentation about all available methods. The godoc includes examples of how to download, upload, check existence for, and list status for files in Alluxio.
The Go client requires an Alluxio proxy that exposes the REST API to function.
Install Go Client Library
$ go get -d github.com/Alluxio/alluxio-go
Example Usage
If there is no Alluxio proxy running locally, replace “localhost” below with a hostname of a proxy.
package main
import (
"fmt"
"io/ioutil"
"log"
"strings"
"time"
alluxio "github.com/Alluxio/alluxio-go"
"github.com/Alluxio/alluxio-go/option"
)
func write(fs *alluxio.Client, path, s string) error {
id, err := fs.CreateFile(path, &option.CreateFile{})
if err != nil {
return err
}
defer fs.Close(id)
_, err = fs.Write(id, strings.NewReader(s))
return err
}
func read(fs *alluxio.Client, path string) (string, error) {
id, err := fs.OpenFile(path, &option.OpenFile{})
if err != nil {
return "", err
}
defer fs.Close(id)
r, err := fs.Read(id)
if err != nil {
return "", err
}
defer r.Close()
content, err := ioutil.ReadAll(r)
if err != nil {
return "", err
}
return string(content), err
}
func main() {
fs := alluxio.NewClient("localhost", 39999, 10*time.Second)
path := "/test_path"
exists, err := fs.Exists(path, &option.Exists{})
if err != nil {
log.Fatal(err)
}
if exists {
if err := fs.Delete(path, &option.Delete{}); err != nil {
log.Fatal(err)
}
}
if err := write(fs, path, "Success"); err != nil {
log.Fatal(err)
}
content, err := read(fs, path)
if err != nil {
log.Fatal(err)
}
fmt.Printf("Result: %v\n", content)
}