Amazon AWS S3
- Basic Setup
- Running Alluxio Locally with S3
- Advanced Setup
- Configure S3 Region
- Advanced Credentials Setup
- Enabling Server Side Encryption
- Accessing S3 through a proxy
- Using a non-Amazon service provider
- Connecting to Oracle Cloud Infrastructure (OCI) object storage
- Using v2 S3 Signatures
- [Experimental] S3 streaming upload
- Tuning for High Concurrency
- Identity and Access Control of S3 Objects
This guide describes the instructions to configure Amazon S3 as Alluxio’s under storage system.
The Alluxio binaries must be available on the machine.
In preparation for using S3 with Alluxio, create a new bucket or use an existing bucket. You
should also note the directory you want to use in that bucket, either by creating a new directory in
the bucket, or using an existing one. For the purposes of this guide, the S3 bucket name is called
S3_BUCKET, and the directory in that bucket is called
Alluxio unifies access to different storage systems through the unified namespace feature. An S3 location can be either mounted at the root of the Alluxio namespace or at a nested directory.
Root Mount Point
conf/alluxio-site.properties if it does not exist.
$ cp conf/alluxio-site.properties.template conf/alluxio-site.properties
Configure Alluxio to use S3 as its under storage system by modifying
Specify an existing S3 bucket and directory as the under storage system by modifying
conf/alluxio-site.properties to include:
Note that if you want to mount the whole s3 bucket, add a trailing slash after the bucket name
Specify the AWS credentials for S3 access by setting s3a.accessKeyId and s3a.secretKey in
s3a.accessKeyId=<S3 ACCESS KEY> s3a.secretKey=<S3 SECRET KEY>
For other methods of setting AWS credentials, see the credentials section in Advanced Setup.
After these changes, Alluxio should be configured to work with S3 as its under storage system, and you can try Running Alluxio Locally with S3.
An S3 location can be mounted at a nested directory in the Alluxio namespace to have unified access to multiple under storage systems. Alluxio’s Command Line Interface can be used for this purpose.
$ ./bin/alluxio fs mount \ --option s3a.accessKeyId=<AWS_ACCESS_KEY_ID> \ --option s3a.secretKey=<AWS_SECRET_KEY_ID> \ /mnt/s3 s3://<S3_BUCKET>/<S3_DIRECTORY>
Running Alluxio Locally with S3
Start up Alluxio locally to see that everything works.
$ ./bin/alluxio format $ ./bin/alluxio-start.sh local
This should start an Alluxio master and an Alluxio worker. You can see the master UI at http://localhost:19999.
Before running an example program, please make sure the root mount point
set in the
alluxio-site.properties is a valid path in the ufs.
Make sure the user running the example program has write permissions to the alluxio file system.
Run a simple example program:
$ ./bin/alluxio runTests
Visit your S3 directory
s3://<S3_BUCKET>/<S3_DIRECTORY> to verify the files
and directories created by Alluxio exist. For this test, you should see files named like:
To stop Alluxio, you can run:
$ ./bin/alluxio-stop.sh local
Configure S3 Region
Configure S3 region when accessing S3 buckets to improve performance.
Otherwise, global S3 bucket access will be enabled which introduces extra requests.
S3 region can be set in
Advanced Credentials Setup
You can specify credentials in different ways, from highest to lowest priority:
s3a.secretKeyspecified as mount options
s3a.secretKeyspecified as Java system properties
- Environment Variables
AWS_ACCESS_KEY(either is acceptable) and
AWS_SECRET_KEY(either is acceptable) on the Alluxio servers
- Profile file containing credentials at
- AWS Instance profile credentials, if you are using an EC2 instance
When using an AWS Instance profile as the credentials’ provider:
- Create an IAM Role with access to the mounted bucket
- Create an Instance profile as a container for the defined IAM Role
- Launch an EC2 instance using the created profile
Note that the IAM role will need access to both the files in the bucket as well as the bucket itself
in order to determine the bucket’s owner. Automatically assigning an owner to the bucket can be
avoided by setting the property
See Amazon’s documentation for more details.
Enabling Server Side Encryption
You may encrypt your data stored in S3. The encryption is only valid for data at rest in S3 and will be transferred in decrypted form when read by clients. Note, enabling this will also enable HTTPS to comply with requirements for reading/writing objects.
Enable this feature by configuring
By default, a request directed at the bucket named “mybucket” will be sent to the host name “mybucket.s3.amazonaws.com”. You can enable DNS-Buckets to use path style data access, for example: “http://s3.amazonaws.com/mybucket” by setting the following configuration:
Accessing S3 through a proxy
To communicate with S3 through a proxy, modify
conf/alluxio-site.properties to include:
<PROXY_PORT> should be replaced by the host and port of your proxy.
Using a non-Amazon service provider
To use an S3 service provider other than “s3.amazonaws.com”, modify
<S3_ENDPOINT> with the hostname and port of your S3 service, e.g.,
http://localhost:9000. Only use this parameter if you are using a provider other than
Connecting to Oracle Cloud Infrastructure (OCI) object storage
Both the endpoint and region value need to be updated to use non-home region.
All OCI object storage regions need to use
Using v2 S3 Signatures
Some S3 service providers only support v2 signatures. For these S3 providers, you can enforce using
the v2 signatures by setting the
[Experimental] S3 streaming upload
S3 is an object store and because of this feature, the whole file is sent from client to worker,
stored in the local disk temporary directory, and uploaded in the
close() method by default.
To enable S3 streaming upload, you need to modify
conf/alluxio-site.properties to include:
The default upload process is safer but has the following issues:
- Slow upload time. The file has to be sent to Alluxio worker first and then Alluxio worker is responsible for uploading the file to S3. The two processes are sequential.
- The temporary directory must have the capacity to store the whole file.
close(). The execution time of
close()method is proportional to the file size and inversely proportional to the bandwidth. That is O(FILE_SIZE/BANDWIDTH). Slow
close()is unexpected and has already been a bottleneck in the Alluxio Fuse integration. Alluxio Fuse method which calls
close()is asynchronous and thus if we write a big file through Alluxio Fuse to S3, the Fuse write operation will be returned much earlier than the file has been written to S3.
The S3 streaming upload feature addresses the above issues and is based on the S3 low-level multipart upload.
The S3 streaming upload has the following advantages:
- Shorter upload time. Alluxio worker uploads buffered data while receiving new data. The total upload time will be at least as fast as the default method.
- Smaller capacity requirement. Our data is buffered and uploaded according to
alluxio.underfs.s3.streaming.upload.partition.sizewhich is 64MB by default). When a partition is successfully uploaded, this partition will be deleted.
close(). We begin uploading data when data buffered reaches the partition size instead of uploading the whole file in
If a S3 streaming upload is interrupted, there may be intermediate partitions uploaded to S3 and S3 will charge for those data.
To reduce the charges, users can modify
conf/alluxio-site.properties to include:
Intermediate multipart uploads in all non-readonly S3 mount points
older than the clean age (configured by
will be cleaned when a leading master starts
or a cleanup interval (configured by
alluxio.underfs.cleanup.interval) is reached.
Tuning for High Concurrency
When using Alluxio to access S3 with a great number of clients per Alluxio server, these parameters can be tuned so that Alluxio uses a configuration optimized for the S3 backend.
# If the S3 connection is slow, a larger timeout is useful alluxio.underfs.s3.socket.timeout=500sec alluxio.underfs.s3.request.timeout=5min # If we expect a great number of concurrent metadata operations alluxio.underfs.s3.admin.threads.max=80 # If the total number of metadata + data operations is huge alluxio.underfs.s3.threads.max=160 # For a worker, the number of concurrent writes to S3 # For a master, the number of threads to concurrently rename files within a directory alluxio.underfs.s3.upload.threads.max=80 # Thread-pool size to submit delete and rename operations to S3 on master alluxio.underfs.object.store.service.threads=80
Identity and Access Control of S3 Objects
S3 identity and access management is very different from the traditional POSIX permission model. For instance, S3 ACL does not support groups or directory-level settings. Alluxio makes the best effort to inherit permission information including file owner, group and permission mode from S3 ACL information.
Why is 403 Access Denied Error Returned
The S3 credentials set in Alluxio configuration corresponds to an AWS user. If this user does not have the required permissions to access an S3 bucket or object, a 403 permission denied error will be returned.
If you see a 403 error in Alluxio server log when accessing an S3 service, you should double-check
- You are using the correct AWS credentials. See credential setup.
- Your AWS user has permissions to access the buckets and objects mounted to Alluxio.
Read more AWS troubleshooting guidance for 403 error.
File Owner and Group
Alluxio file system sets the file owner based on the AWS account configured in Alluxio to connect to S3. Since there is no group in S3 ACL, the owner is reused as the group.
By default, Alluxio extracts the display name of this AWS account as the file owner.
In case this display name is not available,
this AWS user’s canonical user ID will be used.
This canonical user ID is typically a long string (like
thus often inconvenient to read and use in practice.
Optionally, the property
alluxio.underfs.s3.owner.id.to.username.mapping can be used to specify a preset mapping
from canonical user IDs to Alluxio usernames, in the format “id1=user1;id2=user2”.
For example, edit
alluxio-site.properties to include
This configuration helps Alluxio recognize all objects owned by this AWS account as owned by
john in Alluxio namespace.
To find out the AWS S3 canonical ID of your account,
check the console
expand the “Account Identifiers” tab and refer to “Canonical User ID”.
Alluxio checks the S3 bucket read/write ACLs to determine the owner’s permissions to an Alluxio
For example, if the AWS user has read-only access to the mounted bucket, the mounted
directory and files will be set with
If the AWS user has full access to the underlying bucket,
the mounted directory and files will be set with
Note that Alluxio only inherits bucket-level ACLs when determining file system permissions for a mount point,
and ignores the ACLs that are set to individual objects.
0700 as the default file permissions, Alluxio file system users other than the
file owner can not access the files under the mount point.
This may create problems for different users to read mounted data.
For example, a Presto job running under user
presto may encounter an error like
Query failed: Failed to list directory, when accessing a mount point owned by the user
john with permission bits
In Alluxio master log (
master.log), one can find errors like
Error=alluxio.exception.AccessControlException: Permission denied: user=presto, access=--x, path=/mnt/s3/myobject: failed at s3, inode owner=john, inode group=john, inode mode=rwx------
This is because the mounted directory has permission
0700 and thus application user
is not able to access this file.
To share the S3 mount point with other users in Alluxio namespace, one either can choose of:
- (Option1): Set
alluxio-site.propertiesfor root mount or pass
fs mountcommand; This gives all users
- (Option2): Set
alluxio.underfs.s3.default.modeto a new default value other than
0700that can enable other users to access.
chmod of Alluxio directories and files do NOT propagate to the underlying
S3 buckets nor objects.
Alluxio supports authentication with the AssumeRole API. When using AssumeRole, the AWS access key and secret key will ONLY be used to get temporary credentials. ALL subsequent accesses will be using the temporary credentials created with AssumeRole.
To use AssumeRole, there are 2 compulsory properties on Alluxio masters and workers:
Note: Please make sure the role exists, and the user (for the access key and secret key) has the permission to assume the role specified by the target role ARN.
There are 2 optional keys:
# This specifies a name for the session. # The temporary credential will be associated with a session. # The session name is suffixed by a random string to make sure of the uniqueness. alluxio.underfs.s3.assumerole.session.prefix=”alluxio-assume-role” # Typically this value is between 900 and 3600. # The session will be automatically refreshed by the AWS client, # no need to do anything to refresh the temporary credentials. alluxio.underfs.s3.assumerole.session.duration.second=900 # Enable the HTTPS protocol for assume role. The default value is true. alluxio.underfs.s3.assumerole.https.enabled=true # Enable the HTTPS protocol for assume role proxy. The default value is false. alluxio.underfs.s3.assumerole.proxy.https.enabled=false # Set the proxy host for assume role. Note that you have to set both proxy host # and proxy port in the Alluxio configuration. Otherwise, the proxy setting will be fetched from # your environment setting. alluxio.underfs.s3.assumerole.proxy.host=<HOSTNAME> # Set the proxy port for assume role. Note that you have to set both proxy host # and proxy port in the Alluxio configuration. Otherwise, the proxy setting will be fetched from # your environment setting. alluxio.underfs.s3.assumerole.proxy.port=<PORT_NUMBER>
Note: The JVM/System environment variables HTTP(S)_PROXY, http(s)_proxy, http(s).proxyHost and http(s).proxyPort will be automatically picked up by AWS SDK if you don’t set the proxy host and port in Alluxio configuration.
A sample setup looks like below:
aws.accessKeyId=FOO aws.secretKey=BAR alluxio.underfs.s3.assumerole.enabled=true alluxio.underfs.s3.assumerole.session.second=1000 alluxio.underfs.s3.assumerole.session.prefix=”alluxio” alluxio.underfs.s3.assumerole.rolearn=arn:aws:iam::123456:role/example-role
AssumeRole Temporary Token Propagating And Refreshing From the Master
In the previous AssumeRole configuration, in order for every Alluxio Master and Worker node to be able to request a token from S3 , AWS credentials must be stored on each node. This does not scale well for large clusters, and may not be acceptable in security sensitive environments where Workers cannot be trusted with such credentials. Alluxio Storage Integration Access Token Framework supports only storing AWS credentials on master nodes, and the masters will propagate the AssumeRole access token to workers who need to access S3. The workers can also ask the master to refresh the token when expired. The following configuration properties are needed to enable the AssumeRole Temporary Token in the Masters:
# properties mentioned in above aws.accessKeyId=FOO aws.secretKey=BAR alluxio.underfs.s3.assumerole.enabled=true alluxio.underfs.s3.assumerole.session.second=1000 alluxio.underfs.s3.assumerole.session.prefix="alluxio" alluxio.underfs.s3.assumerole.rolearn=arn:aws:iam::123456:role/example-role # extra properties required in the master alluxio.security.underfs.temporary.credential.enabled=true alluxio.security.underfs.mount.temporary.credential.enabled=true alluxio.underfs.s3.assumerole.session.scope=USER
alluxio.underfs.s3.assumerole.session.scope is set to
USER_PATH by default but can also be set to
USER. When set
USER_PATH, a new token is requested per request for each unique combination of user and path.
USER_PATH is the
most secure setting but can have a performance impact if the number of token requests is large.
USER will only request
a token for each unique user, which will reduce the number of requests when the same user is accessing a large number of
aws.secretKey are not required if the Master is on
profile, please referring to Advanced Credentials Setup
And the following configuration properties are needed for the Workers:
alluxio.underfs.s3.assumerole.enabled=true alluxio.security.underfs.temporary.credential.enabled=true alluxio.security.underfs.mount.temporary.credential.enabled=true alluxio.underfs.s3.assumerole.session.scope=USER
Enabling AWS-SDK Debug Level
If issues are encountered when running against your S3 backend, enable additional logging to track HTTP
conf/log4j.properties to add the following properties:
log4j.logger.com.amazonaws=WARN log4j.logger.com.amazonaws.request=DEBUG log4j.logger.org.apache.http.wire=DEBUG
See Amazon’s documentation for more details.
Prevent Creating Zero-byte Files
Alluxio creates zero-byte files in S3 as a performance optimization.
For a bucket mounted with read-write access, zero-byte file creation (S3 PUT operation) is not
restricted to write operations using Alluxio but also occurs when listing contents of the under storage.
To disable the PUT operation, mount the bucket with the
flag or set
alluxio.master.mount.table.root.readonly=true for root mount.