Quick Start Guide
- Prerequisites
- Downloading Alluxio
- Configuring Alluxio
- Validating Alluxio environment
- Starting Alluxio
- Using the Alluxio Shell
- [Bonus] Mounting in Alluxio
- [Bonus] Accelerating Data Access with Alluxio
- Stopping Alluxio
- Conclusion
The simplest way to quickly try out Alluxio is to install it locally on a single machine. In this quick start guide, we will install Alluxio on your local machine, mount example data, and perform basic tasks with the data in Alluxio. During this guide, you will:
- Download and configure Alluxio
- Validating Alluxio environment
- Start Alluxio locally
- Perform basic tasks via Alluxio Shell
- [Bonus] Mount a public Amazon S3 bucket in Alluxio
- Shutdown Alluxio
[Bonus] If you have an AWS account with an access key id and secret access key, you will be able to perform additional tasks in this guide. Sections of the guide which require your AWS account information will be labeled with [Bonus].
Note This guide is meant for you to quickly start interacting with an Alluxio system. Alluxio performs best in a distributed environment for big data workloads. Both of these qualities are difficult to incorporate in a local environment. If you are interested in running a larger scale example which highlights the performance benefits of Alluxio, try out the instructions in either of these two whitepapers: Accelerating on-demand data analytics with Alluxio, Accelerating data analytics on ceph object storage with Alluxio.
Prerequisites
For the following quick start guide, you will need:
- Mac OS X or Linux
- Java 7 or newer
- [Bonus] AWS account and keys
Setup SSH (Mac OS X)
If you are using Mac OS X, you may have to enable the ability to ssh into localhost. To enable remote login, Open System Preferences, then open Sharing. Make sure Remote Login is enabled.
Downloading Alluxio
First, download the Alluxio release. You can download the latest 1.6.1 release pre-built for various versions of Hadoop from the Alluxio download page.
Next, you can unpack the download with the following commands. Your filename may be different depending on which pre-built binaries you have downloaded.
tar -xzf alluxio-1.6.1-bin.tar.gz
cd alluxio-1.6.1
This will create a directory alluxio-1.6.1
with all of the Alluxio
source files and Java binaries.
Configuring Alluxio
Before we start Alluxio, we have to configure it. We will be using most of the default settings.
In the ${ALLUXIO_HOME}/conf
directory, create the conf/alluxio-site.properties
configuration
file from the template.
cp conf/alluxio-site.properties.template conf/alluxio-site.properties
Update alluxio.master.hostname
in conf/alluxio-site.properties
to the hostname of the machine
you plan to run Alluxio Master on.
echo "alluxio.master.hostname=localhost" >> conf/alluxio-site.properties
[Bonus] Configuration for AWS
If you have an Amazon AWS account with your access key id and secret key, you can update your
Alluxio configuration now in preparation for interacting with Amazon S3 later in this guide. Add
your AWS access information to the Alluxio configuration by adding the keys to the
conf/alluxio-site.properties
file. The following commands will update the configuration.
echo "aws.accessKeyId=AWS_ACCESS_KEY_ID" >> conf/alluxio-site.properties
echo "aws.secretKey=AWS_SECRET_ACCESS_KEY" >> conf/alluxio-site.properties
You will have to replace AWS_ACCESS_KEY_ID with your AWS access key id, and AWS_SECRET_ACCESS_KEY with your AWS secret access key. Now, Alluxio is fully configured for the rest of this guide.
Validating Alluxio environment
Before starting Alluxio, you might want to make sure that your system environment is ready for running Alluxio services. You can run the following command to validate your local environment with your Alluxio configuration:
./bin/alluxio validateEnv local
This will report potential problems that might prevent you from starting Alluxio services locally. If you configured Alluxio to run in a cluster and you want to validate environment on all nodes, you can run the following command instead:
./bin/alluxio validateEnv all
You can also make the command run only specific validation task. For example,
./bin/alluxio validateEnv local ulimit
Will only run validation tasks that check your local system resource limits.
You can check out this page for detailed usage information regarding this command.
Starting Alluxio
Next, we will format Alluxio in preparation for starting Alluxio. The following command will format the Alluxio journal and the worker storage directory in preparation for the master and worker to start.
./bin/alluxio format
Now, we can start Alluxio! By default, Alluxio is configured to start a master and worker on the localhost. We can start Alluxio on localhost with the following command:
./bin/alluxio-start.sh local
Congratulations! Alluxio is now up and running! You can visit http://localhost:19999 to see the status of the Alluxio master, and visit http://localhost:30000 to see the status of the Alluxio worker.
Using the Alluxio Shell
Now that Alluxio is running, we can examine the Alluxio file system with the Alluxio shell. The Alluxio shell enables many command-line operations for interacting with Alluxio. You can invoke the Alluxio shell with the following command:
./bin/alluxio fs
This will print out the available Alluxio command-line operations.
For example, you can list files in Alluxio with the ls
command. To list all files in the root directory, use the following command:
./bin/alluxio fs ls /
Unfortunately, we do not have any files in Alluxio. We can solve that by copying a file into
Alluxio. The copyFromLocal
shell command is used to copy a local file into Alluxio.
./bin/alluxio fs copyFromLocal LICENSE /LICENSE
Copied LICENSE to /LICENSE
After copying the LICENSE
file, we should be able to see it in Alluxio. List the files in
Alluxio with the command:
./bin/alluxio fs ls /
26.22KB 06-20-2016 11:30:04:415 In Memory /LICENSE
The output shows the file that exists in Alluxio, as well as some other useful information, like the size of the file, the date it was created, and the in-Alluxio status of the file.
You can also view the contents of the file through the Alluxio shell. The cat
command will print
the contents of the file.
./bin/alluxio fs cat /LICENSE
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
...
With the default configuration, Alluxio uses the local file system as its UnderFileSystem (UFS). The
default path for the UFS is ./underFSStorage
. We can see what is in the UFS with:
ls ./underFSStorage/
However, the directory doesn’t exist! By default, Alluxio will write data only into Alluxio space, not to the UFS.
However, we can tell Alluxio to persist the file from Alluxio space to the UFS. The shell command
persist
will do just that.
./bin/alluxio fs persist /LICENSE
persisted file /LICENSE with size 26847
Now, if we examine the local UFS again, the file should appear.
ls ./underFSStorage
LICENSE
If we browse the Alluxio file system in the master’s web UI we can see the LICENSE file as well as other useful information. Here, the Persistence State column shows the file as PERSISTED.
[Bonus] Mounting in Alluxio
Alluxio unifies access to storage systems with the unified namespace feature. Read the Unified Namespace blog post and the unified namespace documentation for more detailed explanations of the feature.
This feature allows users to mount different storage systems into the Alluxio namespace and access the files across various storage systems through the Alluxio namespace seamlessly.
First, we will create a directory in Alluxio to store our mount points.
./bin/alluxio fs mkdir /mnt
Successfully created directory /mnt
Next, we will mount an existing sample S3 bucket to Alluxio. We have provided a sample S3 bucket for you to use in the rest of this guide.
./bin/alluxio fs mount -readonly alluxio://localhost:19998/mnt/s3 s3a://alluxio-quick-start/data
Mounted s3a://alluxio-quick-start/data at alluxio://localhost:19998/mnt/s3
Now, the S3 bucket is mounted into the Alluxio namespace.
We can list the files from S3, through the Alluxio namespace. We can use the familiar ls
shell
command to list the files from the S3 mounted directory.
./bin/alluxio fs ls /mnt/s3
87.86KB 06-20-2016 12:50:51:660 Not In Memory /mnt/s3/sample_tweets_100k.csv
933.21KB 06-20-2016 12:50:53:633 Not In Memory /mnt/s3/sample_tweets_1m.csv
149.77MB 06-20-2016 12:50:55:473 Not In Memory /mnt/s3/sample_tweets_150m.csv
9.61MB 06-20-2016 12:50:55:821 Not In Memory /mnt/s3/sample_tweets_10m.csv
We can see the newly mounted files and directories in the Alluxio web UI as well.
With Alluxio’s unified namespace, you can interact with data from different storage systems
seamlessly. For example, with the ls
shell command, you can recursively list all the files that
exist under a directory.
./bin/alluxio fs ls -R /
26.22KB 06-20-2016 11:30:04:415 In Memory /LICENSE
1.00B 06-20-2016 12:28:39:176 /mnt
4.00B 06-20-2016 12:30:41:986 /mnt/s3
87.86KB 06-20-2016 12:50:51:660 Not In Memory /mnt/s3/sample_tweets_100k.csv
933.21KB 06-20-2016 12:50:53:633 Not In Memory /mnt/s3/sample_tweets_1m.csv
149.77MB 06-20-2016 12:50:55:473 Not In Memory /mnt/s3/sample_tweets_150m.csv
9.61MB 06-20-2016 12:50:55:821 Not In Memory /mnt/s3/sample_tweets_10m.csv
This shows all the files under the root of the Alluxio file system, from all of the mounted storage
systems. The /LICENSE
file is in your local file system, while the files under /mnt/s3/
are in
S3.
[Bonus] Accelerating Data Access with Alluxio
Since Alluxio leverages memory to store data, it can accelerate access to data. First, let’s take a look at the status of a file in Alluxio (mounted from S3).
./bin/alluxio fs ls /mnt/s3/sample_tweets_150m.csv
149.77MB 06-20-2016 12:50:55:473 Not In Memory /mnt/s3/sample_tweets_150m.csv
The output shows that the file is Not In Memory. This file is a sample of tweets. Let’s see how many tweets mention the word “kitten”. With the following command, we can count the number of tweets with “kitten”.
time ./bin/alluxio fs cat /mnt/s3/sample_tweets_150m.csv | grep -c kitten
889
real 0m22.857s
user 0m7.557s
sys 0m1.181s
Depending on your network connection, the operation may take over 20 seconds. If reading this file takes too long, you may use a smaller dataset. The other files in the directory are smaller subsets of this file.
Now, let’s see how many tweets mention the word “puppy”.
time ./bin/alluxio fs cat /mnt/s3/sample_tweets_150m.csv | grep -c puppy
1553
real 0m25.998s
user 0m6.828s
sys 0m1.048s
As you can see, it takes a lot of time to access the data for each command. Alluxio can accelerate
access to this data by using memory to store the data. However, the cat
shell command does not
cache data in Alluxio memory. There is a separate shell command, load
, which tells
Alluxio to store the data in memory. You can tell Alluxio to load the data into memory with the
following command.
./bin/alluxio fs load /mnt/s3/sample_tweets_150m.csv
After loading the file, you can check the status with the ls command:
./bin/alluxio fs ls /mnt/s3/sample_tweets_150m.csv
149.77MB 06-20-2016 12:50:55:473 In Memory /mnt/s3/sample_tweets_150m.csv
The output shows that the file is now In Memory. Now that the file is memory, reading the file should be much faster now.
Let’s count the number of tweets with the word “puppy”.
time ./bin/alluxio fs cat /mnt/s3/sample_tweets_150m.csv | grep -c puppy
1553
real 0m1.917s
user 0m2.306s
sys 0m0.243s
As you can see, reading the file was very quick, only a few seconds! And, since the data in Alluxio memory, you can easily read the file again just as quickly. Let’s now count how many tweets mention the word “bunny”.
time ./bin/alluxio fs cat /mnt/s3/sample_tweets_150m.csv | grep -c bunny
907
real 0m1.983s
user 0m2.362s
sys 0m0.240s
Congratulations! You installed Alluxio locally and used Alluxio to accelerate access to data!
Stopping Alluxio
Once you are done with interacting with your local Alluxio installation, you can stop Alluxio with the following command:
./bin/alluxio-stop.sh local
Conclusion
Congratulations on completing the quick start guide for Alluxio! You have successfully downloaded and installed Alluxio on your local computer, and performed some basic interactions via the Alluxio shell. This was a simple example on how to get started with Alluxio.
There are several next steps available. You can learn more about the various features of Alluxio in our documentation. You can also deploy Alluxio in your environment, mount your existing storage systems to Alluxio, or configure your applications to work with Alluxio. Additional resources are below.
Deploying Alluxio
Alluxio can be deployed in many different environments.
- Alluxio on Local Machine
- Alluxio Standalone on a Cluster
- Alluxio on Virtual Box
- Alluxio on Docker
- Alluxio Standalone with Fault Tolerance
- Alluxio on EC2
- Alluxio on GCE
- Alluxio with Mesos on EC2
- Alluxio with Fault Tolerance on EC2
- Alluxio with YARN on EC2
- Alluxio YARN Integration
- Alluxio Standalone with YARN
Under Storage Systems
There are many Under storage systems that can be accessed through Alluxio.
- Alluxio with Azure Blob Store
- Alluxio with S3
- Alluxio with GCS
- Alluxio with Minio
- Alluxio with Ceph
- Alluxio with Swift
- Alluxio with GlusterFS
- Alluxio with MapR-FS
- Alluxio with HDFS
- Alluxio with Secure HDFS
- Alluxio with OSS
- Alluxio with NFS
Frameworks and Applications
Different frameworks and applications work with Alluxio.