Getting Started

For those that are new to Alluxio, this guide is a good place to start. For additional installation methods, visit our documentation on installing Alluxio Deployment.

Introduction

We will install Alluxio locally and once we have installed Alluxio, we will run through some basic cluster operations.

  1. Verify Prerequisites.
  2. Install Alluxio locally.
  3. Perform basic tasks via Alluxio Shell.
  4. Mount a public Amazon S3 bucket in Alluxio.
  5. Accelerate data access.
  6. Stop Alluxio.

Prerequisites

Alluxio components have specific requirements which you must meet before proceeding.

Install Alluxio Locally

Alluxio creates distributed filesystem across one or more machines which consitute your Alluxio cluster. For this introduction, we’ll install Alluxio locally. The Alluxio components will all be installed on your one machine, and the filesystem will be ‘distributed’ across local storage only. Follow the step by step instructions for Alluxio Deployment. Once Alluxio is installed continue with steps below.

To verify at the terminal, run jps, processes including ‘AlluxioMaster’, ‘AlluxioWorker’, ‘AlluxioProxy’, ‘AlluxioJobMaster’, and ‘AlluxioJobWorker’ should exist.

Using the Alluxio Shell

Now that Alluxio is running, we can examine the Alluxio filesystem from the command line with the Alluxio shell. In this section we’ll cover basic file system operations including how to copy files into Alluxio and persist them to under storage.

  1. Change directory to the Alluxio install directory by running cd ~/alluxio.
  2. You can invoke the Alluxio shell by running ./bin/alluxio fs, which will list all of the available command-line operations.
  3. Let’s list all the files in Alluxio.
    $ ./bin/alluxio fs ls /
    
  4. Unfortunately, we don’t have any files in Alluxio. We can solve that by copying a file into Alluxio using copyFromLocal.
    $ ./bin/alluxio fs copyFromLocal conf/alluxio-site.properties.template /alluxio-site.properties.template
    Copied conf/alluxio-site.properties.template to /alluxio-site.properties.template
    
  5. After copying the license file, we should be able to see it in Alluxio. List the files in Alluxio again with ls. The output shows the file that exists in Alluxio, as well as some other useful information, like the size of the file, the date it was created, and the in-memory status of the file.
    $ ./bin/alluxio fs ls /
    -rw-r--r--    <owner>    <group>    1229    NOT_PERSISTED    09-27-2017    10:05:07:412    100%    /alluxio-site.properties.template
    
  6. You can also view the contents of the file using the cat command.
    $ ./bin/alluxio fs cat /alluxio-site.properties.template
    ...
    
  7. With the default configuration, Alluxio uses the local file system as its under storage (US). The default path for the US is ./under-storage. We can see what’s in the US as follows:
    $ ls ./under-storage/
    
  8. The directory is empty. By default, Alluxio will write data only into Alluxio space, not to the US. We can tell Alluxio to persist the file from Alluxio space to the US using the shell command persist.
    $ ./bin/alluxio fs persist /alluxio-site.properties.template
    persisted file /alluxio-site.properties.template with size 1193
    
  9. Now, if we examine the US again, the file should appear.
    $ ls ./under-storage
    alluxio-site.properties.template
    

Exploring the Web UI

Alluxio has a user-friendly web interface enabling users to watch and manage the system. The master and workers all serve their own web UI. The default port for the web interface is 19999 for the master and 30000 for the workers.

If we browse the Alluxio file system in the master’s web UI we can see the file we copied earlier, as well as other useful information. Notice the ‘persistence state’ column shows the file is persisted.

Mount a Storage System

Alluxio unifies access to different storage systems with the unified namespace feature, which enables users to mount different storage systems into the Alluxio namespace and access the files across those systems seamlessly.

  1. Create a directory in Alluxio to store your mount points.
    $ ./bin/alluxio fs mkdir /mnt
    Successfully created directory /mnt
    

NOTE: The rest of this example requires Amazon AWS account credentials

  1. You will need to provide credentials to access the alluxio-quick-start bucket. Set Alluxio properties aws.accessKeyId and aws.secretKey in conf/alluxio-site.properties and restart Alluxio.

  2. Mount an existing sample S3 bucket to Alluxio. We have provided a sample S3 bucket for you to use in this guide.
    $ ./bin/alluxio fs mount -readonly alluxio://localhost:19998/mnt/s3 s3a://alluxio-quick-start/data
    Mounted s3a://alluxio-quick-start/data at alluxio://localhost:19998/mnt/s3
    
  3. Now the S3 bucket is mounted into the Alluxio namespace. We can list the files from S3, through the Alluxio namespace using the familiar ls shell command.
    $ ./bin/alluxio fs ls -h /mnt/s3
    -r-x------    <owner>    <group>    933.21KB    PERSISTED    09-27-2017    11:34:20:072    0%    /mnt/s3/sample_tweets_1m.csv
    -r-x------    <owner>    <group>    9.61MB      PERSISTED    09-27-2017    11:34:20:076    0%    /mnt/s3/sample_tweets_10m.csv
    -r-x------    <owner>    <group>    87.86KB     PERSISTED    09-27-2017    11:34:20:076    0%    /mnt/s3/sample_tweets_100k.csv
    -r-x------    <owner>    <group>    149.77MB    PERSISTED    09-27-2017    11:34:20:077    0%    /mnt/s3/sample_tweets_150m.csv
    
  4. With Alluxio’s unified namespace, you can interact with data from different storage systems seamlessly. For example, with the ls shell command, you can recursively list all the files that exist under a directory. The following output shows all the files under the root of the Alluxio file system, from all of the mounted storage systems. The alluxio-site.properties.template file is in your local file system, while the files under /mnt/s3/ are in S3.
    $ ./bin/alluxio fs ls -hR /
    -rw-r--r--    <owner>    <group>    1229B       NOT_PERSISTED 09-27-2017    10:05:07:412    100%  /alluxio-site.properties.template
    dr-x------    <owner>    <group>    1           PERSISTED     09-27-2017    11:34:20:072    DIR   /mnt
    dr-x------    <owner>    <group>    4           PERSISTED     09-27-2017    11:34:20:072    DIR   /mnt/s3
    -r-x------    <owner>    <group>    933.21KB    PERSISTED     09-27-2017    11:34:20:072    0%    /mnt/s3/sample_tweets_1m.csv
    -r-x------    <owner>    <group>    9.61MB      PERSISTED     09-27-2017    11:34:20:076    0%    /mnt/s3/sample_tweets_10m.csv
    -r-x------    <owner>    <group>    87.86KB     PERSISTED     09-27-2017    11:34:20:076    0%    /mnt/s3/sample_tweets_100k.csv
    -r-x------    <owner>    <group>    149.77MB    PERSISTED     09-27-2017    11:34:20:077    0%    /mnt/s3/sample_tweets_150m.csv
    
  5. You can see the newly mounted files and directories in the Alluxio web UI as well.

Accelerating Data Access

Alluxio leverages memory to accelerate data access. This exercise is designed so you can experience this acceleration first hand.

First, let’s take a look at the status of a file in Alluxio, mounted from S3.

$ ./bin/alluxio fs ls -h /mnt/s3/sample_tweets_150m.csv
-r-x------    <owner>    <group>    149.77MB    PERSISTED     09-27-2017    11:34:20:077    0%    /mnt/s3/sample_tweets_150m.csv

The output shows that the file is not in memory. This file is a sample of tweets. Let’s see how many tweets mention the word ‘kitten’.

$ time ./bin/alluxio fs cat /mnt/s3/sample_tweets_150m.csv | grep -c kitten
889

real	0m22.857s
user	0m7.557s
sys	0m1.181s

Now, let’s see how many tweets mention the word ‘puppy’.

$ time ./bin/alluxio fs cat /mnt/s3/sample_tweets_150m.csv | grep -c puppy
1553

real	0m25.998s
user	0m6.828s
sys	0m1.048s

As you can see, it takes a lot of time to access the data for each command. Alluxio can accelerate access to this data by using memory to store the data. However, the cat shell command does not cache data in Alluxio memory. There is a separate shell command, load, which tells Alluxio to store the data in memory.

$ ./bin/alluxio fs load /mnt/s3/sample_tweets_150m.csv

After loading the file, check the status with the ls command. The output shows that the file is now in memory. Now that the file is memory, reading the file should be much faster now.

$ ./bin/alluxio fs ls /mnt/s3/sample_tweets_150m.csv
-r-x------    <owner>    <group>    149.77MB    PERSISTED     09-27-2017    11:34:20:077    100%    /mnt/s3/sample_tweets_150m.csv

Let’s again count the number of tweets with the word ‘puppy’.

$ time ./bin/alluxio fs cat /mnt/s3/sample_tweets_150m.csv | grep -c puppy
1553

real	0m1.917s
user	0m2.306s
sys	0m0.243s

As you can see, reading the file was very fast, only a few seconds! And, since the data is in Alluxio memory, you can easily read the file again just as quickly. Let’s observe this by counting how many tweets mention the word ‘bunny’.

$ time ./bin/alluxio fs cat /mnt/s3/sample_tweets_150m.csv | grep -c bunny
907

real	0m1.983s
user	0m2.362s
sys	0m0.240s

Stop Your Cluster

Alluxio can be stopped and started at the cluster level. Stopping means that all Alluxio services on all nodes, in this case your local computer, will be stopped. All data will remain available after the cluster is restart so long as none of the nodes in the cluster were rebooted in the meantime.

$ ./bin/alluxio-stop.sh all 

Next Steps

Congratulations on successfully installing Alluxio on your local computer and performing some basic operations!

There are several next steps available. You can learn more about the various key features of Alluxio. You can also deploy fault tolerant Alluxio on a cluster, transparently mount storage systems with the Alluxio unified namespace, or configure your applications to work with the Alluxio file system API.