Running Hadoop MapReduce on Alluxio
This guide describes how to get Alluxio running with Apache Hadoop MapReduce, so that you can easily run your MapReduce programs with files stored on Alluxio.
The prerequisite for this part is that you have Java. We also assume that you have set up Alluxio and Hadoop in accordance to these guides Local Mode or Cluster Mode. In order to run some simple map-reduce examples, we also recommend you download the map-reduce examples jar based on your hadoop version, or if you are using Hadoop 1, this examples jar.
Compiling the Alluxio Client
In order to use Alluxio with your version of Hadoop, you will have to re-compile the Alluxio client jar, specifying your Hadoop version. You can do this by running the following in your Alluxio directory:
mvn install -Dhadoop.version=<YOUR_HADOOP_VERSION> -DskipTests
<YOUR_HADOOP_VERSION> supports many different distributions of Hadoop. For example,
mvn install -Dhadoop.version=2.7.1 -DskipTests would compile Alluxio for the Apache Hadoop version 2.7.1.
Please visit the
Building Alluxio Master Branch page for more
information about support for other distributions.
After the compilation succeeds, the new Alluxio client jar can be found at
This is the jar that you should use for the rest of this guide.
Add the following three properties to
core-site.xml file in your Hadoop installation
<property> <name>fs.alluxio.impl</name> <value>alluxio.hadoop.FileSystem</value> <description>The Alluxio FileSystem (Hadoop 1.x and 2.x)</description> </property> <property> <name>fs.alluxio-ft.impl</name> <value>alluxio.hadoop.FaultTolerantFileSystem</value> <description>The Alluxio FileSystem (Hadoop 1.x and 2.x) with fault tolerant support</description> </property> <property> <name>fs.AbstractFileSystem.alluxio.impl</name> <value>alluxio.hadoop.AlluxioFileSystem</value> <description>The Alluxio AbstractFileSystem (Hadoop 2.x)</description> </property>
This will allow your MapReduce jobs to recognize URIs with Alluxio scheme (i.e.,
alluxio://) in their input and output files.
$HADOOP_CLASSPATH by changing
hadoop-env.sh also in
conf directory to have:
This ensures Alluxio client jar available for the MapReduce job client that creates and submits jobs to interact with URIs with Alluxio scheme.
NOTE: Starting from Alluxio 1.3.0 release, adding Alluxio client jar to
$HADOOP_CLASSPATH is required.
Since this release security is enabled by default, if the
$HADOOP_CLASSPATH does not include Alluxio
client jar, running MapReduce on Alluxio might result in “Failed to login: No Alluxio User is
Distributing the Alluxio Client Jar
In order for the MapReduce job to be able to read and write files in Alluxio, the Alluxio client jar must be distributed to all the nodes in the cluster. This allows the TaskTracker and JobClient to have all the requisite executables to interface with Alluxio.
This guide on
how to include 3rd party libraries from Cloudera
describes several ways to distribute the jars. From that guide, the recommended way to distributed
the Alluxio client jar is to use the distributed cache, via the
-libjars command line option.
Another way to distribute the client jar is to manually distribute it to all the Hadoop nodes.
Below are instructions for the two main alternatives:
1.Using the -libjars command line option.
You can run a job by using the
-libjars command line option when using
hadoop jar ...,
as the argument of
-libjars. This will place the jar in the Hadoop DistributedCache, making it available to all
the nodes. For example, the following command adds the Alluxio client jar to the
hadoop jar hadoop-examples-1.2.1.jar wordcount -libjars /<PATH_TO_ALLUXIO>/client/alluxio-1.4.0-client.jar <INPUT FILES> <OUTPUT DIRECTORY>
2.Distributing the jars to all nodes manually.
To install Alluxio on each node, place the client jar
/<PATH_TO_ALLUXIO>/client/alluxio-1.4.0-client.jar in the
$HADOOP_HOME/share/hadoop/common/lib for different versions of Hadoop) directory of every
MapReduce node, and then restart Hadoop.
Alternatively, add this jar to
mapreduce.application.classpath system property for your Hadoop deployment
to ensure this jar is on the classpath.
Note that the jars must be installed again for each update to a new release. On the other hand, when the jar is
already on every node, then the
-libjars command line option is not needed.
Avoiding Conflicting Client Dependencies
It may be the case that the under storage library you are using will have dependency conflicts with
Hadoop. For example using the S3A client to talk to S3 requires higher versions of several libraries
included in Hadoop. You can resolve this conflict by enabling ufs delegation,
alluxio.user.ufs.delegation.enabled=true, which delegates client operations to the under storage
through Alluxio servers. See Configuration Settings for how to modify
the Alluxio configuration. Alternatively you can manually resolve the conflicts when generating the
MapReduce classpath and/or jars, keeping only the highest versions of each dependency.
Running Hadoop wordcount with Alluxio Locally
First, compile Alluxio with the appropriate Hadoop version:
mvn clean install -Dhadoop.version=<YOUR_HADOOP_VERSION>
For simplicity, we will assume a pseudo-distributed Hadoop cluster, started by running (depends on the hadoop version, you might need to replace
cd $HADOOP_HOME ./bin/stop-all.sh ./bin/start-all.sh
Start Alluxio locally:
./bin/alluxio-stop.sh all ./bin/alluxio-start.sh local
You can add a sample file to Alluxio to run wordcount on. From your Alluxio directory:
./bin/alluxio fs copyFromLocal LICENSE /wordcount/input.txt
This command will copy the
LICENSE file into the Alluxio namespace with the path
Now we can run a MapReduce job for wordcount.
bin/hadoop jar hadoop-examples-1.2.1.jar wordcount -libjars /<PATH_TO_ALLUXIO>/client/alluxio-1.4.0-client.jar alluxio://localhost:19998/wordcount/input.txt alluxio://localhost:19998/wordcount/output
After this job completes, the result of the wordcount will be in the
in Alluxio. You can see the resulting files by running:
./bin/alluxio fs ls /wordcount/output ./bin/alluxio fs cat /wordcount/output/part-r-00000