Running Hadoop MapReduce on Alluxio
- Initial Setup
- Prepare the Alluxio client jar
- Configuring Hadoop
- Distributing the Alluxio Client Jar
- Running Hadoop wordcount with Alluxio Locally
This guide describes how to get Alluxio running with Apache Hadoop MapReduce, so that you can easily run your MapReduce programs with files stored on Alluxio.
Initial Setup
The prerequisite for this guide includes
- You have Java.
- You have set up an Alluxio cluster in accordance to these guides Local Mode or Cluster Mode.
- In order to run some simple map-reduce examples, we also recommend you download the map-reduce examples jar based on your hadoop version, or if you are using Hadoop 1, this examples jar.
Prepare the Alluxio client jar
For the MapReduce applications to communicate with Alluxio service, it is required to have the
Alluxio client jar on their classpaths. We recommend you to download the tarball from
Alluxio download page. The Alluxio client jar for Hadoop can be found at /<PATH_TO_ALLUXIO>/client/hadoop/alluxio-1.5.0-hadoop-client.jar
.
Alternatively, advanced users can choose to compile this client jar from the source code. Follow the instructs here and use the generated jar at
/<PATH_TO_ALLUXIO>/core/client/runtime/target/alluxio-core-client-runtime-1.5.0-jar-with-dependencies.jar
for the rest of this guide.
Configuring Hadoop
Add the following three properties to the core-site.xml
file in conf
directory of your Hadoop
installation:
<property>
<name>fs.alluxio.impl</name>
<value>alluxio.hadoop.FileSystem</value>
<description>The Alluxio FileSystem (Hadoop 1.x and 2.x)</description>
</property>
<property>
<name>fs.alluxio-ft.impl</name>
<value>alluxio.hadoop.FaultTolerantFileSystem</value>
<description>The Alluxio FileSystem (Hadoop 1.x and 2.x) with fault tolerant support</description>
</property>
<property>
<name>fs.AbstractFileSystem.alluxio.impl</name>
<value>alluxio.hadoop.AlluxioFileSystem</value>
<description>The Alluxio AbstractFileSystem (Hadoop 2.x)</description>
</property>
This will allow your MapReduce jobs to recognize URIs with Alluxio scheme (i.e., alluxio://
) in
their input and output files.
Optionally, modify $HADOOP_CLASSPATH
by changing hadoop-env.sh
also in conf
directory to
have:
export HADOOP_CLASSPATH=/<PATH_TO_ALLUXIO>/client/alluxio-1.5.0-client.jar:${HADOOP_CLASSPATH}
This ensures Alluxio client jar available for the MapReduce job client that creates and submits jobs to interact with URIs with Alluxio scheme.
Distributing the Alluxio Client Jar
In order for the MapReduce applications to be able to read and write files in Alluxio, the Alluxio client jar must be distributed on the classpath of the application across different nodes.
This guide on
how to include 3rd party libraries from Cloudera
describes several ways to distribute the jars. From that guide, the recommended way to distributed
the Alluxio client jar is to use the distributed cache, via the -libjars
command line option.
Another way to distribute the client jar is to manually distribute it to all the Hadoop nodes.
Below are instructions for the two main alternatives:
1.Using the -libjars command line option.
You can use the -libjars
command line option when using hadoop jar ...
,
specifying /<PATH_TO_ALLUXIO>/client/alluxio-1.5.0-client.jar
as the argument of -libjars
. Hadoop will place the jar in the Hadoop DistributedCache, making it
available to all the nodes. For example, the following command adds the Alluxio client jar to the
-libjars
option:
bin/hadoop jar libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount -libjars /<PATH_TO_ALLUXIO>/client/alluxio-1.5.0-client.jar <INPUT FILES> <OUTPUT DIRECTORY>
2.Distributing the client jars to all nodes manually.
To install Alluxio on each node, place the client jar
/<PATH_TO_ALLUXIO>/client/alluxio-1.5.0-client.jar
in the $HADOOP_HOME/lib
(may be $HADOOP_HOME/share/hadoop/common/lib
for different versions of Hadoop) directory of every
MapReduce node, and then restart Hadoop. Alternatively, add this jar to
mapreduce.application.classpath
system property for your Hadoop deployment
to ensure this jar is on the classpath.
Note that the jars must be installed again for each update to a new release. On the other hand,
when the jar is already on every node, then the -libjars
command line option is not needed.
Running Hadoop wordcount with Alluxio Locally
For simplicity, we will assume a pseudo-distributed Hadoop cluster, started by running (depends on
the hadoop version, you might need to replace ./bin
with ./sbin
.):
cd $HADOOP_HOME
bin/stop-all.sh
bin/start-all.sh
Start Alluxio locally:
bin/alluxio-start.sh local
You can add a sample file to Alluxio to run wordcount on. From your Alluxio directory:
bin/alluxio fs copyFromLocal LICENSE /wordcount/input.txt
This command will copy the LICENSE
file into the Alluxio namespace with the path
/wordcount/input.txt
.
Now we can run a MapReduce job (using Hadoop 2.7.3 as example) for wordcount.
bin/hadoop jar libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount -libjars /<PATH_TO_ALLUXIO>/client/alluxio-1.5.0-client.jar alluxio://localhost:19998/wordcount/input.txt alluxio://localhost:19998/wordcount/output
After this job completes, the result of the wordcount will be in the /wordcount/output
directory
in Alluxio. You can see the resulting files by running:
bin/alluxio fs ls /wordcount/output
bin/alluxio fs cat /wordcount/output/part-r-00000