Running Impala with Alluxio
Impala is an open source distributed SQL query engine for running queries stored in Apache Hadoop. This guide describes how to setup Impala through Cloudera Manager to interact with Alluxio as its filesystem.
Prerequisites
You should already have Cloudera’s Distribution installed. CDH 6 has been tested and the Cloudera Manager is used for the instructions in the rest of this document.
It is also assumed that Alluxio has been installed on the cluster.
Running CDH Impala
To run CDH Impala applications with Alluxio, some addition configuration is required.
The following configurations assume that Alluxio is installed in /opt/alluxio
.
Configuring core-site.xml files
Append the following sections to core-site.xml
for the following sections:
- Under the
HDFS
component, selectConfiguration
and search forCluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml
<property> <name>fs.alluxio.impl</name> <value>alluxio.hadoop.FileSystem</value> </property>
- Under the
Impala
component, selectConfiguration
and search forImpala Catalog Server Advanced Configuration Snippet (Safety Valve) for core-site.xml
<property> <name>fs.alluxio.impl</name> <value>alluxio.hadoop.FileSystem</value> </property>
Impala Daemon Advanced Configuration Snippet (Safety Valve) for core-site.xml
<property> <name>fs.alluxio.impl</name> <value>alluxio.hadoop.FileSystem</value> </property>
Configuring CLASSPATH
Edit the following sections to add the Alluxio client jar to the application classpath.
In the following examples, it is assumed that the jar is located at
/opt/alluxio/client/alluxio-enterprise-2.10.0-3.4-client.jar
;
please double check that the value to set is a valid path to the Alluxio client jar.
- Under the
YARN (MR2 Included)
component, selectConfiguration
and search forGateway Client Environment Advanced Configuration Snippet (Safety Valve) for hadoop-env.sh
HADOOP_CLASSPATH=/opt/alluxio/client/alluxio-enterprise-2.10.0-3.4-client.jar:${HADOOP_CLASSPATH}
YARN Application Classpath
/opt/alluxio/client/alluxio-enterprise-2.10.0-3.4-client.jar
MR Application Classpath
/opt/alluxio/client/alluxio-enterprise-2.10.0-3.4-client.jar
- Under the
Impala
component, selectConfiguration
and search forImpala Service Environment Advanced Configuration Snippet (Safety Valve)
CLASSPATH=/opt/alluxio/client/alluxio-enterprise-2.10.0-3.4-client.jar:${CLASSPATH}
Configuring HIVE_AUX_JARS_PATH
Under the Hive
component, select Configuration
and search for Hive Auxiliary JARs Directory
/opt/alluxio/client/
Example: Create an Impala table in Alluxio from HDFS
Here is an example to create an internal table in Impala backed by files in Alluxio.
Download the MovieLens 100K dataset from
http://grouplens.org/datasets/movielens/.
Unzip this file and upload the downloaded data into /ml-100k/
in Alluxio:
$ ./bin/alluxio fs mkdir /ml-100k
$ ./bin/alluxio fs copyFromLocal /path/to/ml-100k alluxio:///ml-100k
Connect to Impala using the impala-shell
:
impala-shell -i myHostname
where myHostname
is the name of the host to connect to
Create a new internal table pointing to Alluxio.
CREATE TABLE u_user (
userid INT,
age INT,
gender CHAR(1),
occupation STRING,
zipcode STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 'alluxio://master_hostname:port/ml-100k';
An external table can be created by modifying the previous command from CREATE TABLE
to CREATE EXTERNAL TABLE
.