Running Impala with Alluxio

Slack Docker Pulls

Impala is an open source distributed SQL query engine for running queries stored in Apache Hadoop. This guide describes how to setup Impala through Cloudera Manager to interact with Alluxio as its filesystem.

Prerequisites

You should already have Cloudera’s Distribution installed. CDH 6 has been tested and the Cloudera Manager is used for the instructions in the rest of this document.

It is also assumed that Alluxio has been installed on the cluster.

Running CDH Impala

To run CDH Impala applications with Alluxio, some addition configuration is required. The following configurations assume that Alluxio is installed in /opt/alluxio.

Configuring core-site.xml files

Append the following sections to core-site.xml for the following sections:

  1. Under the HDFS component, select Configuration and search for Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml
     <property>
         <name>fs.alluxio.impl</name>  
         <value>alluxio.hadoop.FileSystem</value>
     </property>
    
  2. Under the Impala component, select Configuration and search for
    1. Impala Catalog Server Advanced Configuration Snippet (Safety Valve) for core-site.xml
       <property>
           <name>fs.alluxio.impl</name>  
           <value>alluxio.hadoop.FileSystem</value>
       </property>
      
    2. Impala Daemon Advanced Configuration Snippet (Safety Valve) for core-site.xml
       <property>
           <name>fs.alluxio.impl</name>  
           <value>alluxio.hadoop.FileSystem</value>
       </property>
      

Configuring CLASSPATH

Edit the following sections to add the Alluxio client jar to the application classpath. In the following examples, it is assumed that the jar is located at /opt/alluxio/client/alluxio-enterprise-2.8.0-5.3-client.jar; please double check that the value to set is a valid path to the Alluxio client jar.

  1. Under the YARN (MR2 Included) component, select Configuration and search for
    1. Gateway Client Environment Advanced Configuration Snippet (Safety Valve) for hadoop-env.sh
      HADOOP_CLASSPATH=/opt/alluxio/client/alluxio-enterprise-2.8.0-5.3-client.jar:${HADOOP_CLASSPATH}
      
    2. YARN Application Classpath
      /opt/alluxio/client/alluxio-enterprise-2.8.0-5.3-client.jar
      
    3. MR Application Classpath
      /opt/alluxio/client/alluxio-enterprise-2.8.0-5.3-client.jar
      
  2. Under the Impala component, select Configuration and search for Impala Service Environment Advanced Configuration Snippet (Safety Valve)
     CLASSPATH=/opt/alluxio/client/alluxio-enterprise-2.8.0-5.3-client.jar:${CLASSPATH}
    

Configuring HIVE_AUX_JARS_PATH

Under the Hive component, select Configuration and search for Hive Auxiliary JARs Directory

/opt/alluxio/client/

Example: Create an Impala table in Alluxio from HDFS

Here is an example to create an internal table in Impala backed by files in Alluxio. Download the MovieLens 100K dataset from http://grouplens.org/datasets/movielens/. Unzip this file and upload the downloaded data into /ml-100k/ in Alluxio:

$ ./bin/alluxio fs mkdir /ml-100k
$ ./bin/alluxio fs copyFromLocal /path/to/ml-100k alluxio:///ml-100k

Connect to Impala using the impala-shell:

impala-shell -i myHostname

where myHostname is the name of the host to connect to

Create a new internal table pointing to Alluxio.

CREATE TABLE u_user (
  userid INT,
  age INT,
  gender CHAR(1),
  occupation STRING,
  zipcode STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 'alluxio://master_hostname:port/ml-100k';

An external table can be created by modifying the previous command from CREATE TABLE to CREATE EXTERNAL TABLE.