Running Presto with Alluxio

Slack Docker Pulls

This guide describes how to run Presto with Alluxio, so that you can easily use Presto to query Hive tables in Alluxio’s tiered storage.

Prerequisites

The prerequisite for this part is that you have Java. And the Java version must use Java 8 Update 60 or higher (8u60+), 64-bit. Alluxio cluster should also be set up in accordance to these guides Alluxio Deployment.

Please download Presto(this doc uses presto-0.170). Also, please complete Hive setup using Hive On Alluxio

Configuration

Presto gets the database and table metadata information from Hive Metastore. At the same time, the file system location of table data is obtained from the table’s metadata entries. So you need to configure Presto on HDFS. In order to access HDFS, you need to add the Hadoop conf files (core-site.xml, hdfs-site.xml), and use hive.config.resources in file /<PATH_TO_PRESTO>/etc/catalog/hive.properties to point to the file’s location for every Presto worker.

Configure core-site.xml

You need to add the following configuration items to the core-site.xml configured in hive.properties:

<property>
  <name>fs.alluxio.impl</name>
  <value>alluxio.hadoop.FileSystem</value>
</property>
<property>
  <name>fs.AbstractFileSystem.alluxio.impl</name>
  <value>alluxio.hadoop.AlluxioFileSystem</value>
  <description>The Alluxio AbstractFileSystem (Hadoop 2.x)</description>
</property>

To use fault tolerant mode, set the Alluxio cluster properties appropriately in an alluxio-site.properties file which is on the classpath.

alluxio.zookeeper.enabled=true
alluxio.zookeeper.address=[zookeeper_hostname]:2181

Alternatively you can add the properties to the Hadoop core-site.xml configuration which is then propagated to Alluxio.

<configuration>
  <property>
    <name>alluxio.zookeeper.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>alluxio.zookeeper.address</name>
    <value>[zookeeper_hostname]:2181</value>
  </property>
</configuration>

Configure additional Alluxio properties

Similar to above, add additional Alluxio properties to core-site.xml of Hadoop configuration in Hadoop directory on each node. For example, change alluxio.user.file.writetype.default from default MUST_CACHE to CACHE_THROUGH:

<property>
  <name>alluxio.user.file.writetype.default</name>
  <value>CACHE_THROUGH</value>
</property>

Alternatively, you can also append the path to alluxio-site.properties to Presto’s JVM config at etc/jvm.config under Presto folder. The advantage of this approach is to have all the Alluxio properties set within the same file of alluxio-site.properties.

...
-Xbootclasspath/p:<path-to-alluxio-site-properties>

Also, it’s recommended to increase alluxio.user.network.netty.timeout to a bigger value (e.g. 10 mins) to avoid the timeout failure when reading large files from remote workers.

Enable hive.force-local-scheduling

It is recommended to collocate Presto with Alluxio so that Presto workers can read data locally. An important option to enable in Presto is hive.force-local-scheduling, which forces splits to be scheduled on the same node as the Alluxio worker serving the split data. By default, hive.force-local-scheduling in Presto is set to false, and Presto will not attempt to schedule the work on the same machine as the Alluxio worker node.

Increase hive.max-split-size

Presto’s Hive integration uses the config hive.max-split-size to control the parallelism of the query. It’s recommended to set this size no less than Alluxio’s block size to avoid the read contention within the same block.

Distribute the Alluxio Client Jar

Distribute the Alluxio client jar to all worker nodes in Presto:

  • You must put Alluxio client jar /path/to/alluxio/client/alluxio-enterprise-1.8.0-client.jar into Presto cluster’s worker directory ${PRESTO_HOME}/plugin/hive-hadoop2/ (For different versions of Hadoop, put the jar to the appropriate folder), and restart the process of coordinator and worker.

Presto cli examples

Configure Alluxio as the default filesystem of Hive by following the instructions here.

Create a table in Hive and load a file in local path into Hive:

You can download the data file from http://grouplens.org/datasets/movielens/

hive> CREATE TABLE u_user (
userid INT,
age INT,
gender CHAR(1),
occupation STRING,
zipcode STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;

hive> LOAD DATA LOCAL INPATH '<path_to_ml-100k>/u.user'
OVERWRITE INTO TABLE u_user;

View Alluxio WebUI at http://<MASTER_HOSTNAME>:19999 and you can see the directory and file Hive creates:

HiveTableInAlluxio

Alternatively, you can follow the instructions to create the tables from existing files in Alluxio.

Next, start your Hive metastore service. Hive metastore listens on port 9083 by default.

$ /<PATH_TO_HIVE>/bin/hive --service metastore

The following is an example of the Presto configuration /<PATH_TO_PRESTO>/etc/catalog/hive.properties :

connector.name=hive-hadoop2
hive.metastore.uri=thrift://localhost:9083
hive.config.resources=/<PATH_TO_HADOOP>/etc/hadoop/core-site.xml,/<PATH_TO_HADOOP>/etc/hadoop/hdfs-site.xml

Start your Presto server. Presto server runs on port 8080 by default:

$ /<PATH_TO_PRESTO>/bin/launcher run

Follow Presto CLI guidence to download the presto-cli-<PRESTO_VERSION>-executable.jar, rename it to presto, and make it executable with chmod +x (sometimes the executable presto exists in /<PATH_TO_PRESTO>/bin/presto and you can use it directly).

Run a single query similar to:

$ ./presto --server localhost:8080 --execute "use default;select * from u_user limit 10;" --catalog hive --debug

And you can see the query results from console:

PrestoQueryResult

Presto Server log:

PrestoQueryLog