Running Presto with Alluxio
This guide describes how to run Presto with Alluxio, so that you can easily use Presto to query Hive tables stored in Alluxio’s tiered storage.
Prerequisites
The prerequisite for this part is that you have Java. And the Java version must use Java 8 Update 60 or higher (8u60+), 64-bit. Alluxio cluster should also be set up in accordance to these guides for either Local Mode or Cluster Mode.
Please Download Presto(this doc uses presto-0.170). Also, please complete Hive setup using Hive On Alluxio
Configuration
Presto gets the database and table metadata information from Hive Metastore. At the same time,
the file system location of table data is obtained from the table’s metadata entries. So you need to configure
Presto on HDFS. In order to access HDFS,
you need to add the Hadoop conf files (core-site.xml,hdfs-site.xml), and use hive.config.resources
in
file /<PATH_TO_PRESTO>/etc/catalog/hive.properties
to point to the file’s location for every Presto worker.
Configure core-site.xml
You need to add the following configuration items to the core-site.xml
configured in hive.properties
:
<property>
<name>fs.alluxio.impl</name>
<value>alluxio.hadoop.FileSystem</value>
</property>
<property>
<name>fs.AbstractFileSystem.alluxio.impl</name>
<value>alluxio.hadoop.AlluxioFileSystem</value>
<description>The Alluxio AbstractFileSystem (Hadoop 2.x)</description>
</property>
If your Alluxio is HA, another configuration need to added:
<property>
<name>fs.alluxio-ft.impl</name>
<value>alluxio.hadoop.FaultTolerantFileSystem</value>
<description>The Alluxio FileSystem (Hadoop 1.x and 2.x) with fault tolerant support</description>
</property>
Configure additional Alluxio properties
Similar to above, add additional Alluxio properties to core-site.xml
of Hadoop configuration in Hadoop directory on each node.
For example, change alluxio.user.file.writetype.default
from default MUST_CACHE
to CACHE_THROUGH
:
<property>
<name>alluxio.user.file.writetype.default</name>
<value>CACHE_THROUGH</value>
</property>
Alternatively, you can also append the path to alluxio-site.properties
to Presto’s JVM config at etc/jvm.config
under Presto folder. The advantage of this approach is to have all the Alluxio properties set within the same file of alluxio-site.properties
.
...
-Xbootclasspath/p:<path-to-alluxio-site-properties>
Also, it’s recommended to increase alluxio.user.network.netty.timeout.ms
to a bigger value (e.g. 10 mins) to avoid the timeout
failure when reading large files from remote worker.
Increase hive.max-split-size
Presto’s Hive integration uses the config hive.max-split-size
to control the parallelism of the query. It’s recommended to set this size no less than Alluxio’s block size to avoid the read contention within the same block.
Distribute the Alluxio Client Jar
Distribute the Alluxio client jar to all worker nodes in Presto:
- You must put Alluxio client jar
/<PATH_TO_ALLUXIO>/client/presto/alluxio-1.5.0-presto-client.jar
into Presto cluster’s worker directory$PRESTO_HOME/plugin/hive-hadoop2/
(For different versions of Hadoop, put the appropriate folder), And restart the process of coordinator and worker.
Alternatively, advanced users can choose to compile this client jar from the source code. Follow the instructs here and use the generated jar at /<PATH_TO_ALLUXIO>/core/client/runtime/target/alluxio-core-client-runtime-1.5.0-jar-with-dependencies.jar
for the rest of this guide.
Presto cli examples
Create a table in Hive and load a file in local path into Hive:
You can download the data file from http://grouplens.org/datasets/movielens/
hive> CREATE TABLE u_user (
userid INT,
age INT,
gender CHAR(1),
occupation STRING,
zipcode STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;
hive> LOAD DATA LOCAL INPATH '<path_to_ml-100k>/u.user'
OVERWRITE INTO TABLE u_user;
View Alluxio WebUI at http://master_hostname:19999
and you can see the directory and file Hive creates:
Using a single query:
/home/path/presto/presto-cli-0.170-executable.jar --server masterIp:prestoPort --execute "use default;select * from u_user limit 10;" --user username --debug
And you can see the query results from console:
Presto Server log: