Presto SDK with Local Cache

Slack Docker Pulls GitHub edit source

Presto provides an SDK way to combined with Alluxio. With the SDK, hot data that need to be scanned frequently can be cached locally on Presto Workers that execute the TableScan operator.

Prerequisites

  • Setup Java for Java 8 Update 161 or higher (8u161+), 64-bit.
  • Deploy Presto.
  • Alluxio has been set up and is running following the deployment guide here.
  • Make sure that the Alluxio client jar that provides the SDK is available. This Alluxio client jar file can be found at /<PATH_TO_ALLUXIO>/client/alluxio-${VERSION}-client.jar in the tarball downloaded from Alluxio download page.
  • Make sure that Hive Metastore is running to serve metadata information of Hive tables. The default port of Hive Metastore is 9083. Executing lsof -i:9083 can check whether the Hive Metastore process exists or not.

Basic Setup

Configure Presto to connect to Hive Metastore

Presto gets the database and table metadata information (including file system locations) from the Hive Metastore, via Presto’s Hive Connector. Here is an example Presto configuration file ${PRESTO_HOME}/etc/hive.properties, for a catalog using the Hive connector, where the metastore is located on localhost.

connector.name=hive-hadoop2
hive.metastore.uri=thrift://localhost:9083

Enable local caching for Presto

To enable local caching, add the following configurations in ${PRESTO_HOME}/etc/hive.properties:

hive.node-selection-strategy=SOFT_AFFINITY
cache.enabled=true
cache.type=ALLUXIO
cache.base-directory=file:///tmp/alluxio
cache.alluxio.max-cache-size=100MB

Here cache.enabled=true and cache.type=ALLUXIO are to enable the local caching feature in Presto. cache.base-directory is used for specifying the path for local caching. cache.alluxio.max-cache-size is to allocate the space for local caching.

Distribute the Alluxio client jar to all Presto servers

As Presto communicates with Alluxio servers by the SDK provided in the Alluxio client jar, the Alluxio client jar must be in the classpath of Presto servers. Put the Alluxio client jar /<PATH_TO_ALLUXIO>/client/alluxio-2.9.1-client.jar into the directory ${PRESTO_HOME}/plugin/hive-hadoop2/ (this directory may differ across versions) on all Presto servers. Restart the Presto workers and coordinator

$ ${PRESTO_HOME}/bin/launcher restart

After completing the basic configuration, Presto should be able to access data in Alluxio.

Example

Create a Hive

Create a Hive table by hive client specifying its LOCATION to Alluxio.

hive> CREATE TABlE employee_parquet_alluxio (name string, salary int)
PARTITIONED BY (doj string)
STORED AS PARQUET
LOCATION 'alluxio://Master01:19998/alluxio/employee_parquet_alluxio';

Replace Master01:19998 to your Alluxio Master address. Note that we set STORED AS PARQUET here since currently only parquet and orc format is supported in Presto Local Caching.

Insert data

Insert some data into the created Hive table for testing.

INSERT INTO employee_parquet_alluxio select 'jack', 15000, '2023-02-26';
INSERT INTO employee_parquet_alluxio select 'make', 25000, '2023-02-25';
INSERT INTO employee_parquet_alluxio select 'amy', 20000, '2023-02-26';

Query table using Presto

Follow Presto CLI instructions to download the presto-cli-<PRESTO_VERSION>-executable.jar,
rename it to presto-cli, and make it executable with chmod + x. Run a single query with presto-cli to select the data from the table.

presto> SELECT * FROM employee_parquet_alluxio;

You can see that data are cached in the directory specified in /etc/catalog/hive.properties. In our example, we should see the files are cached in /tmp/alluxio/LOCAL.

Advanced Setup

Monitor metrics about local caching

In order to expose the metrics of local caching, follow the steps below:

  • Step 1: Add -Dalluxio.metrics.conf.file=<ALLUXIO_HOME>/conf/metrics.properties to specify the metrics configuration for the SDK used by Presto.
  • Step 2: Add sink.jmx.class=alluxio.metrics.sink.JmxSink to <ALLUXIO_HOME>/conf/metrics.properties to expose the metrics.
  • Step 3: Add cache.alluxio.metrics-enabled=true in <PRESTO_HOME>/etc/catalog.hive.properties to enable metric collection.
  • Step 4: Restart the Presto process by executing <PRESTO_HOME>/bin/laucher restart.
  • Step 5: Metrics about local caching should be seen in JMX if we access Presto’s JMX RESTful API <PRESTO_NODE_HOST_NAME>:<PRESTO_PORT>/v1/jmx.

The following metrics would be useful for tracking local caching:

Metric Name Type Description
Client.CacheBytesReadCache METER Bytes read from client.
Client.CachePutErrors COUNTER Number of failures when putting cached data in the client cache.
Client.CachePutInsufficientSpaceErrors COUNTER Number of failures when putting cached data in the client cache due to insufficient space made after eviction.
Client.CachePutNotReadyErrors COUNTER Number of failures when cache is not ready to add pages.
Client.CachePutBenignRacingErrors COUNTER Number of failures when adding pages due to racing eviction. This error is benign.
Client.CachePutStoreWriteErrors COUNTER Number of failures when putting cached data in the client cache due to failed writes to page store.
Client.CachePutEvictionErrors COUNTER Number of failures when putting cached data in the client cache due to failed eviction.
Client.CachePutStoreDeleteErrors COUNTER Number of failures when putting cached data in the client cache due to failed deletes in page store.
Client.CacheGetErrors COUNTER Number of failures when getting cached data in the client cache.
Client.CacheGetNotReadyErrors COUNTER Number of failures when cache is not ready to get pages.
Client.CacheGetStoreReadErrors COUNTER Number of failures when getting cached data in the client cache due to failed read from page stores.
Client.CacheDeleteNonExistingPageErrors COUNTER Number of failures when deleting pages due to absence.
Client.CacheDeleteNotReadyErrors COUNTER Number of failures when cache is not ready to delete pages.
Client.CacheDeleteFromStoreErrors COUNTER Number of failures when deleting pages from page stores.
Client.CacheHitRate GAUGE Cache hit rate: (# bytes read from cache) / (# bytes requested).
Client.CachePagesEvicted METER Total number of pages evicted from the client cache.
Client.CacheBytesEvicted METER Total number of bytes evicted from the client cache.