Running Alluxio on Google Cloud Dataproc
This guide describes how to configure Alluxio to run on Google Cloud Dataproc.
- Overview
- Prerequisites
- Basic Setup
- Next steps
- Spark on Alluxio in Dataproc
- Presto on Alluxio in Dataproc
Overview
Google Cloud Dataproc is a managed on-demand service to run Spark and Hadoop compute workloads. It manages the deployment of various Hadoop Services and allows for hooks into these services for customizations. Aside from the added performance benefits of caching, Alluxio also enables users to run compute workloads against on-premise storage or even a different cloud provider’s storage i.e. AWS, Azure Blob Store.
Prerequisites
- Account with Cloud Dataproc API enabled
- A GCS Bucket
- gcloud CLI: Make sure that the CLI is set up with necessary GCS interoperable storage access keys. Note: GCS interoperability should be enabled in the Interoperability tab in GCS setting.
A GCS bucket required as Alluxio’s Root Under File System and to serve as the location for the bootstrap script. If required, the root UFS can be reconfigured to be HDFS or any other supported under store.
Basic Setup
When creating a Dataproc cluster, Alluxio can be installed using an initialization action
The Alluxio initialization action is hosted in a publicly readable GCS location gs://alluxio-public/enterprise-dataproc/2.1.2-1.0/alluxio-dataproc.sh.
- The base64 encoded license should be passed using alluxio_license_base64.
Base64 encode the license using:
$(cat license.json | base64 | tr -d "\n")
- Host the Alluxio Enterprise tarball in a private location and pass in the location using the parameter
alluxio_download_path, e.g.
alluxio_download_path=gs://<my-bucket>/alluxio-enterprise-2.1.2-1.0-all.tar.gz
. - A required argument is the root UFS URI using alluxio_root_ufs_uri.
- Additional properties can be specified using the metadata key alluxio_site_properties delimited
using
;
$ gcloud dataproc clusters create <cluster_name> \ --initialization-actions gs://alluxio-public/enterprise-dataproc/2.1.2-1.0/alluxio-dataproc.sh \ --metadata alluxio_root_ufs_uri=<gs://my_bucket>,alluxio_site_properties="alluxio.master.mount.table.root.option.fs.gcs.accessKeyId=<gcs_access_key_id>;alluxio.master.mount.table.root.option.fs.gcs.secretAccessKey=<gcs_secret_access_key>",alluxio_license_base_64=<license_string>,alluxio_download_path=gs://<my-bucket>/alluxio-enterprise-2.1.2-1.0-all.tar.gz
- Additional files can be downloaded into
/opt/alluxio/conf
using the metadata keyalluxio_download_files_list
by specifyinghttp(s)
orgs
uris delimited using;
$ gcloud dataproc clusters create <cluster_name> \ --initialization-actions gs://alluxio-public/enterprise-dataproc/2.1.2-1.0/alluxio-dataproc.sh \ --metadata alluxio_root_ufs_uri=<under_storage_address>,alluxio_download_files_list="gs://$my_bucket/$my_file;https://$server/$file"
Next steps
The status of the cluster deployment can be monitored using the CLI.
$ gcloud dataproc clusters list
Identify the instance name and SSH into this instance to test the deployment.
$ gcloud compute ssh <cluster_name>-m
Test that Alluxio is running as expected
$ alluxio runTests
Alluxio is installed in /opt/alluxio/
by default.
Spark, Hive and Presto are already configured to connect to Alluxio.
Note: The default Alluxio Worker memory is set to 1/3 of the physical memory on the instance. If a specific value is desired, set
alluxio.worker.memory.size
in the providedalluxio-site.properties
or in the additional options argument.
Spark on Alluxio in Dataproc
The Alluxio initialization script configures Spark for Alluxio.
To run a Spark application accessing data from Alluxio, simply refer to the path as
alluxio:///<path_to_file>
.
Follow the steps in our Alluxio on Spark
documentation
to get started.
Presto on Alluxio in Dataproc
The Alluxio initialization script configures Presto for Alluxio. If installing the optional Presto component, Presto must be installed before Alluxio. Initialization action are executed sequentially and the Presto action must precede the Alluxio action.