Burst Compute to Google Cloud Dataproc

Slack Docker Pulls

This page details how to leverage a public cloud, such Google Cloud Platform (GCP), to scale analytic workloads directly on data residing on-premises without manually copying and synchronizing the data into the cloud. We will show you an example of what it might look like to run on-demand Presto and Hive with Alluxio in GCP connecting to an on-premise HDFS cluster. The same environment can also be used to run Spark or other Hadoop services with Alluxio.

  • Set up two geographically dispersed GCP Dataproc clusters connected via Alluxio.
  • Execute performance benchmarks.

Objectives

In this tutorial, you will create two Dataproc clusters, one proxying as a Hadoop deployment on-premises, and the other in a different region running Presto and Alluxio. Compute workloads will be bursted to the latter with data being serviced from the on-premise proxy via Alluxio.

Optionally, for benchmarking purposes, deploy a third cluster running Presto without Alluxio. Both Presto clusters have been pre-configured to use Hive metadata from the cluster on-premises without re-definition.

Outline

  • Create Dataproc clusters.
    • Proxy for on-premise Hadoop with Hive and HDFS installed, named on-prem-a.
    • Zero-Copy Burst cluster with Presto and Alluxio, named cloud-compute-b.
    • (Optional) Burst cluster with Presto without Alluxio, named cloud-compute-c.
  • Prepare data. This tutorial uses TPC-DS data, generated using this repository.
  • Execute queries on the Presto CLI.
  • Update data on-premises and see how changes are being seamlessly reflected in the public cloud via Alluxio.

gcp_tutorial_arch

Costs

This tutorial uses billable components of Google Cloud, including:

Use the Pricing Calculator to generate a cost estimate based on your usage.

Before you begin

The following are prerequisites to be able to execute this tutorial.

  • Identify the project to use. Note: The project name must not contain a colon “:”.
  • Enable the following APIs for your project: Cloud Shell API, Compute Engine API, Dataproc API and Cloud DNS API.
  • Plan to run terraform commands with Cloud Shell.
  • Check quotas for each region and any applicable global quotas for compute resources. By default this tutorial uses:
    • 1 Dataproc cluster with a total of 6 instances and 96 cores in us-west1.
    • Another 2 Dataproc clusters with a total of 12 instances and 192 cores in us-east1.

Launching Clusters using Terraform

For this tutorial, you will use Terraform to execute the following:

  • Network. Create two VPCs in different regions, and enable Cloud VPC and DNS peering to connect them.
  • Dataproc. Spin up two Dataproc clusters and optionally a third for benchmarking.
  • Prepare Data. Mount TPC-DS dataset in the on-premise proxy cluster to be accessed by the compute cluster cloud-compute-b.

Create clusters by running the commands shown with Cloud Shell:

  1. Download the terraform file.
    $ mkdir alluxio-dataproc && cd alluxio-dataproc
    $ wget https://alluxio-public.storage.googleapis.com/gcp-burst-tutorial/stable/main.tf
    
  2. Define the GCP project to create resources in. Create a file named terraform.tfvars
    $ echo project_name = \"$(gcloud config get-value core/project)\" > terraform.tfvars
    
  3. (Optional) If you don’t wish to benchmark, disable deploying the third cluster by setting the variable cluster_count_c to 0. The default is 1 which does deploy this cluster.
    $ echo 'cluster_count_c = "0"' >> terraform.tfvars
    
  4. Initialize the Terraform working directory to download the necessary plugins to execute. You only need to run this once for the working directory.
    $ terraform init
    
  5. Create the networking resources and launch the two Dataproc clusters.
    $ terraform apply
    
    • Type yes to confirm resource creation.
    • Keep the terminal to destroy resources once done with the tutorial.

This final command will take about 5 minutes to provision and launch the necessary GCP resources.

You can spin up or tear down the optional third cluster if desired by updating the value of cluster_count_c and running terraform apply again. The state of the other two clusters deployed will not be impacted by rerunning the apply command. If the third cluster is not launched, skip any subsequent instructions referring to cloud-compute-c.

What’s Next

Now that Terraform is done launching the clusters, you will copy a dataset into the on-premise Hadoop cluster and run queries from the Presto compute clusters in parallel to compare performance.

Here’s a summary of what’s to come:

  1. Load the TPC-DS dataset into HDFS and create table definitions in Hive on the on-premise proxy cluster, named on-prem-a.
  2. Prepare to run queries on the two Presto clusters in parallel.
  3. Run queries from the zero-copy burst cluster, named cloud-compute-b, to measure execution performance with Alluxio.
    • Running the query for the first time will access HDFS in the on-prem-a cluster.
    • Running the query for the second time will access data from Alluxio, since the data is being cached from the first run, resulting in a faster query.
  4. Run the identical query from the burst cluster without Alluxio, named cloud-compute-c, to measure execution performance as a baseline for comparison.
    • Execution times for this cluster should be the same across multiple runs. This time should be similar to the first query run in cloud-compute-b with Alluxio.

Changes to the cluster on-premises, such as updating tables in Hive, will seamlessly propagate to Alluxio without any user intervention. After executing the benchmarks, you will also learn about Alluxio’s metadata synchronization capabilities.

Prepare data

SSH into the Hadoop cluster on-prem-a to prepare TPC-DS data for query execution. Execute the following in a Cloud Shell terminal.

You may need to generate an SSH key pair.

$ gcloud  compute ssh --zone "us-west1-a" "on-prem-a-m"

Once you’ve logged into the cluster, copy data from Alluxio’s public GCS bucket.

on-prem-a-m$ location="gs://alluxio-public-west/spark/unpart_sf100_10k"
on-prem-a-m$ hdfs dfs -mkdir /tmp/tpcds/
on-prem-a-m$ hadoop distcp ${location}/store_sales/ hdfs:///tmp/tpcds/
on-prem-a-m$ hadoop distcp ${location}/item/ hdfs:///tmp/tpcds/

This step takes about 4 minutes with the default configuration.

Once data is copied into HDFS, create the table metadata in Hive.

on-prem-a-m$ wget https://alluxio-public.storage.googleapis.com/gcp-burst-tutorial/stable/create-table
on-prem-a-m$ master=on-prem-a-m && zone=us-west1-a && project_name=$(gcloud config get-value core/project)
on-prem-a-m$ for f in master zone project_name; do v=${f} && sed -i "s/${f}/${!v}/g" create-table; done
on-prem-a-m$ hive -f create-table

Keep the SSH session open as you will re-use the same session later.

Run Applications

You may choose to run different applications on Google Dataproc to leverage Alluxio. To continue to benchmark with TPC-DS data, Presto has been pre-configured in your setup.


Cleaning up

When done, on the same Cloud Shell terminal used to create, run terraform destroy to tear down the previously created GCP resources. Type yes to approve.

Congratulations, you’re done!

If you want to explore more, take a look at customizable data policies in Alluxio to migrate your data from the environment on-premises to Cloud Storage. Define policies based on date created, updated, or accessed and move or migrate data across storage systems, such as HDFS and GCS, online without any interruption to your workloads. An example policy would look like: move HDFS path hdfs://on_prem_hdfs/user_a/mydir to the GCS path gs://mybucket/mydir once the create time in HDFS is older than two days.