Burst Compute to Google Cloud Dataproc

This page details how to leverage a public cloud, such Google Cloud Platform (GCP), to scale analytic workloads directly on data residing on-premises without manually copying and synchronizing the data into the cloud. We will show you an example of what it might look like to run on-demand Presto and Hive with Alluxio in GCP connecting to an on-premise HDFS cluster. The same environment can also be used to run Spark or other Hadoop services with Alluxio.

Set up two geographically dispersed GCP Dataproc clusters connected via Alluxio.
Execute performance benchmarks.

Objectives

In this tutorial, you will create two Dataproc clusters, one proxying as a Hadoop deployment on-premises, and the other in a different region running Presto and Alluxio. Compute workloads will be bursted to the latter with data being serviced from the on-premise proxy via Alluxio.

Outline

Create Dataproc clusters.
- Proxy for on-premise Hadoop with Hive and HDFS installed, named on-prem-cluster.
- Zero-Copy Burst cluster with Presto and Alluxio, named cloud-compute-cluster.
Prepare data. This tutorial uses TPC-DS data, generated using this repository.
Execute queries on the Presto CLI.
Update data on-premises and see how changes are being seamlessly reflected in the public cloud via Alluxio.

gcp_tutorial_arch

Costs

This tutorial uses billable components of Google Cloud, including:

Use the Pricing Calculator to generate a cost estimate based on your usage.

Before you begin

The following are prerequisites to be able to execute this tutorial.

Identify the project to use. Note: The project name must not contain a colon “:”.
Enable the following APIs for your project: Cloud Shell API, Compute Engine API, Dataproc API and Cloud DNS API.
Plan to run terraform commands with Cloud Shell.
Check quotas for each region and any applicable global quotas for compute resources. By default this tutorial uses:
- 1 Dataproc on-prem-cluster in us-west1.
  - 3 masters * n1-highmem-16 (16 vCPU & 104GB MEM)
  - 5 workers * n1-highmem-16 (16 vCPU & 104GB MEM & 375GB local SSD)
- 1 Dataproc cloud-compute-cluster in us-east1.
  - 1 master * n1-highmem-16 (16 vCPU & 104GB MEM)
  - 5 workers * n1-highmem-16 (16 vCPU & 104GB MEM & 375GB local SSD)

Launching Clusters using Terraform

For this tutorial, you will use Terraform to execute the following:

Network. Create two VPCs in different regions, and enable Cloud VPC and DNS peering to connect them.
Dataproc. Spin up two Dataproc clusters.
Prepare Data. Mount TPC-DS dataset in the on-premise proxy cluster to be accessed by the compute cluster cloud-compute-cluster.

Create clusters by running the commands shown with Cloud Shell:

Download the terraform file.

$ wget https://alluxio-public.storage.googleapis.com/enterprise-terraform/stable/gcp_hybrid_apache_simple.tar.gz
$ tar -zxf gcp_hybrid_apache_simple.tar.gz
$ cd gcp_hybrid_apache_simple

Initialize the Terraform working directory to download the necessary plugins to execute. You only need to run this once for the working directory.
```
$ terraform init
```
Create the networking resources and launch the two Dataproc clusters. Cloud Shell will preconfigured your terminal with Google Cloud Platform project and credentials, so no additional variables are needed. After running the whole GCP tutorial, you can play with the variables defined in variables.tf and relaunch the clusters with new configuration.
```
$ terraform apply
```
- Type yes to confirm resource creation.
- Keep the terminal to destroy resources once done with the tutorial.
  
  Note: Some resources, like the VPC, requires the name to be unique in the project. If you are running into name collision with the default terraform setup, you can set a prefix to the resources by updating variable custom_name in variables.tf.

What’s Next

Now that Terraform is done launching the clusters, you will copy a dataset into the on-premise Hadoop cluster and run queries from the Presto compute cluster.

Here’s a summary of what’s to come:

Load the TPC-DS dataset into HDFS and create table definitions in Hive on the on-premise proxy cluster, named on-prem-cluster.
Run queries from the zero-copy burst cluster, named cloud-compute-cluster, to measure execution performance with Alluxio.
- Running the query for the first time will access HDFS in the on-prem-cluster cluster.
- Running the query for the second time will access data from Alluxio, since the data is being cached from the first run, resulting in a faster query.

Changes to the cluster on-premises, such as updating tables in Hive, will seamlessly propagate to Alluxio without any user intervention. After executing the benchmarks, you will also learn about Alluxio’s metadata synchronization capabilities.

Prepare data

SSH into one of the master instances of the Hadoop cluster on-prem-cluster to prepare TPC-DS data for query execution. Execute the following in a new Cloud Shell terminal.

You may need to generate an SSH key pair.
$ gcloud  compute ssh --zone "us-west1-a" "on-prem-cluster-m-0"

Once you’ve logged into the cluster, copy data from Alluxio’s public GCS bucket.

on-prem-cluster-m-0$ location="gs://alluxio-public-west/spark/unpart_sf100_10k"
on-prem-cluster-m-0$ hdfs dfs -mkdir /tmp/tpcds/
on-prem-cluster-m-0$ hadoop distcp ${location}/store_sales/ hdfs:///tmp/tpcds/
on-prem-cluster-m-0$ hadoop distcp ${location}/item/ hdfs:///tmp/tpcds/

This step takes about 4 minutes with the default configuration.

Once data is copied into HDFS, create the table metadata in Hive. Dataproc sets the HDFS nameservice using the value of Dataproc cluster name in high availability mode. Replace the hdfs_host_info to the actual nameservice to point to the HDFS cluster inside on-prem-cluster.

on-prem-cluster-m-0$ wget https://alluxio-public.storage.googleapis.com/hybrid-apache-simple/stable/create-table
on-prem-cluster-m-0$ sed -i "s/hdfs_host_info/on-prem-cluster/g" create-table
on-prem-cluster-m-0$ hive -f create-table

Keep the SSH session open as you will re-use the same session later.

Run Applications

You may choose to run different applications on Google Dataproc to leverage Alluxio. To continue to benchmark with TPC-DS data, Presto has been pre-configured in your setup.

Presto
Spark

Prepare to run queries in the zero-copy burst cluster with Alluxio, cloud-compute-cluster. With Alluxio, only the first access reaches out to the remote HDFS and all subsequent accesses are serviced from Alluxio.

Option A: Use your web interface of choice using Dataproc Component Gateway.

Option B: SSH into cloud-compute-cluster using a new Cloud Shell terminal.

$ gcloud compute ssh --zone "us-east1-d" "cloud-compute-cluster-m"

Using either option, run the queries to compare the performance of Presto queries with Alluxio between the first and following runs. You can also look at the Presto Web UI, using component gateway, for metrics during query execution.

Execute the following on cloud-compute-cluster:

cloud-compute-cluster-m$ wget https://alluxio-public.storage.googleapis.com/hybrid-apache-simple/stable/query
cloud-compute-cluster-m$ presto --catalog onprem --schema default < query

The second run with Alluxio should be faster since data is cached in Alluxio instead of repeatedly fetching from remote on-prem HDFS. The first run for Alluxio should finish in a couple of minutes. The second run with Alluxio should finish within two minutes.

One of the challenges in accessing data on-premises from a public cloud is that updates on either end should become available for the other end without manual intervention. With Alluxio, data ingestion pipelines for the on-premise datacenter do not need to change in order to enable access in a public cloud. Alluxio seamlessly reflects any updates between the two clusters.

Execute a query from the zero-copy burst cluster cloud-compute-cluster before updating Hive and HDFS on-premises.

Note that a student named fred has age 32, while barney has age 35.
cloud-compute-cluster-m$ presto --catalog onprem --schema default --execute "select * from students;"
Update the Hive table to swap the age of the two students. Execute the following on the cluster on-prem-cluster:
on-prem-cluster-m-0$ hive -e "insert overwrite table students values ('fred flintstone', 35), ('barney rubble', 32);"

Run the select query on on-prem-cluster to see that the change is reflected locally.

on-prem-cluster-m-0$ hive -e "select * from students;"

Run the same query again on cloud-compute-cluster to see that the change is reflected when accessed via Alluxio.

Note that you may experience a lag for the changes to be reflected in Alluxio. By default, Alluxio synchronizes metadata in 30s intervals. This is configurable and synchronization can be on-demand depending on the use case.
cloud-compute-cluster-m$ presto --catalog onprem --schema default --execute "select * from students;"
You can repeat this exercise to see that Alluxio invalidates any cached data which has been updated on-premises and retrieves the data on a subsequent access.

In your setup, Spark has also been pre-configured to use Alluxio as the access layer to data on-premises. You can now run a simple Spark job to observe I/O acceleration with Alluxio.

Execute the following on on-prem-cluster:

on-prem-cluster-m-0$ dd if=/dev/zero of=5g bs=1k count=500000
on-prem-cluster-m-0$ hdfs dfs -copyFromLocal 5g /user/hdfs/5g

Execute the following on cloud-compute-cluster twice using a new shell each time:

cloud-compute-cluster-m$ spark-shell
scala> spark.time(sc.textFile("alluxio:///user/hdfs/5g").count)

You should observe that the second run with Alluxio is much faster, whereas the IO time remains the same for each run without Alluxio.

Cleaning up

When done, on the same Cloud Shell terminal used to create, run terraform destroy to tear down the previously created GCP resources. Type yes to approve.

Congratulations, you’re done!

If you want to explore more, take a look at customizable data policies in Alluxio to migrate your data from the environment on-premises to Cloud Storage. Define policies based on date created, updated, or accessed and move or migrate data across storage systems, such as HDFS and GCS, online without any interruption to your workloads. An example policy would look like: move HDFS path hdfs://on_prem_hdfs/user_a/mydir to the GCS path gs://mybucket/mydir once the create time in HDFS is older than two days.

Categories