GCP Deployment Guide using Terraform

Slack Docker Pulls

This guide describes the Terraform modules used to deploy Alluxio on AWS GCP as a hybrid cloud compute cluster, connecting to a HDFS cluster in a remote network.

Overview

The hybrid cloud tutorial uses several modules to set up the GCP resources necessary to create two VPCs connected via VPC peering, a HDFS cluster in one of the VPCs, and an Alluxio cluster in the other VPC that mounts the HDFS cluster into its filesystem. To customize the Alluxio cluster to connect to an existing HDFS cluster, it is expected that the users will build their own Terraform configuration, invoking modules used in the tutorial. Several tools are provided within the Alluxio cluster to ensure that the cluster in AWS is able to successfully connect to a remote HDFS cluster.

Prerequisites

Terraform configuration layout

Using the tutorial’s configuration layout as an example, the layout of the hybrid cloud setup consists of the following components:

  • A VPC and subnet, to define the network in which cloud resources will be created within. It is strongly recommended to create a new VPC as opposed to using the default VPC. The vpc_with_internet module creates a VPC, a subnet within the VPC, a firewall to allow all traffic ingress from the created subnet, and another firewall to allow external ssh access.
  • VPC peering connection, to enable the corporate network in which the HDFS and Hive clusters reside to communicate with the cloud VPC in which the Alluxio cluster will reside and vice versa. The vpc_peering module creates the resources needed for the peering connection.
  • The Alluxio Dataproc cluster, consisting of the cluster of instances hosting the Alluxio service. The alluxio_cloud_cluster module encompasses the various resources needed to set up Alluxio in Dataproc. A firewall is added to open the Alluxio and Presto web ports.
  • A GCS bucket to serve as an intermediary for initialization and configuration files to be downloaded to or copied between instances.

The following diagram outlines the basic relationships and resources created by each module:

alluxio_terraform_gcp_modules (click image to enlarge)

Connectivity tools

While developing the terraform configuration, we strongly recommend testing the connectivity with a small Alluxio cluster before launching a full sized cluster. These tools can be found in the Alluxio master web UI Manager tab (http://MASTER_PUBLIC_DNS:19999).

The tools provided are:


Alluxio modules

Terraform modules are generally sourced from the public Terraform Registry, but they can also be sourced from a local path. This is the approach used by the GCP hybrid cloud tutorial.

Getting started

Create an empty directory to host your Terraform configuration files; henceforth this path will be referred to as /path/to/terraform/root.

Copy modules in the alluxio directory and scripts in the dataproc directory from the extracted tutorial tarball.

$ wget https://alluxio-public.storage.googleapis.com/enterprise-terraform/stable/gcp_hybrid_apache_simple.tar.gz
$ tar -zxf gcp_hybrid_apache_simple.tar.gz
$ cd gcp_hybrid_apache_simple
$ cp -r alluxio/ /path/to/terraform/root
$ cp -r dataproc/ /path/to/terraform/root

Create a main.tf file and declare an GCP provider. google-beta provider is used instead of google provider because some of the features needed only exist in google-beta. When those features are merged back to google provider in future releases, google provider will be used.

Set its region and zone to the desired GCP region and zone. By default, all resources will be created using this provider in its defined region. In order to create resources in distinct regions, declare another provider and set an alias for both providers.

provider "google-beta" {
  alias       = "google_compute"
  credentials = file("account.json")
  project     = "my-project-id"
  region      = "us-east1"
  zone        = "us-east1-d"
  version     = "~> 3.21"
}
  • project field can be removed if launching in a Cloud Shell
  • credentials field can be removed if launching a Cloud Shell or authenticating with ‘gcloud auth application-default login`.

Similarly, each resource must also declare which provider it is associated with.

resource "google_storage_bucket" "shared_gs_bucket" {
  provider      = google-beta.google_compute
}

Also in the main file, you can declare a module by setting its source to the relative path of the desired module. For example, to create a VPC using the aforementioned provider, add:

module "vpc_compute" {
  source = "./alluxio/vpc_with_internet/gcp"
  providers = {
    google-beta = google-beta.google_compute
  }
}

Below is a template main.tf with 2 GCP providers and a minimal layout of Alluxio modules, denoting the relationship between modules.

provider "google-beta" {
  alias   = "google_compute"
  region  = "us-east1"
  zone    = "us-east1-d"
  version = "~> 3.21"
}

provider "google-beta" {
  alias   = "google_on_prem"
  region  = "us-west1"
  zone    = "us-west1-a"
  version = "~> 3.21"
}

resource "google_storage_bucket" "shared_gs_bucket" {
  provider      = google-beta.google_compute
  name          = "my-bucket-name"
  force_destroy = true
}

// Mocking for the VPC network of on-prem HDFS/Hive cluster
module "vpc_on_prem" {
  source = "./alluxio/vpc_with_internet/gcp"
  providers = {
    google-beta = google-beta.google_on_prem
  }
}

// VPC for the compute cluster
module "vpc_compute" {
  source = "./alluxio/vpc_with_internet/gcp"
  providers = {
    google-beta = google-beta.google_compute
  }
}

resource "google_storage_bucket_object" "compute_alluxio_bootstrap" {
  provider = google-beta.google_compute
  bucket   = google_storage_bucket.shared_gs_bucket.name
  name     = "alluxio-dataproc.sh"
  source = "dataproc/alluxio-dataproc.sh"
}

resource "google_storage_bucket_object" "compute_presto_bootstrap" {
  provider = google-beta.google_compute
  bucket   = google_storage_bucket.shared_gs_bucket.name
  name     = "presto-dataproc.sh"
  source   = "dataproc/presto-dataproc.sh"
}

module "dataproc_compute" {
  source = "./alluxio/alluxio_cloud_cluster/gcp"
  providers = {
    google-beta = google-beta.google_compute
  }

  vpc_self_link    = module.vpc_compute.vpc_self_link
  subnet_self_link = module.vpc_compute.subnet_self_link
  staging_bucket   = google_storage_bucket.shared_gs_bucket.name

  alluxio_bootstrap_gs_uri = "gs://${google_storage_bucket.shared_gs_bucket.name}/${google_storage_bucket_object.compute_alluxio_bootstrap.name}"
  presto_bootstrap_gs_uri  = "gs://${google_storage_bucket.shared_gs_bucket.name}/${google_storage_bucket_object.compute_presto_bootstrap.name}"
  
  // Values from the on-prem HDFS/Hive cluster need to be filled in
  // so that alluxio and presto inside this compute cluster can connect to the on-prem HDFS/Hive cluster
  on_prem_hdfs_address = "${on_prem_hdfs_address}"
  on_prem_hms_address  = "${on_prem_hive_metastore_address}"
}

// Mocking for the network connection between on-prem cluster and compute cluster
module "vpc_peering" {
  source = "./alluxio/vpc_peering/gcp"
  providers = {
    google-beta.on_prem       = google-beta.google_on_prem
    google-beta.cloud_compute = google-beta.google_compute
  }

  use_default_name = var.use_default_name
  custom_name      = var.custom_name

  on_prem_vpc_self_link          = module.vpc_on_prem.vpc_self_link
  on_prem_subnet_self_link       = module.vpc_on_prem.subnet_self_link
  cloud_compute_vpc_self_link    = module.vpc_compute.vpc_self_link
  cloud_compute_subnet_self_link = module.vpc_compute.subnet_self_link
}

A more concrete example can be found in gcp_hybrid_apache_simple/main.tf which creates a Dataproc cluster mocking the HDFS and Hive cluster in high availability mode, a Dataproc cluster running Alluxio and Presto connecting to the remote HDFS and Hive, and a VPC peering connecting the network between two Dataproc clusters.

The following sections describe the various input variables and outputs of each module.

Module common variables

Each module published by Alluxio has the following Terraform variables in common:

  • enabled
    • type: bool
    • default: true
    • description: If set to false, the module will not create any resources, effectively disabling the module. This is useful when a module should only be invoked conditionally.
  • depends_on_variable
    • type: any
    • default: null
    • description: This placeholder variable can be used to define an explicit dependency to another module or resource.
  • use_default_name
    • type: bool
    • default: true
    • description: Each resource has a default readable name. If set to false, a random string will be prefixed to the default name to avoid name collisions when the same Terraform configuration is invoked concurrently. When developing and testing, it is recommended to set this to false, but on a production cluster, this should be set as false.
  • custom_name
    • type: string
    • default: ""
    • description: Each resource created will be prefixed with the provided custom name. This is helpful to identify the resources created by Terraform.

Specific module input variables and outputs


Connecting to an existing HDFS cluster

After the Terraform configuration is ready, launch a small cluster with 1 worker. The Alluxio web UI url should be accessible at http://MASTER_PUBLIC_IP:19999; this is also provided as an output of the alluxio_cloud_cluster module. Access the connectivity tools under the Manager tab.