GCP Deployment Guide using Terraform
This guide describes the Terraform modules used to deploy Alluxio on AWS GCP as a hybrid cloud compute cluster, connecting to a HDFS cluster in a remote network.
- Overview
- Prerequisites
- Terraform configuration layout
- Connectivity tools
- Alluxio modules
- Connecting to an existing HDFS cluster
- Firewall Configuration between On-premise and Compute Clusters
Overview
The hybrid cloud tutorial uses several modules to set up the GCP resources necessary to create two VPCs connected via VPC peering, a HDFS cluster in one of the VPCs, and an Alluxio cluster in the other VPC that mounts the HDFS cluster into its filesystem. To customize the Alluxio cluster to connect to an existing HDFS cluster, it is expected that the users will build their own Terraform configuration, invoking modules used in the tutorial. Several tools are provided within the Alluxio cluster to ensure that the cluster in AWS is able to successfully connect to a remote HDFS cluster.
Prerequisites
- Terraform (0.12 or higher)
- Familiarity with the GCP hybrid cloud tutorial
Terraform configuration layout
Using the tutorial’s configuration layout as an example, the layout of the hybrid cloud setup consists of the following components:
- A VPC and subnet, to define the network in which cloud resources will be created within.
It is strongly recommended to create a new VPC as opposed to using the default VPC.
The
vpc_with_internet
module creates a VPC, a subnet within the VPC, a firewall to allow all traffic ingress from the created subnet, and another firewall to allow external ssh access. - VPC peering connection, to enable the corporate network in which the HDFS and Hive clusters
reside to communicate with the cloud VPC in which the Alluxio cluster will reside and vice versa.
The
vpc_peering
module creates the resources needed for the peering connection. - The Alluxio Dataproc cluster, consisting of the cluster of instances hosting the Alluxio service.
The
alluxio_cloud_cluster
module encompasses the various resources needed to set up Alluxio in Dataproc. A firewall is added to open the Alluxio and Presto web ports. - A GCS bucket to serve as an intermediary for initialization and configuration files to be downloaded to or copied between instances.
The following diagram outlines the basic relationships and resources created by each module:
Connectivity tools
While developing the terraform configuration, we strongly recommend testing the connectivity with a
small Alluxio cluster before launching a full sized cluster. These tools can be found in the Alluxio
master web UI Manager tab (http://MASTER_PUBLIC_DNS:19999
).
The tools provided are:
Alluxio modules
Terraform modules are generally sourced from the public Terraform Registry, but they can also be sourced from a local path. This is the approach used by the GCP hybrid cloud tutorial.
Getting started
Create an empty directory to host your Terraform configuration files;
henceforth this path will be referred to as /path/to/terraform/root
.
Copy modules in the alluxio
directory and scripts in the dataproc
directory from the extracted
tutorial tarball.
$ wget https://alluxio-public.storage.googleapis.com/enterprise-terraform/stable/gcp_hybrid_apache_simple.tar.gz
$ tar -zxf gcp_hybrid_apache_simple.tar.gz
$ cd gcp_hybrid_apache_simple
$ cp -r alluxio/ /path/to/terraform/root
$ cp -r dataproc/ /path/to/terraform/root
Create a main.tf
file and declare an GCP provider.
google-beta
provider is used instead of google
provider because some of the features needed only exist in google-beta
.
When those features are merged back to google
provider in future releases, google
provider will be used.
Set its region
and zone
to the desired GCP region and zone.
By default, all resources will be created using this provider in its defined region.
In order to create resources in distinct regions, declare another provider and
set an alias
for both providers.
provider "google-beta" {
alias = "google_compute"
credentials = file("account.json")
project = "my-project-id"
region = "us-east1"
zone = "us-east1-d"
version = "~> 3.21"
}
project
field can be removed if launching in a Cloud Shellcredentials
field can be removed if launching a Cloud Shell or authenticating with ‘gcloud auth application-default login`.
Similarly, each resource must also declare which provider it is associated with.
resource "google_storage_bucket" "shared_gs_bucket" {
provider = google-beta.google_compute
}
Also in the main file, you can declare a module by setting its source
to the relative path of the desired module.
For example, to create a VPC using the aforementioned provider, add:
module "vpc_compute" {
source = "./alluxio/vpc_with_internet/gcp"
providers = {
google-beta = google-beta.google_compute
}
}
Below is a template main.tf
with 2 GCP providers and a minimal layout of Alluxio modules,
denoting the relationship between modules.
provider "google-beta" {
alias = "google_compute"
region = "us-east1"
zone = "us-east1-d"
version = "~> 3.21"
}
provider "google-beta" {
alias = "google_on_prem"
region = "us-west1"
zone = "us-west1-a"
version = "~> 3.21"
}
resource "google_storage_bucket" "shared_gs_bucket" {
provider = google-beta.google_compute
name = "my-bucket-name"
force_destroy = true
}
// Mocking for the VPC network of on-prem HDFS/Hive cluster
module "vpc_on_prem" {
source = "./alluxio/vpc_with_internet/gcp"
providers = {
google-beta = google-beta.google_on_prem
}
}
// VPC for the compute cluster
module "vpc_compute" {
source = "./alluxio/vpc_with_internet/gcp"
providers = {
google-beta = google-beta.google_compute
}
}
resource "google_storage_bucket_object" "compute_alluxio_bootstrap" {
provider = google-beta.google_compute
bucket = google_storage_bucket.shared_gs_bucket.name
name = "alluxio-dataproc.sh"
source = "dataproc/alluxio-dataproc.sh"
}
resource "google_storage_bucket_object" "compute_presto_bootstrap" {
provider = google-beta.google_compute
bucket = google_storage_bucket.shared_gs_bucket.name
name = "presto-dataproc.sh"
source = "dataproc/presto-dataproc.sh"
}
module "dataproc_compute" {
source = "./alluxio/alluxio_cloud_cluster/gcp"
providers = {
google-beta = google-beta.google_compute
}
vpc_self_link = module.vpc_compute.vpc_self_link
subnet_self_link = module.vpc_compute.subnet_self_link
staging_bucket = google_storage_bucket.shared_gs_bucket.name
alluxio_bootstrap_gs_uri = "gs://${google_storage_bucket.shared_gs_bucket.name}/${google_storage_bucket_object.compute_alluxio_bootstrap.name}"
presto_bootstrap_gs_uri = "gs://${google_storage_bucket.shared_gs_bucket.name}/${google_storage_bucket_object.compute_presto_bootstrap.name}"
// Values from the on-prem HDFS/Hive cluster need to be filled in
// so that alluxio and presto inside this compute cluster can connect to the on-prem HDFS/Hive cluster
on_prem_hdfs_address = "${on_prem_hdfs_address}"
on_prem_hms_address = "${on_prem_hive_metastore_address}"
}
// Mocking for the network connection between on-prem cluster and compute cluster
module "vpc_peering" {
source = "./alluxio/vpc_peering/gcp"
providers = {
google-beta.on_prem = google-beta.google_on_prem
google-beta.cloud_compute = google-beta.google_compute
}
use_default_name = var.use_default_name
custom_name = var.custom_name
on_prem_vpc_self_link = module.vpc_on_prem.vpc_self_link
on_prem_subnet_self_link = module.vpc_on_prem.subnet_self_link
cloud_compute_vpc_self_link = module.vpc_compute.vpc_self_link
cloud_compute_subnet_self_link = module.vpc_compute.subnet_self_link
}
A more concrete example can be found in gcp_hybrid_apache_simple/main.tf
which creates
a Dataproc cluster mocking the HDFS and Hive cluster in high availability mode,
a Dataproc cluster running Alluxio and Presto connecting to the remote HDFS and Hive,
and a VPC peering connecting the network between two Dataproc clusters.
The following sections describe the various input variables and outputs of each module.
Module common variables
Each module published by Alluxio has the following Terraform variables in common:
enabled
- type:
bool
- default:
true
- description: If set to false, the module will not create any resources, effectively disabling the module. This is useful when a module should only be invoked conditionally.
- type:
depends_on_variable
- type:
any
- default:
null
- description: This placeholder variable can be used to define an explicit dependency to another module or resource.
- type:
use_default_name
- type:
bool
- default:
true
- description: Each resource has a default readable name. If set to false, a random string will be prefixed to the default name to avoid name collisions when the same Terraform configuration is invoked concurrently. When developing and testing, it is recommended to set this to false, but on a production cluster, this should be set as false.
- type:
custom_name
- type:
string
- default:
""
- description: Each resource created will be prefixed with the provided custom name. This is helpful to identify the resources created by Terraform.
- type:
Specific module input variables and outputs
Connecting to an existing HDFS cluster
After the Terraform configuration is ready, launch a small cluster with 1 worker.
The Alluxio web UI url should be accessible at http://MASTER_PUBLIC_IP:19999;
this is also provided as an output of the alluxio_cloud_cluster
module.
Access the connectivity tools under the Manager tab.
Firewall Configuration between On-premise and Compute Clusters
With respect to communication between the cloud compute cluster and on-premise cluster, there are 4 considerations for network connectivity:
- Compute cluster egress = what traffic is allowed to be sent out of the compute cluster to a destination
- Compute cluster ingress = what traffic is allowed to be received into the compute cluster from a destination
- On-premise cluster egress = what traffic is allowed to be sent out of the on-premise cluster to a destination
- On-premise cluster ingress = what traffic is allowed to be received into the on-premise cluster from a destination For each consideration, one needs to know the ports in which traffic will be traversing through.
The compute cluster settings are defined by configuring the security group(s) associated with the instances. Generally, all egress traffic is open; this implies all ports are available to send information to any destination. Ingress traffic typically should be more restrictive to protect services from being publicly accessible. Specific ports and destinations should be whitelisted to permit the expected traffic from the on-premise cluster.
For the on-premise cluster, its firewall similarly needs to allow the compute cluster to communicate by opening its ports.