AWS Deployment Guide using Terraform

Slack Docker Pulls

This guide describes the Terraform modules used to deploy Alluxio on AWS EMR as a hybrid cloud compute cluster, connecting to a HDFS cluster in a remote network.

Overview

The hybrid cloud tutorial uses several modules to set up the AWS resources necessary to create two VPCs connected via VPC peering, a HDFS cluster in one of the VPCs, and an Alluxio cluster in the other VPC that mounts the HDFS cluster into its filesystem. To customize the Alluxio cluster to connect to an existing HDFS cluster, it is expected that the users will build their own Terraform configuration, invoking modules used in the tutorial. Several tools are provided within the Alluxio cluster to ensure that the cluster in AWS is able to successfully connect to a remote HDFS cluster.

Prerequisites

Terraform configuration layout

Using the tutorial’s configuration layout as an example, the layout of the hybrid cloud setup consists of the following components:

  • A VPC and subnet, to define the network in which cloud resources will be created within. It is strongly recommended to create a new VPC as opposed to using the default VPC. The vpc_with_internet module creates a VPC, a subnet within the VPC, and a gateway to allow access to the public internet. The module also takes into account the instance types that would be launched in the subnet, allowing a user to blacklist an availability zone if the instance type is not available.
  • VPC peering connection, to enable the corporate network in which the HDFS and Hive clusters reside to communicate with the cloud VPC in which the Alluxio cluster will reside and vice versa. The vpc_peering module creates the resources needed for the peering connection.
  • Security groups, to define the ingress and egress rules of the Alluxio cluster instances. The alluxio_security_group module creates two security groups; one for EMR instances to communicate with each other and one to open the Alluxio Web UI port.
  • The Alluxio EMR cluster, consisting of the cluster of instances hosting the Alluxio service. The alluxio_cloud_cluster module encompasses the various resources needed to set up Alluxio in EMR.
  • A S3 bucket to serve as an intermediary for configuration files to be copied between instances.

The setup may also have the following optional resources:

  • A key pair, to enable SSH access to the Alluxio cluster instances. Either specify an existing key pair name or create a new one by providing a public key to import. The key-pair module available in the public Terraform registry imports your public key as a temporary key pair.
  • A Kerberos configuration, to secure the cluster with the Kerberos authentication protocol. The kerberos_config module helps organize the parameters necessary to connect to a KDC and create keytabs.

The following diagram outlines the basic relationships and resources created by each module:

alluxio_terraform_aws_modules (click image to enlarge)

Alluxio modules

Terraform modules are generally sourced from the public Terraform Registry, but they can also be sourced from a local path. This is the approach used by the AWS hybrid cloud tutorial.

Getting started

Create an empty directory to host your Terraform configuration files; henceforth this path will be referred to as /path/to/terraform/root.

Copy modules in the alluxio directory and scripts in the emr directory from the extracted tutorial tarball.

$ wget https://alluxio-public.s3.amazonaws.com/enterprise-terraform/stable/aws_hybrid_apache_simple.tar.gz
$ tar -zxf aws_hybrid_apache_simple.tar.gz
$ cd aws_hybrid_apache_simple
$ cp -r alluxio/ /path/to/terraform/root
$ cp -r emr/ /path/to/terraform/root

Create a main.tf file and declare an AWS provider. Set its region to the desired AWS region. By default, all resources will be created using this provider in its defined region. In order to create resources in distinct regions, declare another provider and set an alias for both providers.

provider "aws" {
  region = "us-east-1"
  alias = "aws_1"
  version = "~> 2.56"
}

Similarly, each resource must also declare which provider it is associated with.

resource "aws_s3_bucket" "shared_s3_bucket" {
  provider = aws.aws_1
}

Also in the main file, you can declare a module by setting its source to the relative path of the desired module. For example, to create a VPC using the aforementioned provider, add:

module "vpc_compute" {
  source = "./alluxio/vpc_with_internet/aws"
  providers = {
    aws = aws.aws_1
  }
}

Below is a template main.tf with 2 AWS providers and a minimal layout of Alluxio modules, denoting the relationship between modules.

provider "aws" {
  alias   = "aws_compute"
  region  = "us-east-1"
  version = "~> 2.56"
}

provider "aws" {
  alias   = "aws_on_prem"
  region  = "us-west-1"
  version = "~> 2.56"
}

resource "aws_s3_bucket" "shared_s3_bucket" {
  provider      = aws.aws_compute
  bucket        = "my-bucket-name"
  force_destroy = true
}

module "vpc_on_prem" {
  source = "./alluxio/vpc_with_internet/aws"
  providers = {
    aws = aws.aws_on_prem
  }
}

module "vpc_compute" {
  source = "./alluxio/vpc_with_internet/aws"
  providers = {
    aws = aws.aws_compute
  }
}

module "security_group_compute" {
  source = "./alluxio/alluxio_security_group/aws"
  providers = {
    aws = aws.aws_compute
  }
  name   = "alluxio-security-group"
  vpc_id = module.vpc_compute.vpc_id
}

module "aws_key_pair_compute" {
  source = "terraform-aws-modules/key-pair/aws"
  providers = {
    aws = aws.aws_compute
  }
  key_name   = "my-key-pair"
  public_key = file("~/.ssh/id_rsa.pub")
}

resource "aws_s3_bucket_object" "compute_alluxio_bootstrap" {
  provider = aws.aws_compute
  bucket   = aws_s3_bucket.shared_s3_bucket.bucket
  key      = "alluxio-emr.sh"
  source   = "emr/alluxio-emr.sh"
}

resource "aws_s3_bucket_object" "compute_presto_bootstrap" {
  provider = aws.aws_compute
  bucket   = aws_s3_bucket.shared_s3_bucket.bucket
  key      = "presto-emr.sh"
  source   = "emr/presto-emr.sh"
}

module "alluxio_compute" {
  source = "./alluxio/alluxio_emr/aws"
  providers = {
    aws = aws.aws_compute
  }
  name                          = "alluxio-cluster"
  aws_security_group_id         = module.security_group_compute.emr_managed_sg_id
  additional_security_group_ids = [module.security_group_compute.alluxio_sg_id]
  aws_subnet_id                 = module.vpc_compute.subnet_id
  aws_key_pair_name             = module.aws_key_pair_compute.this_key_pair_key_name

  alluxio_working_bucket   = aws_s3_bucket.shared_s3_bucket.bucket
  alluxio_bootstrap_s3_uri = "s3://${aws_s3_bucket.shared_s3_bucket.bucket}/${aws_s3_bucket_object.compute_alluxio_bootstrap.key}"
  presto_bootstrap_s3_uri  = "s3://${aws_s3_bucket.shared_s3_bucket.bucket}/${aws_s3_bucket_object.compute_presto_bootstrap.key}"
}

module "vpc_peering" {
  source = "./alluxio/vpc_peering/aws"
  providers = {
    aws.cloud_compute = aws.aws_compute
    aws.on_prem       = aws.aws_on_prem
  }
  cloud_compute_vpc_id            = module.vpc_compute.vpc_id
  cloud_compute_subnet_id         = module.vpc_compute.subnet_id
  cloud_compute_security_group_id = module.security_group_compute.emr_managed_sg_id
  on_prem_vpc_id                  = module.vpc_on_prem.vpc_id
  on_prem_subnet_id               = module.vpc_on_prem.subnet_id
  on_prem_security_group_id       = aws_security_group.security_group_on_prem.id
}

The following sections describe the various input variables and outputs of each module.

Module common variables

Each module published by Alluxio has the following Terraform variables in common:

  • enabled
    • type: bool
    • default: true
    • description: If set to false, the module will not create any resources, effectively disabling the module. This is useful when a module should only be invoked conditionally.
  • depends_on_variable
    • type: any
    • default: null
    • description: This placeholder variable can be used to define an explicit dependency to another module or resource.
  • name
    • type: string
    • default: ""
    • description: Each resource created will be prefixed with the provided name. If left empty, a random string will be used. This is helpful to identify the resources created by Terraform.
  • randomize_name
    • type: bool
    • default: true
    • description: When left as true, a random string is appended to the end of var.name. This generates a unique name for resources to avoid name collisions when the same Terraform configuration is invoked concurrently. While developing and testing, it is recommended to leave this set to true, but on a production cluster, this should be set as false.

Specific module input variables and outputs


Security Configuration

Alluxio can be configured to run in one of the two authentication modes: SIMPLE or KERBEROS. In SIMPLE mode, Alluxio infers the client user from the operating system user and does not require additional credentials. In KERBEROS mode, authentication is enforced via Kerberos protocol.

Configuring Simple Authentication

This is the default authentication mode. No additional configuration is required.

Configuring Kerberos Authentication

Please follow the steps below to configure Kerberos authentication for the Alluxio cluster:

To enable Kerberos authentication for all services on the cluster, define a kerberos_config module in your terraform file and update the alluxio_emr module as follows:

module "my_kerberos_config" {
  source = "./alluxio/kerberos_config/aws"
  providers = {
    aws = <your AWS region>
  }
  // <add kerberos settings - see next step>
}

module "my_alluxio_cluster" {
  source = "./alluxio/alluxio_emr/aws"
  providers = {
    aws = <your AWS region>
  }
  // <other cluster settings>
  kerberos_type                     = module.my_kerberos_config.kerberos_type
  kerberos_configuration            = module.my_kerberos_config.kerberos_configuration
  emr_security_configuration_string = module.my_kerberos_config.emr_security_configuration_string
}

The Kerberos configuration module supports the following authentication setups:

  • kerberos_local_kdc: Starts a local KDC in the Alluxio cluster for authentication
  • kerberos_external_kdc: Configures Alluxio cluster to authenticate with an existing KDC
  • kerberos_local_kdc_with_ad: Starts a local KDC in the Alluxio cluster and set up cross realm authentication with an external Active Directory service and a KDC for on-premise Hadoop cluster.

Please choose an authentication setup and follow the instructions below:


Connectivity tools

While developing the terraform configuration, we strongly recommend testing the connectivity with a small Alluxio cluster before launching a full sized cluster. These tools can be found in the Alluxio master web UI Manager tab (http://MASTER_PUBLIC_DNS:19999).

The tools provided are: