AWS Deployment Guide using Terraform

Slack Docker Pulls

This guide will present a step-by-step process for connecting a remote Alluxio and Presto cluster to an on-premises data store to fully leverage the benefits of data caching and locality with Alluxio.

Overview

This tutorial will leverage Alluxio, Hashicorp Terraform, and PrestoDB to build a cloud-based part of a hybrid cloud data analytics solution. It enables efficient compute utilization and caching in AWS while still allowing data to be kept secure on-premises. By using Alluxio, we prevent redundant data transfer and allow elastic expansion of storage in AWS to provide a solution that keeps data transfer costs low while enabling lightning-fast analytics workloads using the resources of the cloud.

Prerequisites

  • Hashicorp Terraform 0.12+
  • AWS Direct Connect or site-to-site AWS VPN let you establish a secure and private encrypted tunnel

Provision and Launch a Cluster on AWS EMR

This section uses Hashicorp Terraform to provision and AWS EMR cluster with Alluxio Enterprise Edition.

Alluxio modules

Terraform modules are generally sourced from the public Terraform Registry, but they can also be sourced from a local path. This is the approach used by the AWS tutorial.

Getting started

Create an empty directory to host your Terraform configuration files. This tutorial chooses to use /tmp/alluxio. You may choose a different path; however, remember to change the paths for any commands provided in this section.

Copy modules in the alluxio directory and scripts in the emr directory from the extracted tutorial tarball.

$ wget https://alluxio-public.s3.amazonaws.com/enterprise-terraform/stable/aws_hybrid_compute_only_simple.tar.gz
$ tar -zxf aws_hybrid_compute_only_simple.tar.gz
$ cd aws_hybrid_compute_only_simple
$ cp -r alluxio/ /tmp/alluxio
$ cp -r emr/ /tmp/alluxio

Once all of the modules are downloaded and copied, run the following commands and then type yes at the terraform prompt:

$ terraform init
$ terraform apply

This will provision and launch an EMR cluster with Presto and Alluxio via Terraform. Once the command exits, the cluster should be available to use. You can access the Alluxio master at the URL output at the end of the Terraform command. Otherwise, you can also view all of the resources created in the AWS EMR console.

Firewall Configuration between On-premise and Compute Clusters

With respect to communication between the cloud compute cluster and on-premise cluster, there are 4 considerations for network connectivity:

  1. Compute cluster egress = what traffic is allowed to be sent out of the compute cluster to a destination
  2. Compute cluster ingress = what traffic is allowed to be received into the compute cluster from a destination
  3. On-premise cluster egress = what traffic is allowed to be sent out of the on-premise cluster to a destination
  4. On-premise cluster ingress = what traffic is allowed to be received into the on-premise cluster from a destination For each consideration, one needs to know the ports in which traffic will be traversing through.

The compute cluster settings are defined by configuring the security group(s) associated with the instances. Generally, all egress traffic is open; this implies all ports are available to send information to any destination. Ingress traffic typically should be more restrictive to protect services from being publicly accessible. Specific ports and destinations should be whitelisted to permit the expected traffic from the on-premise cluster.

For the on-premise cluster, its firewall similarly needs to allow the compute cluster to communicate by opening its ports. You can verify that the compute cluster’s connectivity with the provided connectivity tools in Data Orchestration Hub.

HDFS


Hive Metastore


KDC

The KDC setup applies to both Apache Hadoop and CDH for you on-premise cluster.

If the KDC is local to your compute cluster, you can skip this section.

If the compute cluster is using the KDC in the on-premise cluster, or if you are using cross-realm authentication with Active Directory, you will need to access the below ports from the master node in the on-premise cluster.

  • KDC server port: 88
  • KDC admin server port: 749

Guided Wizards for On-prem Connectivity

After the AWS cluster is running, further changes to the cluster should be made using the Data Orchestration Hub. The Hub provides self-guided wizards to connect the cluster in AWS with on-prem data and metadata sources, including connection to Hive, HDFS or other object storage alternatives.

What’s Next