AWS Deployment Guide using Terraform
This guide will present a step-by-step process for connecting a remote Alluxio and Presto cluster to an on-premises data store to fully leverage the benefits of data caching and locality with Alluxio.
- Provision and Launch a Cluster on AWS EMR
- Firewall Configuration between On-premise and Compute Clusters
- Guided Wizards for On-prem Connectivity
- What’s Next
This tutorial will leverage Alluxio, Hashicorp Terraform, and PrestoDB to build a cloud-based part of a hybrid cloud data analytics solution. It enables efficient compute utilization and caching in AWS while still allowing data to be kept secure on-premises. By using Alluxio, we prevent redundant data transfer and allow elastic expansion of storage in AWS to provide a solution that keeps data transfer costs low while enabling lightning-fast analytics workloads using the resources of the cloud.
- Hashicorp Terraform 0.12+
- AWS Direct Connect or site-to-site AWS VPN let you establish a secure and private encrypted tunnel
Provision and Launch a Cluster on AWS EMR
This section uses Hashicorp Terraform to provision and AWS EMR cluster with Alluxio Enterprise Edition.
Create an empty directory to host your Terraform configuration files. This tutorial chooses to use
/tmp/alluxio. You may choose a different path; however, remember to change the paths for any
commands provided in this section.
Copy modules in the
alluxio directory and scripts in the
emr directory from the extracted
$ wget https://alluxio-public.s3.amazonaws.com/enterprise-terraform/stable/aws_hybrid_compute_only_simple.tar.gz $ tar -zxf aws_hybrid_compute_only_simple.tar.gz $ cd aws_hybrid_compute_only_simple $ cp -r alluxio/ /tmp/alluxio $ cp -r emr/ /tmp/alluxio
Once all of the modules are downloaded and copied, run the following commands and
yes at the terraform prompt:
$ terraform init $ terraform apply
This will provision and launch an EMR cluster with Presto and Alluxio via Terraform. Once the command exits, the cluster should be available to use. You can access the Alluxio master at the URL output at the end of the Terraform command. Otherwise, you can also view all of the resources created in the AWS EMR console.
Firewall Configuration between On-premise and Compute Clusters
With respect to communication between the cloud compute cluster and on-premise cluster, there are 4 considerations for network connectivity:
- Compute cluster egress = what traffic is allowed to be sent out of the compute cluster to a destination
- Compute cluster ingress = what traffic is allowed to be received into the compute cluster from a destination
- On-premise cluster egress = what traffic is allowed to be sent out of the on-premise cluster to a destination
- On-premise cluster ingress = what traffic is allowed to be received into the on-premise cluster from a destination For each consideration, one needs to know the ports in which traffic will be traversing through.
The compute cluster settings are defined by configuring the security group(s) associated with the instances. Generally, all egress traffic is open; this implies all ports are available to send information to any destination. Ingress traffic typically should be more restrictive to protect services from being publicly accessible. Specific ports and destinations should be whitelisted to permit the expected traffic from the on-premise cluster.
For the on-premise cluster, its firewall similarly needs to allow the compute cluster to communicate by opening its ports. You can verify that the compute cluster’s connectivity with the provided connectivity tools in Data Orchestration Hub.
The KDC setup applies to both Apache Hadoop and CDH for you on-premise cluster.
If the KDC is local to your compute cluster, you can skip this section.
If the compute cluster is using the KDC in the on-premise cluster, or if you are using cross-realm authentication with Active Directory, you will need to access the below ports from the master node in the on-premise cluster.
- KDC server port:
- KDC admin server port:
Guided Wizards for On-prem Connectivity
After the AWS cluster is running, further changes to the cluster should be made using the Data Orchestration Hub. The Hub provides self-guided wizards to connect the cluster in AWS with on-prem data and metadata sources, including connection to Hive, HDFS or other object storage alternatives.
- Customize your deployment in AWS by using advanced Terraform variables.