AWS Advanced Deployment Guide using Terraform
This guide describes the Terraform modules used to deploy Alluxio
on AWS EMR as a hybrid cloud compute cluster,
connecting to a HDFS cluster in a remote network.
Overview
The hybrid cloud tutorial uses several modules to set up the AWS resources necessary
to create two VPCs connected via VPC peering, a HDFS cluster in one of the VPCs,
and an Alluxio cluster in the other VPC that mounts the HDFS cluster into its filesystem.
To customize the Alluxio cluster to connect to an existing HDFS cluster,
it is expected that the users will build their own Terraform configuration,
invoking modules used in the tutorial.
Several tools are provided within the Alluxio cluster
to ensure that the cluster in AWS is able to successfully connect to a remote HDFS cluster.
Prerequisites
Using the tutorial’s configuration layout as an example,
the layout of the hybrid cloud setup consists of the following components:
- A VPC and subnet, to define the network in which cloud resources will be created within.
It is strongly recommended to create a new VPC as opposed to using the default VPC.
The
vpc_with_internet
module creates a VPC, a subnet within the VPC,
and a gateway to allow access to the public internet.
The module also takes into account the instance types that would be launched in the subnet,
allowing a user to blacklist an availability zone if the instance type is not available.
- VPC peering connection, to enable the corporate network in which the HDFS and Hive clusters
reside to communicate with the cloud VPC in which the Alluxio cluster will reside and vice versa.
The
vpc_peering
module creates the resources needed for the peering connection.
- Security groups, to define the ingress and egress rules of the Alluxio cluster instances.
The
alluxio_security_group
module creates two security groups;
one for EMR instances to communicate with each other and one to open the Alluxio Web UI port.
- The Alluxio EMR cluster, consisting of the cluster of instances hosting the Alluxio service.
The
alluxio_cloud_cluster
module encompasses the various resources needed to set up Alluxio in EMR.
- A S3 bucket to serve as an intermediary for configuration files to be copied between instances.
The setup may also have the following optional resources:
- A key pair, to enable SSH access to the Alluxio cluster instances.
Either specify an existing key pair name or create a new one by providing a public key to import.
The
key-pair
module
available in the public Terraform registry imports your public key as a temporary key pair.
- A Kerberos configuration, to secure the cluster with the Kerberos authentication protocol.
The
kerberos_config
module helps organize the parameters necessary to connect to a KDC and create
keytabs.
The following diagram outlines the basic relationships and resources created by each module:
(click image to enlarge)
Alluxio modules
Terraform modules
are generally sourced from the public Terraform Registry,
but they can also be sourced from a local path.
This is the approach used by the AWS hybrid cloud tutorial.
Getting started
Create an empty directory to host your Terraform configuration files;
henceforth this path will be referred to as /path/to/terraform/root
.
Copy modules in the alluxio
directory and scripts in the emr
directory from the extracted
tutorial tarball.
$ wget https://alluxio-public.s3.amazonaws.com/enterprise-terraform/stable/aws_hybrid_apache_simple.tar.gz
$ tar -zxf aws_hybrid_apache_simple.tar.gz
$ cd aws_hybrid_apache_simple
$ cp -r alluxio/ /path/to/terraform/root
$ cp -r emr/ /path/to/terraform/root
Create a main.tf
file and declare an AWS provider.
Set its region
to the desired AWS region.
By default, all resources will be created using this provider in its defined region.
In order to create resources in distinct regions, declare another provider and
set an alias
for both providers.
provider "aws" {
region = "us-east-1"
alias = "aws_1"
version = "~> 2.56"
}
Similarly, each resource must also declare which provider it is associated with.
resource "aws_s3_bucket" "shared_s3_bucket" {
provider = aws.aws_1
}
Also in the main file, you can declare a module by setting its source
to the relative path of the desired module.
For example, to create a VPC using the aforementioned provider, add:
module "vpc_compute" {
source = "./alluxio/vpc_with_internet/aws"
providers = {
aws = aws.aws_1
}
}
Below is a template main.tf
with 2 AWS providers and a minimal layout of Alluxio modules,
denoting the relationship between modules.
provider "aws" {
alias = "aws_compute"
region = "us-east-1"
version = "~> 2.56"
}
provider "aws" {
alias = "aws_on_prem"
region = "us-west-1"
version = "~> 2.56"
}
resource "aws_s3_bucket" "shared_s3_bucket" {
provider = aws.aws_compute
bucket = "my-bucket-name"
force_destroy = true
}
module "vpc_on_prem" {
source = "./alluxio/vpc_with_internet/aws"
providers = {
aws = aws.aws_on_prem
}
}
module "vpc_compute" {
source = "./alluxio/vpc_with_internet/aws"
providers = {
aws = aws.aws_compute
}
}
module "security_group_compute" {
source = "./alluxio/alluxio_security_group/aws"
providers = {
aws = aws.aws_compute
}
name = "alluxio-security-group"
vpc_id = module.vpc_compute.vpc_id
}
module "aws_key_pair_compute" {
source = "terraform-aws-modules/key-pair/aws"
version = "0.6.0"
providers = {
aws = aws.aws_compute
}
key_name = "my-key-pair"
public_key = file("~/.ssh/id_rsa.pub")
}
resource "aws_s3_bucket_object" "compute_alluxio_bootstrap" {
provider = aws.aws_compute
bucket = aws_s3_bucket.shared_s3_bucket.bucket
key = "alluxio-emr.sh"
source = "emr/alluxio-emr.sh"
}
resource "aws_s3_bucket_object" "compute_presto_bootstrap" {
provider = aws.aws_compute
bucket = aws_s3_bucket.shared_s3_bucket.bucket
key = "presto-emr.sh"
source = "emr/presto-emr.sh"
}
module "alluxio_compute" {
source = "./alluxio/alluxio_emr/aws"
providers = {
aws = aws.aws_compute
}
name = "alluxio-cluster"
aws_security_group_id = module.security_group_compute.emr_managed_sg_id
additional_security_group_ids = [module.security_group_compute.alluxio_sg_id]
aws_subnet_id = module.vpc_compute.subnet_id
aws_key_pair_name = module.aws_key_pair_compute.this_key_pair_key_name
alluxio_working_bucket = aws_s3_bucket.shared_s3_bucket.bucket
alluxio_bootstrap_s3_uri = "s3://${aws_s3_bucket.shared_s3_bucket.bucket}/${aws_s3_bucket_object.compute_alluxio_bootstrap.key}"
presto_bootstrap_s3_uri = "s3://${aws_s3_bucket.shared_s3_bucket.bucket}/${aws_s3_bucket_object.compute_presto_bootstrap.key}"
}
module "vpc_peering" {
source = "./alluxio/vpc_peering/aws"
providers = {
aws.cloud_compute = aws.aws_compute
aws.on_prem = aws.aws_on_prem
}
cloud_compute_vpc_id = module.vpc_compute.vpc_id
cloud_compute_subnet_id = module.vpc_compute.subnet_id
cloud_compute_security_group_id = module.security_group_compute.emr_managed_sg_id
on_prem_vpc_id = module.vpc_on_prem.vpc_id
on_prem_subnet_id = module.vpc_on_prem.subnet_id
on_prem_security_group_id = aws_security_group.security_group_on_prem.id
}
The following sections describe the various input variables and outputs of each module.
Module common variables
Each module published by Alluxio has the following Terraform variables in common:
enabled
- type:
bool
- default:
true
- description: If set to false, the module will not create any resources, effectively disabling the module.
This is useful when a module should only be invoked conditionally.
depends_on_variable
- type:
any
- default:
null
- description: This placeholder variable can be used to define an explicit dependency to another module or resource.
name
- type:
string
- default:
""
- description: Each resource created will be prefixed with the provided name.
If left empty, a random string will be used.
This is helpful to identify the resources created by Terraform.
randomize_name
- type:
bool
- default:
true
- description: When left as true, a random string is appended to the end of
var.name
.
This generates a unique name for resources to avoid name collisions
when the same Terraform configuration is invoked concurrently.
While developing and testing, it is recommended to leave this set to true,
but on a production cluster, this should be set as false.
The alluxio_cloud_cluster
module creates an EMR cluster configured with a bootstrap action
to configure and deploy Alluxio.
aws_subnet_id
- type:
string
- REQUIRED
- description: Id of VPC subnet to create EMR cluster in
aws_security_group_id
- type:
string
- REQUIRED
- description: Id of security group to associate with EMR cluster instances
aws_key_pair_name
- type:
string
- default:
""
- description: Name of AWS key pair to launch instances with.
If left blank, SSH access to instances will not be available.
additional_security_group_ids
- type:
list(string)
- default:
[]
- description: List of additional security group ids to associate EMR instances with
applications
- type:
list(string)
- default:
[]
- description: List of application names to deploy on EMR cluster.
Hadoop, Hive, Spark, and Presto applications will be set if value is left as an empty list.
bootstrap_actions
- type:
list(object({
path = string
name = string
args = list(string)
}))
- default:
[]
- description: List of additional bootstrap actions to execute after provisioning
emr_release_label
- type:
string
- default:
"emr-5.29.0"
- description: Use emr-5.29.0 for Hadoop 2 and emr-6.0.0 for Hadoop 3.
See https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop.html for more details
emr_configurations_json_file
- type:
string
- default:
""
- description: JSON file containing configuration overrides for EMR applications.
See https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html
log_uri
- type:
string
- default:
""
- description: S3 URL to write EMR logs to.
The S3 bucket created for the cluster will be used if value is left empty.
ebs_root_volume_size
- type:
number
- default:
32
- description: Size in GB to allocate for the root volume on each instance
master_config
- type:
object({
ebs_volume_size = number
instance_count = number
instance_type = string
})
- default:
{
ebs_volume_size = 10
instance_count = 1
instance_type = "r4.xlarge"
}
- description: Master instance(s) configuration details
masters_spot_price
- type:
number
- default:
0
- description: Provision spot instances for masters with the given price.
If set to 0, provisions on demand.
worker_config
- type:
object({
ebs_volume_size = number
instance_count = number
instance_type = string
})
- default:
{
ebs_volume_size = 32
instance_count = 1
instance_type = "r4.xlarge"
}
- description: Worker instance(s) configuration details
workers_spot_price
- type:
number
- default:
2.0
- description: Provision spot instances for masters with the given price.
If set to 0, provisions on demand.
alluxio_tarball_url
- type:
string
- default:
"https://downloads.alluxio.io/protected/files/alluxio-enterprise-trial.tar.gz"
- description: Alluxio tarball download url
alluxio_additional_properties
- type:
string
- default:
""
- description: A string containing a delimited set of properties which should be added to the alluxio-site.properties file.
The delimiter by default is a semicolon ‘;’.
alluxio_active_sync_list
- type:
string
- default:
"/"
- description: A string containing a delimited set of Alluxio paths where UFS metadata will be periodically synced with the Alluxio namespace.
The delimiter by default is a semicolon ‘;’.
alluxio_nvme_percentage
- type:
number
- default:
0
- description: Percentage of the instance attached NVMe SSD to be configured as Alluxio worker storage.
Set this variable when your worker instance type contains NVMe SSDs by default.
alluxio_working_bucket
- type:
string
- default:
null
- description: S3 bucket to be used as alluxio working bucket and emr log place.
The bucket must be created in the same region as the emr cluster.
alluxio_bootstrap_s3_uri
- type:
string
- default:
null
- description:
S3 uri of the alluxio bootstrap script
presto_bootstrap_s3_uri
- type:
string
- default:
null
- description:
S3 uri of the presto bootstrap script
on_prem_hdfs_address
- type:
string
- default:
""
- description:
On-prem hdfs address (e.g. hdfs://<on_prem_hdfs_master_hostname>:8020/path/to/mount) for alluxio to connect to.
on_prem_hms_address
- type:
string
- default:
""
- description: On-prem hive metastore address (e.g. thrift://:9083) for Presto and Spark to connect to
on_prem_core_site_uri
- type:
string
- default:
""
- description: An s3:// or http(s):// URI to download on_prem core-site.xml file from.
If provided, Presto and Alluxio will be configured with the given core-site.xml
on_prem_hdfs_site_uri
- type:
string
- default:
""
- description: An s3:// or http(s):// URI to download on_prem hdfs-site.xml file from.
If provided, Presto and Alluxio will be configured with the given hdfs-site.xml
hdfs_version
- type:
string
- default:
"hadoop-2.8"
- description: Version of the hdfs to connect to.
Valid values include cdh-5.11, cdh-5.12, cdh-5.13, cdh-5.14, cdh-5.15, cdh-5.16, cdh-5.6, cdh-5.8,
cdh-6.0, cdh-6.1, cdh-6.2, cdh-6.3,
hadoop-2.2, hadoop-2.3, hadoop-2.4, hadoop-2.5, hadoop-2.6, hadoop-2.7, hadoop-2.8, hadoop-2.9,
hadoop-3.0, hadoop-3.1, hadoop-3.2,
hdp-2.0, hdp-2.1, hdp-2.2, hdp-2.3, hdp-2.4, hdp-2.5, hdp-2.6, hdp-3.0, hdp-3.1
kerberos_type
- type:
string
- default:
""
- description: Name of kerberos type, where empty string indicates no kerberos authentication
kerberos_configuration
- type:
map(string)
- default:
{}
- description: Map of kerberos configuration properties, output from kerberos_config module
emr_security_configuration_string
- type:
string
- default:
""
- description: String to set as the configuration field of the
aws_emr_security_configuration resource, output from kerberos_config module
hadoop_master_public_dns
- description: Public DNS of the hadoop cluster master
hadoop_master_private_dns
- description: Private DNS of the hadoop cluster master
The alluxio_security_group
module creates two separate security groups.
One is set as the EMR managed security group,
which has constraints in which ports can be opened.
The id of this security group is returned by emr_managed_sg_id.
The other security group opens the Alluxio web UI port.
The id of this security group is returned by alluxio_sg_id
.
vpc_id
- type:
string
- REQUIRED
- description: Id of VPC to create security group in
alluxio_web_ui_rule_cidr_blocks
- type:
list(string)
- default:
[]
- description: List of CIDR block to set as an ingress rule for Alluxio Web UI port.
This cannot contain 0.0.0.0/0 because EMR will fail to start.
A CIDR block representing your IP address is typically in the format IP_ADDRESS/32.
If this is left unset, the Alluxio Web UI will not be accessible,
but the security group can be updated after creation to open access.
emr_managed_sg_id
- description: Id of EMR managed security group
alluxio_sg_id
- description: Id of Alluxio security group
The kerberos_config
module is a helper module to organize all kerberos related configurations.
Each variable represents a particular kerberos configuration setup.
Only one variable should be populated, with its enabled
flag set to true.
kerberos_local_kdc
- type:
object({
enabled = bool
default_realm = string
kdc_admin_password = string
})
- description: Configuration for a local KDC residing in the same cluster as Alluxio
kerberos_external_kdc
- type:
object({
enabled = bool
default_realm = string
kdc_admin_password = string
kdc_server_address = string
kdc_admin_server_address = string
tls_cert_zip_s3_uri = string
})
- description: Configuration for an external KDC, residing outside of the Alluxio cluster
kerberos_local_kdc_with_ad
- type:
object({
enabled = bool
default_realm = string
kdc_admin_password = string
tls_cert_zip_s3_uri = string
ad_admin_server = string
ad_cross_realm_trust_principal_password = string
ad_domain = string
ad_domain_join_password = string
ad_domain_join_user = string
ad_kdc_server = string
ad_realm = string
onprem_admin_password = string
onprem_admin_server = string
onprem_admin_user = string
onprem_cross_realm_trust_password = string
onprem_domain = string
onprem_kdc_server = string
onprem_realm = string
})
- description: Configuration for cross realm authentication using active directory
and a local KDC residing in the same cluster as Alluxio
emr_security_configuration_string
- description: String to set as the configuration field of the aws_emr_security_configuration resource
kerberos_configuration
- description: Map of kerberos configuration properties
kerberos_type
- description: Type of kerberos authentication to configure for
The vpc_peering
module creates a peering connection between two VPCs.
Each of the declared variables are required.
cloud_compute_vpc_id
- type:
string
- REQUIRED
- description: VPC id of cloud compute cluster
cloud_compute_subnet_id
- type:
string
- REQUIRED
- description: Subnet id within the given VPC in cloud compute cluster
cloud_compute_security_group_id
- type:
string
- REQUIRED
- description: Security group id of cloud compute cluster
on_prem_vpc_id
- type:
string
- REQUIRED
- description: VPC id of on-prem cluster
on_prem_subnet_id
- type:
string
- REQUIRED
- description: Subnet id within the given VPC in on-prem cluster
on_prem_security_group_id
- type:
string
- REQUIRED
- description: Security group id of on-prem cluster
vpc_peering_id
- description: Id of vpc peering connection
The vpc_with_internet
module creates a VPC and subnet within the VPC.
In order to accommodate instance types that may not be available in every availability zone,
a blacklist is provided to avoid creating a subnet in an availability zone that does not support
the desired instance type.
aws_vpc_cidr
- type:
string
- default:
""
- description: VPC CIDR block to create VPC with. If left empty, a random CIDR block will be used
aws_subnet_zone
- type:
string
- default:
null
- description: Availbility zone to create subnet and related resources in.
If zone is not provided, aws will pick a random zone to create subnet in.
aws_availability_zone_blacklist
- type:
list(string)
- default:
["us-east-1e", "us-east-1f"]
- description: Instance types to check against the availability zone of the created subnet.
If the availability zone doesn’t support one or more of the given instance types,
error
no EC2 Instance Type Offerings found matching criteria
will be thrown directly.
aws_instance_types
- type:
list(string)
- default:
[]
- description: VPC id of on-prem cluster
aws_subnet_cidr
- type:
string
- default:
""
- description: Subnet CIDR block to create subnet with, within the created VPC. If left empty, a random CIDR block will be used
aws_dns_servers
- type:
string
- default:
[]
- description: DNS servers to use. If not specified, the default AWS DNS server will be used.
vpc_id
subnet_id
vpc_cidr
- description: CIDR block of VPC
Security Configuration
Alluxio can be configured to run in one of the two authentication modes: SIMPLE
or KERBEROS
.
In SIMPLE
mode, Alluxio infers the client user from the operating system user and does not require
additional credentials.
In KERBEROS
mode, authentication is enforced via Kerberos protocol.
Configuring Simple Authentication
This is the default authentication mode. No additional configuration is required.
Configuring Kerberos Authentication
Please follow the steps below to configure Kerberos authentication for the Alluxio cluster:
To enable Kerberos authentication for all services on the cluster, define a kerberos_config
module in your terraform file and update the alluxio_emr
module as follows:
module "my_kerberos_config" {
source = "./alluxio/kerberos_config/aws"
providers = {
aws = <your AWS region>
}
// <add kerberos settings - see next step>
}
module "my_alluxio_cluster" {
source = "./alluxio/alluxio_emr/aws"
providers = {
aws = <your AWS region>
}
// <other cluster settings>
kerberos_type = module.my_kerberos_config.kerberos_type
kerberos_configuration = module.my_kerberos_config.kerberos_configuration
emr_security_configuration_string = module.my_kerberos_config.emr_security_configuration_string
}
The Kerberos configuration module supports the following authentication setups:
kerberos_local_kdc
: Starts a local KDC in the Alluxio cluster for authentication
kerberos_external_kdc
: Configures Alluxio cluster to authenticate with an existing KDC
kerberos_local_kdc_with_ad
: Starts a local KDC in the Alluxio cluster and set up cross realm
authentication with an external Active Directory service and a KDC for on-premise Hadoop cluster.
Please choose an authentication setup and follow the instructions below:
This setup starts a KDC within the Alluxio cluster and configures all services on the cluster to
authenticate with the KDC.
To setup a local KDC in the Alluxio cluster, please configure the kerberos_config
module as below:
module "my_kerberos_config" {
source = "./alluxio/kerberos_config/aws"
providers = {
aws = <your AWS region>
}
kerberos_local_kdc = {
enabled = true
default_realm = "ALLUXIO.COM"
kdc_admin_password = "admin"
}
}
In the kerberos_local_kdc
section:
This setup allows you to connect services on the Alluxio cluster with an existing KDC so you can
centralize credential management of multiple clusters.
To setup Kerberos authentication with an existing KDC, please configure the kerberos_config
module
as below:
module "my_kerberos_config" {
source = "./alluxio/kerberos_config/aws"
providers = {
aws = <your AWS region>
}
kerberos_external_kdc = {
enabled = true
default_realm = "ALLUXIO.COM"
kdc_admin_password = "admin"
kdc_server_address = "domain.example.com:88"
kdc_admin_server_address = "domain.example.com:749"
tls_cert_zip_s3_uri = "s3://bucket_name/emr-certs.zip"
}
}
In the kerberos_external_kdc
section:
-
default_realm
specifies the default Kerberos realm of the external KDC.
-
kdc_admin_password
specifies the admin password for the external KDC.
This is used for creating user credentials for Alluxio services.
-
kdc_server_address
specifies the fully qualified domain name or IP address of the external KDC
server.
A port can also be specified, otherwise port 88 is used.
-
kdc_admin_server_address
specifies the fully qualified domain name or IP address of the external
Kerberos admin server.
A port can also be specified, otherwise port 749 is used.
-
tls_cert_zip_s3_uri
specifies the URL of a zip file with certificates in Amazon S3 used for
in-transit data encryption.
See EMR ecryption documentation for details.
This setup configures services on Alluxio cluster to establish cross realm authentication with an
Active Directory service and a Kerberized on-premise Hadoop cluster.
To setup cross realm authentication, please configure the kerberos_config
module as below:
module "my_kerberos_config" {
source = "./alluxio/kerberos_config/aws"
providers = {
aws = <your AWS region>
}
kerberos_local_kdc_with_ad = {
enabled = true
default_realm = "ALLUXIO.COM"
kdc_admin_password = "admin"
tls_cert_zip_s3_uri = "s3://bucket_name/emr-certs.zip"
ad_admin_server = "myad.com"
ad_cross_realm_trust_principal_password = "admin"
ad_domain = "example.com"
ad_domain_join_password = "admin"
ad_domain_join_user = "CrossRealmAdmin"
ad_kdc_server = "myad.com"
ad_realm = "EXAMPLE.COM"
onprem_admin_password = "admin"
onprem_admin_server = "onprem.com"
onprem_admin_user = "kadmin/admin"
onprem_cross_realm_trust_password = "admin"
onprem_domain = "onprem.com"
onprem_kdc_server = "onprem.com"
onprem_realm = "ONPREM.COM"
}
}
In the kerberos_local_kdc_with_ad
section:
-
default_realm
specifies the default Kerberos realm of the local KDC.
-
kdc_admin_password
specifies the admin password for the local KDC.
-
tls_cert_zip_s3_uri
specifies the URL of a zip file with certificates in Amazon S3 used for
in-transit data encryption.
See EMR ecryption documentation for details.
-
ad_admin_server
specifies the fully qualified domain name or IP address of the Active Directory
admin server.
A port can also be specified, otherwise port 749 is used.
-
ad_cross_realm_trust_principal_password
specifies the cross-realm principal password with Active
Directory, which must be identical across realms.
-
ad_domain
specifies the domain name of the Active Directory service.
-
ad_domain_join_password
specifies the user logon name of an Active Directory account with
permission to join computers to the domain.
-
ad_domain_join_user
specifies the password for the Active Directory domain join user.
-
ad_kdc_server
specifies the fully qualified domain name or IP address of the Active Directory
KDC server.
A port can also be specified, otherwise port 88 is used.
-
ad_realm
specifies the Kerberos realm name of the Active Directory service.
-
onprem_admin_password
specifies the admin password for the on-prem KDC.
This is used for creating user credentials for Alluxio services.
-
onprem_admin_server
specifies the fully qualified domain name or IP address of the on-premise
KDC admin server.
A port can also be specified, otherwise port 749 is used.
-
onprem_admin_user
specifies the admin user name for the on-prem KDC.
This is used for creating user credentials for Alluxio services.
-
onprem_cross_realm_trust_password
specifies the cross-realm principal password with on-premise
Kerberos realm, which must be identical across realms.
-
onprem_domain
specifies the domain name of the on-preimise KDC.
-
onprem_kdc_server
specifies the fully qualified domain name or IP address of the on-premise KDC
server. A port can also be specified, otherwise port 88 is used.
-
onprem_realm
specifies the Kerberos realm name of the on-premise KDC.