Spark on Kubernetes
This guide describes how to configure Apache Spark to access Alluxio.
Applications using Spark 1.1 or later can access Alluxio through its HDFS-compatible interface. Using Alluxio as the data access layer, Spark applications can transparently access data in many different types of persistent storage services. Data can be actively fetched or transparently cached into Alluxio to speed up I/O performance especially when the Spark deployment is remote to the data. In addition, Alluxio can help simplify the architecture by decoupling compute and physical storage. When the data path in persistent under storage is hidden from Spark, changes to under storage can be independent from application logic; meanwhile, as a near-compute cache, Alluxio can still provide compute frameworks data-locality.
This guide describes how to integrate Apache Spark with Alluxio in a Kubernetes environment.
Prerequisites
This guide assumes that the Alluxio cluster is deployed on Kubernetes.
docker
is also required to build the custom Spark image.
Prepare image
To integrate with Spark, Alluxio jars and configuration files must be added within the Spark image. Spark containers need to be launched with this modified image in order to connect to the Alluxio cluster.
Among the files listed in the Alluxio installation instructions to download,
locate the tarball named alluxio-enterprise-DA-3.2-8.0.0-release.tar.gz
.
Extract the following Alluxio jars from the tarball:
client/alluxio-DA-3.2-8.0.0-client.jar
client/ufs/alluxio-underfs-s3a-shaded-DA-3.2-8.0.0.jar
if using a S3 bucket as an UFS
Prepare an empty directory as the working directory to build an image from.
Within this directory, create the directory files/alluxio/
and copy the aforementioned jar files into it.
Create a Dockerfile
with the operations to modify the base Spark image.
The following example defines arguments for:
SPARK_VERSION=3.5.2
as the Spark versionUFS_JAR=files/alluxio/alluxio-underfs-s3a-shaded-DA-3.2-8.0.0.jar
as the path to the UFS jar copied intofiles/alluxio/
CLIENT_JAR=files/alluxio/alluxio-DA-3.2-8.0.0-client.jar
as the path to the Alluxio client jar copied intofiles/alluxio/
# Use the official Spark 3.5.x (or above) image as the base image
ARG SPARK_VERSION=3.5.2
ARG IMAGE=apache/spark:${SPARK_VERSION}
FROM $IMAGE
ARG SPARK_UID=185
ARG UFS_JAR=files/alluxio/alluxio-underfs-s3a-shaded-DA-3.2-8.0.0.jar
ARG CLIENT_JAR=files/alluxio/alluxio-DA-3.2-8.0.0-client.jar
USER root
# Create the /opt/alluxio directory
RUN mkdir -p /opt/alluxio && \
mkdir -p /opt/alluxio/lib
# Copy the Alluxio client JAR file to the /opt/alluxio directory
COPY ${UFS_JAR} /opt/alluxio/lib/
COPY ${CLIENT_JAR} /opt/alluxio/
ENV SPARK_HOME=/opt/spark
WORKDIR /opt/spark/work-dir
RUN chmod g+w /opt/spark/work-dir
RUN chmod a+x /opt/decom.sh
ENTRYPOINT [ "/opt/entrypoint.sh" ]
USER ${SPARK_UID}
Build the image by running, replacing <PRIVATE_REGISTRY>
with the URL of your private container registry and <SPARK_VERSION>
with the corresponding Spark version.
In following examples, we will continue to utilize 3.5.2
as the Spark version as indicated by <SPARK_VERSION>
.
$ docker build --platform linux/amd64 -t <PRIVATE_REGISTRY>/alluxio/spark:<SPARK_VERSION> .
Push the image by running:
$ docker push <PRIVATE_REGISTRY>/alluxio/spark:<SPARK_VERSION>
Deploy Spark
There are a few things we need to do before submitting a Spark job:
- Install Spark operator
- Set Alluxio config map
- Create a service account for Alluxio (if you are using IAM)
- Add additional parameters in the Spark job
Install Spark Operator
If you are using aws-samples/emr-on-eks-benchmark to create the EKS cluster, the spark-operator will be installed in the scripts, so you do not need to install it again.
The following instructions are derived from the spark-operator getting started guide.
Add the spark-operator repo in Helm.
$ helm repo add spark-operator https://kubeflow.github.io/spark-operator
$ helm repo update
To add custom configurations, you can create a spark-operator.yaml
file.
For example, the following example sets the namespace to spark
(not required, but we will use this as example):
spark:
jobNamespaces:
- spark
Install the spark operator with those configurations by running the command:
$ helm install spark-operator spark-operator/spark-operator -f spark-operator.yaml \
--namespace spark-operator \
--create-namespace \
--set webhook.enable=true \
--set rbac.create=true
The webhook.enable
setting is needed to mount configmaps for Alluxio.
Check the status of the spark operator. If the status is Running
, it is ready for jobs to be submitted.
$ kubectl get pod -n spark-operator
NAME READY STATUS RESTARTS AGE
spark-operator-controller-5db84774f-2d72c 1/1 Running 0 28s
spark-operator-webhook-7c9fb4788-d98fw 1/1 Running 0 28s
When complete with Spark, uninstall the Spark operator and its related components with the command:
$ helm uninstall spark-operator -n spark-operator
Create a ConfigMap for Alluxio
This configmap is for the Spark jobs, as an Alluxio client, to understand the Alluxio configuration.
The configmap can be created from alluxio-site.properties
of the existing Alluxio cluster config map built by the Alluxio operator.
To show alluxio-site.properties
from the Alluxio cluster config map, run:
$ kubectl get configmap <ALLUXIO_NAMESPACE>-<ALLUXIO_CLUSTER_NAME>-conf -o yaml
If following the Install Alluxio on Kubernetes instructions,
the value of <ALLUXIO_NAMESPACE>-<ALLUXIO_CLUSTER_NAME>-conf
would be default-alluxio-conf
.
Using the following command will generate a alluxio-config.yaml
file:
$ cat <<EOF > alluxio-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: alluxio-config
namespace: spark
data:
alluxio-site.properties: |-
$( kubectl -n ${alluxioNamespace} get configmap ${alluxioNamespace}-${alluxioClusterName}-conf -o json \
| jq -r '.data["alluxio-site.properties"]' \
| sed -e's/^/ /' )
EOF
Note:
${alluxioNamespace}
and${alluxioClusterName}
should match the values for the existing Alluxio cluster. If you followed the Install Alluxio on Kubernetes instructions, it would bedefault
andalluxio
respectively.- The
jq
command is used to parse JSON
Create the configmap by running the command:
kubectl -n spark apply -f alluxio-config.yaml
Create a Service Account for Alluxio
An Alluxio service account is used if you are using IAM for authentication/authorization.
Create a spark-s3-access-sa.yaml
file, with the following contents:
apiVersion: v1
kind: ServiceAccount
metadata:
name: alluxio-s3-access
namespace: spark
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::<YOUR_AWS_ACCOUNT_ID>:role/S3EKSServiceAccountRole
where <YOUR_AWS_ACCOUNT_ID>
should be replaced with your AWS account ID.
Create the service account with the command:
$ kubectl -n spark apply -f spark-s3-access-sa.yaml
Provide Alluxio properties as Spark configuration values for job submission
For the Spark cluster to properly communicate with the Alluxio cluster, certain properties must be aligned between the Alluxio client and Alluxio server.
In particular, the values for hadoopConf
should be set to match the values of your Alluxio deployment.
Take a note of these properties under alluxio-site.properties
of the previously created alluxio-config.yaml
file:
alluxio.etcd.endpoints
alluxio.cluster.name
alluxio.k8s.env.deployment
alluxio.mount.table.source
alluxio.worker.membership.manager.type
Add the following in your spark application yaml file for job submission;
see the next section for a full example of alluxio-sparkApplication.yaml
.
sparkConf:
spark.hadoop.fs.s3a.aws.credentials.provider: com.amazonaws.auth.DefaultAWSCredentialsProviderChain
spark.hadoop.fs.s3a.access.key: <YOUR_S3_ACCESS_KEY>
spark.hadoop.fs.s3a.secret.key: <YOUR_S3_SECRET_KEY>
spark.hadoop.fs.s3a.impl: alluxio.hadoop.FileSystem
spark.driver.extraClassPath: /opt/alluxio/alluxio-DA-3.2-8.0.0-client.jar:/opt/alluxio/conf
spark.executor.extraClassPath: /opt/alluxio/alluxio-DA-3.2-8.0.0-client.jar:/opt/alluxio/conf
hadoopConf:
alluxio.etcd.endpoints: http://alluxio-etcd.default:2379
alluxio.cluster.name: default-alluxio
alluxio.k8s.env.deployment: true
alluxio.mount.table.source: ETCD
alluxio.worker.membership.manager.type: ETCD
The above example assumes Alluxio was deployed following the Install Alluxio on Kubernetes instructions.
Examples
Using Spark to read and write a file
This section provides an example on how to use Spark to read and write a file. In this simple example, we will count words in an input file.
Create a scala file
We need to write a scala file and generate a JAR, and then upload JAR to s3.
First create a folder that contains a build.sbt
file with the following contents:
scalaVersion := "2.12.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.3.1"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.3.1"
Create a scala file spark-scala-demo.scala
with the following example content,
which will be used to generate the Spark jar:
import org.apache.spark.sql.SparkSession
object S3WordCount {
def main(args: Array[String]): Unit = {
// create SparkSession
val spark = SparkSession.builder
.appName("AlluxioSparkExample")
.master("local[*]")
.getOrCreate()
// create sparkContext
val sc = spark.sparkContext
val inputPath = "s3a://<MY_BUCKET>/test-1/file.txt"
val outputPath = "s3a://<MY_BUCKET>/test-1/Output"
// read file
val s = sc.textFile(inputPath)
// print file
println("Original File Content:")
s.collect().foreach(println)
// map
val doubled = s.map(line => line + line)
// save result
doubled.saveAsTextFile(outputPath)
// stop
spark.stop()
}
}
Update the inputPath
to the S3 path where you put your input file,
and the outputPath
path is the S3 path where you want your output file be.
They should be accessible by the provided credentials.
Use the sbt
tool to build JAR under the folder with the scala file.
If sbt
is not already installed, run $ brew install sbt
$ sbt package
Find the file in ./target/scala-2.12/
directory, note its name (i.e. <SPARK_JOB_JAR_FILE>.jar
). Upload it to S3:
$ aws s3 cp ./target/scala-2.12/<SPARK_JOB_JAR_FILE>.jar s3://<BUCKET_NAME>/<S3_PATH>/alluxioread_2.12-0.1.jar
replacing <BUCKET_NAME>/<S3_PATH>
with an accessible S3 location.
Create Spark application
Create a alluxio-sparkApplication.yaml
file with the following example content:
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: alluxio-word-count
namespace: spark
spec:
type: Scala
mode: cluster
image: <PRIVATE_REGISTRY>/alluxio/spark:<SPARK_VERSION>
imagePullPolicy: Always
mainClass: S3WordCount
mainApplicationFile: s3://<BUCKET_NAME>/<S3_PATH>/alluxioread_2.12-0.1.jar
sparkVersion: 3.5.2
driver:
javaOptions: "-Divy.cache.dir=/tmp -Divy.home=/tmp -Daws.accessKeyId=<YOUR_S3_ACCESS_KEY> -Daws.secretKey=<YOUR_S3_SECRET_KEY>"
labels:
version: 3.5.2
cores: 1
coreLimit: 1200m
memory: 512m
serviceAccount: alluxio-s3-access
configMaps:
- name: alluxio-config
path: /opt/alluxio/conf/
executor:
labels:
version: 3.5.2
cores: 1
coreLimit: 1200m
memory: 512m
configMaps:
- name: alluxio-config
path: /opt/alluxio/conf/
deps:
repositories:
- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws
packages:
- org.apache.hadoop:hadoop-aws:3.2.2
sparkConf:
spark.hadoop.fs.s3a.aws.credentials.provider: com.amazonaws.auth.DefaultAWSCredentialsProviderChain
spark.hadoop.fs.s3a.access.key: <YOUR_S3_ACCESS_KEY>
spark.hadoop.fs.s3a.secret.key: <YOUR_S3_SECRET_KEY>
spark.hadoop.fs.s3a.impl: alluxio.hadoop.FileSystem
spark.driver.extraClassPath: /opt/alluxio/alluxio-DA-3.2-8.0.0-client.jar:/opt/alluxio/conf
spark.executor.extraClassPath: /opt/alluxio/alluxio-DA-3.2-8.0.0-client.jar:/opt/alluxio/conf
hadoopConf:
alluxio.etcd.endpoints: http://alluxio-etcd.default:2379
alluxio.mount.table.source: ETCD
alluxio.worker.membership.manager.type: ETCD
alluxio.k8s.env.deployment: true
alluxio.cluster.name: default-alluxio
Note the following customizations:
- Under
spec.image
, specify the location of the custom Spark image - Set the S3 path to the uploaded jar in
spec.mainApplicationFile
in place ofs3a://<BUCKET_NAME>/<S3_PATH>/alluxioread_2.12-0.1.jar
- Set the access credentials to S3 in the following locations:
javaOptions
for both the driver and executor- As
spark.hadoop.fs.s3a.*
properties insparkConf
- Alluxio specific configurations for
sparkConf
andhadoopConf
, as previously described in provide Alluxio properties to Spark
Deploy the spark application with the command:
$ kubectl create -n spark -f alluxio-sparkApplication.yaml