Running Spark on Alluxio in DC/OS
Spark on Alluxio DC/OS v2.3.0-2.1.0-1.0
This guide describes how to run Apache Spark on Alluxio in a DC/OS environment with framework v2.3.0-2.1.0-1.0.
Prerequisites
- A DC/OS cluster
- A Docker registry on the DC/OS cluster
Build Spark Docker Image
The DC/OS CLI subcommand dcos alluxio-enterprise
can be used to build a custom Spark image with the Alluxio client jar and client side configuration required to connect with an Alluxio cluster deployed on DC/OS.
dcos alluxio-enterprise plan start build-spark-client -p DOCKER_PUBLISH_URL=<registry-host>:<registry-port> -p DOCKER_SPARK_CLIENT_BASE=mesosphere/spark:1.0.6-2.0.2-hadoop-2.6 -p DOCKER_SPARK_DIST_HOME=/opt/spark/dist
Parameters:
DOCKER_PUBLISH_URL
: Docker registry URL to push the built custom Spark image.DOCKER_SPARK_CLIENT_BASE
: The Docker image containing the desired Spark version.DOCKER_SPARK_DIST_HOME
: Path to Spark home within the Docker image. Alluxio client jar and configuration will be copied over into this location.
Run a Spark Job on Alluxio
SSH into a DC/OS node and run the Spark Docker image.
$ sudo docker run -it --net=host <registry-host>:<registry-port>/alluxio/spark-aee /bin/bash
From within the Docker image run the Spark shell.
$ ./bin/spark-shell --master mesos://master.mesos:5050 --conf "spark.mesos.executor.docker.image=registry.marathon.l4lb.thisdcos.directory:5000/alluxio/spark-aee" --conf "spark.mesos.executor.docker.forcePullImage=false" --conf "spark.scheduler.minRegisteredResourcesRatio=1" --conf "spark.scheduler.maxRegisteredResourcesWaitingTime=5s" --conf "spark.driver.extraClassPath=/opt/spark/dist/jars/alluxio-enterprise-2.1.0-1.0-client.jar" --conf "spark.executor.extraClassPath=/opt/spark/dist/jars/alluxio-enterprise-2.1.0-1.0-client.jar" --executor-memory 1G
Note: For locality, please ensure that Spark executors are registered on every node running an Alluxio worker before running a job. You can check the status by visiting the Mesos Web UI at <DC/OS DnsAddress>/mesos
. Once the cluster is ready for use, you will see the task status change from Staging
to Running
for each Spark executor.
Change path/to/file
to the path of the file you want to run the Spark count job on.
scala> sc.setLogLevel("INFO")
scala> val file = sc.textFile("alluxio://master-0-node.alluxio-enterprise.mesos:19998/path/to/file")
scala> file.count()
Run the job again to see performance benefits with Alluxio.
Note that a Spark locality level of NODE_LOCAL
indicates that locality was achieved.