Alluxio Community Day VIII

Join us at our next virtual community event on December 14th featuring fellow Alluxio community users from Apache Iceberg and WeRide.

Deploy Alluxio on Kubernetes

Slack Docker Pulls GitHub edit source

Alluxio can be run on Kubernetes. This guide demonstrates how to run Alluxio on Kubernetes using the specification included in the Alluxio Docker image or helm.

Prerequisites

  • A Kubernetes cluster (version 1.11+) with Beta feature gate APIs enabled
    • The Alluxio Helm chart which the Kubernetes resource specifications are built from supports Kubernetes version 1.11+.
    • Beta feature gates are enabled by default for Kubernetes cluster installations
  • Cluster access to an Alluxio Docker image alluxio/alluxio. If using a private Docker registry, refer to the Kubernetes private image registry documentation.
  • Ensure the cluster’s Kubernetes Network Policy allows for connectivity between applications (Alluxio clients) and the Alluxio Pods on the defined ports.

Basic Setup

This tutorial walks through a basic Alluxio setup on Kubernetes. Alluxio supports two methods of installation on Kubernetes: either using helm charts or using kubectl. When available, helm is the preferred way to install Alluxio. If helm is not available or if additional deployment customization is desired, kubectl can be used directly using native Kubernetes resource specifications.

Note: From Alluxio 2.3 on, Alluxio only supports helm 3. See how to migrate from helm 2 to 3 here.

The Alluxio Helm chart source code is located here. Alternatively, you can extract the Helm chart directory from the Alluxio Docker image:

$ id=$(docker create alluxio/alluxio:2.8.0-SNAPSHOT)
$ docker cp $id:/opt/alluxio/integration/kubernetes/ - > kubernetes.tar
$ docker rm -v $id 1>/dev/null
$ tar -xvf kubernetes.tar
$ cd kubernetes/helm-chart/alluxio

Depending on the configuration used to deploy Alluxio, you will likely need to provision Persistent Volumes.

  • Embedded Journal requires a Persistent Volume for each master Pod to be provisioned and is the preferred HA mechanism for Alluxio on Kubernetes. The volume, once claimed, is persisted across restarts of the master process.
  • When using the UFS Journal an Alluxio master can also be configured to use a persistent volume for storing the journal. If Alluxio is configured to use a UFS journal and with an external journal location like HDFS, the rest of this section can be skipped.
  • When Alluxio workers have short-circuit access, you may need to use Volumes to mount the domain socket to the workers.

There are multiple ways to create a PersistentVolume. This is an example which defines one with hostPath for the Alluxio Master journal:

# Name the file alluxio-master-journal-pv.yaml
kind: PersistentVolume
apiVersion: v1
metadata:
  name: alluxio-journal-0
  labels:
    type: local
spec:
  storageClassName: standard
  capacity:
    storage: 1Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: /tmp/alluxio-journal-0

Note:

  • By default each journal volume should be at least 1Gi, because each Alluxio master Pod will have one PersistentVolumeClaim that requests for 1Gi storage. You will see how to configure the journal size in later sections.
  • If this hostPath is not already present on the host, Kubernetes can be configured to create it. However the assigned user:group permissions may prevent the Alluxio masters & workers from accessing it. Please ensure the permissions are set to allow the pods to access the directory.
    • See the Kubernetes volume docs for more details
    • From Alluxio v2.1 on, Alluxio Docker containers will run as non-root user alluxio with UID 1000 and GID 1000 by default.

Then create the persistent volume with kubectl:

$ kubectl create -f alluxio-master-journal-pv.yaml

Deploy


Verify

If using persistent volumes, the status of the volume(s) should change to CLAIMED and the status of the volume claims should be BOUNDED. You can validate the status of your PersistentVolume and PersistentVolumeClaims using the follow kubectl commands:

$ kubectl get pv
$ kubectl get pvc
  • If you have unbound PersistentVolumeClaims, please ensure you have provisioned matching PersistentVolumes. See “(Optional) Provision a Persistent Volume” in Basic Setup.

Once ready, access the Alluxio CLI from the master Pod and run basic I/O tests.

$ kubectl exec -ti alluxio-master-0 /bin/bash

From the master Pod, execute the following:

$ alluxio runTests

Access the Web UI

The Alluxio UI can be accessed from outside the kubernetes cluster using port forwarding.

$ kubectl port-forward alluxio-master-$i <local-port>:19999

The command above allocates a port on the local node <local-port> and forwards traffic on <local-port> to port 19999 of pod alluxio-master-$i. The pod alluxio-master-$i does NOT have to be on the node you are running this command.

Note: i=0 for the the first master Pod. When running multiple masters, forward port for each master. Only the primary master serves the Web UI.

For example, you are on a node with hostname master-node-1 and you would like to serve the Alluxio master web UI for alluxio-master-0 on master-node-1:8080. Here’s the command you can run:

[alice@master-node-1 ~]$ kubectl port-forward --address 0.0.0.0 pods/alluxio-master-0 8080:19999

This forwards the local port master-node-1:8080 to the port on the Pod alluxio-master-0:19999. The Pod alluxio-master-0 does NOT need to be running on master-node-1.

You will see messages like below when there are incoming connections.

[alice@master-node-1 ~]$ kubectl port-forward --address 0.0.0.0 alluxio-master-0 8080:19999
Forwarding from 0.0.0.0:8080 -> 19999
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080

You can terminate the process to stop the port forwarding, with either Ctrl + C or kill.

For more information about K8s port-forward see the K8s doc.

Advanced Setup

Enable Short-circuit Access

Short-circuit access enables clients to perform read and write operations directly against the worker bypassing the networking interface. For performance-critical applications it is recommended to enable short-circuit operations against Alluxio because it can increase a client’s read and write throughput when co-located with an Alluxio worker.

This feature is enabled by default (see next section to disable this feature), however requires extra configuration to work properly in Kubernetes environments.

There are two modes for using short-circuit.

Option1: Use local mode

In this mode, the Alluxio client and local Alluxio worker recognize each other if the client hostname matches the worker hostname. This is called Hostname Introspection. In this mode, the Alluxio client and local Alluxio worker share the tiered storage of Alluxio worker.


Option2: Use uuid (default)

This is the default policy used for short-circuit in Kubernetes.

If the client or worker container is using virtual networking, their hostnames may not match. In such a scenario, set the following property to use filesystem inspection to enable short-circuit operations and make sure the client container mounts the directory specified as the domain socket path. Short-circuit writes are then enabled if the worker UUID is located on the client filesystem.

Domain Socket Path. The domain socket is a volume which should be mounted on:

  • All Alluxio workers
  • All application containers which intend to read/write through Alluxio

This domain socket volume can be either a PersistentVolumeClaim or a hostPath Volume.

Use PersistentVolumeClaim. By default, this domain socket volume is a PersistentVolumeClaim. You need to provision a PersistentVolume to this PersistentVolumeClaim. And this PersistentVolume should be either local or hostPath.


Use hostPath Volume. You can also directly define the workers to use a hostPath Volume for domain socket.


Verify Short-circuit Operations

To verify short-circuit reads and writes monitor the metrics displayed under:

  1. the metrics tab of the web UI as Domain Socket Alluxio Read and Domain Socket Alluxio Write
  2. or, the metrics json as cluster.BytesReadDomain and cluster.BytesWrittenDomain
  3. or, the fsadmin metrics CLI as Short-circuit Read (Domain Socket) and Alluxio Write (Domain Socket)

Disable Short-Circuit Operations

To disable short-circuit operations, the operation depends on how you deploy Alluxio.

Note: As mentioned, disabling short-circuit access for Alluxio workers will result in worse I/O throughput


Enable remote logging

Alluxio supports a centralized log server that collects logs for all Alluxio processes. You can find the specific section at Remote logging. This can be enabled on K8s too, so that all Alluxio pods will send logs to this log server.


Verify log server

You can go into the log server pod and verify the logs exist.

$ kubectl exec -it <logserver-pod-name> bash
# In the logserver pod
bash-4.4$ pwd
/opt/alluxio
# You should see logs collected from other Alluxio pods
bash-4.4$ ls -al logs
total 16
drwxrwsr-x    4 1001     bin           4096 Jan 12 03:14 .
drwxr-xr-x    1 alluxio  alluxio         18 Jan 12 02:38 ..
drwxr-sr-x    2 alluxio  bin           4096 Jan 12 03:14 job_master
-rw-r--r--    1 alluxio  bin            600 Jan 12 03:14 logserver.log
drwxr-sr-x    2 alluxio  bin           4096 Jan 12 03:14 master
drwxr-sr-x    2 alluxio  bin           4096 Jan 12 03:14 worker
drwxr-sr-x    2 alluxio  bin           4096 Jan 12 03:14 job_worker

POSIX API

Once Alluxio is deployed on Kubernetes, there are multiple ways in which a client application can connect to it. For applications using the POSIX API, application containers can simply mount the Alluxio FileSystem.

In order to use the POSIX API, first deploy the Alluxio FUSE daemon.


CSI

Other than using Alluxio FUSE daemon, you could also use CSI to mount the Alluxio FileSystem into application containers.

In order to use CSI, you need a Kubernetes cluster with version at least 1.17, with RBAC enabled in API Server.

Step 1: Customize configurations and generate templates

You can either use the default CSI configurations provided in here under the csi section, or you can customize them to make it suitable for your workload. Here are some common properties that you can customize:

property name Description
alluxioPath The path in Alluxio which will be mounted
mountPath The path that Alluxio will be mounted to in the application container
javaOptions The customized options which will be passes to fuse daemon
mountOptions Alluxio Fuse mount options

Then please use helm-generate.sh (see here for usage) to generate related templates. All CSI related templates will be under ${ALLUXIO_HOME}/integration/kubernetes/<deploy-mode>/csi.

Step 2: Deploy CSI services

Modify or add any configuration properties as required, then create the respective resources.

$ mv alluxio-csi-controller-rbac.yaml.template alluxio-csi-controller-rbac.yaml
$ mv alluxio-csi-controller.yaml.template alluxio-csi-controller.yaml
$ mv alluxio-csi-driver.yaml.template alluxio-csi-driver.yaml
$ mv alluxio-csi-nodeplugin.yaml.template alluxio-csi-nodeplugin.yaml

Then run

$ kubectl apply -f alluxio-csi-controller-rbac.yaml -f alluxio-csi-controller.yaml -f alluxio-csi-driver.yaml -f alluxio-csi-nodeplugin.yaml

to deploy CSI-related services.

Step 3: Provisioning

We provide both templates for k8s dynamic provisioning and static provisioning. Please choose the suitable provisioning methods according to your use case. You can refer to Persistent Volumes | Kubernetes and Dynamic Volume Provisioning | Kubernetes to get more details.


Step 4: Deploy applications

Now you can put the PVC name in your application pod spec to use the Alluxio FileSystem. The template alluxio-nginx-pod.yaml.template shows how to use PVC in the pod. You can also deploy it by running

$ mv alluxio-nginx-pod.yaml.templte alluxio-nginx-pod.yaml
$ kubectl apply -f alluxio-nginx-pod.yaml

to validate that CSI has been deployed, and you can successfully access the data stored in Alluxio.

For more information on how to configure a pod to use a persistent volume for storage in Kubernetes, please refer to here.

Toggle Master or Worker in Helm chart

In use cases where you wish to install Alluxio masters and workers separately with the Helm chart, use the following respective toggles:

master:
  enabled: false

worker:
  enabled: false

Kubernetes Configuration Options

The following options are provided in our Helm chart as additional parameters for experienced Kubernetes users.

ServiceAccounts

By default Kubernetes will assign the namespace’s default ServiceAccount to new pods in a namespace. You may specify for Alluxio pods to use any existing ServiceAccounts you may have in your cluster through the following:


Node Selectors & Tolerations

Kubernetes provides many options to control the scheduling of pods onto nodes in the cluster. The most direct of which is a node selector.

However, Kubernetes will avoid scheduling pods on any tainted nodes. To allow certain pods to schedule on such nodes, Kubernetes allows you to specify tolerations for those taints. See the Kubernetes documentation on taints and tolerations for more details.


Host Aliases

If you wish to add or override hostname resolution in the pods, Kubernetes exposes the containers’ /etc/hosts file via host aliases. This can be particularly useful for providing hostname addresses for services not managed by Kubernetes, like HDFS.


Deployment Strategy

By default Kubernetes will use the ‘RollingUpdate’ deployment strategy to progressively upgrade Pods when changes are detected.


ImagePullSecrets

Kubernetes supports accessing images from a Private Registry. After creating the registry credentials Secret in Kubernetes, you pass the secret to your Pods via imagePullSecrets.


Troubleshooting

Alluxio workers use host networking with the physical host IP as the hostname. Check the cluster firewall if an error such as the following is encountered:

Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Host is unreachable: <host>/<IP>:29999
  • Check that <host> matches the physical host address and is not a virtual container hostname. Ping from a remote client to check the address is resolvable.
    $ ping <host>
    
  • Verify that a client can connect to the workers on the ports specified in the worker deployment specification. The default ports are [29998, 29999, 29996, 30001, 30002, 30003]. Check access to the given port from a remote client using a network utility such as ncat:
    $ nc -zv <IP> 29999
    

From Alluxio v2.1 on, Alluxio Docker containers will run as non-root user alluxio with UID 1000 and GID 1000 by default. Kubernetes hostPath volumes are only writable by root so you need to update the permission accordingly.

To change the log level for Alluxio servers (master and workers), use the CLI command logLevel as follows:

Access the Alluxio CLI from the master Pod.

$ kubectl exec -ti alluxio-master-0 /bin/bash

From the master Pod, execute the following:

$ alluxio logLevel --level DEBUG --logName alluxio

The Alluxio master and job master run as separate containers of the master Pod. Similarly, the Alluxio worker and job worker run as separate containers of a worker Pod. Logs can be accessed for the individual containers as follows.

Master:

$ kubectl logs -f alluxio-master-0 -c alluxio-master

Worker:

$ kubectl logs -f alluxio-worker-<id> -c alluxio-worker

Job Master:

$ kubectl logs -f alluxio-master-0 -c alluxio-job-master

Job Worker:

$ kubectl logs -f alluxio-worker-<id> -c alluxio-job-worker

In order for an application container to mount the hostPath volume, the node running the container must have the Alluxio FUSE daemon running. The default spec alluxio-fuse.yaml runs as a DaemonSet, launching an Alluxio FUSE daemon on each node of the cluster.

If there are issues accessing Alluxio using the POSIX API:

  1. Identify the node the application container ran on using the command kubectl describe pods or the dashboard.
  2. Use the command kubectl describe nodes <node> to identify the alluxio-fuse Pod running on that node.
  3. Tail logs for the identified Pod to view any errors encountered: kubectl logs -f alluxio-fuse-<id>.

Alluxio workers create a domain socket used for short-circuit access by default. On Mac OS X, Alluxio workers may fail to start if the location for this domain socket is a path which is longer than what the filesystem accepts.

2020-07-27 21:39:06,030 ERROR GrpcDataServer - Alluxio worker gRPC server failed to start on /opt/domain/1d6d7c85-dee0-4ac5-bbd1-86eb496a2a50
java.io.IOException: Failed to bind
	at io.grpc.netty.NettyServer.start(NettyServer.java:252)
	at io.grpc.internal.ServerImpl.start(ServerImpl.java:184)
	at io.grpc.internal.ServerImpl.start(ServerImpl.java:90)
	at alluxio.grpc.GrpcServer.lambda$start$0(GrpcServer.java:77)
	at alluxio.retry.RetryUtils.retry(RetryUtils.java:39)
	at alluxio.grpc.GrpcServer.start(GrpcServer.java:77)
	at alluxio.worker.grpc.GrpcDataServer.<init>(GrpcDataServer.java:107)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at alluxio.util.CommonUtils.createNewClassInstance(CommonUtils.java:273)
	at alluxio.worker.DataServer$Factory.create(DataServer.java:47)
	at alluxio.worker.AlluxioWorkerProcess.<init>(AlluxioWorkerProcess.java:162)
	at alluxio.worker.WorkerProcess$Factory.create(WorkerProcess.java:46)
	at alluxio.worker.WorkerProcess$Factory.create(WorkerProcess.java:38)
	at alluxio.worker.AlluxioWorker.main(AlluxioWorker.java:72)
Caused by: io.netty.channel.unix.Errors$NativeIoException: bind(..) failed: Filename too long

If this is the case, set the following properties to limit the path length:

  • alluxio.worker.data.server.domain.socket.as.uuid=false
  • alluxio.worker.data.server.domain.socket.address=/opt/domain/d

Note: You may see performance degradation due to lack of node locality.

This is most likely caused due to the Kubernetes configured Pod resource limits having the limits.memory set too low.

Firstly, double check the configured values for your Alluxio worker Pod limits.memory. Note that the Pod consists of two containers, each with their own resource limits.

Check the configured resource requests and limits using kubectl describe pod, kubectl get pod, or equivalent Kube API requests. eg.,

$ kubectl get po -o json alluxio-worker-xxxxx | jq '.spec.containers[].resources'
{
  "limits": {
    "cpu": "4",
    "memory": "4G"
  },
  "requests": {
    "cpu": "1",
    "memory": "2G"
  }
}
{
  "limits": {
    "cpu": "4",
    "memory": "4G"
  },
  "requests": {
    "cpu": "1",
    "memory": "1G"
  }
}

If you used the Helm chart, the default values are:

worker:
  resources:
    limits:
      cpu: "4"
      memory: "4G"
    requests:
      cpu: "1"
      memory: "2G"

jobWorker:
  resources:
    limits:
      cpu: "4"
      memory: "4G"
    requests:
      cpu: "1"
      memory: "1G"
  • Even if you did not configure any values with Helm, you may still have resource limits in place due to a LimitRange applied to your namespace

Next, ensure that the nodes that the Alluxio worker pods are running on have sufficient resources matching your configured values. You can check that the nodes you intend to schedule Alluxio worker Pods on have sufficient resources to meet your requests using kubectl describe node, kubectl get node, or equivalent Kube API requests. eg.,

$ kubectl get no -o json k8sworkernode-0 | jq '.status.allocatable'
{
  "cpu": "8",
  "ephemeral-storage": "123684658586",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "64886128Ki",
  "pods": "110"
}

Isolating Alluxio worker Pods from other Pods in your Kubernetes cluster can be accomplished with the help of node selectors and node taints + tolerations.

  • Keep in mind that the Alluxio worker Pod definition uses a DaemonSet, so there will be worker Pods assigned to all eligible nodes

Next, verify the Alluxio workers’ configured ramdisk sizes (if any). See the list of Alluxio configuration properties for additional details.

  • If you used the Helm chart, the Alluxio site properties are configured using properties. eg.,
properties:
  alluxio.worker.ramdisk.size: 2G
  alluxio.worker.tieredstore.levels: 1
  alluxio.worker.tieredstore.level0.alias: MEM
  alluxio.worker.tieredstore.level0.dirs.mediumtype: MEM
  alluxio.worker.tieredstore.level0.dirs.path: /dev/shm
  alluxio.worker.tieredstore.level0.dirs.quota: 2G
  • Otherwise, you can view and modify the site properties in the alluxio-config ConfigMap. eg.,
$ kubectl get cm -o json alluxio-config | jq '.data.ALLUXIO_WORKER_JAVA_OPTS'
"-Dalluxio.worker.ramdisk.size=2G
-Dalluxio.worker.tieredstore.levels=1
-Dalluxio.worker.tieredstore.level0.alias=MEM
-Dalluxio.worker.tieredstore.level0.dirs.mediumtype=MEM
-Dalluxio.worker.tieredstore.level0.dirs.path=/dev/shm
-Dalluxio.worker.tieredstore.level0.dirs.quota=2G "

NOTE: Our DaemonSet uses emptyDir volumes as the Alluxio worker Pod’s ramdisk in Kubernetes.

spec:
  template:
    spec:
      volumes:
        - name: mem
          emptyDir:
            medium: "Memory"
            sizeLimit: 1G

This results in the following nuances:

  • sizeLimit has no effect on the size of the allocated ramdisk unless the SizeMemoryBackedVolumes feature gate is enabled (enabled by default as of Kubernetes 1.22).
  • As stated in the Kubernetes emptyDir documentation, if no size is specified then memory-backed emptyDir volumes will have capacity allocated equal to half the available memory on the host node. This capacity is reflected inside of your containers (for example when running df -u). However if the combined size of your ramdisk and container memory usage exceeds the pod’s limits.memory then the Kubernetes scheduler will trigger an OOMKilled on that pod. This is a very likely overlooked source of memory consumption in Alluxio worker Pods.

Lastly, verify the Alluxio worker JVM heap and off-heap maximum capacities. These are configured with the JVM flags -Xmx/-XX:MaxHeapSize and -XX:MaxDirectMemorySize respectively.

To adjust those values, you would have to manually update the (...)_JAVA_OPTS environment variables in the alluxio-config ConfigMap. For example:

apiVersion: v1
kind: ConfigMap
metadata:
  name: alluxio-config
data:
  ALLUXIO_JAVA_OPTS: |-
    -Xmx2g -Dalluxio.master.hostname=alluxio-master-0 ...
  ALLUXIO_MASTER_JAVA_OPTS: |-
    -Dalluxio.master.hostname=${ALLUXIO_MASTER_HOSTNAME}
  ALLUXIO_JOB_MASTER_JAVA_OPTS: |-
    -Dalluxio.master.hostname=${ALLUXIO_MASTER_HOSTNAME}
  ALLUXIO_WORKER_JAVA_OPTS: |-
    -XX:MaxDirectMemorySize=2g -Dalluxio.worker.hostname=${ALLUXIO_WORKER_HOSTNAME} ...
  ALLUXIO_JOB_WORKER_JAVA_OPTS: |-
    -XX:MaxDirectMemorySize=1g -Dalluxio.worker.hostname=${ALLUXIO_WORKER_HOSTNAME} ...
  ALLUXIO_FUSE_JAVA_OPTS: |-
    -Dalluxio.user.hostname=${ALLUXIO_CLIENT_HOSTNAME} -XX:MaxDirectMemorySize=2g
  ALLUXIO_WORKER_TIEREDSTORE_LEVEL0_DIRS_PATH: /dev/shm

Thus to avoid worker Pods running into OOMKilled errors,

  1. Verify that the nodes your Alluxio worker Pods are scheduled on have sufficient memory to satisfy all the limits.memory specifications assigned.
  2. Ensure you have configured alluxio.worker.ramdisk.size and alluxio.worker.tieredstore.level0.dirs.quota low enough such that the memory consumed by the ramdisk combined with the JVM memory options (-Xmx, -XX:MaxDirectMemorySize) do not exceed the Pod’s limits.memory. It is recommended to allow for some overhead as memory may be consumed by other processes as well.

Aside: There is currently an open issue in Alluxio where Alluxio’s interpretation of byte sizes differs from Kubernetes (due to Kubernetes distinguishing between “-bibytes”). This is unlikely to cause OOMKilled errors unless you are operating on very tight memory margins.

It is a known issue that in some early versions of Java 8, the JVM running in a container will determine its heap size(if not specified with -Xmx and -Xms) based on the memory of the physical host instead of the container. In that case, the JVM may attempt to use more memory than the container resource limit and gets killed. You can find more detailed explanations here.

Since Java 8u131, some JVM flags can be turned on in order to correctly read the memory from cgroup. You can refer to our values.yaml from our Helm chart template, and uncomment the below options. These options will be added to the JVM options of all Alluxio containers, including the masters and workers etc. You can find more detailed explanations here.

# Recommended JVM Heap options for running in Docker
# Ref: https://developers.redhat.com/blog/2017/03/14/java-inside-docker/
# These JVM options are common to all Alluxio services
jvmOptions:
  - "-XX:+UnlockExperimentalVMOptions"
  - "-XX:+UseCGroupMemoryLimitForHeap"
  - "-XX:MaxRAMFraction=2"

From Java git 8u191 on, the container support works out-of-the-box. So you don’t need to turn on the flags mentioned above any more.

You should check the Java version in the container you are using to ensure the correct memory limits are respected. Also it is recommended to go to the running container and double check the JVM process is running with the correct memory consumption.