Running Apache Flink on Alluxio
This guide describes how to get Alluxio running with Apache Flink, so that you can easily work with files stored in Alluxio.
Prerequisites
- Setup Java for Java 8 Update 161 or higher (8u161+), 64-bit.
- Alluxio has been set up and is running.
- Flink has been installed and set up.
Configuration
Apache Flink allows to use Alluxio through a generic file system wrapper for the Hadoop file system. Therefore, the configuration of Alluxio is done mostly in Hadoop configuration files.
Set property in core-site.xml
If you have a Hadoop setup next to the Flink installation, add the following property to the
core-site.xml
configuration file:
<property>
<name>fs.alluxio.impl</name>
<value>alluxio.hadoop.FileSystem</value>
</property>
In case you don’t have a Hadoop setup, you have to create a file called core-site.xml
with the
following contents:
<configuration>
<property>
<name>fs.alluxio.impl</name>
<value>alluxio.hadoop.FileSystem</value>
</property>
</configuration>
Specify path to core-site.xml
in conf/flink-conf.yaml
Next, you have to specify the path to the Hadoop configuration in Flink. Open the
conf/flink-conf.yaml
file in the Flink root directory and set the fs.hdfs.hadoopconf
configuration value to the directory containing the core-site.xml
. (For newer Hadoop versions,
the directory usually ends with etc/hadoop
.)
Distribute the Alluxio Client Jar
In order to communicate with Alluxio, we need to provide Flink programs with the Alluxio Core Client
jar. We recommend you to download the tarball from
Alluxio download page.
Alternatively, advanced users can choose to compile this client jar from the source code
by following the instructions here.
The Alluxio client jar can be found at /<PATH_TO_ALLUXIO>/client/alluxio-2.10.0-SNAPSHOT-client.jar
.
We need to make the Alluxio jar
file available to Flink, because it contains the configured
alluxio.hadoop.FileSystem
class.
There are different ways to achieve that:
- Put the
/<PATH_TO_ALLUXIO>/client/alluxio-2.10.0-SNAPSHOT-client.jar
file into thelib
directory of Flink (for local and standalone cluster setups) - Put the
/<PATH_TO_ALLUXIO>/client/alluxio-2.10.0-SNAPSHOT-client.jar
file into theship
directory for Flink on YARN. - Specify the location of the jar file in the
HADOOP_CLASSPATH
environment variable (make sure its available on all cluster nodes as well). For example like this:
$ export HADOOP_CLASSPATH=/<PATH_TO_ALLUXIO>/client/alluxio-2.10.0-SNAPSHOT-client.jar
Translate additional Alluxio site properties to Flink
In addition, if there are any client-related properties specified in conf/alluxio-site.properties
,
translate those to env.java.opts
in {FLINK_HOME}/conf/flink-conf.yaml
for Flink to pick up
Alluxio configuration. For example, if you want to configure Alluxio client to use CACHE_THROUGH as
the write type, you should add the following to {FLINK_HOME}/conf/flink-conf.yaml
.
env.java.opts: -Dalluxio.user.file.writetype.default=CACHE_THROUGH
Note: If there are running flink clusters, stop the flink clusters and restart them to apply the changes to the configuration.
Using Alluxio with Flink
To use Alluxio with Flink, just specify paths with the alluxio://
scheme.
If Alluxio is installed locally, a valid path would look like this
alluxio://localhost:19998/user/hduser/gutenberg
.
Wordcount Example
This example assumes you have set up Alluxio and Flink as previously described.
Put the file LICENSE
into Alluxio, assuming you are in the top level Alluxio project directory:
$ bin/alluxio fs copyFromLocal LICENSE alluxio://localhost:19998/LICENSE
Run the following command from the top level Flink project directory:
$ bin/flink run examples/batch/WordCount.jar \
--input alluxio://localhost:19998/LICENSE \
--output alluxio://localhost:19998/output
Open your browser and check http://localhost:19999/browse. There should be an output file output
which contains the word counts of the file LICENSE
.