Alluxio Transformation Service

Slack Docker Pulls

Data Transformations

With job service and catalog service, Alluxio can transform a table to a new table. If the table is not partitioned, then the transformation is run at table level. If the table is partitioned, then the transformation is run at partition level. The data of the original table is not modified, and the data of the new table is persisted in a new location managed by Alluxio. Once the transformation is done, Presto users can transparently query against the new data.

There are three kinds of supported transformations:

  1. coalesce the files to a specified number of files.
  2. convert CSV files to Parquet files.
  3. sort each file according to specified columns.

The transformed data is always in Parquet format.

Before running a transformation, you should first attach a database. The following command attaches the “default” database in Hive to Alluxio.

$ ${ALLUXIO_HOME}/bin/alluxio table attachdb hive thrift://localhost:9083 default

The following command coalesce files under each partition of table “test” to one file.

$ ${ALLUXIO_HOME}/bin/alluxio table transform default test

After running the above command, you’ll see output like:

Started transformation job with job ID 1572296710137, you can monitor the status of the job with './bin/alluxio table transformStatus 1572296710137'.

Now follow the instruction in the output to monitor status of the transformation:

$ ${ALLUXIO_HOME}/bin/alluxio table transformStatus 1572296710137

It will show the status of the transformation job:

database: default
table: test
transformation: write(hive).option(hive.num.files, 1)
job ID: 1572296710137
job status: COMPLETED

Since the transformation has completed, you can run your Presto queries on the transformed table transparently.

You can find out the location of the transformed data with the following Presto query:

presto:default> select "$path" from test;

You should see output like:

alluxio://localhost:19998/catalog/default/tables/test/_internal_/part=0/20191024-213102-905-R29wf/part-0.parquet

If your queries do filtering on some columns, you can speed up the query execution by sorting each Parquet file according to zordering of the columns. It will allow query engine to skip reading data outside the filtering criteria and spend less time waiting for excessive data transfer.

The following command sorts each file under each partition according to columns col1 and col2, using the ZORDER algorithm.

$ ${ALLUXIO_HOME}/bin/alluxio table transform default test -d "sort(col1, col2).option(algorithm, ZORDER)"

The sorted files are stored in the same way as the coalesced files described above.

You can also combine sorting and coalescing together, so that the original files are coalesced (also converted to Parquet if they are CSV files) first, and then each coalesced file is sorted to a new file.

The following command first coalesces each partition into one file, then sort the file according to zordering of columns col1 and col2.

$ ${ALLUXIO_HOME}/bin/alluxio table transform default test -d "write(hive).option(hive.num.files, 1); sort(col1, col2).option(algorithm, ZORDER)"