Data Lake Connectors

Slack Docker Pulls

Data lake connectors enable compute engines such as Trino and Spark to query data as structured tables.

The supported connectors include:

The instructions to configure each of these connectors are described in their respective compute engine documentation.

Known limitations

Iceberg

Due to the nature of how Iceberg handles metadata through files, it is highly recommended to avoid caching the corresponding metadata files. If metadata files end up being persisted to cache, subsequent errors and/or warnings may occur when accessing the related files.

After determining the locations of the metadata files, set the paths as skipCache via the cache filter feature.

Caching data when writing to HDFS

When writing data with HDFS as the UFS, data is not cached upon writing, even in the case that the write type is configured to persist the data to cache. Only during a cold read of the newly written data will it be persisted in Alluxio cache. Note that this behavior was observed using Trino connecting to HDFS, but it was not observed when using Trino connecting to S3.