Catalog
Overview
Alluxio 2.1.0 introduced a new service within Alluxio called the Alluxio Catalog Service. The Alluxio Catalog Service is a service for managing access to structured data, which serves a purpose similar to the Apache Hive Metastore
SQL engines like Presto, SparkSQL, and Hive, leverage these metastore-like services to determine which, and how much data to read when executing queries. They store information about different database catalogs, tables, storage formats, data location, and more. The Alluxio Catalog Service is designed to make it simple and straightforward to retrieve and serve structured table metadata to Presto query engines, e.g.PrestoDB, Trino and Starburst Presto.
Architecture
The Alluxio Catalog Service is designed in a way very similar to the normal Alluxio filesystem. The service itself is not responsible for retaining all data, but is rather a caching service for metadata that originates in another location (i.e. MySQL, Hive). The locations where the metadata originates from are called UDBs (Under DataBase). UDBs are responsible for the management and storage of the metadata. Currently, Hive is the only supported UDB. The Alluxio Catalog Service caches and makes the metadata available universally through the Alluxio filesystem namespace.
Query Engine Metadata service Under meta service
+--------+ +--------------------------+ +----------------+
| Presto | <---> | Alluxio Catalog Service | <---> | Hive Metastore |
+--------+ +--------------------------+ +----------------+
Users that have tables which span multiple storage services (i.e. AWS S3, HDFS, GCS) - would typically need to configure their SQL engines to connect to each one of these services individually in order to make requests. Using the Alluxio catalog service, all the user needs to do is configure a single Alluxio client, and data any supported under storage systems locations can be served and read through Alluxio.
Using The Alluxio Catalog Service
Here are the basic configuration parameters and ways to interact with the Alluxio Catalog Service. More details can be found in the command line interface documentation.
Alluxio Server Configuration
By default, the catalog service is enabled. To explicitly disable it, add the following
line to your alluxio-site.properties
alluxio.table.enabled=false
By default, mounted databases and tables will exist underneath /catalog
directory in Alluxio.
Configure the root directory for the catalog service by configuring
alluxio.table.catalog.path=</desired/alluxio/path>
Attaching Databases
In order for Alluxio to serve information about structured table data, Alluxio must be informed
about where databases are located.
The attachdb
command is the first command that should be used to inform the Catalog service about
table locations.
The Table CLI can be used to perform any actions regarding attaching, browsing, or detaching
databases from Alluxio.
$ ${ALLUXIO_HOME}/bin/alluxio table
Usage: alluxio table [generic options]
[attachdb [-o|--option <key=value>] [--db <alluxio db name>] [--ignore-sync-errors] <udb type> <udb connection uri> <udb db name>]
[detachdb <db name>]
[ls [<db name> [<table name>]]]
[sync <db name>]
[transform <db name> <table name>]
[transformStatus [<job ID>]]
To attach a database use the attachdb
command. Currently, hive
and glue
are supported as the
<udb type>
.
See the attachdb command documentation
for more details.
The following command maps the hive database default
into a database in Alluxio called
alluxio_db
from the metastore located at thrift://metastore_host:9083
$ ${ALLUXIO_HOME}/bin/alluxio table attachdb --db alluxio_db hive \
thrift://metastore_host:9083 default
UDB Configuration File
To specify a configuration file for the UDB, append an option -o catalog.db.config.file
to
attachdb
command.
Each time the configuration file is changed, you can use alluxio table sync
to apply the changes.
The configuration file is in JSON format, and can contain these configurations:
-
Tables and partitions bypassing specification:
You can specify some tables and partitions to be bypassed from Alluxio, so that they will not be cached in Alluxio, instead clients will be directed to access them directly from the UDB. This can be helpful when some tables and partitions are large, and accommodating them in the cache is undesirable. An example configuration is like the following:
{ "bypass": { "tables": [ "table1", {"table": "table2", "partitions": ["table2_part1", "table2_part2"]} ] } }
You can specify which tables and partitions within these tables should be bypassed from Alluxio. By specifying only the table name, all partitions of that table, if any, will be bypassed. Otherwise, you can specify specific partitions to bypass.
In the example above, table 1 is fully bypassed. Partition 1 and 2 of table 2 are bypassed, and any other partitions, if any, are not.
Note: When databases are attached, all tables are synced from the configured UDB. If out-of-band updates occur to the database or table and the user wants query results to reflect the updates, the database must be synced. See Syncing Databases for more information.
Exploring Attached Databases
Once attached, check that the database has been mounted with alluxio table ls
$ ${ALLUXIO_HOME}/bin/alluxio table ls
alluxio_db
List the tables underneath the database with alluxio tables ls <db_name>
.
If any tables exist underneath the corresponding database in hive,
they will appear when executing this command.
$ ${ALLUXIO_HOME}/bin/alluxio table ls alluxio_db
test
In this case there is a table test
in the alluxio_db
.
To get more information about a table in a database, run
$ alluxio table ls <db_name> <table_name>
This command will dump information about the table to the console. An example of the output is below:
$ ${ALLUXIO_HOME}/bin/alluxio table ls alluxio_db test
db_name: "alluxio_db"
table_name: "test"
owner: "alluxio"
schema {
cols {
name: "c1"
type: "int"
comment: ""
}
}
layout {
layoutType: "hive"
layoutSpec {
spec: "test"
}
layoutData: "\022\004test\032\004test\"\004test*\345\001\n\232\001\n2org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe\022(org.apache.hadoop.mapred.TextInputFormat\032:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat\0227alluxio://localhost:19998/catalog/test/tables/test/hive\032\v\020\377\377\377\377\377\377\377\377\377\001 \0002\v\022\002c1\032\003int*\000"
}
parameters {
key: "totalSize"
value: "6"
}
parameters {
key: "numRows"
value: "3"
}
parameters {
key: "rawDataSize"
value: "3"
}
parameters {
key: "COLUMN_STATS_ACCURATE"
value: "{\"BASIC_STATS\":\"true\"}"
}
parameters {
key: "numFiles"
value: "3"
}
parameters {
key: "transient_lastDdlTime"
value: "1571191595"
}
Detaching databases
If desired, databases can be detached from the Alluxio namespace by using
$ alluxio table detach <database name>
For the previous examples, to detach we would run:
$ ${ALLUXIO_HOME}/bin/alluxio table detachdb alluxio_db
Running alluxio table ls
afterwards will not display the database any more.
Syncing databases
When the underlying database and tables are updated, users can invoke the sync command to refresh the information stored in the Alluxio catalog metadata. See the sync command documentation for more details.
$ alluxio table sync <database name>
For the previous examples, to sync we would run:
$ ${ALLUXIO_HOME}/bin/alluxio table sync alluxio_db
This sync command will update the Alluxio catalog metadata according to the updates, deletions, and additions in the UDB tables, it will update db config as well.
Automated Periodic UDB Syncing
The Alluxio Catalog Service has the ability to automatically sync with the underlying database at regular intervals.
This is convenient to automatically keep the Alluxio catalog up-to-date with the changes in the underlying databases.
The interval is configured by the server parameter: alluxio.table.catalog.udb.sync.interval
.
For example, if you set alluxio.table.catalog.udb.sync.interval=4h
, the Alluxio catalog will automatically sync
the databases every 4 hours.
The default value for the parameter is 4 hours.
If you want to disable the automatic syncing, set the parameter alluxio.table.catalog.udb.sync.interval=-1
.
A separate configuration parameter alluxio.table.catalog.udb.sync.initial.delay
controls the initial delay before the
first sync, on a startup.
Using Alluxio Structured Data with Presto
PrestoDB version 0.232 or above and Trino version 332 or above has built-in support for the Alluxio Catalog Service in their hive-hadoop2 connector. For instructions to setup Alluxio Catalog Service with those versions of PrestoDB or Trino, please consult PrestoDB’s documentation Trino’s documentation.
If you are using PrestoDB and Trino’s earlier versions, you can use the hive-alluxio
connector
included in the Alluxio distribution.
The latest Alluxio distribution contains a presto connector jar which can be dropped into the
${PRESTO_HOME}/plugins
directory to enable connectivity to the catalog service via Presto.
Enabling the Alluxio Catalog Service with Presto
Assuming you have Alluxio and Presto installation on your local machine at ${ALLUXIO_HOME}
and
${PRESTO_HOME}
respectively, you need to copy the Alluxio connector files into the
Presto installation as a new plugin. This must be done on all Presto nodes.
For PrestoDB installations, run the following.
$ cp -R ${ALLUXIO_HOME}/client/presto/plugins/prestodb-hive-alluxio-227/ ${PRESTO_HOME}/plugin/hive-alluxio/
For Trino installations, run the following.
$ cp -R ${ALLUXIO_HOME}/client/presto/plugins/presto-hive-alluxio-319/ ${PRESTO_HOME}/plugin/hive-alluxio/
Additionally, you’ll need to create a new catalog to use the Alluxio connector and Alluxio Catalog Service:
/etc/catalog/catalog_alluxio.properties
connector.name=hive-alluxio
hive.metastore=alluxio
hive.metastore.alluxio.master.address=HOSTNAME:PORT
Creating the catalog_alluxio.properties
file means a new catalog named catalog_alluxio
is added
to Presto.
Setting connector.name=hive-alluxio
sets the connector type to the name of the
new Alluxio connector for Presto, which is hive-alluxio
.
If you are using PrestoDB version 0.232 or above and Trino version 332 or above, support for Alluxio Catalog Service is built into
the hive-hadoop2
connector, so you should set connector.name=hive-hadoop2
here.
The hive.metastore=alluxio
means Hive metastore connection will use the alluxio
type, in order
to communicate with the Alluxio Catalog service.
The setting hive.metastore.alluxio.master.address=HOSTNAME:PORT
defines the host and port of the
Alluxio catalog service, which is the same host and port as the Alluxio master.
Once configured on each node, restart all presto coordinators and workers.
JAVA 11 Support
Trino 330 or above will only run with Java 11. Starting with the 2.4 version, Alluxio also supports Java 11. If you have both Java 8 and 11 installed on the same node, you will need to modify JAVA_HOME environment variable to point to the correct Java installation directory. The following is an example of launching presto
$ JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64" ${PRESTO_HOME}/bin/launcher start
Update Alluxio client library to the lastest version
Both PrestoDb and Trino now includes Alluxio’s client library in their release.
However, the version included in those releases may not be the latest version.
To update the client library to the latest version, first remove the old alluxio client library from ${PRESTO_HOME}/plugin/hive-hadoop2/
directory.
Then you can follow
instructions to copy the latest client to presto plugin directory.
This client update needs to be performed on all Presto nodes (coordinator and workers).
After the files are in place, be sure to restart the presto cluster so the new library is loaded.
Using the Alluxio Catalog Service with Presto
In order to utilize the Alluxio Presto plugin start the presto CLI with the following (assuming
the /etc/catalog/catalog_alluxio.properties
file has been created)
$ presto --catalog catalog_alluxio
By default, presto will now retrieve database and table information from Alluxio’s catalog service when executing any queries.
Confirm that configuration is correct by running some of the following queries:
- List out the attached databases:
SHOW SCHEMAS;
- List tables from one of the schemas:
SHOW TABLES FROM <schema name>;
- Run a simple query which will read data from the metastore and load data from a table:
DESCRIBE <schema name>.<table name>;
SELECT count(*) FROM <schema name>.<table name>;