Trino ( https://trino.io/ ) is a distributed query engine which can be run on a variety of data sources. In this short article, we will instruct you on how to configure Trino to execute queries on data which is stored on VAST.
Note that one can use either VAST-S3, or VAST-NFS (with or without multipath) as a data source for Trino. For now, we will focus on VAST-S3.
Setup VAST-S3 credentials
In order for Trino (or any application) to be able to read and write data via VAST-S3 , one must first create a set of credentials to use. For specific details on how to accomplish this, please refer to Managing S3 User Permissions. Simply put, you will need to:
1. Create a new user on VAST (alternatively you can re-use an existing user, including one which exists within your AD/LDAP/NIS environment).
2. (Optional, but recommended): Grant 'Allow Create Bucket' permissions to the user.
3. Create an Access/Secret keypair for the user
Create an S3 Bucket on VAST
Next, you will need to create a bucket for use with Trino. There are multiple ways of accomplishing this, please refer to First Steps to Working with S3 for some of them. It is recommended to:
1. Use the same access/secret key obtained from the previous steps
2. Choose a bucket name which does not collide with previously created bucket names, AND does not collide with any 'root level' directories on the VAST filesystem.
Here's an example of using s3cmd :
1. Make a "s3cfg" file which contains the relevant parameters. Note that in this example, we are disabling SSL/HTTPS for simplicity. It is not required to disable this.
# cat trino.s3cfg
access_key = P1XO278JTWX48HVK7S5B
secret_key = 2izPPWryu1+O9KM+t8hDZnjP+eR5R/uo+Cl2qD8L
host_base = trinopool.vast.company.com:80
host_bucket = trinopool.vast.company.com:80
use_https = False
server_side_encryption = False
signature_v2 = False
signurl_use_https = False
encrypt = False
2. Make a bucket:
s3cmd -c trino.s3cfg mb s3://trino
Note: in more advanced scenarios, you may also want to adjust the S3 bucket ACL and Object ACLs to allow other users to access this data. However, in this article, we will presume that only a single user/access_key is used for all steps.
Setup a HIVE Metastore
Trino requires that a Metastore (such as the HIVE Metastore) is available for it to store table schema/DDLs. If you already have a HIVE Metastore available, you can skip this step. If you do not, then here is a quick recipe for running one within Docker. Note that you will need to have "docker-compose" installed on your machine.
1. Clone this repo:
git clone https://github.com/andypern/hive-metastore-docker && cd hive-metastore-docker
2. Edit the conf/metastore-site.xml file , with the following relevant details:
3. Build the hive-metastore container:
4. Now, bring it up
docker-compose up -d
5. Wait approximately 1-2 minutes for everything to startup, then run the following command to make sure that the hive-metastore service is listening on the appropriate port:
sudo lsof -i:9083
curl $IP:9083 # where $IP is an IP that other machines can access.
If you have issues, sometimes you can re-run `docker-compose up -d` to kick-start hive-metastore. If you continue to have issues, run the following to print out logs to the screen so that you can debug:
(omit the '-d' flag)
Now you can move on to get Trino going..
Create trino configs
The following example assumes a one-node Trino cluster where the single node is acting as both a coordinator as well as a worker/executor. Multi-node environments will require additional configuration, please refer to the Trino documentation for specific details.
In order to simplify, we will be using the trinodb/trino docker container for this installation. If you wish to perform these steps in your environment, you will need to have docker installed on your host. You will also need to make sure that your host has access to the internet (to pull docker images), as well as to the VAST cluster using the high-speed network.
1. Create a directory on your host which will hold the catalog configuration files for trino to use. For now, I am using "/home/vastdata/trino/catalog"
2. Inside the catalog folder, create an s3.properties file. Note that the "hive.metastore.uri" should point to the host where you are running the hive-metastore from the previous section:
3. Also, within the catalog folder, create the following tpcds.properties file, which will allow us to synthesize data for queries:
You're almost there! Since we're using Docker, we need to make sure that our catalog files are available to Trino using a bind mount. Make sure that you adjust the path to your catalog folder in the following command appropriately.
1. The following command will launch a one-node trino cluster.
docker run -d \
-v /home/vastdata/andy/trino/catalog:/etc/trino/catalog \
2. Wait about 1-2 minutes, then run the following command to view the logs for the container
docker logs `docker ps -q --filter=ancestor=trinodb/trino`
2021-09-22T17:13:20.994Z INFO main io.trino.metadata.StaticCatalogStore -- Added catalog s3 using connector hive-hadoop2 --
2021-09-22T17:13:20.996Z INFO main io.trino.security.AccessControlManager Using system access control default
2021-09-22T17:13:21.028Z INFO main io.trino.server.Server ======== SERVER STARTED ========
You should now also be able to browse to the following URL (substitute the IP for the host you are running this on): http://trinohost:8080 . Note that the login is "admin" , with no password.
Using the Trino client
In order to execute Trino queries, you need a trino client. Luckily, the docker container we are using includes it, we merely need to attach to it, using the following:
docker exec -it `docker ps -q --filter ancestor=trinodb/trino` trino
This will present you with a prompt:
Lets check to make sure our catalogs show up:
trino> show catalogs;
The "tpcds" catalog will contain some "synthetic" tables, within a series of "schemas", here's how to see a list:
show schemas from tpcds;
Each of the above schemas represents a different "size" of dataset. For more information on the layout of TPCDS, please see this helpful documentation: http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v2.8.0.pdf
Lets use the smallest schema for now, here's how to list the tables within a schema:
show tables from tpcds.tiny;
Here's how to run a quick query against one of those tables:
trino> select count(*) from tpcds.tiny.catalog_sales;
Creating a Parquet Table on VAST-S3
Now that you have trino running, and you know how to query from the built-in TPCDS dataset, lets make a table that will live on VAST. Trino lets you do this with the "Create Table as Select" (CTAS) query type.
First, you need to create a "schema" within VAST-S3.
create schema s3.tpcds_tiny;
Now, make a new parquet table w/ a CTAS statement
create table s3.tpcds_tiny.catalog_sales
as select * from tpcds.tiny.catalog_sales;
If all went well, you should now be able to see the table and query from it:
trino> show tables from s3.tpcds_tiny;
You can also see the data in the s3 bucket:
s3cmd -c trino.s3cfg ls s3://trino/tpcds_tiny.db/catalog_sales/
2021-09-24 08:01 4897581 s3://trino/tpcds_tiny.db/catalog_sales/20210924_080104_00005_b4iqx_66249692-00c9-48ad-a8c3-c9a5b7859a8d