To be able to run queries against one or more VAST databases using a third-party query engine, complete the following steps:
-
Verify that the prerequisites are met.
-
Configure VAST Cluster for interaction with a third-party query engine.
-
Install and configure the VAST connector for the third-party query engine.
-
Verify that the environment is operational.
-
Fill in one or more VAST databases with data.
-
Verify that the query engine can run queries against the VAST database(s).
If your intended query engine is Trino:
-
A Trino cluster must be up and running.
Tip
Consult VAST Support if you need help with installing a Trino cluster.
-
Low-latency, high-bandwidth networking between Trino nodes and the VAST cluster.
Complete the following steps:
-
(Recommended) Create a VIP pool dedicated to VAST Database.
-
Set up a database owner user.
-
Create at least one database and a schema.
It is recommended to create a dedicated VIP pool containing only the CNodes which you want to dedicate for servicing VAST Database workloads. This prevents any performance implications for other VAST Cluster services.
-
In VAST Web UI, choose Network Access -> Virtual IP Pools and click the + Create VIP Pool button. In the dialog that opens, complete the fields and click Create. For a complete VIP pool creation procedure, see Managing Virtual IP Pools.
-
If using VAST CLI, run the
vippool create
command.
-
Choose an existing VAST Cluster user or create a new one. For instructions on how to create a VAST Cluster user, see Managing Local Users.
-
Ensure that the user is allowed to create S3 buckets:
-
In VAST Web UI, open the Update User dialog (User Management -> Users -> edit a user) and verify that the Allow Create Bucket option is enabled.
-
If using VAST CLI, run the
user query --context local --uid UID
command and verify that theallow_create_bucket
property is set toTrue
.
-
-
Ensure that the user has an S3 access key pair defined and enabled.
-
To check if S3 access keys exist for a user, log in to VAST CLI and run the
user query --context local --uid UID
command. -
To generate a new S3 access key pair for a user via VAST Web UI, open the Update User dialog (User Management -> Users -> edit a user) and click the Create new key button.
-
To generate a new S3 access key pair via VAST CLI, run the
user generate-key
command.
-
By creating a VAST database, you create an Element Store view where database tables will be stored. A database view must have an S3 policy and the Database protocol type. You can manage database views in the same way as other views.
Caution
A VAST database table created on an Element Store view looks like a file on a filesystem. Ensure that you have proper access policies in place to prevent non-authorized access to or manipulation of database tables.
A database can have one or more schemas. A schema contains one or more tables which store data.
-
To create a database and a schema via VAST Web UI:
-
Choose Database -> Database.
-
In the Database page that opens, click + New Database.
-
In the Create Database dialog, make the following required settings:
-
Database Name: Enter the name for your database. Database naming requirements are the same as for S3 bucket names.
-
Policy name: Select an S3 policy. This policy should match the policy chosen for the database view.
-
Database Owner: Specify the name of the database owner user.
-
-
Click Create.
The new database is displayed in the database tree.
-
Select the newly created database in the database tree and click + Add Schema.
-
In the Add Schema dialog that opens, enter a name for the schema. Schema naming requirements are the same as for S3 object names.
-
Click Create.
The new schema is displayed under the database in the database tree.
-
-
To create a database and a schema via VAST CLI, run the
schema create
command. For example:vcli: admin> schema create --name schema1 --database-name vastdb
Follow this procedure if you are going to use Trino as the query engine to run queries against a VAST database.
-
Copy the VAST Connector distribution package to your Trino nodes and install the connector into your Trino plugins directory:
$ cd /my/trino/plugins $ unzip /tmp/trino-plugin.zip <...> # Move to match the plugin name $ mv trino-vast-375 vast
-
Create a VAST Connector configuration file named
vast.properties
in the Trino configurationcatalogs
directory.For example:
/etc/trino/catalogs/vast.properties
.Note
The file name needs to align with the directory name used when you installed the plugin (
vast
in the example).In the configuration file, specify the following information:
-
endpoint
: The name of the endpoint to be used to access the tabular data. This is the same endpoint name used as for S3 access. You can specify one or more IP addresses from the VIP pool serving the database/bucket. -
access_key_id
andsecret_access_key
: A pair of S3 access keys of the database owner user. -
(Optional)
data_endpoints
: A list of CNode VIP addresses for the VAST Connector to implement load balancing. Use this option if the endpoint you specified inendpoint
is not load-balanced across the VAST cluster.
Below is an example configuration file:
endpoint=http://${ENDPOINT_NAME} region=us-east-1 access_key_id=${S3_ACCESS_KEY} secret_access_key=${S3_SECRET_KEY} # Preliminary tuning parameters num_of_splits=256 num_of_subsplits=10 vast.http-client.request-timeout=60m vast.http-client.idle-timeout=60m # Hard-code the CNode VIPs if the “endpoint” value isn’t load-balanced data_endpoints=http://172.19.120.1,http://172.19.120.2,http://172.19.120.3,http://172.19.120.4,http://172.19.120.5,http://172.19.120.6,http://172.19.120.7,http://172.19.120.8
-
-
Configure the VAST Connector to log to the Trino server logs by adding
com.vastdata=INFO
to thelog.properties
file:echo ”com.vastdata=INFO” >> /etc/trino/log.properties
-
Restart Trino.
-
Check the Trino server log to ensure that the VAST Connector is loaded.
If you are using Trino as the query engine, run the following commands from your Trino client and confirm that the result is as expected:
-
List Trino catalogs:
trino> SHOW CATALOGS;
The output should include a catalog named
vast
:Catalog --------- <...> vast <...>
-
List schemas in the
vast
catalog:trino> SHOW SCHEMAS FROM vast;
The output should include the database and the schema you created when configuring the VAST cluster:
Schema -------------------- <...> vastdb/schema1 <...>
-
List tables in schema
vastdb/schema1
:trino> SHOW TABLES FROM vast."vastdb/schema1";
If you did not create any tables in the schema, the command completes but the output shows no tables.
-
(Optional) Set
vastdb/schema1
as the default namespace for this session. This allows you to omit specifying the catalog and schema in the following steps.trino> USE vast."vastdb/schema1";
-
Create a table in schema
vastdb/schema1
:trino:vastdb/schema1> CREATE TABLE cats (weight BIGINT, name VARCHAR, attitude REAL, date DATE, time TIMESTAMP);
-
List columns in the newly created table:
trino:vastdb/schema1> DESCRIBE cats;
The output should show the columns:
Column | Type | Extra | Comment ----------+--------------+-------+--------- weight | bigint | | name | varchar | | attitude | real | | date | date | | time | timestamp(3) | | (5 rows)
-
Add data to the newly created table:
trino:vastdb/schema1> INSERT INTO cats (weight, name, attitude, date, time) VALUES (14, 'jeff', 2.5, date('2022-12-07'), now());
-
Run a transaction:
trino:vastdb/schema1> START TRANSACTION; INSERT INTO cats ( weight, name, attitude, date, time ) VALUES ( 12, 'happy', 18.7, date('2022-12-02'), now() ); INSERT INTO cats ( weight, name, attitude, date, time ) VALUES ( 8, 'joy', 19.2, date('2022-12-02'), now() ); INSERT INTO cats ( weight, name, attitude, date, time ) VALUES ( 11, 'chester', 13.2, date('2022-12-02'), now() ); COMMIT;
-
List table rows:
trino:vastdb/schema1> SELECT * FROM cats;
The output should contain the following rows:
weight | name | attitude | date | time --------+---------+----------+------------+------------------------- 14 | jeff | 2.5 | 2022-12-07 | 2022-12-08 19:50:51.471 12 | happy | 18.7 | 2022-12-02 | 2022-12-08 19:55:35.325 8 | joy | 19.2 | 2022-12-02 | 2022-12-08 19:55:36.390 11 | chester | 13.2 | 2022-12-02 | 2022-12-08 19:55:37.180 (4 rows)
-
Verify that you can view a query execution plan:
trino:vastdb/schema1> EXPLAIN SELECT * FROM cats WHERE weight > 10 AND attitude < 10;
Use either of the following methods to fill in your VAST database with data:
-
Run a CTAS query from your query engine's client.
-
Insert data directly into a VAST database table.
-
Import data from multiple Parquet files using a single RPC call.
-
Import data from one Parquet file at a time.
Using your query engine's client, connect to the data source where the data reside and run a command to make a CREATE TABLE AS SELECT (CTAS) query. A CTAS query creates a copy of the source table in the VAST database.
The syntax would be similar to the following:
CREATE TABLE <VAST database table> AS SELECT * FROM <data source table>
To insert data directly into a VAST database table:
-
Create a VAST database table to which to insert the data using VAST Web UI or VAST CLI.
-
In VAST Web UI, choose Database -> Database, select a database and a schema in the database tree, and click the + Add Table button. Complete the fields in the dialog that opens and click Create.
Note
For a complete procedure, see Creating a Table via VAST Web UI.
-
Run the VAST CLI
table create
command.
-
-
Run an INSERT command from your query engine's client.
The syntax would be similar to the following:
INSERT INTO <VAST database table> SELECT * FROM <data source table>
You can fill in the VAST database with data from Parquet files contained in a VAST-stored S3 bucket. To do so, make an RPC call from Trino to the VAST cluster. The data are imported directly from the S3 bucket to the database table(s), keeping Trino out of the data path.
Tip
Before importing the data, ensure that the VAST database owner user has valid S3 access keys that provide access to the S3 bucket with the Parquet files.
The RPC call format is as follows:
CALL vast.rpc.import_data(ARRAY[],MAP( ARRAY[‘<bucket path to file1>.parquet’, ‘<bucket path to file2>.parquet’, ... , ‘<bucket path to fileN>.parquet’ ], ARRAY[ARRAY[]] ),'<VAST database>/<VAST database schema>/<VAST database table>');
Where:
-
<bucket path to file>
specifies the path to the source Parquet file that contains data to be imported. You can specify one or more files as a comma-separated list. -
<VAST database>/<VAST database schema>/<VAST database table>
specifies the target table on the VAST cluster to import the data to.
For example:
CALL vast.rpc.import_data(ARRAY[],MAP( ARRAY[ 'parquet-data/PARQUETDATA/09513780/2018/07/27.parquet', 'parquet-data/PARQUETDATA/09513780/2018/07/09.parquet' ], ARRAY[ ARRAY[], ARRAY[] ] ), 'vastdb/schema1/riverflow');
You can import Parquet objects one at a time using the following syntax:
INSERT INTO "<VAST database table> vast.import_data" ("$parquet_file_path") VALUES ('<bucket path to file>.parquet');
Where:
-
<VAST database table>
is the name of the VAST database table to which to insert the data. -
<bucket path to file>
is the path to the source Parquet file.
For example:
INSERT INTO "riverflow vast.import_data" ("$parquet_file_path") VALUES ('parquet-data/PARQUETDATA/09474000/2018/07/24.parquet');
Comments
0 comments
Article is closed for comments.