Vectorization Query Computing
When handling large datasets, the vectorization execution engine can greatly improve computing efficiency. By vectorizing data, multiple data elements can be processed at the same time, using parallel computing and SIMD instruction sets to speed up the process. SynxDB Elastic Vectorization (referred to as Vectorization) is a vectorization plugin based on the SynxDB Elastic kernel, designed to optimize query performance.
Enable vectorization
SET vector.enable_vectorization = ON;
Usage
Vectorization provides two parameters for users: vector.enable_vectorization
and vector.max_batch_size
.
Parameter name |
Description |
---|---|
|
Controls whether vectorized queries are enabled. It is disabled by default. |
|
The size of the vectorized batch, which controls how many rows the executor processes in one cycle. The range is |
Enable or disable vectorization queries
You can temporarily enable or disable the vectorization feature in a connection by setting the vector.enable_vectorization
variable. This setting is effective only for the current connection, and it will reset to the default value after disconnecting.
After uninstalling vectorization, setting the vector.enable_vectorization
variable will have no effect. When you reinstall it, the vector.enable_vectorization
will be restored to the default value on
.
SET vector.enable_vectorization TO [on|off];
Set vectorization batch size
The batch size of vectorization greatly affects performance. If the value is too small, queries might slow down. If the value is too large, it might increase memory usage without improving performance.
SET vector.max_batch_size TO <batch_number>;
Verify whether a query is vectorized
You can use EXPLAIN
to check whether a query is a vectorization query.
If the first line of the
QUERY PLAN
has a “Vec” label, it means the query uses vectorization.gpadmin=# EXPLAIN SELECT * FROM int8_tbl; QUERY PLAN ----------------------------------------------------------------------------------- Vec Gather Motion 3:1 (slice1; segments: 3) (cost=0.00..431.00 rows=1 width=16) -> Vec Seq Scan on int8_tbl (cost=0.00..431.00 rows=1 width=16) Optimizer: Pivotal Optimizer (GPORCA) (3 rows)
If the first line of the
QUERY PLAN
does not have a “Vec” label, it means the query does not use vectorization.gpadmin=# EXPLAIN SELECT * FROM int8_tbl; QUERY PLAN ------------------------------------------------------------------------------- Gather Motion 3:1 (slice1; segments: 3) (cost=0.00..431.00 rows=1 width=16) -> Seq Scan on int8_tbl (cost=0.00..431.00 rows=1 width=16) Optimizer: Pivotal Optimizer (GPORCA) (3 rows)
Supported features
The following table describes the operators, functions, and data types supported by vectorized execution.
Feature |
Supported or not |
Description |
---|---|---|
Storage format |
Supported |
AOCS |
Storage format |
Not supported |
HEAP |
Data types |
Supported |
int2, int4, int8, float8, bool, char, tid, date, time, timestamp, timestamptz, varchar, text, numeric |
Data types |
Not supported |
Custom type |
Scan operator |
Supported |
Scanning AOCS tables, complex filter conditions |
Scan operator |
Not supported |
Non-AOCS tables |
Agg operator |
Supported |
Aggregate functions: min, max, count, sum, avg Aggregation plans: PlanAggregate (simple aggregate), GroupAggregate (sorted aggregate), HashAggregate (hash aggregate) |
Agg operator |
Not Supported |
Aggregate functions: sum(int8), sum(float8), stddev (standard deviation), variance Aggregation plans: MixedAggregate (mixed aggregate) |
Limit operator |
Supported |
All |
ForeignScan operator |
Supported |
All |
Result operator |
Supported |
All |
Append operator |
Supported |
All |
Subquery operator |
Supported |
All |
Sequence operator |
Supported |
All |
NestedLoopJoin operator |
Supported |
Join types: inner join, left join, semi join, anti join |
NestedLoopJoin operator |
Not supported |
Join types: right join, full join, semi-anti join Join conditions: different data types, complex inequality conditions |
Material operator |
Supported |
All |
ShareInputScan operator |
Supported |
All |
ForeignScan operator |
Supported |
All |
HashJoin operator |
Supported |
Join types: inner join, left join, right join, full join, semi join, anti join, semi-anti join |
HashJoin operator |
Not Supported |
Join conditions: different data types, complex inequality conditions |
Sort operator |
Supported |
Sorting order: ascending, descending algorithms: order by, order by limit |
Motion operator |
Supported |
GATHER (sending tuples from multiple senders to one receiver), GATHER_SINGLE (single-node gathering), HASH (simple hash conditions), BROADCAST (broadcast gathering), EXPLICIT (explicit gathering) |
Motion operator |
Not supported |
Hash gathering (complex hash conditions) |
Expressions |
Supported |
case when, is distinct, is not distinct, grouping, groupid, stddev_sample, abs, round, upper, textcat, date_pli, coalesce, substr |
Bench |
Supported |
ClickHouse, TPC-H, TPC-DS, ICW, ICW-ORCA |
Single-node threaded execution
By performing threaded parallel computation within the execution node, better utilization of multi-core machine resources can be achieved, reducing query time and improving query performance. The following vectorized operators support threaded acceleration:
Scan (Filter)
Join
Agg
Sort
Currently, the vectorized query feature supports enabling threaded queries in a single-node deployment. To enable it, the following GUC values need to be configured:
set vector.enable_vectorization=on; -- Enables vectorized query.
set vector.enable_plan_merge=on; -- Enables Pipeline scheduling execution.
set vector.pool_threads=8; -- Configures the number of execution threads.
Vectorized threaded execution relies on the underlying PAX storage file format. Take the tpcds
PAX table store_sales
as an example. When threaded execution is not enabled:
gpadmin=# set vector.pool_threads=0;
SET
gpadmin=# explain analyze select ss_quantity from store_sales
where ss_quantity between 6 and 10 and ss_list_price between 5 and 5+10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------
---------------------------------
Vec Seq Scan on store_sales (cost=0.00..661957.82 rows=28801 width=4) (actual time=8804.432..8805.236 rows
=94117 loops=1)
Filter: ((ss_quantity >= 6) AND (ss_quantity <= 10) AND (ss_list_price >= '5'::numeric(15,2)) AND (ss_lis
t_price <= '15'::numeric(15,2)))
Planning Time: 0.475 ms
(slice0) Executor memory: 114K bytes.
Memory used: 128000kB
Optimizer: Postgres query optimizer
Execution Time: 8809.841 ms
(7 rows)
When threaded execution is enabled and the pool_threads
parameter is set to 8
:
gpadmin=# set vector.pool_threads=8;
SET
gpadmin=# explain analyze select ss_quantity from store_sales
where ss_quantity between 6 and 10 and ss_list_price between 5 and 5+10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------
---------------------------------
After enabling threaded queries, the query time is significantly reduced.