Vectorization Query Computing

When handling large datasets, the vectorization execution engine can greatly improve computing efficiency. By vectorizing data, multiple data elements can be processed at the same time, using parallel computing and SIMD instruction sets to speed up the process. SynxDB Elastic Vectorization (referred to as Vectorization) is a vectorization plugin based on the SynxDB Elastic kernel, designed to optimize query performance.

Enable vectorization

SET vector.enable_vectorization = ON;

Usage

Vectorization provides two parameters for users: vector.enable_vectorization and vector.max_batch_size.

Parameter name	Description
`vector.enable_vectorization`	Controls whether vectorized queries are enabled. It is disabled by default.
`vector.max_batch_size`	The size of the vectorized batch, which controls how many rows the executor processes in one cycle. The range is `[0, 600000]`, and the default value is `16384`.

Enable or disable vectorization queries

You can temporarily enable or disable the vectorization feature in a connection by setting the vector.enable_vectorization variable. This setting is effective only for the current connection, and it will reset to the default value after disconnecting.

After uninstalling vectorization, setting the vector.enable_vectorization variable will have no effect. When you reinstall it, the vector.enable_vectorization will be restored to the default value on.

SET vector.enable_vectorization TO [on|off];

Set vectorization batch size

The batch size of vectorization greatly affects performance. If the value is too small, queries might slow down. If the value is too large, it might increase memory usage without improving performance.

SET vector.max_batch_size TO <batch_number>;

Verify whether a query is vectorized

You can use EXPLAIN to check whether a query is a vectorization query.

If the first line of the QUERY PLAN has a “Vec” label, it means the query uses vectorization.

gpadmin=# EXPLAIN SELECT * FROM int8_tbl;
                                    QUERY PLAN                                     
-----------------------------------------------------------------------------------
Vec Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..431.00 rows=1 width=16)
->  Vec Seq Scan on int8_tbl  (cost=0.00..431.00 rows=1 width=16)
Optimizer: Pivotal Optimizer (GPORCA)
(3 rows)

If the first line of the QUERY PLAN does not have a “Vec” label, it means the query does not use vectorization.

gpadmin=# EXPLAIN SELECT * FROM int8_tbl;
                                QUERY PLAN                                   
-------------------------------------------------------------------------------
Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..431.00 rows=1 width=16)
->  Seq Scan on int8_tbl  (cost=0.00..431.00 rows=1 width=16)
Optimizer: Pivotal Optimizer (GPORCA)
(3 rows)

Supported features

The following table describes the operators, functions, and data types supported by vectorized execution.

Feature	Supported or not	Description
Storage format	Supported	AOCS
Storage format	Not supported	HEAP
Data types	Supported	int2, int4, int8, float8, bool, char, tid, date, time, timestamp, timestamptz, varchar, text, numeric
Data types	Not supported	Custom type
Scan operator	Supported	Scanning AOCS tables, complex filter conditions
Scan operator	Not supported	Non-AOCS tables
Agg operator	Supported	Aggregate functions: min, max, count, sum, avg Aggregation plans: PlanAggregate (simple aggregate), GroupAggregate (sorted aggregate), HashAggregate (hash aggregate)
Agg operator	Not Supported	Aggregate functions: sum(int8), sum(float8), stddev (standard deviation), variance Aggregation plans: MixedAggregate (mixed aggregate)
Limit operator	Supported	All
ForeignScan operator	Supported	All
Result operator	Supported	All
Append operator	Supported	All
Subquery operator	Supported	All
Sequence operator	Supported	All
NestedLoopJoin operator	Supported	Join types: inner join, left join, semi join, anti join
NestedLoopJoin operator	Not supported	Join types: right join, full join, semi-anti join Join conditions: different data types, complex inequality conditions
Material operator	Supported	All
ShareInputScan operator	Supported	All
ForeignScan operator	Supported	All
HashJoin operator	Supported	Join types: inner join, left join, right join, full join, semi join, anti join, semi-anti join
HashJoin operator	Not Supported	Join conditions: different data types, complex inequality conditions
Sort operator	Supported	Sorting order: ascending, descending algorithms: order by, order by limit
Motion operator	Supported	GATHER (sending tuples from multiple senders to one receiver), GATHER_SINGLE (single-node gathering), HASH (simple hash conditions), BROADCAST (broadcast gathering), EXPLICIT (explicit gathering)
Motion operator	Not supported	Hash gathering (complex hash conditions)
Expressions	Supported	case when, is distinct, is not distinct, grouping, groupid, stddev_sample, abs, round, upper, textcat, date_pli, coalesce, substr
Bench	Supported	ClickHouse, TPC-H, TPC-DS, ICW, ICW-ORCA

Single-node threaded execution

By performing threaded parallel computation within the execution node, better utilization of multi-core machine resources can be achieved, reducing query time and improving query performance. The following vectorized operators support threaded acceleration:

Scan (Filter)
Join
Agg
Sort

Currently, the vectorized query feature supports enabling threaded queries in a single-node deployment. To enable it, the following GUC values need to be configured:

set vector.enable_vectorization=on;  -- Enables vectorized query.
set vector.enable_plan_merge=on; -- Enables Pipeline scheduling execution.
set vector.pool_threads=8;  -- Configures the number of execution threads.

Vectorized threaded execution relies on the underlying PAX storage file format. Take the tpcds PAX table store_sales as an example. When threaded execution is not enabled:

gpadmin=# set vector.pool_threads=0;
SET
gpadmin=# explain analyze select ss_quantity from store_sales 
where ss_quantity between 6 and 10 and ss_list_price between 5 and 5+10;
                                                               QUERY PLAN                                 

------------------------------------------------------------------------------------------------------------
---------------------------------
Vec Seq Scan on store_sales  (cost=0.00..661957.82 rows=28801 width=4) (actual time=8804.432..8805.236 rows
=94117 loops=1)
   Filter: ((ss_quantity >= 6) AND (ss_quantity <= 10) AND (ss_list_price >= '5'::numeric(15,2)) AND (ss_lis
t_price <= '15'::numeric(15,2)))
Planning Time: 0.475 ms
   (slice0)    Executor memory: 114K bytes.
Memory used:  128000kB
Optimizer: Postgres query optimizer
Execution Time: 8809.841 ms
(7 rows)

When threaded execution is enabled and the pool_threads parameter is set to 8:

gpadmin=# set vector.pool_threads=8;
SET
gpadmin=# explain analyze select ss_quantity from store_sales 
where ss_quantity between 6 and 10 and ss_list_price between 5 and 5+10;
                                                               QUERY PLAN                                 

------------------------------------------------------------------------------------------------------------
---------------------------------

After enabling threaded queries, the query time is significantly reduced.