SynxDB Documentation

Table of Contents

  • Product Overview
    • Feature Overview
      • Efficient queries in different scenarios
      • Polymorphic data storage
      • Multi-layer data security
      • Data loading
      • Multi-layer fault tolerance
      • Rich data analysis support
      • Flexible workload management
      • Multiple compatibility
    • Product Architecture
      • Coordinator node
      • Segment node
      • Interconnect
      • MVCC
    • User Scenarios
    • Comparison with Greenplum Features
      • General features
      • Performance-related features
      • Security-related features
    • Releases
      • v4.0.0 Release Notes
        • New features
          • Database
          • Interactive manager DBCC
        • Upgrade path
        • Product change information
          • SQL syntax
          • Configuration parameters (GUCs)
          • Extensions and plugins
          • Removed features
        • Bug fixes
  • Deployment Guides
    • Deploy on Physical Machines
      • Software and Hardware Configuration
        • Hardware requirements
          • Physical machine
          • Storage
        • Data exchange network
        • Software requirements
          • Supported OS
          • SSH configurations
      • Prepare to Deploy on Physical Machine
        • Plan the deployment architecture
        • Modify server settings
          • Change hostname
          • Add gpadmin admin user
          • Disable SELinux and firewall software
          • Modify network mapping
          • Set system parameters
      • Deploy on Mulitiple Nodes
        • Step 1: Prepare server nodes
        • Step 2. Install the RPM package
        • Step 3. Configure mutual trust between nodes
        • Step 4. Initialize SynxDB
        • Step 5. Log into the Database
      • Deploy on Single Node
        • How to deploy
          • Step 1. Prepare to deploy
          • Step 2: Install SynxDB via RPM package
          • Step 3: Deploy SynxDB with a single computing node
        • Common issues
          • How to check the deployment mode of a cluster
          • Where is the data directory
        • How it works
        • User-behavior changes
  • Create and Prepare Databases
    • Create and Manage Databases
      • Templates and default databases
      • Create a database
        • Clone a database
        • Assign a different owner when creating a database
      • List databases
      • Alter a database
      • Drop a database
    • Start and Stop Databases
      • Start SynxDB
      • Restart SynxDB
      • Reload configuration file changes
      • Start the coordinator in maintenance mode
      • Stop SynxDB
      • Stop client processes
    • Connect to Databases
      • Connection parameters
      • Supported client applications
        • Client applications
        • Connect using psql
      • Connect using application interfaces
      • Common connection problems
  • Load Data
    • Data Loading Overview
      • Data loading process
      • Loading tools and scenarios
    • Load Data from Local Files
      • Load Data Using COPY
        • Load from a file
        • Load from STDIN
        • Load data using \copy in psql
        • Input format
      • Load Data Using gpfdist
        • Step 1. Install gpfdist
        • Step 2. Start and stop gpfdist
        • Step 3. Use gpfdist with external tables to load data
          • Example 1 - Run single gpfdist instance on a single-NIC machine
          • Example 2 — Run multiple gpfdist instances
          • Example 3 — Single gpfdist instance with error logging
          • Example 4 - Create a writable external table with gpfdist
        • About gpfdist
          • About gpfdist and external tables
          • About gpfdist setup and performance
      • Load Data Using the file:// Protocol
        • Usage examples
      • Load Data Using gpload
        • To use gpload
    • Load External Data Using Foreign Table
      • Use foreign table
        • Create foreign table using the LIKE clause
      • Query a foreign table
    • Load Data from Web Services
      • Command-based external web tables
      • URL-based external web tables
    • Load Data Using PXF
      • Install and configure PXF
        • Prerequisites
        • Step 1: Install the PXF package
        • Step 2: Set Up Your Environment
        • Step 3: Run PXF
      • Next steps
    • Load Data from Kafka Using Kafka FDW
      • Basic usage
      • Supported data formats
      • Query
      • Message producer
      • Data import
        • Scheduled import
    • Load Data from Kafka Using Kafka Connector
      • Data Formats
      • Installation
        • Preparation
        • Installation Steps
      • Parameter Descriptions
      • Troubleshooting
      • References
    • Load Data from Object Storage and HDFS
      • Install the extension
      • Instructions
        • Load data from object storage
        • Load data from HDFS
          • Load HDFS data without authentication
          • Load HDFS data with Kerberos authentication
    • Load Data from Hive Data Warehouse
      • Supported Hive file formats
      • Usage limitations
      • Steps
        • Step 1. Create a configuration file on database cluster
          • Configure Hive cluster information
          • Configure HDFS cluster information
        • Step 2. Create foreign data wrapper and Hive Connector plugin
        • Step 3. Create server and user mapping
        • Step 4. Sync Hive objects to the database cluster
          • Syncing a Hive table
          • Sync a Hive database
      • Examples of syncing tables
        • Sync a Hive text table
        • Sync a Hive ORC table
        • Sync a Hive ORC partitioned table
        • Sync a sample Hive database
        • Sync tables in Iceberg and Hudi formats
          • Load Iceberg tables
          • Load Hudi tables
      • Data type mapping
      • Known issues
        • Solution
    • Load Data from MySQL Server Using MySQL_FDW
      • Prerequisites
      • Use MySQL FDW
        • CREATE SERVER options
        • CREATE USER MAPPING options
        • CREATE FOREIGN TABLE options
        • IMPORT FOREIGN SCHEMA options
        • TRUNCATE support
        • Functions
      • Usage Examples
        • Installing the Extension
        • Creating a Foreign Server
        • Granting Permissions to Use Foreign Server
        • Creating a User Mapping
        • Creating a Foreign Table
        • Importing a MySQL Database as a SynxDB Schema
    • Custom Multi-Character Delimiters for Reading and Writing External Tables
      • Compilation and installation
      • Usage example
        • Read external tables
        • Write to external tables
  • Operate with Data
    • Operate with Database Objects
      • Basic Query Syntax
      • Create and Manage Tables
        • Create tables
          • Choose column data types
          • Set table and column constraints
      • Insert, Update and Delete Row Data
        • Insert rows
          • Insert rows into partitioned tables
          • Insert rows in Append-Optimized tables
        • Update existing rows
        • Delete rows
        • Truncate table
        • Vacuum database
      • Create and Manage Views
        • Create views
        • Drop views
        • Best practices when creating views
      • Create and Manage Materialized Views
        • Create materialized views
        • Refresh or deactivate materialized views
        • Drop materialized views
      • Create and Manage indexes
        • Index types
          • Bitmap indexes
        • Manage indexes
          • Cluster a table by index
          • Create indexes
          • Rebuild all indexes on a table
          • Drop indexes
        • Index-only scan and covering index
          • What is an index-only scan
          • Requirements
          • Additional note: MVCC visibility checks
          • Purpose of covering indexes
          • Dynamic index-only scan
        • Backward index scan
        • Check index usage
      • BRIN Indexes
        • Maintain BRIN indexes
      • Work with Transactions
        • Transaction isolation levels
          • Read uncommitted and read committed
      • Transaction Concurrency Control
        • Multi-version concurrency control
        • Lock modes
        • Global deadlock detector
          • Global deadlock detector management of concurrent UPDATE and DELETE operations
      • Choose Table Storage Model
        • Heap and Append-optimized Table Storage Models
          • Heap storage
          • Append-optimized storage
          • To create a table with specified storage options
          • Choose row or column-oriented storage
        • PAX Storage Format
          • Applicable scenarios
          • Usage
          • Support for TOAST
          • Support for clustering
          • Bloom filter support
          • Sparse filtering
          • View data change records on PAX tables in WAL logs
          • Limitations for PAX tables
          • PAX-related SQL options
          • PAX-related system parameters
          • Best practices
        • UnionStore Storage Format
          • Use cases
          • Prerequisites
          • Install UnionStore
          • Usage
          • Limitations
      • Cross-Cluster Federated Query
        • User scenarios
        • Usages
          • Prerequisites
          • Step 1. Create a foreign data wrapper
          • Step 2. Create a user mapping
          • Step 3. Create a foreign table and add shards
          • Step 4. Perform Cross-Cluster Federated Query
      • Use Tags to Manage Database Objects
        • What is a tag?
        • Features of tags
        • Usage scenarios
        • Use tags
          • Query existing tag information
          • Create tags
          • Delete tags
          • Modify tags
          • Assign tags to objects
          • Tag comments
        • System tables related to tags
          • pg_tag table
          • pg_tag_description table
          • Handle global and non-global objects
        • Common errors and tips
    • SQL Queries
      • Join Queries
        • Join types
          • LEFT OUTER JOIN
          • RIGHT OUTER JOIN
          • FULL OUTER JOIN
          • CROSS JOIN
          • NATURAL JOIN
        • Join conditions
          • ON clause
          • USING clause
        • LATERAL
        • Example
          • INNER JOIN example
          • LEFT OUTER JOIN example
          • RIGHT OUTER JOIN example
          • FULL OUTER JOIN example
          • CROSS JOIN example
    • Advanced Analytics
      • Use pgvector for Vector Similarity Search
        • Quick start
        • Store data
        • Query data
        • Index data
          • HNSW index
          • IVFFlat index
          • Use filters in indexes
          • Half-precision vectors
          • Half-precision index
          • Binary vectors
        • Hybrid search
        • pgvector performance
          • Exact search
          • Approximate search
      • Vectorization Query Computing
        • Enable vectorization
        • Usage
          • Enable or disable vectorization queries
          • Set vectorization batch size
          • Verify whether a query is vectorized
          • Features supported by vectorization
          • Threaded execution on single node
        • Performance evaluation
          • TPC-H
          • TPC-DS
      • Use MADlib for Machine Learning and Deep Learning
        • Install MADlib components
          • Install the SynxDB MADlib package
        • Add MADlib functions to the database
        • Uninstall MADlib from the database
        • Usage examples
          • Check the MADlib version
          • Scenario 1: Perform linear regression on a database table
          • Scenario 2: Use association rules
          • Scenario 3: Perform Naive Bayes classification
      • Directory Tables
        • Usage
          • Create a directory table
          • Create index on a directory table
          • View the field information of directory table
          • Upload file into directory table
          • Export files to local
          • Query and use the files managed by directory table
          • Add transaction lock to directory table
          • Delete the file managed by directory table
          • Delete a directory table
          • Back up and restore directory tables
      • Use RoaringBitMap Bitmap Operations
        • Usage
          • Load the plugin
          • Create tables
          • Create a bitmap
          • Bitmap union operations
          • Bitmap union and intersection aggregation operations
          • Statistical cardinality
          • Convert bitmap to SETOF integer
          • Conversion between bytea and roaringbitmap data types
    • Auto-Execute SQL commands
      • Syntax description
        • CREATE TASK
        • ALTER TASK
        • DROP TASK
      • View Create Task metadata
      • Parameter tuning description
      • Usage examples
    • Use Oracle Compatibility SQL via Orafce
      • Install and remove Orafce
      • SynxDB considerations
        • SynxDB implementation differences
      • Using Orafce
  • Optimize Performance
    • Optimize Query Performance
      • Query Processing Overview
        • Query planning and dispatch
        • Query plans
        • Parallel query execution
      • Query Performance Overview
        • Dynamic partition elimination
        • Memory optimization
      • Use GPORCA Optimizer
        • GPORCA overview
          • Enable or disable GPORCA
          • Determine which query optimizer is used
        • GPORCA Features and Enhancements
          • Enhancements for partitioned table queries
          • Subquery optimization
          • Common table expression (CTE) optimization
          • DML operation optimization
          • Other optimization capabilities
        • GPORCA Optimizer Update Notes
          • v4.0.0
      • Update Statistics
        • Check whether statistics are up to date
        • Generate statistics selectively
        • Improve statistics quality
        • When to run ANALYZE
        • Configure automatic statistics collection
      • Query Plan Hints
        • Quick example
        • Cardinality hints
        • Table access hints
        • Join type hints
        • Join order hints
        • Supported scope and limitations
        • Best practices for using query plan hints
      • Use Unique Indexes on AO Tables
        • How it works
        • How to add a unique index to an AO table
        • Usage examples
      • Use Automatic Materialized Views for Query Optimization
        • Usage scenarios
        • Implementation
        • Comparison with dynamic tables
        • Restrictions
        • Usage examples
        • Use materialized views to query external tables
        • Aggregate query support
          • Aggregate query support
          • HAVING clause processing
          • LIMIT and ORDER BY Support
          • Notes for aggregate query support
      • Use Incremental Materialized Views
        • Use cases
        • Usage example
        • Query performance comparison with regular views
          • Example of using a regular view
          • Example of using an incremental materialized view
          • TPC-H test result comparison
        • Usage restrictions and notes
      • Execute Queries in Parallel
        • Use cases
        • How to use
          • Query heap tables in parallel
          • Query AO/AOCO tables in parallel
        • Parameter descriptions
        • Frequently asked questions
      • Use Index Scan on AO Tables
        • Use cases
        • Usage example
      • Push Down Aggregation Operations
        • Usage example
        • Use cases
          • Scenario one
          • Scenario two
        • Not recommended scenarios
          • Not applicable scenario one
          • Not applicable scenario two
        • Usage restrictions
          • Restriction one
          • Restriction two
      • Optimize HashJoin Query Performance
        • Use cases
        • Usage restrictions
        • Usage examples
        • Example of performance improvement
    • Create Indexes Concurrently
      • CREATE INDEX CONCURRENTLY
        • Use cases
        • Usage
        • Usage restrictions
        • Common issue handling
      • REINDEX CONCURRENTLY
        • Use cases
        • How it works
        • Usage
        • Usage restrictions
        • Common issue handling
    • Create AO/AOCO Tables in Parallel and Refresh Materialized Views
      • Usage examples
        • Create AO/AOCO tables in parallel
        • Refresh materialized views in parallel
      • Performance improvement demonstration
    • Manage Resources Using Resource Groups
      • Role and component resource groups
      • Resource group attributes and limits
        • Transaction concurrency limit
        • Bypass limits and unassign from resource groups
        • CPU limits
          • Assign CPU resources by core
          • Assign CPU resources by percentage
        • Memory limits
        • Disk I/O limits
      • Configure and use resource groups
        • Prerequisites
          • Configure cgroup v1
          • Configure cgroup v2
      • Enable resource groups
      • Create resource groups
      • Configure automatic query termination based on memory usage
      • Assign a resource group to a role
      • Monitor resource group status
        • View resource group limits
        • View resource group query status
        • View resource group memory usage per host
        • View the resource group assigned to a role
        • View resource group disk I/O usage per host
        • View a resource group’s running and pending queries
        • Cancel a running or queued transaction in a resource group
      • Move a query to a different resource group
      • Frequently asked questions
    • Use Dynamic Tables to Speed Up Queries and Auto-Refresh Data
      • Use cases
      • Comparison with materialized views
      • Usage
        • Create a dynamic table
        • Refresh a dynamic table
        • View schedule information
        • Drop a dynamic table
        • View distribution key
      • Examples
        • Example 1: Accelerate external table queries in lake-house architecture
        • Example 2: Create an empty dynamic table
      • Notes
  • Manage System
    • View Monitoring Data Using the Web Console
      • Install Monitoring Console
        • Prerequisites
        • Step 1: Deploy the Server component
          • Default ports
          • Start and maintain the Server
          • Configuration file paths
        • Step 2: Deploy the Agent on each node
          • Start and maintain the Agent
        • Next step
        • Troubleshooting
          • Server service fails to start
          • Server health check fails
          • Agent service fails to start
          • Agent health check fails
      • View Cluster Information
        • Steps
      • Cluster Status and Metrics
        • Access the pages
        • View the overall status and data of the cluster
          • Cluster CPU usage status
          • Cluster memory usage status
          • Cluster disk I/O rate
          • Network I/O rate
        • View the status and data of nodes and hosts
          • Host metrics
          • Segment status
      • View Database Object Information
        • View table objects
      • View SQL Monitoring Information
        • Access the page
        • View SQL execution status
          • View in-progress SQL
          • View SQL Details
          • Cancel SQL execution
          • View SQL details
        • View session status
      • View Storage Information
        • Steps
      • View and Create Alert Rules
        • Access the alert page
        • View existing alert rules
        • Create an alert rule
          • Select alert type
          • Set alert rules
          • Set notification policy
          • Save the configuration
      • View and Create Contact Groups
        • Access the contact group page
        • View existing contact groups
        • Create a contact group
    • Configure Security and Permission
      • Configure Password Policies
        • How it works
        • Set password policies using SQL syntax
          • CREATE PROFILE
          • ALTER PROFILE
          • DROP PROFILE
          • CREATE USER … PROFILE
          • ALTER USER … PROFILE
          • CREATE USER … ENABLE/DISABLE PROFILE
          • ALTER USER … ENABLE/DISABLE PROFILE
          • CREATE USER … ACCOUNT LOCK/UNLOCK
          • ALTER USER … ACCOUNT LOCK/UNLOCK
        • View password policy information in system tables
        • Default password policy
        • Usage examples
          • Create a password policy
          • Scenario 1: Set the maximum number of failed login attempts and the password lock time
        • Scenario 2: Set the number of historical password reuses
        • Scenario 3: Use the settings of the default profile
        • Scenario 4: Superuser locks or unlocks user accounts
        • Scenario 5: Enable profile for regular users
      • Check Password Security
        • Usage scenarios
        • How to use
        • Limitations
      • Encrypt Data with pgcrypto
        • Notes on OpenSSL-related algorithm support
        • General hashing functions
          • digest()
          • hmac()
        • Password hashing functions
          • crypt()
          • gen_salt()
        • PGP encryption functions
          • pgp_sym_encrypt()
          • pgp_sym_decrypt()
          • pgp_pub_encrypt()
          • pgp_pub_decrypt()
          • pgp_key_id()
          • armor(), dearmor()
          • pgp_armor_headers
          • Options for PGP functions
          • Generate PGP keys with GnuPG
          • Limitations of the PGP code
        • Raw encryption functions
        • Random-data functions
          • gen_random_bytes()
          • gen_random_uuid()
        • Notes on random number generation
      • Anonymize Data Using Anon
        • How it works
        • Install Anon
        • User guide
          • Dynamic masking
          • Masking functions
      • Use TDE to Encrypt Data
        • Prerequisites
        • How it works
          • Basic concepts
          • Encryption algorithms
          • Implementation of transparent data encryption
        • How to use
        • How to verify
          • Verification process
        • Performance evaluation
          • Test environment
          • Test data
      • Log Audit with pgAudit
        • Compile and install
        • pgAudit settings
          • pgaudit.log_class
          • pgaudit.log_catalog
          • pgaudit.log_client
          • pgaudit.log_level
          • pgaudit.log_parameter
          • pgaudit.log_relation
          • pgaudit.log_rows
          • pgaudit.log_statement
          • pgaudit.log_statement_once
          • pgaudit.role
        • Session audit log
          • Configuration
          • Example
        • Object audit log
          • Settings
          • Example
        • Format
      • Manage Roles and Privileges in SynxDB
        • Create new roles (users)
          • Alter role attributes
        • Role membership
        • Manage object privileges
        • Security best practices for roles and privileges
        • Encrypt data
    • Backup and Restore
      • Backup and Restore Overview
        • Parallel backup with gpbackup and gprestore
          • Command-line flags for gpbackup and gprestore
        • Non-parallel backup with pg_dump
        • Backup and recovery with CBDR at the WAL level
      • Perform Full Backup and Restore
        • Back up the full database
        • Restore the full database
        • Filter the contents of a backup or restore
          • Filter by schema
          • Filter by table
          • Filter with gprestore
          • Filter by leaf partition
        • Check report files
        • Configure email notifications
          • gpbackup and gprestore email file format
          • Email YAML file sections
          • Email examples
      • Perform Incremental Backup and Restore
        • About incremental backup sets
        • Use incremental backups
          • Example using incremental backup sets
          • Create an incremental backup with gpbackup
          • Restore from an incremental backup with gprestore
        • Incremental backup notes
      • CBDR
        • Full backup and restore procedure
          • Backup process
          • Restore process
        • Incremental backup and restore procedure
        • Restore procedure using a restore point
        • Configuration file reference
        • Command usage
          • Configure commands
          • Cluster backup
          • View backup list
          • Restore command
          • Delete command
          • Restore point commands
    • Expand and Shrink a Cluster
      • Shrink a cluster using gpshrink
      • Expand a cluster using gpexpand
    • Configure Database System
      • Coordinator and local parameters
      • Set configuration parameters
        • Set a local configuration parameter
        • Set a coordinator configuration parameter
          • Set parameters at the system level
          • Set parameters at the database level
          • Set parameters at the row level
          • Set parameters in a session level
      • View server configuration parameter settings
        • View parameters using the pg_settings view
  • Component Guides
    • Using ZomboDB to integrate with Elasticsearch
      • How it works
        • Create index and load data process
        • Query data process
      • Installation
        • Install using gppkg package
        • Install using RPM package
      • Usage instructions
        • Create and use indexes
          • Create indexes
          • Modify indexes
          • Drop indexes
          • Syntax options
          • Usage examples
        • ZomboDB query syntax
          • Boolean queries
          • Field-specific queries
          • Value list queries
          • Proximity search
          • Elasticsearch JSON queries
        • Query DSL
        • ZomboDB query statements
          • Direct JSON
          • SQL Builder API
          • PostGIS support
        • Aggregation functions
          • Arbitrary JSON aggregations
          • Single-value aggregations
          • Multi-row/column aggregations
        • Sorting and highlighting
        • SQL functions
      • ES _cat API
      • Vacuum support
  • Reference Guides
    • SQL Statements
      • ABORT
        • Synopsis
        • Description
        • Parameters
        • Notes
        • Examples
        • Compatibility
      • ALTER DATABASE
        • Synopsis
        • Description
        • Parameters
        • Notes
        • Examples
        • Compatibility
      • ALTER RULE
        • Synopsis
        • Description
        • Parameters
        • Examples
        • Compatibility
      • ALTER TABLESPACE
        • Synopsis
        • Description
        • Parameters
        • Examples
        • Compatibility
      • CREATE DATABASE
        • Synopsis
        • Description
        • Parameters
        • Notes
        • Examples
        • Compatibility
      • CREATE INDEX
        • Synopsis
        • Description
        • Parameters
        • Index storage parameters
        • Notes
        • Examples
        • Compatibility
      • CREATE TABLESPACE
        • Synopsis
        • Description
        • Parameters
        • Notes
        • Examples
        • Compatibility
      • DROP DATABASE
        • Synopsis
        • Description
        • Parameters
        • Notes
        • Examples
        • Compatibility
      • DROP EXTENSION
        • Synopsis
        • Description
        • Parameters
        • Examples
        • Compatibility
      • DROP EXTERNAL TABLE
        • Synopsis
        • Description
        • Parameters
        • Examples
        • Compatibility
      • DROP INDEX
        • Synopsis
        • Description
        • Parameters
        • Examples
        • Compatibility
      • DROP MATERIALIZED VIEW
        • Synopsis
        • Description
        • Parameters
        • Examples
        • Compatibility
      • DROP ROLE
        • Synopsis
        • Description
        • Parameters
        • Examples
        • Compatibility
      • DROP RULE
        • Synopsis
        • Description
        • Parameters
        • Examples
        • Compatibility
      • DROP SCHEMA
        • Synopsis
        • Description
        • Parameters
        • Notes
        • Examples
        • Compatibility
      • DROP TABLE
        • Synopsis
        • Description
        • Parameters
        • Examples
        • Compatibility
      • DROP TABLESPACE
        • Synopsis
        • Description
        • Parameters
        • Notes
        • Examples
        • Compatibility
      • DROP TYPE
        • Synopsis
        • Description
        • Parameters
        • Examples
        • Compatibility
      • DROP USER
        • Synopsis
        • Description
        • Compatibility
        • See also
      • DROP VIEW
        • Synopsis
        • Description
        • Parameters
        • Examples
        • Compatibility
      • END
        • Synopsis
        • Description
        • Parameters
        • Notes
        • Examples
        • Compatibility
      • ROLLBACK TO SAVEPOINT
        • Synopsis
        • Description
        • Parameters
        • Notes
        • Examples
        • Compatibility
      • ROLLBACK
        • Synopsis
        • Description
        • Parameters
        • Notes
        • Examples
        • Compatibility
      • SAVEPOINT
        • Synopsis
        • Description
        • Parameters
        • Notes
        • Examples
        • Compatibility
      • SET ROLE
        • Synopsis
        • Description
        • Parameters
        • Notes
        • Examples
        • Compatibility
      • SET TRANSACTION
        • Synopsis
        • Description
        • Notes
        • Examples
        • Compatibility
      • START TRANSACTION
        • Synopsis
        • Description
        • Parameters
        • Compatibility
      • TRUNCATE
        • Synopsis
        • Description
        • Parameters
        • Notes
        • Examples
        • Compatibility
    • System Utilities
      • gpdemo
        • How to use
          • Deploy with default settings
          • Deploy with customized settings
        • Command-line options
    • System Catalogs
      • System Views
        • pg_stat_progress_create_index
          • Field descriptions
      • The gp_toolkit Administrative Schema
        • About the extension
        • Upgrade the extension
        • Check for tables that need routine maintenance
          • gp_bloat_diag
          • gp_stats_missing
        • Check for locks
          • gp_locks_on_relation
          • gp_locks_on_resqueue
        • Check append-optimized tables
          • __gp_aovisimap_compaction_info(oid)
          • __gp_aoseg(regclass)
          • __gp_aoseg_history(regclass)
          • __gp_aocsseg(regclass)
          • __gp_aocsseg_history(regclass)
          • __gp_aovisimap(regclass)
          • __gp_aovisimap_hidden_info(regclass)
          • __gp_aovisimap_entry(regclass)
          • __gp_aoblkdir(regclass)
          • get_column_size(oid)
          • gp_column_size
          • gp_column_size_summary
        • View server log files
          • gp_log_command_timings
          • gp_log_database
          • gp_log_coordinator_concise
          • gp_log_system
        • Check server configuration files
          • gp_param_setting(‘parameter_name’)
          • gp_param_settings_seg_value_diffs
        • Check for failed segments
          • gp_pgdatabase_invalid
        • Check resource group activity and status
          • gp_resgroup_config
          • gp_resgroup_role
          • gp_resgroup_status
          • gp_resgroup_status_per_host
          • gp_resgroup_status_per_segment
        • Check resource queue activity and status
          • gp_resq_activity
          • gp_resq_activity_by_queue
          • gp_resq_priority_statement
          • gp_resq_role
          • gp_resqueue_status
        • Check query disk spill space usage
          • gp_workfile_entries
          • gp_workfile_usage_per_query
          • gp_workfile_usage_per_segment
        • View users and groups (roles)
          • gp_roles_assigned
        • Check database object sizes and disk space
          • gp_size_of_all_table_indexes
          • gp_size_of_database
          • gp_size_of_index
          • gp_size_of_schema_disk
          • gp_size_of_table_and_indexes_disk
          • gp_size_of_table_and_indexes_licensing
          • gp_size_of_table_disk
          • gp_size_of_table_uncompressed
          • gp_disk_free
        • Check for missing and orphaned data files
          • gp_check_orphaned_files
          • gp_check_missing_files
          • gp_check_missing_files_ext
        • Move orphaned data files
        • Check for uneven data distribution
          • gp_skew_coefficients
          • gp_skew_idle_fractions
        • Maintain partitions
          • The gp_partitions view
          • User-defined functions for partition maintenance
    • Configuration Parameters
      • autovacuum_freeze_max_age
      • autovacuum_vacuum_cost_delay
      • autovacuum_vacuum_scale_factor
      • autovacuum_vacuum_threshold
      • checkpoint_timeout
      • gp_appendonly_compaction_segfile_limit
      • gp_autostats_lock_wait
      • gp_command_count
      • gp_dynamic_partition_pruning
      • gp_enable_runtime_filter_pushdown
      • gp_enable_statement_trigger
      • gp_max_partition_level
      • gp_resource_manager
      • gp_role
      • gp_session_id
      • krb_server_keyfile
      • log_checkpoints
      • max_connections
      • max_replication_slots
      • optimizer_array_constraints
      • optimizer_array_expansion_threshold
      • optimizer_cost_model
      • optimizer_cost_threshold
      • optimizer_cte_inlining_bound
      • optimizer_damping_factor_filter
      • optimizer_damping_factor_groupby
      • optimizer_damping_factor_join
      • optimizer_discard_redistribute_hashjoin
      • optimizer_dpe_stats
      • optimizer_enable_derive_stats_all_groups
      • optimizer_enable_dynamicbitmapscan
      • optimizer_enable_dynamicindexonlyscan
      • optimizer_enable_dynamicindexscan
      • optimizer_enable_foreign_table
      • optimizer_enable_indexonlyscan
      • optimizer_enable_orderedagg
      • optimizer_enable_push_join_below_union_all
      • optimizer_enable_query_parameter
      • optimizer_enable_right_outer_join
      • optimizer_force_three_stage_scalar_dqa
      • optimizer_nestloop_factor
      • optimizer_penalize_broadcast_threshold
      • optimizer_push_group_by_below_setop_threshold
      • optimizer_replicated_table_insert
      • optimizer_skew_factor
      • optimizer_sort_factor
      • optimizer_trace_fallback
      • optimizer_use_gpdb_allocators
      • optimizer_xform_bind_threshold
      • superuser_reserved_connections
      • track_io_timing
      • wal_compression
      • wal_keep_size
      • work_mem
      • writable_external_table_bufsize
  • Developer Guides
    • Develop Database Extensions Using PGRX
      • Requirements for development environment
        • Basic software environment
        • PostgreSQL dependencies
      • Quick start for PGRX
        • Set up and install PGRX
        • Create an extension
        • Install and use the extension
      • PGRX type mapping
        • Custom type conversions
        • Type mapping details
      • PGRX core features
        • Complete management for development environment
        • Automatic mode generation
        • Security first
        • UDF supports
        • Simple custom types
        • Server programming interface (SPI)
        • Advanced features
      • Considerations and best practices for PGRX
      • Debugging and development tips
      • Learning resources for PGRX
SynxDB Documentation
  • Operate with Data
  • Advanced Analytics
  • Use pgvector for Vector Similarity Search

Use pgvector for Vector Similarity Search

pgvector is an open-source plugin for vector similarity search. It supports both exact and approximate nearest neighbor searches, as well as L2 distance, inner product, and cosine distance. For more details, see pgvector/pgvector: Open-source vector similarity search for Postgres. SynxDB allows you to use pgvector for data storage, querying, indexing, hybrid searches, and more through SQL statements.

This document explains how to use pgvector in SynxDB.

Quick start

Enable the extension (do this once in each database where you want to use it):

CREATE EXTENSION vector;

Create a vector column with 3 dimensions:

CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));

Insert vector data:

INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');

Get the nearest neighbors by L2 distance:

SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

Note: Use <#> for inner product and <=> for cosine distance. Use <+> for L1 distance.

Store data

Create a table with a vector column:

CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));

Or add a vector column to an existing table:

ALTER TABLE items ADD COLUMN embedding vector(3);

Insert vectors:

INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');

Insert and update vectors:

INSERT INTO items (id, embedding) VALUES (1, '[1,2,3]'), (2, '[4,5,6]')
    ON CONFLICT (id) DO UPDATE SET embedding = EXCLUDED.embedding;

Update vectors:

UPDATE items SET embedding = '[1,2,3]' WHERE id = 1;

Delete vectors:

DELETE FROM items WHERE id = 1;

Query data

Get the nearest neighbors to a vector:

SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

The supported distance functions are:

  • <->: L2 distance

  • <#>: negative inner product

  • <=>: cosine distance

  • <+>: L1 distance

Get the nearest neighbors of a row:

SELECT * FROM items WHERE id != 1 ORDER BY embedding <-> (SELECT embedding FROM items WHERE id = 1) LIMIT 5;

Get rows within a specific distance range:

SELECT * FROM items WHERE embedding <-> '[3,1,2]' < 5;

Note: Using ORDER BY and LIMIT together can take advantage of indexing.

Get the distance:

SELECT embedding <-> '[3,1,2]' AS distance FROM items;

For inner product, multiply by -1 (because <#> returns the negative inner product).

SELECT (embedding <#> '[3,1,2]') * -1 AS inner_product FROM items;

For cosine similarity, use 1 minus the cosine distance.

SELECT 1 - (embedding <=> '[3,1,2]') AS cosine_similarity FROM items;

Calculate the average of vectors:

SELECT AVG(embedding) FROM items;

Calculate the average of a group of vectors:

SELECT category_id, AVG(embedding) FROM items GROUP BY category_id;

Index data

By default, pgvector performs exact nearest neighbor searches, which provides a high recall rate.

If you need a higher recall rate, you can use approximate nearest neighbor search by adding an index, although this might reduce performance. Unlike adding a regular index, after adding an approximate index, queries will return different results.

pgvector supports the following index types:

  • HNSW

  • IVFFlat

HNSW index

About HNSW

HNSW (Hierarchical Navigable Small World) is an efficient algorithm for approximate nearest neighbor search, designed for processing large-scale and high-dimensional datasets.

The basic principles of HNSW are as follows:

  • Multi-level graph structure: HNSW organizes data by building a multi-level graph. In this graph, each node represents a data point (or vector), and the edges between nodes reflect their relative proximity in space.

  • Search optimization: This multi-level structure allows the search process to skip over many irrelevant data points quickly, narrowing down the neighboring area of the query vector. This greatly improves query efficiency.

HNSW is particularly useful in the following scenarios:

  • High-dimensional data: HNSW index is very effective for high-dimensional datasets, because it is good at handling complex proximity relationships in high-dimensional space.

  • Large datasets: HNSW index is suitable for large datasets because it balances query speed and recall rate better than many other types of indexes.

Creating an HNSW index takes longer and uses more memory, but it offers better query performance (speed-recall tradeoff). Unlike IVFFlat, HNSW does not require a training step, so you can create the index even when the table has no data.

Add an index for each distance function you plan to use.

Create an HNSW index

Each distance metric has its specific use cases. The choice of which method to use for creating an index depends on the type of search you want to optimize. For example, if your application focuses on getting vectors with similar direction but possibly different magnitudes, an index created with cosine distance might be more suitable. On the other hand, if you are concerned with the straight-line distance between vectors, you should choose an index based on L2 distance.

L2 distance:

CREATE INDEX ON items USING hnsw (embedding vector_l2_ops);

Inner product:

CREATE INDEX ON items USING hnsw (embedding vector_ip_ops);

Cosine distance:

CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops);

The maximum dimension for indexable vectors is 2000.

Hamming distance:

CREATE INDEX ON items USING hnsw (embedding bit_hamming_ops);

Jaccard distance:

CREATE INDEX ON items USING hnsw (embedding bit_jaccard_ops);

Supported types are:

  • vector: up to 2000 dimensions

  • halfvec: up to 4000 dimensions

  • bit: up to 64000 dimensions

  • sparsevec: up to 1000 non-zero elements

HNSW index options

  • m: The maximum number of connections per layer (16 by default).

  • ef_construction: The size of the dynamic candidate list used to build the graph (64 by default).

CREATE INDEX ON items USING hnsw (embedding vector_l2_ops) WITH (m = 16, ef_construction = 64);

A larger ef_construction value provides higher recall but at the cost of index build time and insert speed.

HNSW index query options

Specify the size of the dynamic candidate list for searches (40 by default). A larger value improves recall but reduces speed.

SET hnsw.ef_search = 100;

Use SET LOCAL within a transaction to apply it to a single query:

BEGIN;
SET LOCAL hnsw.ef_search = 100;
SELECT ...
COMMIT;

HNSW index build time

Index build speed increases greatly when the internal graph structure of the HNSW index fits into maintenance_work_mem.

SET maintenance_work_mem = '8GB';

If the graph no longer fits, you will receive a notification:

NOTICE:  hnsw graph no longer fits into maintenance_work_mem after 100000 tuples
DETAIL:  Building will take significantly more time.
HINT:  Increase maintenance_work_mem to speed up builds.

Note: Do not set maintenance_work_mem too high, because this exhausts the server’s memory.

Like other index types, it is faster to create the index after loading the initial data.

You can also speed up index creation by increasing the number of parallel workers (2 by default).

SET max_parallel_maintenance_workers = 7; -- Including the leader thread

For more workers, you might also need to increase max_parallel_workers (8 by default).

View HNSW index progress

You can check the progress when creating an HNSW index:

SELECT phase, round(100.0 * blocks_done / nullif(blocks_total, 0), 1) AS "%" FROM pg_stat_progress_create_index;

The HNSW index build process includes the following phases:

  1. initializing: The starting phase of index creation. In this phase, the system prepares all necessary resources and configurations to begin building the index.

  2. loading tuples: Data points (or vectors) are added to the multi-level graph, and the corresponding connections are established.

IVFFlat index

About IVFFlat

The IVFFlat index is a method for efficient vector search in large datasets, particularly useful for the Approximate Nearest Neighbor (ANN) search.

The basic principles of IVFFlat index are as follows:

  • Partitioned search space: IVFFlat works by dividing the data into multiple “lists”. These lists are created by clustering the dataset (for example, using the K-means algorithm), with each list representing a cluster in the data space.

  • Reduced search complexity: Instead of searching through the entire dataset, the search first identifies which lists (or clusters) the search vector is likely to belong to, then only searches within those lists, reducing computing time.

IVFFlat is particularly useful in the following scenarios:

  • Large datasets: For datasets containing many vectors, a full search (checking every vector) can be very time-consuming. IVFFlat optimizes the search process through clustering and partitioning.

  • Approximate search: IVFFlat is an approximate nearest neighbor search method, suitable for scenarios where quick response times are needed and some inaccuracy in search results is acceptable.

To achieve good recall with IVFFlat, follow these best practices:

  • Create the index after some data has been inserted into the table.

  • Choose an appropriate number of lists. For tables with up to 1 million rows, it is recommended to use the number of rows divided by 1000 as the number of lists. For tables with more than 1 million rows, use the square root of the number of rows as the number of lists.

  • Specify an appropriate number of probes during queries (the higher the number of probes, the higher the recall, but the slower the query). It is recommended to start by trying the square root of the number of lists. Add an index for each distance function you plan to use.

Create an IVFFlat index

Each distance metric has its specific use cases. The choice of which method to use for creating an index depends on the type of search you want to optimize. For example, if your application focuses on finding vectors with similar direction but possibly different magnitudes, an index created with cosine distance might be more suitable. On the other hand, if you are concerned with the straight-line distance between vectors, you should choose an index based on L2 distance.

The lists parameter specifies the number of partitions (lists).

L2 distance:

CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 100);

Inner product:

CREATE INDEX ON items USING ivfflat (embedding vector_ip_ops) WITH (lists = 100);

Cosine distance:

CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

Attention

SynxDB currently supports indexing vectors with up to 2000 dimensions.

Hamming distance:

CREATE INDEX ON items USING ivfflat (embedding bit_hamming_ops) WITH (lists = 100);

Supported types:

  • vector: up to 2000 dimensions.

  • halfvec: up to 4000 dimensions

  • bit: up to 64,000 dimensions

Specify the number of Probes

Attention

The number of probes means how many “lists” the system checks during an approximate nearest neighbor search. These lists are subsets of the dataset created through clustering algorithms like K-means. Increasing the number of probes means the system checks more lists to get the nearest neighbors, thus improving accuracy.

A higher number of probes increases search accuracy but also adds computing overhead, slowing down the search. Therefore, the number of probes is a parameter that should be balanced based on specific use cases.

Specify the number of probes (1 by default):

SET ivfflat.probes = 10;

If you choose a large number of probes, there might be some performance loss that impacts speed, but you will achieve higher recall. You can also set it to match the number of lists for an exact nearest neighbor search (in this case, the optimizer does use the index).

Use SET LOCAL within a transaction to set the number of probes for a single query:

BEGIN;
SET LOCAL ivfflat.probes = 10;
SELECT ...
COMMIT;

View IVFFlat index progress

You can check the index progress during its creation:

SELECT phase, tuples_done, tuples_total FROM pg_stat_progress_create_index;

The progress phases include:

  1. initializing: The starting phase of index creation. The system prepares all necessary resources and configurations during this phase.

  2. performing k-means: The k-means algorithm is used to divide the vector dataset into multiple lists (or clusters).

  3. sorting tuples: Sorting the data (tuples). This is done based on vector values or the lists they belong to, optimizing the index structure and improving search efficiency.

  4. loading tuples: Data is actually loaded into the index structure, which means that tuple data is written to the index to ensure it meets indexing requirements.

Attention

tuples_done and tuples_total are only populated during the loading tuples phase.

Use filters in indexes

When creating an index, you can use a WHERE clause to limit the scope of the index. This method allows vector searches to only consider rows that meet specific conditions, which improves search efficiency and accuracy.

SELECT * FROM items WHERE category_id = 123 ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

Create an index on one or more WHERE columns for exact searches:

CREATE INDEX ON items (category_id);

Create a partial index on a vector column for approximate searches:

CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 100) WHERE (category_id = 123);

For approximate searches with different values in multiple WHERE columns, use partitioning:

CREATE TABLE items (embedding vector(3), category_id int) PARTITION BY LIST(category_id);

Half-precision vectors

Use the halfvec type to store half-precision vectors:

CREATE TABLE items (id bigserial PRIMARY KEY, embedding halfvec(3));

Half-precision index

Use half-precision vector index to reduce index size:

CREATE INDEX ON items USING hnsw ((embedding::halfvec(3)) halfvec_l2_ops);

Get the nearest neighbors:

SELECT * FROM items ORDER BY embedding::halfvec(3) <-> '[1,2,3]' LIMIT 5;

Binary vectors

Use the bit type to store binary vectors:

CREATE TABLE items (id bigserial PRIMARY KEY, embedding bit(3));
INSERT INTO items (embedding) VALUES ('000'), ('111');

Get the nearest neighbors using the Hamming distance:

SELECT * FROM items ORDER BY bit_count(embedding # '101') LIMIT 5;

Hybrid search

Perform hybrid search using SynxDB full-text search:

SELECT id, content FROM items, plainto_tsquery('hello search') query
    WHERE textsearch @@ query ORDER BY ts_rank_cd(textsearch, query) DESC LIMIT 5;

pgvector performance

Use EXPLAIN ANALYZE for performance debugging:

EXPLAIN ANALYZE SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;

Exact search

To speed up queries without an index, you can increase the value of the max_parallel_workers_per_gather parameter.

SET max_parallel_workers_per_gather = 4;

If vectors are already normalized to a length of 1 (for example, the OpenAI embeddings), using inner product can provide the best performance.

SELECT * FROM items ORDER BY embedding <#> '[3,1,2]' LIMIT 5;

Approximate search

To speed up queries with an index, you can increase the number of inverted lists (at the cost of some recall).

CREATE INDEX ON items USING ivfflat (embedding vector_l2_ops) WITH (lists = 1000);

These are some guidelines for nearest neighbor search and performance optimization in pgvector. Depending on your needs and data structure, you can adjust and optimize based on these recommendations.

Next Previous

© 2025, Synx Data Labs. All rights reserved.

SynxDB Documentation