Cluster Failover

SynxDB Elastic v4 features fast failover. When a data warehouse node or other service in the cluster fails, the cluster can quickly detect the failure and automatically recover the failed node and its stored data, ensuring a rapid restoration to the state before the failure.

Failover mechanism

SynxDB Elastic is deployed in a containerized environment based on Kubernetes (K8s). Components and services that form the database cluster are deployed in different pods and are managed and scheduled by the K8s cluster.

When a data warehouse node (Segment), data warehouse proxy service, or cluster-level control service in a SynxDB Elastic cluster fails, K8s can automatically restart the corresponding Pod and re-associate its Persistent Volume in a short time. Data from before the failure is completely preserved and restored. The restarted Pod has the same network configuration as before the failure. Once the failed Pod restarts and enters the Running state, the database can resume normal operation.

Note: Before the failed Pod enters the Running state, the database cluster or the data warehouse corresponding to the Pod will not function normally.

Failover scenarios

This section lists the following failover scenarios and provides corresponding examples.

Failover for cluster metadata proxy service (example Pod: cloudberry-proxy-0)
Failover for cluster metadata safekeeper service (example Pod: safekeeper-0)
Failover for cluster metadata pageserver service (example Pod: pageserver-0)
Failover for data warehouse proxy service in the cluster (example Pod: cloudberry-proxy-0)
Failover for data warehouse node in the cluster (example Pod: wh9527-0)

The following examples simulate a failure by deleting the Pod of the corresponding service in the K8s cluster. After deleting the Pod, the success of the failover is determined by observing the Pod restart process and testing the functionality of the corresponding module in the database.

Prerequisites

Before proceeding with the following tests, ensure the following prerequisites are met:

An example SynxDB Elastic cluster has been created and is accessible.
The Kubernetes command-line tool kubectl and the k9s console are installed locally.

Cluster metadata proxy service failover

This example simulates the failover of the SynxDB Elastic cluster metadata proxy service after a failure.

Access the k9s console and locate the cloudberry-proxy-0 Pod in the synxdb-system namespace.
Set a local forwarding port for cloudberry-proxy-0. Use shift + f to open the port forwarding settings, enter the container port (default is 5432) and the local port to forward to. The address can be the default localhost.

Note: Avoid setting the local port to 5432 to prevent conflicts with a local PostgreSQL installation.
Log in to the cluster using the forwarding port and address set in the previous step: psql -h localhost -p 15432 -U gpadmin -d postgres.

Enter the cloudmeta database and verify that data can be accessed:

--Enter the cloudmeta database
\c cloudmeta
--List tables in the current database
\dt

In the k9s cluster, switch to the Persistent Volume Claim (PVC) view: :pvc. Find the corresponding storage claim and delete it using CTRL+D.
In the Pod list, delete cloudberry-proxy-0.

After the Pod is deleted, k8s will restart it. The metadata cluster will be temporarily inaccessible until the Pod’s status returns to Running.
Wait for the Pod to restart. Once its status is Running, set up local port forwarding again and access the cloudmeta database. The database should now be accessible.

Cluster metadata `safekeeper` service failover

This example simulates the failover of the SynxDB Elastic cluster metadata safekeeper-0 service after a failure. For specific screenshots and commands, refer to the cluster metadata proxy service failover example.

Access the k9s console and locate the safekeeper-0 Pod in the synxdb-system namespace.
Switch to the PVC view, find the corresponding storage claim, and delete it.
Delete the safekeeper-0 Pod. k8s will automatically restart it. The database remains accessible during this period.
The safekeeper-0 Pod’s status returns to Running, and the failover is complete.

Cluster metadata `pageserver` service failover

This example simulates the failover of the SynxDB Elastic cluster metadata pageserver service after a failure. For specific screenshots and commands, refer to the cluster metadata proxy service failover example.

Access the k9s console and locate the pageserver-0 Pod in the synxdb-system namespace.
Switch to the PVC view, find the corresponding storage claim, and delete it.
Delete the pageserver-0 Pod. k8s will automatically restart it. The database remains accessible during this period.
The pageserver-0 Pod’s status returns to Running, and the failover is complete.

Data warehouse proxy service failover

This example simulates the failover of the SynxDB Elastic data warehouse cluster proxy service after a failure.

In k9s, go to the data warehouse namespace (e.g., cloud007-513d75d5-3033-4649-b913-07f6115ec41a+) and locate the proxy service Pod cloudberry-proxy-0.
Set a local forwarding port for cloudberry-proxy-0. Use shift + f to open the port forwarding settings, enter the container port (default is 5432) and the local port to forward to. The address can be the default localhost.
Log in to the cluster locally using the forwarding port and address set in the previous step: psql -h localhost -p 15432 -U gpadmin -d postgres

Create a test database and insert test data:

--Creates the database.
CREATE DATABASE test_db;

--Enters the database.
\c test_db;

--Creates a table and inserts data.
CREATE TABLE projects (
    project_id SERIAL PRIMARY KEY,
    project_name TEXT,
    project_manager TEXT,
    department TEXT
) USING heap;

INSERT INTO projects (project_name, project_manager, department) VALUES
('Project Alpha', 'Alice', 'Engineering'),
('Project Beta', 'Bob', 'HR'),
('Project Gamma', 'Charlie', 'Sales');

Delete the cloudberry-proxy-0 Pod and its corresponding PVC. At this point, test_db will be temporarily inaccessible.
When the Pod’s status automatically changes to Running, test_db becomes accessible again.

Data warehouse node failover

This example simulates the failover of a SynxDB Elastic cluster data node after a failure.

In k9s, go to the data warehouse namespace (e.g., cloud007-513d75d5-3033-4649-b913-07f6115ec41a+) and locate the data node Pod (e.g., wh9527-0).
Go to the PVC list, locate the corresponding PVC, and delete it.
Delete the Pod wh9527-0. At this point, test_db will be briefly inaccessible.
The deleted Pod is automatically restarted. When its status changes to Running, database access is restored.