Load Data Using PXF
PXF is an extensible framework that allows SynxDB to query external data files, such as those in Hadoop or cloud object stores, without needing to load them into the database first. The metadata for these files is not managed by the database.
With PXF, you can achieve high-performance parallel data access between your SynxDB cluster and external data sources. It is particularly useful for scenarios where you need to analyze large datasets stored in systems like Hadoop or cloud object stores (for example, Amazon S3) without a separate ETL process.
Typical use cases for PXF include:
Querying data in-place: Access data in external systems for ad-hoc analysis and data exploration without moving it.
High-speed data loading: Use PXF as a parallel pipeline to efficiently ingest large datasets into SynxDB.
Data federation: Execute queries that join local tables in SynxDB with data from external sources.
PXF comes with built-in connectors for accessing data from various sources, including:
HDFS files
Hive tables
HBase tables
JDBC-accessible databases
You can also create your own connectors to other data storage or processing engines.
This document provides a guide on how to install, configure, and run PXF for use with SynxDB.
Install and configure PXF
Follow these steps to install and set up PXF. These operations are typically performed on the coordinator node of your SynxDB cluster.
Prerequisites
Before installing PXF, ensure the following dependencies are installed on your system:
JDK 1.8 or JDK 11
A running SynxDB instance.
Step 1: Install the PXF package
First, obtain the PXF RPM package (for example, pxf-1.0.0-1.x86_64.rpm
) from Synx Data Labs Technical Support. Then, install it using yum
or dnf
.
sudo yum install pxf-1.0.0-1.x86_64.rpm
Step 2: Set Up Your Environment
After the installation, you need to configure the necessary environment variables for PXF to function correctly.
Set
JAVA_HOME
:
export JAVA_HOME=<path_to_your_java_home>
Source your SynxDB environment. The path should point to your SynxDB installation directory.
source /usr/local/synxdb/greenplum_path.sh
Note
The path
/usr/local/synxdb/
is an example. Adjust it to match your actual SynxDB installation path.Set the
GPHOME
,PXF_HOME
, andPXF_BASE
environment variables.
export GPHOME=/usr/local/synxdb export PXF_HOME=/usr/local/pxf-1.0.0 export PXF_BASE=${HOME}/pxf-baseImportant
GPHOME
must point to your SynxDB installation directory.
PXF_HOME
should point to the directory created by the RPM installation.It is highly recommended to set
PXF_BASE
to a location outside ofPXF_HOME
. IfPXF_BASE
is not set, it defaults toPXF_HOME
, and server configurations might be deleted upon re-installation.
Step 3: Run PXF
After setting up the environment, you need to initialize and start the PXF service.
Add the PXF binaries to your
PATH
. You can add this line to your.bashrc
or.zshrc
profile for convenience.export PATH=${PXF_HOME}/bin:$PATH
Prepare and start the PXF service. The
pxf prepare
command initializes the PXF configuration based on your environment. It only needs to be run once.pxf prepare pxf start
If the service fails to start due to a port conflict on the default port (
5888
), you can specify a different port. First, add the following line to the${PXF_BASE}/conf/pxf-application.properties
file:server.port=<new_port_number>
Then, add the following line to the
${PXF_BASE}/conf/pxf-env.sh
file to ensure the new port is used by the control scripts:export PXF_PORT=<new_port_number>
After making these changes, run
pxf start
again.
After these steps, PXF is running and ready to be used with SynxDB to access external data.
Next steps
This guide covers the basic installation of PXF. PXF is a powerful tool with many connectors and configuration options. To learn more about:
Configuring PXF for different data sources (including HDFS, Hive, S3)
Using PXF with external tables
Advanced topics and troubleshooting
Refer to apache/cloudberry-pxf documentation.