Anonymize Data Using Anon
Anon is a SynxDB extension based on PostgreSQL Anonymizer. It provides data masking features for SynxDB to prevent sensitive data leakage.
Data masking, also known as data anonymization, refers to the process of removing sensitive information from data. This sensitive information includes personal names, phone numbers, addresses, ID numbers, and so on. Anon uses the SECURITY LABEL feature provided by PostgreSQL to specify masking rules and applies these rules to designated database objects to achieve data masking. For example, an original phone number 0609110911
becomes 06******11
after data masking.
How it works
Anon primarily uses PostgreSQL’s SECURITY LABEL feature to store masking rules. It creates corresponding views for tables in the original schema under a new schema. When a user who needs sensitive data hidden connects, they are actually using the views with the same names in the new schema instead of the tables in the original schema. This is how masking rules are applied to specified database objects.
Install Anon
Install using the gppkg package.
# Install gppkg -i anon-*.gppkg # Uninstall gppkg -r anon*
Verification. After installation, if the extension is created successfully via
psql
, the installation is successful.# Command psql postgres -c "ALTER DATABASE postgres SET session_preload_libraries = 'anon';" psql postgres -c "CREATE EXTENSION IF NOT EXISTS anon CASCADE;" # Expected result NOTICE: installing required extension "pgcrypto" WARNING: referential integrity (FOREIGN KEY) constraints are not supported in |product_name|, will not be enforced CREATE EXTENSION
User guide
Anon’s main feature is dynamic masking. The following section briefly introduces how to use dynamic masking and lists the available masking functions.
Warning
Anon also provides a static masking feature, but this feature is very “dangerous” because it directly updates the database with the masked data, causing the loss of original data. Therefore, it is not recommended to use this feature in a production environment.
However, this feature is useful for testing scenarios, and many existing tests use static masking. The relevant functions are anon.anonymize_database()
, anon.anonymize_table()
, and anon.anonymize_column()
.
Dynamic masking
Dynamic masking hides sensitive information for specified users. Here is a specific example.
Specify the shared libraries to be loaded at the start of a connection. Note: After setting this parameter, you must open a new connection for it to take effect.
ALTER DATABASE :DBNAME SET session_preload_libraries = 'anon';
Create the extension and initialize it.
CREATE EXTENSION IF NOT EXISTS pgcrypto; CREATE EXTENSION IF NOT EXISTS anon CASCADE; SELECT anon.init();
Create a table for testing.
CREATE TABLE people ( id TEXT, firstname TEXT, lastname TEXT, phone TEXT); INSERT INTO people VALUES ('T1','Sarah', 'Conor','0609110911');
Specify masking rules for the columns of the people table.
SECURITY LABEL FOR anon ON COLUMN people.lastname IS 'MASKED WITH FUNCTION anon.fake_last_name()'; SECURITY LABEL FOR anon ON COLUMN people.phone IS 'MASKED WITH FUNCTION anon.partial(phone,2,$$******$$,2)';
Create a new user and designate them as “MASKED”.
CREATE ROLE skynet LOGIN; ALTER ROLE skynet WITH PASSWORD '123456'; SECURITY LABEL FOR anon ON ROLE skynet IS 'MASKED';
Start dynamic masking.
SELECT anon.start_dynamic_masking();
Switch to the new user to view the masked data.
\c - skynet SELECT * FROM people; -- The test result is as follows dynamic_masking=> SELECT * FROM people; id | firstname | lastname | phone ----+-----------+----------+------------ T1 | Sarah | Watsica | 06******11 (1 row)
Attention
If a function like anon.fake_last_name()
is specified for masking, you must disable the ORCA optimizer before querying the data.
set optimizer to off;
This is because the fake_last_name
function queries a data table, which is not currently supported. If your custom masking function has similar behavior, you also need to disable the ORCA optimizer when querying.
As you can see, apart from common operations like creating and initializing the extension, the basic steps for dynamic masking are: specify masking rules for the database objects you want to mask (for example, a column), designate a user as “MASKED”, and start dynamic masking. Finally, when that user logs in, the data they see will be masked.
Masking functions
Anon provides the following categories of masking functions. Different functions can be used for different data types.
Destruction
In many cases, the best way to hide the content of a column is to replace all its values with a static value. For example, you can replace an entire column with the word “CONFIDENTIAL”.
SECURITY LABEL FOR anon
ON COLUMN users.address
IS 'MASKED WITH VALUE ''CONFIDENTIAL'' ';
Adding noise
This series of functions adds a certain proportion or value of deviation to the original value. For example, adding a +/-10% deviation to the salary column.
anon.noise(original_value,ratio)
:original_value
can be of type integer, bigint, or double precision.anon.dnoise(original_value, interval)
:original_value
can be of type date, timestamp, or time.
Note: A user might try to deduce the original value by making multiple attempts and averaging the results.
Randomization
Anon provides a large number of functions to generate completely random data.
Basic random values
Function |
Description |
---|---|
anon.random_date() |
Returns a date. |
anon.random_string(n) |
Returns a TEXT value with n characters. |
anon.random_zip() |
Returns a five-digit code. |
anon.random_phone(p) |
Returns an eight-digit phone number with p as the prefix. |
anon.random_hash(seed) |
Given a seed, returns a hash of a random string. |
Random between
Function |
Description |
---|---|
anon.random_date_between(d1,d2) |
Returns a date between d1 and d2. |
anon.random_int_between(i1,i2) |
Returns an integer between i1 and i2. |
anon.random_bigint_between(b1,b2) |
Returns a bigint between b1 and b2. |
Note that the values returned by these functions are inclusive of the boundaries. For example, anon.random_int_between(1,3)
can return 1, 2, or 3.
Random in array
Function |
Description |
---|---|
|
Returns an element from the given array. For example, |
Random in enum
Function |
Description |
---|---|
|
Returns any value from the range of the specified enum type. |
Random in range
Function |
Description |
---|---|
|
Returns 5 of type INT. |
|
Returns 7 of type BIGINT. |
|
Returns a NUMERIC value between 0.1 and 0.9. |
|
Returns a date in the year 2001. |
|
Returns a timestamp in October 2022. |
|
Returns a timestamp with time zone in October 2022. |
Faking
Faking also generates random content, but unlike Randomization, Faking generates data that looks plausible. Before using this series of functions, you must first call the anon.init()
function to import the random dataset.
This type of function includes:
anon.fake_address()
anon.fake_city()
anon.fake_country()
anon.fake_company()
anon.fake_email()
anon.fake_first_name()
anon.fake_iban()
anon.fake_last_name()
anon.fake_postcode()
anon.fake_siret()
For columns of type TEXT and VARCHAR, you can also use the classic Lorem Ipsum generator:
Function |
Description |
---|---|
|
Returns 5 paragraphs of text. |
|
Returns 2 paragraphs of text. |
|
Returns 4 paragraphs of text. |
|
Returns 20 words. |
|
Returns 7 characters. |
|
Returns a string of the same length as the original string. |
Pseudonymization
Pseudonymization is similar to Faking, but Pseudonymization generates deterministic values, meaning that given a seed and a salt, it will always generate the same value. You also need to call the anon.init()
function before using this category of functions.
anon.pseudo_first_name('seed','salt')
anon.pseudo_last_name('seed','salt')
anon.pseudo_email('seed','salt')
anon.pseudo_city('seed','salt')
anon.pseudo_country('seed','salt')
anon.pseudo_company('seed','salt')
anon.pseudo_iban('seed','salt')
anon.pseudo_siret('seed','salt')
The second parameter, salt
, is optional. If no salt
is given, a random one will be chosen.
Generic hashing
Function |
Description |
---|---|
|
Returns the hash of a given value based on a salt and a hash algorithm. |
|
Choose a salt and a hash algorithm from a predefined list. |
A random salt is generated when the extension is initialized, and the default hash algorithm is sha512. You can change these two values with the following two functions.
anon.set_secret_salt(value)
anon.set_algorithm(value)
Possible values are: md5, sha1, sha224, sha256, sha384, or sha512.
Note: The salt and the hash algorithm used should be stored with the same level of security as the original dataset, otherwise attackers might be able to calculate the original data from these values. Generally, it is more recommended to use anon.hash()
, because the salt does not appear directly in the masking rule.
Partial scrambling
Partial scrambling ignores a part of the data. For example, a credit card number can be replaced with 40XX XXXX XXXX XX96
.
Function |
Description |
---|---|
anon.partial(‘abcdefgh’,1,’xxxx’,3) |
returns ‘axxxxfgh’. |
anon.partial_email(‘daamien@gmail.com’) |
becomes ‘da*@gm**.com’. |
Generalization
Generalization refers to replacing a precise value with a range. For example, if Lin Qian’ge’s actual age is 21, you can say that her age is between 20 and 30. Generalization will change the data type of the column, so the related functions cannot be used with dynamic masking. The related functions are as follows:
generalize_int4range(value, step)
generalize_int8range(value, step)
generalize_numrange(value, step)
Value is the data to be generalized, and step is the size of each range.