Dear Andrew @andatki,
I have found a working solution for the problem discussed in your book related to cloning tables without constraints, copying rows, and recreating constraints. I would like to share this solution, which may help other readers who are facing similar issues.
Solution: SCRUB_BATCHES Procedure
Below is the SCRUB_BATCHES
procedure I created. This procedure iterates over the current tables, scrubs sensitive data using predefined functions, and copies the data to a new table:
-- Ensure the hstore extension is enabled
-- CREATE EXTENSION IF NOT EXISTS hstore WITH SCHEMA rideshare;
CREATE OR REPLACE PROCEDURE SCRUB_BATCHES(schema_name text, tablename text)
LANGUAGE plpgsql
AS $$
DECLARE
current_id INT;
max_id INT;
batch_size INT := 1000;
rows_inserted INT;
column_list text;
value_list text;
attr_rec RECORD;
scrub_functions hstore := 'name => SCRUB_NAME,
-- Additional functions follow
email => SCRUB_EMAIL,
secret => SCRUB_SECRET,
-- Additional functions follow
ssn => SCRUB_SSN';
function_name text;
key text;
value text;
BEGIN
-- Get the minimum and maximum IDs for the specified table
EXECUTE format('SELECT MIN(id), MAX(id) FROM %I.%I', schema_name, tablename)
INTO current_id, max_id;
-- Loop over the table in batches of `batch_size`
WHILE current_id IS NOT NULL AND current_id <= max_id LOOP
-- Reset the column and value lists for each batch
column_list := 'id';
value_list := 'id';
-- Retrieve the list of attributes for the specified table
FOR attr_rec IN
SELECT a.attname, col_description(a.attrelid, a.attnum) as comment
FROM pg_attribute a
WHERE a.attrelid = format('%I.%I', schema_name, tablename)::regclass
AND a.attnum > 0
AND NOT a.attisdropped
AND a.attname NOT IN ('id') -- Exclude 'id' column from updates
LOOP
-- Determine the appropriate scrubbing function based on attribute name patterns
function_name := NULL;
FOR key, value IN SELECT * FROM each(scrub_functions)
LOOP
IF attr_rec.attname NOT IN ('id') AND attr_rec.attname NOT ILIKE '%_id'
AND attr_rec.attname ILIKE '%' || key || '%' THEN
function_name := value;
EXIT;
END IF;
END LOOP;
-- Append attribute to the column and value lists with the determined scrubbing function
column_list := column_list || format(', %I', attr_rec.attname);
IF function_name IS NOT NULL THEN
value_list := value_list || format(', CASE WHEN %I.%I.%I IS NOT NULL THEN %s(%I.%I.%I) ELSE %I.%I.%I END',
schema_name, tablename, attr_rec.attname,
function_name, schema_name, tablename, attr_rec.attname,
schema_name, tablename, attr_rec.attname);
ELSE
-- Default case if no specific scrubbing function is defined
value_list := value_list || format(', %I.%I.%I', schema_name, tablename, attr_rec.attname);
END IF;
END LOOP;
-- Execute the insert statement with the dynamically built column and value lists
EXECUTE format('INSERT INTO %I.%I_copy (%s) SELECT %s FROM %I.%I WHERE id >= %L AND id < %L',
schema_name, tablename,
column_list,
value_list,
schema_name, tablename, current_id::bigint, (current_id + batch_size)::bigint);
GET DIAGNOSTICS rows_inserted = ROW_COUNT;
COMMIT;
RAISE NOTICE 'Table: %, current_id: % - Number of rows inserted: %', tablename, current_id, rows_inserted;
current_id := current_id + batch_size + 1;
END LOOP;
END $$;
This SCRUB_BATCHES
procedure can be executed for iterating over the current tables as follows:
CALL SCRUB_BATCHES(schema_name, table_rec.tablename);
It can be easily changed to use a batched UPDATE
statement, as explained in the book on page 64
.
Issues with Existing Constraints and Indexes
Following the book explanations, I was able to use this procedure on all tables. However, I still encountered issues with existing constraints that cascade on the old tables and the new, copied tables.
Example constraints:
table_name | foreign_key | pg_get_constraintdef
--------------------------+---------------------------+-----------------------------------------------
trip_positions | trip_positions_pkey_copy | PRIMARY KEY (id)
trip_positions | fk_rails_9688ac8706_copy | FOREIGN KEY (trip_id) REFERENCES trips_old(id)
trip_positions_old | trip_positions_pkey | PRIMARY KEY (id)
trip_positions_old | fk_rails_9688ac8706 | FOREIGN KEY (trip_id) REFERENCES trips_old(id)
-- Additional constraints follow
Example indexes:
indexname | tablename
--------------------------+---------------------
trip_positions_pkey | trip_positions_old
trip_positions_pkey_copy | trip_positions
-- Additional indexes follow
After several hours of working on this script, I could not find a good way to remove these constraints and indexes automatically.
If you would like to see the full example, I can upload it to the forum for further discussion and review.
Thank you for your time and consideration.
Best regards,
Viktor
PS: If found the chapter order Performing Database Maintenance
and Performing Updates in Batches
not logical and weird. I would expect it in opposite order. Also, a missed a notice, that VACUUM
cannot run in a transaction and cannot run in a function or in a procedure. Which is crucial for the provided examples.