Doing upserts in Postgres DBs

The case#

While creating a projection for a specific use-case, we faced the problem that there was no more room for optimization with indexes on the database. So we had to step back and re-think the whole data model which consisted of 4 tables each having more than 50 million rows. Until this point, we hat cheap inserts and expensive gets on the data model.

After the analysis of the data, we figured out that the approach we took, in the beginning, was a bit too naive. Well at a certain point when your data grows to an amount you have barely experience how to handle it or until now you could easily handle all data, and then it’s time to re-define the use case and optimize from both sides: insert and get.

It turned out that not all the data is relevant and the data was connected in certain ways, so we already reduced the number of tables from 4 down to 2. This improvement already reduced the cost for joining significantly, but on the other hand, the inserts were now attached to a certain logic, because the data had to be combined. Nevertheless, the service and database could handle this amount of requests to cope with the load.

On the other hand, we wanted the projection to be lightning fast. On a projection, there should usually be no joins if it is meant to be fast. So we took an intermediate step to update the projections (there were 2 in total) on each insert or update of the intermediate tables. Furthermore, we computed the eligibility of a data record for the projection. Additionally, we partitioned the projection tables. Fortunately Postgres 10 does much work on partioning already for you. You still have to create the partitions on your own but a CREATE TABLE IF EXISTS does the magic before each insert. In Postgres 9 you still had to write triggers. You still can in Postgres 10 as well if you want full flexibility. On AWS RDS the latest Postgres version available is currently 10, meaning if you cannot to the lovely INSERT ... ON CONFLICT DO UPDATE, but you have to implement it on your own. With our experience in Redshift , we already knew how to achieve this. (BTW Postgres 11 will support those upserts.)

Regarding Domain-driven Design the question rose where to put which logic: put the CREATE TABLE IF EXISTS for the partition as a stored procedure or in the infrastructure layer of the service? We decided on the infrastructure layer, although it might be slower than the stored procedure, but it is better testable and extendable and the code can be put under version control.

It was a good decision to split logic between inserts and update of the projection. The next step will be the optimization of the UI which is another topic. But the database layer is now at a stable state.

Conclusion#

There will be a point in time when all your data cannot be handled any by a single data source. Projections are a neat solution but they should be optimized as much as possible or kept at a specific amount of data, meaning old data should be deleted or archived.

© 2021