python postgresql bulk upsert

Psycopg2, Postgresql, Python: Fastest way to bulk-insert

stackoverflow.com › questions › 2271787 › psycopg2-postgresql-python-fastest-way-to-bulk-insert

Yeah, I would vote for COPY, providing you can write a file to the server's hard drive (not the drive the app is running on) as COPY will only read off the server.

Answer from Andy Shellam on Stack Overflow

Stack Overflow

stackoverflow.com › questions › 2271787 › psycopg2-postgresql-python-fastest-way-to-bulk-insert

Psycopg2, Postgresql, Python: Fastest way to bulk-insert - Stack Overflow

Top answer

1 of 9

Yeah, I would vote for COPY, providing you can write a file to the server's hard drive (not the drive the app is running on) as COPY will only read off the server.

2 of 9

There is a new psycopg2 manual containing examples for all the options.

The COPY option is the most efficient. Then the executemany. Then the execute with pyformat.

Medium

medium.com › @kennethhughesa › optimization-of-upsert-methods-in-postgresql-python-ac11b8471494

Optimization of Upsert Methods in PostgreSQL/Python | by Kenny Hughes | Medium

June 5, 2022 - Using PostgreSQL/Python to ingest Exchange Traded Fund (ETF) holdings via Upsert statements in an ETL (Extract, Transform, Load) housed in a Github Actions CI/CD Pipeline.

Discussions

python - Bulk Upsert with SQLAlchemy Postgres - Stack Overflow

I'm following the SQLAlchemy documentation here to write a bulk upsert statement with Postgres. For demonstration purposes, I have a simple table MyTable: class MyTable(base): __tablename__ = ' More on stackoverflow.com

stackoverflow.com

python - How do I increase the speed of a bulk UPSERT in postgreSQL? - Stack Overflow

I am trying to load many millions of data records, from multiple distinct sources, to a postgresql table with the following design: CREATE TABLE public.variant_fact ( variant_id bigint NOT NULL... More on stackoverflow.com

stackoverflow.com

python - Bulk upsert with SQLAlchemy - Stack Overflow

I am working on bulk upserting lots of data into PostgreSQL with SQLAlchemy 1.1.0b, and I'm running into duplicate key errors. from sqlalchemy import * from sqlalchemy.orm import sessionmaker from More on stackoverflow.com

stackoverflow.com

sql - Bulk/batch update/upsert in PostgreSQL - Stack Overflow

I'm writing a Django-ORM enchancement that attempts to cache models and postpone model saving until the end of the transaction. It's all almost done, however I came across an unexpected difficulty ... More on stackoverflow.com

stackoverflow.com

Videos

49:26

YouTube

Bulk Inserts with PostgreSQL: Four+ Methods for Efficient Data ...

April 21, 2023

06:07

YouTube

Evaluating Which Python library is Best suitable for Bulk insert ...

November 5, 2022

18:48

YouTube

psycopg2 python Bulk Insert with mogrify, Update, Delete rows ...

December 26, 2021

3.59K

stackoverflow.com

python - Bulk update of rows in Postgres DB using psycopg2 - Stack ...

View all

GitHub

gist.github.com › aisayko › dcacd546bcb17a740dec703de6b2377e

Postgresql bulk upsert in Python (Django) · GitHub

Postgresql bulk upsert in Python (Django). GitHub Gist: instantly share code, notes, and snippets.

Stack Overflow

stackoverflow.com › questions › 55368162 › bulk-upsert-with-sqlalchemy-postgres

python - Bulk Upsert with SQLAlchemy Postgres - Stack Overflow

Top answer

1 of 1

update_stmt = insert_stmt.on_conflict_do_update(
    index_elements=[MyTable.id],
    set_=dict(data=values)
)

index_elements should either be a list of strings or a list of column objects. So either [MyTable.id] or ['id'] (This is correct)

set_ should be a dictionary with column names as keys and valid sql update objects as values. You can reference values from the insert block using the excluded attribute. So to get the result you are hoping for here you would want set_={'test_value': insert_stmt.excluded.test_value} (The error you made is that data= in the example isn't a magic argument... it was the name of the column on their example table)

So, the whole thing would be

update_stmt = insert_stmt.on_conflict_do_update(
    index_elements=[MyTable.id],
    set_={'test_value': insert_stmt.excluded.test_value}
)

Of course, in a real world example I usually want to change more then one column. In that case I would do something like...

update_columns = {col.name: col for col in insert_stmt.excluded if col.name not in ('id', 'datetime_created')}
update_statement = insert_stmt.on_conflict_do_update(index_elements=['id'], set_=update_columns)

(This example would overwrite every column except for the id and datetime_created columns)

PostgreSQL

postgresql.org › message-id › CA+mi_8bGfzhEr0+t2FjZGDRnQP45MC1E3C_djdBem_xZQXsD8A@mail.gmail.com

PostgreSQL: Re: Fastest way to insert/update many rows

August 12, 2014 - Yes: using copy to populate a temp table and then update via a query is the fastest way to bulk-update in postgres, regardless of the psycopg usage. > I have to set one column in each row, is there a way to update cursors like in PL/pgSQL's ...

sqlpey

sqlpey.com › python › bulk-upsert-sqlalchemy-postgresql

How to Perform Bulk Upsert with SQLAlchemy in PostgreSQL - …

December 6, 2024 - Learn how to efficiently perform a bulk upsert (update or insert) in PostgreSQL using Python and SQLAlchemy.

Find elsewhere

Google Bing Mojeek

GitHub

gist.github.com › luke › 5697511

I needed to upsert (insert or update) bajillions of records into postgresql. After trying various libs including upsert (which was slow as hell) I ended up doing a bit of research and trying 3 different methods. This one won. While I'm manually building the sql string no user data is passed in. Its loaded via the copy from statement as CSV. Call it by passing psycopg2 cursor, table name, selector fields ( your primary keys ), setting fields (the fields you want to set) and data. The data should be a list of

August 22, 2018 - I needed to upsert (insert or update) bajillions of records into postgresql. After trying various libs including upsert (which was slow as hell) I ended up doing a bit of research and trying 3 different methods. This one won. While I'm manually building the sql string no user data is passed in.

GitHub

gist.github.com › amorgun › 2a04764f0fc80e646efeb79cd7ad0b70

SqlAlchemy postgres bulk upsert · GitHub

Is there a way to pass to set_ in on_conflict_do_update a list of values from items, instead of a default value for all the records being upserted? ... session.execute( postgresql.insert(MyModel.__table__) .values(items) .on_conflict_do_update( index_elements=MyModel.__table__.primary_key.columns, set_=items ) )

Naysan

naysan.ca › 2020 › 05 › 09 › pandas-to-postgresql-using-psycopg2-bulk-insert-performance-benchmark

Pandas to PostgreSQL using Psycopg2: Bulk Insert Performance Benchmark | Naysan Saran

May 9, 2020 Comments Off Coding Databases Pandas-PostgreSQL Python · If you have ever tried to insert a relatively large dataframe into a PostgreSQL table, you know that single inserts are to be avoided at all costs because of how long they take to execute. There are multiple ways to do bulk inserts with Psycopg2 (see this Stack Overflow page and this blog post for instance).

Stack Overflow

stackoverflow.com › questions › 66557655 › how-do-i-increase-the-speed-of-a-bulk-upsert-in-postgresql

python - How do I increase the speed of a bulk UPSERT in postgreSQL? - Stack Overflow

Top answer

1 of 2

Sorting arglist by "variant_name" and "start" (the first two columns in the index) should make sure that most of the index lookups will be hitting already cached pages. Having the table also be clustered on that index would help make sure the table pages are also accessed in a cache friendly way (although it won't stay clustered very well in the face of new data).

Also, your index is gratuitously double the size it needs to be. There is no point in doing INCLUDE on a column that is already part of the main part of the index. That is going to cost you CPU and IO to format and write the data (and the WAL) and also reduce the amount of data which fits in cache.

2 of 2

Turning off WAL (setting the table UNLOGGED) means that the table will be empty after a crash, because it cannot be recovered. If you are considering running ALTER TABLE later to change it to a LOGGED table, know that this operation will dump the whole table into WAL, so you won't win anything.

For a simple statement like that on an unlogged table, the only way to speed it up are:

drop all indexes, triggers and constraints except variant_fact_unique – but creating them again will be expensive, so you might not win overall
make sure you have fast storage and enough RAM

Stack Overflow

stackoverflow.com › questions › 38579049 › bulk-upsert-with-sqlalchemy

python - Bulk upsert with SQLAlchemy - Stack Overflow

Top answer

1 of 1

from https://stackoverflow.com/a/26018934/465974

After I found this command, I was able to perform upserts, but it is worth mentioning that this operation is slow for a bulk "upsert".

The alternative is to get a list of the primary keys you would like to upsert, and query the database for any matching ids:

DNMTechs

dnmtechs.com › performing-a-bulk-upsert-in-postgresql-using-sqlalchemy-in-python-3

Performing a Bulk Upsert in PostgreSQL using SQLAlchemy in Python 3 – DNMTechs – Sharing and Storing Technology Knowledge

Performing a bulk upsert operation in PostgreSQL using SQLAlchemy in Python 3 can be achieved by generating a multi-row `INSERT` statement with an `ON CONFLICT` clause. This allows us to insert new rows and update existing rows in a single operation, reducing the number of round trips to the ...

Stack Overflow

stackoverflow.com › questions › 7019831 › bulk-batch-update-upsert-in-postgresql

sql - Bulk/batch update/upsert in PostgreSQL - Stack Overflow

Top answer

1 of 7

155

Bulk insert

You can modify the bulk insert of three columns by @Ketema:

INSERT INTO "table" (col1, col2, col3)
  VALUES (11, 12, 13) , (21, 22, 23) , (31, 32, 33);

It becomes:

INSERT INTO "table" (col1, col2, col3)
  VALUES (unnest(array[11,21,31]), 
          unnest(array[12,22,32]), 
          unnest(array[13,23,33]))

Replacing the values with placeholders:

INSERT INTO "table" (col1, col2, col3)
  VALUES (unnest(?), unnest(?), unnest(?))

You have to pass arrays or lists as arguments to this query. This means you can do huge bulk inserts without doing string concatenation (and all its hazzles and dangers: sql injection and quoting hell).

Bulk update

PostgreSQL has added the FROM extension to UPDATE. You can use it in this way:

update "table" 
  set value = data_table.new_value
  from 
    (select unnest(?) as key, unnest(?) as new_value) as data_table
  where "table".key = data_table.key;

The manual is missing a good explanation, but there is an example on the postgresql-admin mailing list. I tried to elaborate on it:

create table tmp
(
  id serial not null primary key,
  name text,
  age integer
);

insert into tmp (name,age) 
values ('keith', 43),('leslie', 40),('bexley', 19),('casey', 6);

update tmp set age = data_table.age
from
(select unnest(array['keith', 'leslie', 'bexley', 'casey']) as name, 
        unnest(array[44, 50, 10, 12]) as age) as data_table
where tmp.name = data_table.name;

There are also other posts on StackExchange explaining UPDATE...FROM.. using a VALUES clause instead of a subquery. They might by easier to read, but are restricted to a fixed number of rows.

2 of 7

I've used 3 strategies for batch transactional work:

Generate SQL statements on the fly, concatenate them with semicolons, and then submit the statements in one shot. I've done up to 100 inserts in this way, and it was quite efficient (done against Postgres).
JDBC has batching capabilities built in, if configured. If you generate transactions, you can flush your JDBC statements so that they transact in one shot. This tactic requires fewer database calls, as the statements are all executed in one batch.
Hibernate also supports JDBC batching along the lines of the previous example, but in this case you execute a flush() method against the Hibernate Session, not the underlying JDBC connection. It accomplishes the same thing as JDBC batching.

Incidentally, Hibernate also supports a batching strategy in collection fetching. If you annotate a collection with @BatchSize, when fetching associations, Hibernate will use IN instead of =, leading to fewer SELECT statements to load up the collections.

Schinckel

schinckel.net › 2019 › 12 › 13 › asyncpg-and-upserting-bulk-data

asyncpg and upserting bulk data - Schinckel.net

December 13, 2019 - I’d normally just use psycopg2, but since I was using Python 3, I thought I’d try out asyncpg. ... import asyncio import asyncpg async def write(): conn = await asyncpg.connect('postgres://:@:/database') await conn.execute('query', [params]) await conn.close() asyncio.get_event_loop().run_until_complete(write()) So, that part was fine. There are more hoops to jump through because async, but meh. But I had an interesting case: I wanted to do a bulk insert, and I didn’t know how many records I was going to be inserting.

Overflow

overflow.no › blog › 2025 › 1 › 5 › using-staging-tables-for-faster-bulk-upserts-with-python-and-postgresql

Using staging tables for faster bulk upserts with Python and PostgreSQL | Overflow

This allows for moving larger amounts of data faster, using copy_to_table(), and then doing the upsert in-engine within a transaction, ensuring rollback if anything unexpected happens. PostgreSQL also supports temporary tables, which can be automatically cleaned up after the transaction by using the ON COMMIT DROP clause.

Stack Overflow

stackoverflow.com › questions › 64679758 › anyway-to-upsert-database-using-postgresql-in-python

sql - Anyway to Upsert database using PostgreSQL in Python - Stack Overflow

Top answer

1 of 2

I would try something like this:

sql = '''INSERT INTO temp.tickets
    (id, created_at, updated_at, emails, status)
VALUES
    (%s, %s, %s, %s, %s)
    ON CONFLICT (id)
    DO UPDATE SET ( emails, status) values (%s,%s) 
    
    '''

cursor = cm.cursor()
## cm is a custom module
cursor.execute(sql, (ticket['id'],
                     ticket['created_at'],
                     ticket['updated_at'],
                     ticket['emails'], 
                     ticket['status'],
                     ticket['emails'], 
                      ticket['status'] )

Thre number of %s must match the number of parameters.

2 of 2

-1

When Postgres encounters a captured conflict it basically creates a record called EXCLUDED that contains the values you attempted to insert, You can refer to this record in DO UPDATE. Try the following:

INSERT INTO temp.tickets
    (id, created_at, updated_at, emails, status)
VALUES
    (%s, %s, %s, %s, %s)
    ON CONFLICT (id)
    DO UPDATE 
          SET emails = excluded.emails
            , status = excluded.status
            , updated_at = excluded.updated_at    -- my assumption.  
 ...

You will have to format is into the requirements of your source language.

reddit.com › r/dataengineering › processing and inserting large sets of data into postgres with python

r/dataengineering on Reddit: Processing and Inserting Large sets of data into Postgres with Python

June 23, 2024 -

I am migrating data from a Microsoft Access Database (.mdb) into a Postgres Database. The dataset is very large and I need to process the data before loading it into the database(Postgres). My approach is very slow and I need some help making it faster.

I've created an insert method which when called establishes a connection to the Postgres Database first and then executes the insert query before it closes the connection.

So far I've tried the following approaches: Single Insert: Where I process a single row of data and insert it. This approach was slow because, anytime the insert method is called, a connection needs to be established before the insertion is done and the connection is closed. This happens for every single row, making it slow.

Then I tried Bulk Insert: Where I process all the columns before calling the insert method. I created a bulk insert method for this approach using pandas and SQLAlchemy. This approach, however, takes a very long time too because even though the connection is established only once, all the data must be processed first before the insertion happens.

Now I'm thinking the best approach will be to use Batching: Where I process a chunk of the data, insert, and then go back to process the rest and insert. I realized combining multithreading with Batching can help to improve the speed significantly.

Please I need suggestions on which approach will be best. Even if it is different from the one I have mentioned.

Top answer

1 of 5

Does "processing" mean data transformation? If so, I would load the data in raw into a staging/replication area and then transform/process between staging and the production table.

2 of 5

Thank you all for your contributions. I just found that the best approach will be to convert it to CSV and use COPY to import it into Postgres. I haven't tried this approach yet but found it to be the best based on responses from here and other forums. With the processing of the data, I may use some parallelism if need be. Thank you all once again.

Stack Overflow

stackoverflow.com › questions › 25955200 › sqlalchemy-performing-a-bulk-upsert-if-exists-update-else-insert-in-postgr

python - SQLAlchemy - performing a bulk upsert (if exists, update, else insert) in postgresql - Stack Overflow

Top answer

1 of 7

There is an upsert-esque operation in SQLAlchemy:

db.session.merge()

After I found this command, I was able to perform upserts, but it is worth mentioning that this operation is slow for a bulk "upsert".

The alternative is to get a list of the primary keys you would like to upsert, and query the database for any matching ids:

# Imagine that post1, post5, and post1000 are posts objects with ids 1, 5 and 1000 respectively
# The goal is to "upsert" these posts.
# we initialize a dict which maps id to the post object

my_new_posts = {1: post1, 5: post5, 1000: post1000} 

for each in posts.query.filter(posts.id.in_(my_new_posts.keys())).all():
    # Only merge those posts which already exist in the database
    db.session.merge(my_new_posts.pop(each.id))

# Only add those posts which did not exist in the database 
db.session.add_all(my_new_posts.values())

# Now we commit our modifications (merges) and inserts (adds) to the database!
db.session.commit()

2 of 7

You can leverage the on_conflict_do_update variant. A simple example would be the following:

from sqlalchemy.dialects.postgresql import insert

class Post(Base):
    """
    A simple class for demonstration
    """

    id = Column(Integer, primary_key=True)
    title = Column(Unicode)

# Prepare all the values that should be "upserted" to the DB
values = [
    {"id": 1, "title": "mytitle 1"},
    {"id": 2, "title": "mytitle 2"},
    {"id": 3, "title": "mytitle 3"},
    {"id": 4, "title": "mytitle 4"},
]

stmt = insert(Post).values(values)
stmt = stmt.on_conflict_do_update(
    # Let's use the constraint name which was visible in the original posts error msg
    constraint="post_pkey",

    # The columns that should be updated on conflict
    set_={
        "title": stmt.excluded.title
    }
)
session.execute(stmt)

See the Postgres docs for more details about ON CONFLICT DO UPDATE.

See the SQLAlchemy docs for more details about on_conflict_do_update.

Side-Note on duplicated column names

The above code uses the column names as dict keys both in the values list and the argument to set_. If the column-name is changed in the class-definition this needs to be changed everywhere or it will break. This can be avoided by accessing the column definitions, making the code a bit uglier, but more robust:

coldefs = Post.__table__.c

values = [
    {coldefs.id.name: 1, coldefs.title.name: "mytitlte 1"},
    ...
]

stmt = stmt.on_conflict_do_update(
    ...
    set_={
        coldefs.title.name: stmt.excluded.title
        ...
    }
)

Trvrm

trvrm.github.io › bulk-psycopg2-inserts.html

Efficient Postgres Bulk Inserts using Psycopg2 and Unnest

October 22, 2015 - def bulkInsertRate(count): tester=Tester(count) start=time.time() tester.fastInsert() duration=time.time()-start return count/duration def normalInsertRate(count): tester=Tester(count) start=time.time() tester.insert() duration=time.time()-start return count/duration

PyPI

pypi.org › project › django-pg-bulk-update

django-pg-bulk-update · PyPI

Django extension, executing bulk update operations for PostgreSQL

      » pip install django-pg-bulk-update

Published Jun 30, 2024

Version 3.7.3

Homepage https://github.com/M1hacka/django-pg-bulk-update