Yeah, I would vote for COPY, providing you can write a file to the server's hard drive (not the drive the app is running on) as COPY will only read off the server.

Answer from Andy Shellam on Stack Overflow
๐ŸŒ
Medium
medium.com โ€บ @kennethhughesa โ€บ optimization-of-upsert-methods-in-postgresql-python-ac11b8471494
Optimization of Upsert Methods in PostgreSQL/Python | by Kenny Hughes | Medium
June 5, 2022 - Using PostgreSQL/Python to ingest Exchange Traded Fund (ETF) holdings via Upsert statements in an ETL (Extract, Transform, Load) housed in a Github Actions CI/CD Pipeline.
Discussions

python - Bulk Upsert with SQLAlchemy Postgres - Stack Overflow
I'm following the SQLAlchemy documentation here to write a bulk upsert statement with Postgres. For demonstration purposes, I have a simple table MyTable: class MyTable(base): __tablename__ = ' More on stackoverflow.com
๐ŸŒ stackoverflow.com
python - How do I increase the speed of a bulk UPSERT in postgreSQL? - Stack Overflow
I am trying to load many millions of data records, from multiple distinct sources, to a postgresql table with the following design: CREATE TABLE public.variant_fact ( variant_id bigint NOT NULL... More on stackoverflow.com
๐ŸŒ stackoverflow.com
python - Bulk upsert with SQLAlchemy - Stack Overflow
I am working on bulk upserting lots of data into PostgreSQL with SQLAlchemy 1.1.0b, and I'm running into duplicate key errors. from sqlalchemy import * from sqlalchemy.orm import sessionmaker from More on stackoverflow.com
๐ŸŒ stackoverflow.com
sql - Bulk/batch update/upsert in PostgreSQL - Stack Overflow
I'm writing a Django-ORM enchancement that attempts to cache models and postpone model saving until the end of the transaction. It's all almost done, however I came across an unexpected difficulty ... More on stackoverflow.com
๐ŸŒ stackoverflow.com
๐ŸŒ
GitHub
gist.github.com โ€บ aisayko โ€บ dcacd546bcb17a740dec703de6b2377e
Postgresql bulk upsert in Python (Django) ยท GitHub
Postgresql bulk upsert in Python (Django). GitHub Gist: instantly share code, notes, and snippets.
๐ŸŒ
PostgreSQL
postgresql.org โ€บ message-id โ€บ CA+mi_8bGfzhEr0+t2FjZGDRnQP45MC1E3C_djdBem_xZQXsD8A@mail.gmail.com
PostgreSQL: Re: Fastest way to insert/update many rows
August 12, 2014 - Yes: using copy to populate a temp table and then update via a query is the fastest way to bulk-update in postgres, regardless of the psycopg usage. > I have to set one column in each row, is there a way to update cursors like in PL/pgSQL's ...
๐ŸŒ
sqlpey
sqlpey.com โ€บ python โ€บ bulk-upsert-sqlalchemy-postgresql
How to Perform Bulk Upsert with SQLAlchemy in PostgreSQL - โ€ฆ
December 6, 2024 - Learn how to efficiently perform a bulk upsert (update or insert) in PostgreSQL using Python and SQLAlchemy.
Find elsewhere
๐ŸŒ
GitHub
gist.github.com โ€บ amorgun โ€บ 2a04764f0fc80e646efeb79cd7ad0b70
SqlAlchemy postgres bulk upsert ยท GitHub
Is there a way to pass to set_ in on_conflict_do_update a list of values from items, instead of a default value for all the records being upserted? ... session.execute( postgresql.insert(MyModel.__table__) .values(items) .on_conflict_do_update( index_elements=MyModel.__table__.primary_key.columns, set_=items ) )
๐ŸŒ
Naysan
naysan.ca โ€บ 2020 โ€บ 05 โ€บ 09 โ€บ pandas-to-postgresql-using-psycopg2-bulk-insert-performance-benchmark
Pandas to PostgreSQL using Psycopg2: Bulk Insert Performance Benchmark | Naysan Saran
May 9, 2020 Comments Off Coding Databases Pandas-PostgreSQL Python ยท If you have ever tried to insert a relatively large dataframe into a PostgreSQL table, you know that single inserts are to be avoided at all costs because of how long they take to execute. There are multiple ways to do bulk inserts with Psycopg2 (see this Stack Overflow page and this blog post for instance).
๐ŸŒ
DNMTechs
dnmtechs.com โ€บ performing-a-bulk-upsert-in-postgresql-using-sqlalchemy-in-python-3
Performing a Bulk Upsert in PostgreSQL using SQLAlchemy in Python 3 โ€“ DNMTechs โ€“ Sharing and Storing Technology Knowledge
Performing a bulk upsert operation in PostgreSQL using SQLAlchemy in Python 3 can be achieved by generating a multi-row `INSERT` statement with an `ON CONFLICT` clause. This allows us to insert new rows and update existing rows in a single operation, reducing the number of round trips to the ...
Top answer
1 of 7
155

Bulk insert

You can modify the bulk insert of three columns by @Ketema:

INSERT INTO "table" (col1, col2, col3)
  VALUES (11, 12, 13) , (21, 22, 23) , (31, 32, 33);

It becomes:

INSERT INTO "table" (col1, col2, col3)
  VALUES (unnest(array[11,21,31]), 
          unnest(array[12,22,32]), 
          unnest(array[13,23,33]))

Replacing the values with placeholders:

INSERT INTO "table" (col1, col2, col3)
  VALUES (unnest(?), unnest(?), unnest(?))

You have to pass arrays or lists as arguments to this query. This means you can do huge bulk inserts without doing string concatenation (and all its hazzles and dangers: sql injection and quoting hell).

Bulk update

PostgreSQL has added the FROM extension to UPDATE. You can use it in this way:

update "table" 
  set value = data_table.new_value
  from 
    (select unnest(?) as key, unnest(?) as new_value) as data_table
  where "table".key = data_table.key;

The manual is missing a good explanation, but there is an example on the postgresql-admin mailing list. I tried to elaborate on it:

create table tmp
(
  id serial not null primary key,
  name text,
  age integer
);

insert into tmp (name,age) 
values ('keith', 43),('leslie', 40),('bexley', 19),('casey', 6);

update tmp set age = data_table.age
from
(select unnest(array['keith', 'leslie', 'bexley', 'casey']) as name, 
        unnest(array[44, 50, 10, 12]) as age) as data_table
where tmp.name = data_table.name;
 

There are also other posts on StackExchange explaining UPDATE...FROM.. using a VALUES clause instead of a subquery. They might by easier to read, but are restricted to a fixed number of rows.

2 of 7
26

I've used 3 strategies for batch transactional work:

  1. Generate SQL statements on the fly, concatenate them with semicolons, and then submit the statements in one shot. I've done up to 100 inserts in this way, and it was quite efficient (done against Postgres).
  2. JDBC has batching capabilities built in, if configured. If you generate transactions, you can flush your JDBC statements so that they transact in one shot. This tactic requires fewer database calls, as the statements are all executed in one batch.
  3. Hibernate also supports JDBC batching along the lines of the previous example, but in this case you execute a flush() method against the Hibernate Session, not the underlying JDBC connection. It accomplishes the same thing as JDBC batching.

Incidentally, Hibernate also supports a batching strategy in collection fetching. If you annotate a collection with @BatchSize, when fetching associations, Hibernate will use IN instead of =, leading to fewer SELECT statements to load up the collections.

๐ŸŒ
Schinckel
schinckel.net โ€บ 2019 โ€บ 12 โ€บ 13 โ€บ asyncpg-and-upserting-bulk-data
asyncpg and upserting bulk data - Schinckel.net
December 13, 2019 - Iโ€™d normally just use psycopg2, but since I was using Python 3, I thought Iโ€™d try out asyncpg. ... import asyncio import asyncpg async def write(): conn = await asyncpg.connect('postgres://:@:/database') await conn.execute('query', [params]) await conn.close() asyncio.get_event_loop().run_until_complete(write()) So, that part was fine. There are more hoops to jump through because async, but meh. But I had an interesting case: I wanted to do a bulk insert, and I didnโ€™t know how many records I was going to be inserting.
๐ŸŒ
Overflow
overflow.no โ€บ blog โ€บ 2025 โ€บ 1 โ€บ 5 โ€บ using-staging-tables-for-faster-bulk-upserts-with-python-and-postgresql
Using staging tables for faster bulk upserts with Python and PostgreSQL | Overflow
This allows for moving larger amounts of data faster, using copy_to_table(), and then doing the upsert in-engine within a transaction, ensuring rollback if anything unexpected happens. PostgreSQL also supports temporary tables, which can be automatically cleaned up after the transaction by using the ON COMMIT DROP clause.
๐ŸŒ
Reddit
reddit.com โ€บ r/dataengineering โ€บ processing and inserting large sets of data into postgres with python
r/dataengineering on Reddit: Processing and Inserting Large sets of data into Postgres with Python
June 23, 2024 -

I am migrating data from a Microsoft Access Database (.mdb) into a Postgres Database. The dataset is very large and I need to process the data before loading it into the database(Postgres). My approach is very slow and I need some help making it faster.

I've created an insert method which when called establishes a connection to the Postgres Database first and then executes the insert query before it closes the connection.

So far I've tried the following approaches: Single Insert: Where I process a single row of data and insert it. This approach was slow because, anytime the insert method is called, a connection needs to be established before the insertion is done and the connection is closed. This happens for every single row, making it slow.

Then I tried Bulk Insert: Where I process all the columns before calling the insert method. I created a bulk insert method for this approach using pandas and SQLAlchemy. This approach, however, takes a very long time too because even though the connection is established only once, all the data must be processed first before the insertion happens.

Now I'm thinking the best approach will be to use Batching: Where I process a chunk of the data, insert, and then go back to process the rest and insert. I realized combining multithreading with Batching can help to improve the speed significantly.

Please I need suggestions on which approach will be best. Even if it is different from the one I have mentioned.

Top answer
1 of 7
60

There is an upsert-esque operation in SQLAlchemy:

db.session.merge()

After I found this command, I was able to perform upserts, but it is worth mentioning that this operation is slow for a bulk "upsert".

The alternative is to get a list of the primary keys you would like to upsert, and query the database for any matching ids:

# Imagine that post1, post5, and post1000 are posts objects with ids 1, 5 and 1000 respectively
# The goal is to "upsert" these posts.
# we initialize a dict which maps id to the post object

my_new_posts = {1: post1, 5: post5, 1000: post1000} 

for each in posts.query.filter(posts.id.in_(my_new_posts.keys())).all():
    # Only merge those posts which already exist in the database
    db.session.merge(my_new_posts.pop(each.id))

# Only add those posts which did not exist in the database 
db.session.add_all(my_new_posts.values())

# Now we commit our modifications (merges) and inserts (adds) to the database!
db.session.commit()
2 of 7
53

You can leverage the on_conflict_do_update variant. A simple example would be the following:

from sqlalchemy.dialects.postgresql import insert

class Post(Base):
    """
    A simple class for demonstration
    """

    id = Column(Integer, primary_key=True)
    title = Column(Unicode)

# Prepare all the values that should be "upserted" to the DB
values = [
    {"id": 1, "title": "mytitle 1"},
    {"id": 2, "title": "mytitle 2"},
    {"id": 3, "title": "mytitle 3"},
    {"id": 4, "title": "mytitle 4"},
]

stmt = insert(Post).values(values)
stmt = stmt.on_conflict_do_update(
    # Let's use the constraint name which was visible in the original posts error msg
    constraint="post_pkey",

    # The columns that should be updated on conflict
    set_={
        "title": stmt.excluded.title
    }
)
session.execute(stmt)

See the Postgres docs for more details about ON CONFLICT DO UPDATE.

See the SQLAlchemy docs for more details about on_conflict_do_update.

Side-Note on duplicated column names

The above code uses the column names as dict keys both in the values list and the argument to set_. If the column-name is changed in the class-definition this needs to be changed everywhere or it will break. This can be avoided by accessing the column definitions, making the code a bit uglier, but more robust:

coldefs = Post.__table__.c

values = [
    {coldefs.id.name: 1, coldefs.title.name: "mytitlte 1"},
    ...
]

stmt = stmt.on_conflict_do_update(
    ...
    set_={
        coldefs.title.name: stmt.excluded.title
        ...
    }
)
๐ŸŒ
Trvrm
trvrm.github.io โ€บ bulk-psycopg2-inserts.html
Efficient Postgres Bulk Inserts using Psycopg2 and Unnest
October 22, 2015 - def bulkInsertRate(count): tester=Tester(count) start=time.time() tester.fastInsert() duration=time.time()-start return count/duration def normalInsertRate(count): tester=Tester(count) start=time.time() tester.insert() duration=time.time()-start return count/duration
๐ŸŒ
PyPI
pypi.org โ€บ project โ€บ django-pg-bulk-update
django-pg-bulk-update ยท PyPI
Django extension, executing bulk update operations for PostgreSQL
      ยป pip install django-pg-bulk-update
    
Published ย  Jun 30, 2024
Version ย  3.7.3