Yeah, I would vote for COPY, providing you can write a file to the server's hard drive (not the drive the app is running on) as COPY will only read off the server.
Answer from Andy Shellam on Stack OverflowYeah, I would vote for COPY, providing you can write a file to the server's hard drive (not the drive the app is running on) as COPY will only read off the server.
There is a new psycopg2 manual containing examples for all the options.
The COPY option is the most efficient. Then the executemany. Then the execute with pyformat.
python - Bulk Upsert with SQLAlchemy Postgres - Stack Overflow
python - How do I increase the speed of a bulk UPSERT in postgreSQL? - Stack Overflow
python - Bulk upsert with SQLAlchemy - Stack Overflow
sql - Bulk/batch update/upsert in PostgreSQL - Stack Overflow
Videos
Sorting arglist by "variant_name" and "start" (the first two columns in the index) should make sure that most of the index lookups will be hitting already cached pages. Having the table also be clustered on that index would help make sure the table pages are also accessed in a cache friendly way (although it won't stay clustered very well in the face of new data).
Also, your index is gratuitously double the size it needs to be. There is no point in doing INCLUDE on a column that is already part of the main part of the index. That is going to cost you CPU and IO to format and write the data (and the WAL) and also reduce the amount of data which fits in cache.
Turning off WAL (setting the table UNLOGGED) means that the table will be empty after a crash, because it cannot be recovered. If you are considering running ALTER TABLE later to change it to a LOGGED table, know that this operation will dump the whole table into WAL, so you won't win anything.
For a simple statement like that on an unlogged table, the only way to speed it up are:
drop all indexes, triggers and constraints except
variant_fact_uniqueโ but creating them again will be expensive, so you might not win overallmake sure you have fast storage and enough RAM
Bulk insert
You can modify the bulk insert of three columns by @Ketema:
INSERT INTO "table" (col1, col2, col3)
VALUES (11, 12, 13) , (21, 22, 23) , (31, 32, 33);
It becomes:
INSERT INTO "table" (col1, col2, col3)
VALUES (unnest(array[11,21,31]),
unnest(array[12,22,32]),
unnest(array[13,23,33]))
Replacing the values with placeholders:
INSERT INTO "table" (col1, col2, col3)
VALUES (unnest(?), unnest(?), unnest(?))
You have to pass arrays or lists as arguments to this query. This means you can do huge bulk inserts without doing string concatenation (and all its hazzles and dangers: sql injection and quoting hell).
Bulk update
PostgreSQL has added the FROM extension to UPDATE. You can use it in this way:
update "table"
set value = data_table.new_value
from
(select unnest(?) as key, unnest(?) as new_value) as data_table
where "table".key = data_table.key;
The manual is missing a good explanation, but there is an example on the postgresql-admin mailing list. I tried to elaborate on it:
create table tmp
(
id serial not null primary key,
name text,
age integer
);
insert into tmp (name,age)
values ('keith', 43),('leslie', 40),('bexley', 19),('casey', 6);
update tmp set age = data_table.age
from
(select unnest(array['keith', 'leslie', 'bexley', 'casey']) as name,
unnest(array[44, 50, 10, 12]) as age) as data_table
where tmp.name = data_table.name;
There are also other posts on StackExchange explaining UPDATE...FROM.. using a VALUES clause instead of a subquery. They might by easier to read, but are restricted to a fixed number of rows.
I've used 3 strategies for batch transactional work:
- Generate SQL statements on the fly, concatenate them with semicolons, and then submit the statements in one shot. I've done up to 100 inserts in this way, and it was quite efficient (done against Postgres).
- JDBC has batching capabilities built in, if configured. If you generate transactions, you can flush your JDBC statements so that they transact in one shot. This tactic requires fewer database calls, as the statements are all executed in one batch.
- Hibernate also supports JDBC batching along the lines of the previous example, but in this case you execute a
flush()method against the HibernateSession, not the underlying JDBC connection. It accomplishes the same thing as JDBC batching.
Incidentally, Hibernate also supports a batching strategy in collection fetching. If you annotate a collection with @BatchSize, when fetching associations, Hibernate will use IN instead of =, leading to fewer SELECT statements to load up the collections.
I am migrating data from a Microsoft Access Database (.mdb) into a Postgres Database. The dataset is very large and I need to process the data before loading it into the database(Postgres). My approach is very slow and I need some help making it faster.
I've created an insert method which when called establishes a connection to the Postgres Database first and then executes the insert query before it closes the connection.
So far I've tried the following approaches: Single Insert: Where I process a single row of data and insert it. This approach was slow because, anytime the insert method is called, a connection needs to be established before the insertion is done and the connection is closed. This happens for every single row, making it slow.
Then I tried Bulk Insert: Where I process all the columns before calling the insert method. I created a bulk insert method for this approach using pandas and SQLAlchemy. This approach, however, takes a very long time too because even though the connection is established only once, all the data must be processed first before the insertion happens.
Now I'm thinking the best approach will be to use Batching: Where I process a chunk of the data, insert, and then go back to process the rest and insert. I realized combining multithreading with Batching can help to improve the speed significantly.
Please I need suggestions on which approach will be best. Even if it is different from the one I have mentioned.
There is an upsert-esque operation in SQLAlchemy:
db.session.merge()
After I found this command, I was able to perform upserts, but it is worth mentioning that this operation is slow for a bulk "upsert".
The alternative is to get a list of the primary keys you would like to upsert, and query the database for any matching ids:
# Imagine that post1, post5, and post1000 are posts objects with ids 1, 5 and 1000 respectively
# The goal is to "upsert" these posts.
# we initialize a dict which maps id to the post object
my_new_posts = {1: post1, 5: post5, 1000: post1000}
for each in posts.query.filter(posts.id.in_(my_new_posts.keys())).all():
# Only merge those posts which already exist in the database
db.session.merge(my_new_posts.pop(each.id))
# Only add those posts which did not exist in the database
db.session.add_all(my_new_posts.values())
# Now we commit our modifications (merges) and inserts (adds) to the database!
db.session.commit()
You can leverage the on_conflict_do_update variant. A simple example would be the following:
from sqlalchemy.dialects.postgresql import insert
class Post(Base):
"""
A simple class for demonstration
"""
id = Column(Integer, primary_key=True)
title = Column(Unicode)
# Prepare all the values that should be "upserted" to the DB
values = [
{"id": 1, "title": "mytitle 1"},
{"id": 2, "title": "mytitle 2"},
{"id": 3, "title": "mytitle 3"},
{"id": 4, "title": "mytitle 4"},
]
stmt = insert(Post).values(values)
stmt = stmt.on_conflict_do_update(
# Let's use the constraint name which was visible in the original posts error msg
constraint="post_pkey",
# The columns that should be updated on conflict
set_={
"title": stmt.excluded.title
}
)
session.execute(stmt)
See the Postgres docs for more details about ON CONFLICT DO UPDATE.
See the SQLAlchemy docs for more details about on_conflict_do_update.
Side-Note on duplicated column names
The above code uses the column names as dict keys both in the values list and the argument to set_. If the column-name is changed in the class-definition this needs to be changed everywhere or it will break. This can be avoided by accessing the column definitions, making the code a bit uglier, but more robust:
coldefs = Post.__table__.c
values = [
{coldefs.id.name: 1, coldefs.title.name: "mytitlte 1"},
...
]
stmt = stmt.on_conflict_do_update(
...
set_={
coldefs.title.name: stmt.excluded.title
...
}
)
I would try something like this:
sql = '''INSERT INTO temp.tickets
(id, created_at, updated_at, emails, status)
VALUES
(%s, %s, %s, %s, %s)
ON CONFLICT (id)
DO UPDATE SET ( emails, status) values (%s,%s)
'''
cursor = cm.cursor()
## cm is a custom module
cursor.execute(sql, (ticket['id'],
ticket['created_at'],
ticket['updated_at'],
ticket['emails'],
ticket['status'],
ticket['emails'],
ticket['status'] )
Thre number of %s must match the number of parameters.
When Postgres encounters a captured conflict it basically creates a record called EXCLUDED that contains the values you attempted to insert, You can refer to this record in DO UPDATE. Try the following:
INSERT INTO temp.tickets
(id, created_at, updated_at, emails, status)
VALUES
(%s, %s, %s, %s, %s)
ON CONFLICT (id)
DO UPDATE
SET emails = excluded.emails
, status = excluded.status
, updated_at = excluded.updated_at -- my assumption.
...
You will have to format is into the requirements of your source language.