Update: You can save yourself some typing by using this method.


If you are using PostgreSQL 9.5 or later you can perform the UPSERT using a temporary table and an INSERT ... ON CONFLICT statement:

import sqlalchemy as sa

# …

with engine.begin() as conn:
    # step 0.0 - create test environment
    conn.exec_driver_sql("DROP TABLE IF EXISTS main_table")
    conn.exec_driver_sql(
        "CREATE TABLE main_table (id int primary key, txt varchar(50))"
    )
    conn.exec_driver_sql(
        "INSERT INTO main_table (id, txt) VALUES (1, 'row 1 old text')"
    )
    # step 0.1 - create DataFrame to UPSERT
    df = pd.DataFrame(
        [(2, "new row 2 text"), (1, "row 1 new text")], columns=["id", "txt"]
    )
    
    # step 1 - create temporary table and upload DataFrame
    conn.exec_driver_sql(
        "CREATE TEMPORARY TABLE temp_table AS SELECT * FROM main_table WHERE false"
    )
    df.to_sql("temp_table", conn, index=False, if_exists="append")

    # step 2 - merge temp_table into main_table
    conn.exec_driver_sql(
        """\
        INSERT INTO main_table (id, txt) 
        SELECT id, txt FROM temp_table
        ON CONFLICT (id) DO
            UPDATE SET txt = EXCLUDED.txt
        """
    )

    # step 3 - confirm results
    result = conn.exec_driver_sql("SELECT * FROM main_table ORDER BY id").all()
    print(result)  # [(1, 'row 1 new text'), (2, 'new row 2 text')]
Answer from Gord Thompson on Stack Overflow
🌐
GitHub
github.com › ThibTrip › pangres
GitHub - ThibTrip/pangres: SQL upsert using pandas DataFrames for PostgreSQL, SQlite and MySQL with extra features · GitHub
SQL upsert using pandas DataFrames for PostgreSQL, SQlite and MySQL with extra features - ThibTrip/pangres
Starred by 234 users
Forked by 15 users
Languages   Python
🌐
GitHub
github.com › ryanbaumann › Pandas-to_sql-upsert
GitHub - ryanbaumann/Pandas-to_sql-upsert: Extend pandas to_sql function to perform multi-threaded, concurrent "insert or update" command in memory · GitHub
The goal of this library is to extend the Python Pandas to_sql() function to be: Muti-threaded (improving time-to-insert on large datasets) Allow the to_sql() command to run an 'insert if does not exist' to the database ...
Starred by 84 users
Forked by 16 users
Languages   Jupyter Notebook 67.7% | Python 32.3%
Discussions

I made a Pandas.to_sql_upsert()
Do you have any performance tests on it? More on reddit.com
🌐 r/dataengineering
37
62
December 28, 2024
Faster loading of Dataframes from Pandas to Postgres
I believe odo implements this kind of approach. More on reddit.com
🌐 r/Python
7
13
May 3, 2017
🌐
GitHub
gist.github.com › Nikolay-Lysenko › 0887f4b59dc4914cec9b236c317d06e3
Upsert (a hybrid of insert and update) from pandas.DataFrame to PostgreSQL database · GitHub
Upsert (a hybrid of insert and update) from pandas.DataFrame to PostgreSQL database - upsert_from_pandas_to_postgres.py
Top answer
1 of 6
24

Update: You can save yourself some typing by using this method.


If you are using PostgreSQL 9.5 or later you can perform the UPSERT using a temporary table and an INSERT ... ON CONFLICT statement:

import sqlalchemy as sa

# …

with engine.begin() as conn:
    # step 0.0 - create test environment
    conn.exec_driver_sql("DROP TABLE IF EXISTS main_table")
    conn.exec_driver_sql(
        "CREATE TABLE main_table (id int primary key, txt varchar(50))"
    )
    conn.exec_driver_sql(
        "INSERT INTO main_table (id, txt) VALUES (1, 'row 1 old text')"
    )
    # step 0.1 - create DataFrame to UPSERT
    df = pd.DataFrame(
        [(2, "new row 2 text"), (1, "row 1 new text")], columns=["id", "txt"]
    )
    
    # step 1 - create temporary table and upload DataFrame
    conn.exec_driver_sql(
        "CREATE TEMPORARY TABLE temp_table AS SELECT * FROM main_table WHERE false"
    )
    df.to_sql("temp_table", conn, index=False, if_exists="append")

    # step 2 - merge temp_table into main_table
    conn.exec_driver_sql(
        """\
        INSERT INTO main_table (id, txt) 
        SELECT id, txt FROM temp_table
        ON CONFLICT (id) DO
            UPDATE SET txt = EXCLUDED.txt
        """
    )

    # step 3 - confirm results
    result = conn.exec_driver_sql("SELECT * FROM main_table ORDER BY id").all()
    print(result)  # [(1, 'row 1 new text'), (2, 'new row 2 text')]
2 of 6
18

I have needed this so many times, I ended up creating a gist for it.

The function is below, it will create the table if it is the first time persisting the dataframe and will update the table if it already exists:

import pandas as pd
import sqlalchemy
import uuid
import os

def upsert_df(df: pd.DataFrame, table_name: str, engine: sqlalchemy.engine.Engine):
    """Implements the equivalent of pd.DataFrame.to_sql(..., if_exists='update')
    (which does not exist). Creates or updates the db records based on the
    dataframe records.
    Conflicts to determine update are based on the dataframes index.
    This will set unique keys constraint on the table equal to the index names
    1. Create a temp table from the dataframe
    2. Insert/update from temp table into table_name
    Returns: True if successful
    """

    # If the table does not exist, we should just use to_sql to create it
    if not engine.execute(
        f"""SELECT EXISTS (
            SELECT FROM information_schema.tables 
            WHERE  table_schema = 'public'
            AND    table_name   = '{table_name}');
            """
    ).first()[0]:
        df.to_sql(table_name, engine)
        return True

    # If it already exists...
    temp_table_name = f"temp_{uuid.uuid4().hex[:6]}"
    df.to_sql(temp_table_name, engine, index=True)

    index = list(df.index.names)
    index_sql_txt = ", ".join([f'"{i}"' for i in index])
    columns = list(df.columns)
    headers = index + columns
    headers_sql_txt = ", ".join(
        [f'"{i}"' for i in headers]
    )  # index1, index2, ..., column 1, col2, ...

    # col1 = exluded.col1, col2=excluded.col2
    update_column_stmt = ", ".join([f'"{col}" = EXCLUDED."{col}"' for col in columns])

    # For the ON CONFLICT clause, postgres requires that the columns have unique constraint
    query_pk = f"""
    ALTER TABLE "{table_name}" DROP CONSTRAINT IF EXISTS unique_constraint_for_upsert;
    ALTER TABLE "{table_name}" ADD CONSTRAINT unique_constraint_for_upsert UNIQUE ({index_sql_txt});
    """
    engine.execute(query_pk)

    # Compose and execute upsert query
    query_upsert = f"""
    INSERT INTO "{table_name}" ({headers_sql_txt}) 
    SELECT {headers_sql_txt} FROM "{temp_table_name}"
    ON CONFLICT ({index_sql_txt}) DO UPDATE 
    SET {update_column_stmt};
    """
    engine.execute(query_upsert)
    engine.execute(f"DROP TABLE {temp_table_name}")

    return True
🌐
Reddit
reddit.com › r/dataengineering › i made a pandas.to_sql_upsert()
r/dataengineering on Reddit: I made a Pandas.to_sql_upsert()
December 28, 2024 -

Hi guys. I made a Pandas.to_sql() upsert that uses the same syntax as Pandas.to_sql(), but allows you to upsert based on unique column(s): https://github.com/vile319/sql_upsert

This is incredibly useful to me for scraping multiple times daily with a live baseball database. The only thing is, I would prefer if pandas had this built in to the package, and I did open a pull request about it, but I think they are too busy to care.

Maybe it is just a stupid idea? I would like to know your opinions on whether or not pandas should have upsert. I think my code handles it pretty well as a workaround, but I feel like Pandas could just do this as part of their package. Maybe I am just thinking about this all wrong?

Not sure if this is the wrong subreddit to post this on. While this I guess is technically self promotion, I would much rather delete my package in exchange for pandas adopting any equivalent.

🌐
GitHub
github.com › ryanbaumann › Pandas-to_sql-upsert › blob › master › Pandas_tosql_upsert.ipynb
Pandas-to_sql-upsert/Pandas_tosql_upsert.ipynb at master · ryanbaumann/Pandas-to_sql-upsert
"DB_TYPE = 'postgresql'\n", "DB_DRIVER = 'psycopg2'\n", "DB_USER = 'admin'\n", "DB_PASS = 'password'\n", "DB_HOST = 'localhost'\n", "DB_PORT = '5432'\n", "DB_NAME = 'pandas_upsert'\n", "POOL_SIZE = 50\n", "### Config update complete ###\n", "SQLALCHEMY_DATABASE_URI = '%s+%s://%s:%s@%s:%s/%s' %(DB_TYPE, DB_DRIVER, DB_USER,\n", " DB_PASS, DB_HOST, DB_PORT, DB_NAME)\n", "#Add more threads to the pool for execution\n", "engine = create_engine(SQLALCHEMY_DATABASE_URI, pool_size=POOL_SIZE, max_overflow=0)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false ·
Author   ryanbaumann
🌐
Readthedocs
aws-sdk-pandas.readthedocs.io › en › 3.2.1 › stubs › awswrangler.postgresql.to_sql.html
awswrangler.postgresql.to_sql — AWS SDK for pandas 3.2.1 documentation
AWS SDK for pandas 3.2.1 · About · Install · At Scale · Tutorials · API Reference · License · Contribute · GitHub · awswrangler.postgresql.to_sql(df: DataFrame, con: pg8000.Connection, table: str, schema: str, mode: Literal['append', 'overwrite', 'upsert'] = 'append', index: bool = False, dtype: Dict[str, str] | None = None, varchar_lengths: Dict[str, int] | None = None, use_column_names: bool = False, chunksize: int = 200, upsert_conflict_columns: List[str] | None = None, insert_conflict_columns: List[str] | None = None) → None¶ ·
Find elsewhere
🌐
GitHub
github.com › ThibTrip › pangres › wiki › Aupsert
Aupsert
SQL upsert using pandas DataFrames for PostgreSQL, SQlite and MySQL with extra features - ThibTrip/pangres
Author   ThibTrip
🌐
GitHub
github.com › ryanbaumann › Pandas-to_sql-upsert › blob › master › readme.md
Pandas-to_sql-upsert/readme.md at master · ryanbaumann/Pandas-to_sql-upsert
The goal of this library is to extend the Python Pandas to_sql() function to be: Muti-threaded (improving time-to-insert on large datasets) Allow the to_sql() command to run an 'insert if does not exist' to the database ...
Author   ryanbaumann
🌐
Minwook-shin
minwook-shin.github.io › pandas-dataframe-to-sql-upsert
Pandas to_sql 메소드로 PostgreSQL Upsert 구현하기
January 23, 2022 - 오늘은 Pandas dataframe 데이터로 PostgreSQL 데이터베이스에 Upsert 작업을 해보려고 합니다. PostgreSQL 데이터베이스는 로컬에 따로 띄어놓고 작업해보겠습니다. (현 시점 22년 1월 23일 기준으로) Pandas에서 제공하는 to_sql 메소드에서는 각 행마다 데이터베이스의 PK나 유니크 제약조건이 충돌날 때 업데이트하는 로직을 바로 사용할 수 없다고 알고있습니다.
🌐
GitHub
gist.github.com › gordthompson › ae7a1528fde1c00c03fdbb5c53c8f90f
Build a PostgreSQL INSERT … ON CONFLICT statement and upsert a DataFrame
Build a PostgreSQL INSERT … ON CONFLICT statement and upsert a DataFrame - postgresql_df_upsert.py
🌐
Readthedocs
aws-sdk-pandas.readthedocs.io › en › 3.10.1 › stubs › awswrangler.postgresql.to_sql.html
awswrangler.postgresql.to_sql — AWS SDK for pandas 3.10.1 documentation
AWS SDK for pandas 3.10.1 · About · Install · At Scale · Tutorials · API Reference · License · Contribute · GitHub · awswrangler.postgresql.to_sql(df: DataFrame, con: pg8000.Connection, table: str, schema: str, mode: Literal['append', 'overwrite', 'upsert'] = 'append', overwrite_method: Literal['drop', 'cascade', 'truncate', 'truncate cascade'] = 'drop', index: bool = False, dtype: dict[str, str] | None = None, varchar_lengths: dict[str, int] | None = None, use_column_names: bool = False, chunksize: int = 200, upsert_conflict_columns: list[str] | None = None, insert_conflict_columns: list[str] | None = None, commit_transaction: bool = True) → None¶ ·
🌐
GitHub
github.com › reachanu21 › Pandas-to_sql-upsert
GitHub - reachanu21/Pandas-to_sql-upsert: Extend pandas to_sql function to perform multi-threaded, concurrent "insert or update" command in memory
The goal of this library is to extend the Python Pandas to_sql() function to be: Muti-threaded (improving time-to-insert on large datasets) Allow the to_sql() command to run an 'insert if does not exist' to the database ...
Author   reachanu21
🌐
PyPI
pypi.org › project › pangres
pangres · PyPI
Upsert with pandas DataFrames (ON CONFLICT DO NOTHING or ON CONFLICT DO UPDATE) for PostgreSQL, MySQL, SQlite and potentially other databases behaving like SQlite (untested) with some additional optional features (see features). Upserting can be done with primary keys or unique keys.
      » pip install pangres
    
Published   Nov 05, 2023
Version   4.2.1
🌐
GitHub
github.com › Ianphorsman › PandasSqlWrapper
GitHub - Ianphorsman/PandasSqlWrapper: Provides upsert and schema updating capabilities and wraps basic functionality expected when communicating between dataframes and sql tables.
sql_data = PandasSQLWrapper( ... database to communicate back performed actions ) Performs an upsert on a sql table and updates table schema by adding columns if necessary....
Author   Ianphorsman
🌐
GitHub
github.com › ryanbaumann › Pandas-to_sql-upsert › blob › master › to_sql_newrows.py
Pandas-to_sql-upsert/to_sql_newrows.py at master · ryanbaumann/Pandas-to_sql-upsert
May 2, 2016 - DB_TYPE = 'postgresql' DB_DRIVER = 'psycopg2' DB_USER = 'admin' DB_PASS = 'password' DB_HOST = 'localhost' DB_PORT = '5432' DB_NAME = 'pandas_upsert' POOL_SIZE = 50 · TABLENAME = 'test_upsert' SQLALCHEMY_DATABASE_URI = '%s+%s://%s:%s@%s:%s/%s' % (DB_TYPE, DB_DRIVER, DB_USER, DB_PASS, DB_HOST, DB_PORT, DB_NAME) ENGINE = create_engine( SQLALCHEMY_DATABASE_URI, pool_size=POOL_SIZE, max_overflow=0) ·
Author   ryanbaumann
🌐
Pandas
pandas.pydata.org › docs › reference › api › pandas.DataFrame.to_sql.html
pandas.DataFrame.to_sql — pandas 3.0.2 documentation
>>> df3 = pd.DataFrame({"name": ['User 8', 'User 9']}) >>> df3.to_sql(name='users', con=engine, if_exists='delete_rows', ... index_label='id') 2 >>> with engine.connect() as conn: ... conn.execute(text("SELECT * FROM users")).fetchall() [(0, 'User 8'), (1, 'User 9')] Use method to define a callable insertion method to do nothing if there’s a primary key conflict on a table in a PostgreSQL database.
🌐
Medium
medium.com › @kennethhughesa › optimization-of-upsert-methods-in-postgresql-python-ac11b8471494
Optimization of Upsert Methods in PostgreSQL/Python | by Kenny Hughes | Medium
June 5, 2022 - I then would run a SQL DELETE statement to remove deprecated data. What I found was that when the Upsert was performed from within the database engine a significant compute advantage took place. However, when the Upsert procedure involved transferring data from Python to the database engine the ingestion time was well below compute and time performance standards (the data ingestion script was going to run on an Github Action Virtual Machine Runner).
🌐
Readthedocs
aws-sdk-pandas.readthedocs.io › en › stable › stubs › awswrangler.postgresql.to_sql.html
awswrangler.postgresql.to_sql — AWS SDK for pandas 3.14.0 documentation
AWS SDK for pandas 3.14.0 · About · Install · At Scale · Tutorials · API Reference · License · Contribute · GitHub · awswrangler.postgresql.to_sql(df: DataFrame, con: pg8000.Connection, table: str, schema: str, mode: Literal['append', 'overwrite', 'upsert'] = 'append', overwrite_method: Literal['drop', 'cascade', 'truncate', 'truncate cascade'] = 'drop', index: bool = False, dtype: dict[str, str] | None = None, varchar_lengths: dict[str, int] | None = None, use_column_names: bool = False, chunksize: int = 200, upsert_conflict_columns: list[str] | None = None, insert_conflict_columns: list[str] | None = None, commit_transaction: bool = True) → None¶ ·