How to create a large pandas dataframe from an sql query without running out of memory?

stackoverflow.com › questions › 18107953 › how-to-create-a-large-pandas-dataframe-from-an-sql-query-without-running-out-of

As mentioned in a comment, starting from pandas 0.15, you have a chunksize option in read_sql to read and process the query chunk by chunk:

sql = "SELECT * FROM My_Table"
for chunk in pd.read_sql_query(sql , engine, chunksize=5):
    print(chunk)

Reference: http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#querying

Answer from Kamil Sindi on Stack Overflow

Stack Overflow

stackoverflow.com › questions › 18107953 › how-to-create-a-large-pandas-dataframe-from-an-sql-query-without-running-out-of

python - How to create a large pandas dataframe from an sql query without running out of memory? - Stack Overflow

Top answer

1 of 11

As mentioned in a comment, starting from pandas 0.15, you have a chunksize option in read_sql to read and process the query chunk by chunk:

sql = "SELECT * FROM My_Table"
for chunk in pd.read_sql_query(sql , engine, chunksize=5):
    print(chunk)

Reference: http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#querying

2 of 11

Update: Make sure to check out the answer below, as Pandas now has built-in support for chunked loading.

You could simply try to read the input table chunk-wise and assemble your full dataframe from the individual pieces afterwards, like this:

import pandas as pd
import pandas.io.sql as psql
chunk_size = 10000
offset = 0
dfs = []
while True:
  sql = "SELECT * FROM MyTable limit %d offset %d order by ID" % (chunk_size,offset) 
  dfs.append(psql.read_frame(sql, cnxn))
  offset += chunk_size
  if len(dfs[-1]) < chunk_size:
    break
full_df = pd.concat(dfs)

It might also be possible that the whole dataframe is simply too large to fit in memory, in that case you will have no other option than to restrict the number of rows or columns you're selecting.

Pandas

pandas.pydata.org › docs › reference › api › pandas.read_sql.html

pandas.read_sql — pandas 3.0.1 documentation - PyData |

Dict of {column_name: arg dict}, where the arg dict corresponds to the keyword arguments of pandas.to_datetime() Especially useful with databases without native Datetime support, such as SQLite. ... List of column names to select from SQL table (only used when reading a table). ... If specified, return an iterator where chunksize is the number of rows to include in each chunk.

Discussions

python - Equivalent of LIMIT and OFFSET of SQL in pandas? - Stack Overflow

I have a dataframe like this: id type city 0 2 d H 1 7 c J 2 7 x Y 3 2 o G 4 6 i F 5 5 b E 6 6 v G 7 8 u L 8 1 g L 9 8 ... More on stackoverflow.com

stackoverflow.com

python - Importing database of 4 million rows into Pandas DataFrame - Code Review Stack Exchange

All it does is give you the functionality to read from a given row to the bottom. I would omit it, or at least give it a default value of 0. It also looks like you need connection as one of the variables. dfs_ct = [] j = 0 start = dt.datetime.now() df = pd.DataFrame() while True: sql_ct = "SELECT * FROM my_table limit ... More on codereview.stackexchange.com

codereview.stackexchange.com

August 5, 2017

python - Is there a maximum limit to the size of the query string for the pandas.read_sql() function? - Stack Overflow

I have a Python script that takes a list of names from the user, creates an SQL query string, and queries the database and puts the data into a dataframe using pandas.read_sql() method. However, I ... More on stackoverflow.com

stackoverflow.com

November 16, 2021

Reading table with chunksize still pumps the memory

I'm trying to migrate database tables from MySQL to SQL Server: import pandas as pd from sqlalchemy import create_engine my_engine = create_engine("mysql+pymysql://root:pass@localhost/gen") ms_engi... More on github.com

github.com

February 9, 2016

Python⇒Speed

pythonspeed.com › articles › pandas-sql-chunking

Loading SQL data into Pandas without running out of memory

January 6, 2023 - In many cases you don’t actually need all of the rows in memory at once. If you can load the data in chunks, you are often able to process the data one chunk at a time, which means you only need as much memory as a single chunk. An in fact, pandas.read_sql() has an API for chunking, by passing in a chunksize parameter.

Hex

hex.tech › blog › query-sql-database-pandas

How to query a SQL database from Pandas | Hex

February 12, 2023 - pip install pandas import pandas ... This outputs all 102,599 rows in the dataset: We can use the SQL limit clause to limit ourselves to just the first 5 rows, emulating df.head(): Copy ·...

Stack Overflow

stackoverflow.com › questions › 53934470 › equivalent-of-limit-and-offset-of-sql-in-pandas

python - Equivalent of LIMIT and OFFSET of SQL in pandas? - Stack Overflow

Top answer

1 of 2

Yes, integer location, where iloc starting index is the 'offset' and ending index is incremented by 'limit':

df.sort_values('type', ascending=False).iloc[2:6]

Output:

   id type city
7   8    u    L
3   2    o    G
9   8    k    U
4   6    i    F

And you can add reset_index to clean up indexing.

print(df.sort_values('type', ascending=False).iloc[2:6].reset_index(drop=True))

Output:

   id type city
0   8    u    L
1   2    o    G
2   8    k    U
3   6    i    F

Update let's sort by type and index:

df.index.name = 'index'
df[['id','type']].sort_values(['type','index'], ascending=[False,True]).iloc[2:6]

Output:

   index      id           type
0      3    6525  small_airport
1      5  322127  small_airport
2      6    6527  small_airport
3      7    6528  small_airport

2 of 2

You could use sort_values with ascending=False, and use .loc() to slice the result (having reset the index) with the rows and columns of interest:

offset = 2
limit = 4
(df.sort_values(by='type', ascending=False).reset_index(drop=True)
               .loc[offset : offset+limit-1, ['id','type']])

   id type
2   8    u
3   2    o
4   8    k
5   6    i

Pandas

pandas.pydata.org › pandas-docs › stable › reference › api › pandas.read_sql_query.html

pandas.read_sql_query — pandas 3.0.1 documentation

Read SQL query into a DataFrame. Returns a DataFrame corresponding to the result set of the query string. Optionally provide an index_col parameter to use one of the columns as the index, otherwise default integer index will be used. ... SQL query to be executed. conSQLAlchemy connectable, str, or sqlite3 connection

Pandas

pandas.pydata.org › pandas-docs › stable › generated › pandas.read_sql.html

pandas.read_sql — pandas 2.2.2 documentation - PyData |

The page has been moved to this page

Stack Exchange

codereview.stackexchange.com › questions › 162402 › importing-database-of-4-million-rows-into-pandas-dataframe

python - Importing database of 4 million rows into Pandas DataFrame - Code Review Stack Exchange

Top answer

1 of 1

Your code

def import_db_table(chunk_size, offset):

It doesn't look like you need to pass offset to this function. All it does is give you the functionality to read from a given row to the bottom. I would omit it, or at least give it a default value of 0. It also looks like you need connection as one of the variables.

    dfs_ct = []
    j = 0
    start = dt.datetime.now()
    df = pd.DataFrame()

    while True:
        sql_ct = "SELECT * FROM my_table limit %d offset %d" % (chunk_size, offset)
        dfs_ct.append(psql.read_sql_query(sql_ct, connection))

        offset += chunk_size

        if len(dfs_ct[-1]) < chunk_size:
            break

As written, the while loop should stop here. You can also get better performance by making a generator instead of a list out of the query results. For example:

Code suggestions

    def generate_df_pieces(connection, chunk_size, offset = 0):
        while True:
            sql_ct = "SELECT * FROM my_table limit %d offset %d" % (chunk_size, offset)
            df_piece = psql.read_sql_query(sql_ct, connection)

            # don't yield an empty data frame
            if not df_piece.shape[0]:
                break
            yield df_piece

            # don't make an unnecessary database query
            if df_piece.shape[0] < chunk_size:
                break

            offset += chunk_size

Then you can call:

    df = pd.concat(generate_df_pieces(connection, chunk_size, offset=offset))

The function pd.concat can take a sequence. Making the sequence be a generator like this is more efficient than growing a list, as you don't need to keep more than one df_piece in memory until you actually make them into the final, larger one.

Back to your code

        df = pd.concat(dfs_ct)

You're resetting the entire dataframe each time and rebuilding it anew from the whole list! If this were outside of the loop it would make sense.

        # Convert columns to datetime
        columns = ['col1', 'col2', 'col3','col4', 'col5', 'col6',
                   'col7', 'col8', 'col9', 'col10', 'col11', 'col12',
                   'col13', 'col14', 'col15']

        for column in columns:
            df[column] = pd.to_datetime(df[column], errors='coerce')

        # Remove the uninteresting columns
        columns_remove = ['col42', 'col43', 'col67','col52', 'col39', 'col48','col49', 'col50', 'col60', 'col61', 'col62', 'col63', 'col64','col75', 'col80']

        for c in df.columns:
            if c not in columns_remove:
                df = df.drop(c, axis=1)

This part could be done in the loop / generator function or outside. Dropping columns is a good thing to place inside as then the big dataframe you build won't ever need to be larger than you want. If you're able to put only the columns you want in the SQL query, that would be even better, as it would be less to send over the connection.

Another point to make about df.drop is that by default it makes a new dataframe. So use inplace = True so you don't copy your huge dataframe. And it also accepts a list of columns to be dropped:

Code suggestions

        df.drop(columns_remove, inplace = True, axis = 1)

gives the same result without looping and copying df over and over. You can also use:

        columns_remove_numbers = [ ... ] # list the column numbers
        columns_remove = df.columns[columns_remove_numbers]

So you don't have to type all those strings.

Back to your code

        j+=1
        print('{} seconds: completed {} rows'.format((dt.datetime.now() - start).seconds, j*chunk_size))

If you use the generator function version of this, you could put this inside that function to keep track of the performance.

Find elsewhere

Google Bing Mojeek

Medium

deveshpoojari.medium.com › efficiently-reading-and-writing-large-datasets-with-pandas-and-sql-13e593bd28b4

Efficiently Reading and Writing Large Datasets with Pandas and SQL | by Devesh Poojari | Medium

March 13, 2024 - With the addition of the chunksize parameter, you can control the number of rows loaded into memory at a time, allowing you to process the data in manageable chunks and manipulate it as needed.

Stack Overflow

stackoverflow.com › questions › 69982964 › is-there-a-maximum-limit-to-the-size-of-the-query-string-for-the-pandas-read-sql

python - Is there a maximum limit to the size of the query string for the pandas.read_sql() function? - Stack Overflow

November 16, 2021 - query_string = construct_string_above(names_list) pymongoConnection = ... try: print("Trying") df = pandas.read_sql(query_string, pymongoConnection) except Exception as ex: print(ex) traceback.print_exc() sys.exit(0) print("Finishing")

Pandas

pandas.pydata.org › docs › reference › api › pandas.read_sql_table.html

pandas.read_sql_table — pandas 3.0.1 documentation

Dict of {column_name: arg dict}, ... of pandas.to_datetime() Especially useful with databases without native Datetime support, such as SQLite. ... List of column names to select from SQL table. ... If specified, returns an iterator where chunksize is the number of rows to include ...

GitHub

github.com › pandas-dev › pandas › issues › 12265

Reading table with chunksize still pumps the memory · Issue #12265 · pandas-dev/pandas

February 9, 2016 - I'm using Python 3.5.1 with pandas 0.17.1 and all latest packages, although I tried also Python 2.7 with pandas 0.16 and same results · Reactions are currently unavailable · No one assigned · IO SQLto_sql, read_sql, read_sql_queryto_sql, read_sql, read_sql_queryUsage Question ·

Author klonuo

Pandas

pandas.pydata.org › pandas-docs › stable › reference › api › pandas.read_sql.html

pandas.read_sql — pandas 2.2.3 documentation - PyData |

Pandas

pandas.pydata.org › pandas-docs › version › 0.20 › generated › pandas.read_sql_query.html

pandas.read_sql_query — pandas 0.20.3 documentation

Comparison with SQL · Comparison with SAS · API Reference · Internals · Release Notes · Enter search terms or a module, class or function name. pandas.read_sql_query(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, chunksize=None)[source]¶ ·

Dataquest

dataquest.io › blog › python-pandas-databases

Working with SQL using Python and Pandas – Dataquest

April 9, 2023 - pd.read_sql_query("select * from airlines limit 1;", conn) Note that all the columns are set to null in SQLite (which translates to None in Python) because there aren’t any values for the column yet. It’s also possible to use Pandas to alter tables by exporting the table to a DataFrame, making modifications to the DataFrame, then exporting the DataFrame to a table:

Plus2Net

plus2net.com › python › pandas-mysql.php

MySQL table data to Python Pandas DataFrame by read_sql()

February 5, 2019 - #To get all records of a particular ... DESC LIMIT 1,1" You can read more on various SQL in our SQL Exercise section. DataFrame to SQlite table at Colab platform using to_sql() » SQLite table to DataFrame at Colab platform using read_sql() » « Data input and output from Pandas DataFrame ...

GitHub

github.com › pandas-dev › pandas › issues › 7826

read_sql chokes on mysql when using labels with queries due to unnecessary quoting · Issue #7826 · pandas-dev/pandas

engine=create_engine('mysql://{username}:{password}@{host}/{database}?charset=utf8'.format(**db)) pandas.io.sql.read_sql('SELECT onlinetransactions.id FROM onlinetransactions LIMIT 1', engine) #Does what you'd expect pandas.io.sql.read_sql('SELECT ...

GitHub

github.com › modin-project › modin › issues › 3524

Usage of LIMIT clause to fetch column names incompatible ...

October 7, 2021 - Modin uses Limit to get column names. This is not standard sql and is not supported by sql server. Sql_dispatcher.py line number 85 cols_names_df = pandas.read_sql( "SELECT * FROM ({}) as foo LIMIT 0".format(sql), con, index_col=index_col ) The sql produced by the above line throws an error when used with sqlalchemy connection string.

Author pointerness

DataLemur

datalemur.com › blog › sql-pandas-read_sql

How to Use Pandas read_sql to Write and Run SQL?

April 21, 2025 - Despite its slower performance compared to SQL, Pandas remains a valuable tool for its ease of use and flexibility in various data analysis workflows. Let's practice how to effectively utilize and functions using the LinkedIn SQL Interview Question. Let's start by executing a SQL query to retrieve data from the table. We'll select all columns and limit the results to the first five rows...

Stack Overflow

stackoverflow.com › questions › 72890701 › pandas-to-sql-output-exceeds-the-size-limit

python - Pandas .to_sql() Output exceeds the size limit - Stack Overflow

Top answer

1 of 3

Thanks for all the input (still new here)! Accidentally stumbled upon the solution, which is by reducing the df.to_sql from

df.to_sql(chunksize=1000)

df.to_sql(chunksize=200)

After digging it turns out there's a limitation from SQL server (https://discuss.dizzycoding.com/to_sql-pyodbc-count-field-incorrect-or-syntax-error/)

2 of 3

In my case, I had the same "Output exceeds the size limit" error, and I fixed it adding "method='multi'" in df.to_sql(method='multi'). First I tried the "chuncksize" solution and it didn't work. So... check that if you're at the same scenario!

with engine.connect().execution_options(autocommit=True) as conn:
    df.to_sql('mytable', con=conn, method='multi', if_exists='replace', index=True)