As mentioned in a comment, starting from pandas 0.15, you have a chunksize option in read_sql to read and process the query chunk by chunk:

sql = "SELECT * FROM My_Table"
for chunk in pd.read_sql_query(sql , engine, chunksize=5):
    print(chunk)

Reference: http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#querying

Answer from Kamil Sindi on Stack Overflow
🌐
Pandas
pandas.pydata.org › docs › reference › api › pandas.read_sql.html
pandas.read_sql — pandas 3.0.1 documentation - PyData |
Dict of {column_name: arg dict}, where the arg dict corresponds to the keyword arguments of pandas.to_datetime() Especially useful with databases without native Datetime support, such as SQLite. ... List of column names to select from SQL table (only used when reading a table). ... If specified, return an iterator where chunksize is the number of rows to include in each chunk.
Discussions

python - Equivalent of LIMIT and OFFSET of SQL in pandas? - Stack Overflow
I have a dataframe like this: id type city 0 2 d H 1 7 c J 2 7 x Y 3 2 o G 4 6 i F 5 5 b E 6 6 v G 7 8 u L 8 1 g L 9 8 ... More on stackoverflow.com
🌐 stackoverflow.com
python - Importing database of 4 million rows into Pandas DataFrame - Code Review Stack Exchange
All it does is give you the functionality to read from a given row to the bottom. I would omit it, or at least give it a default value of 0. It also looks like you need connection as one of the variables. dfs_ct = [] j = 0 start = dt.datetime.now() df = pd.DataFrame() while True: sql_ct = "SELECT * FROM my_table limit ... More on codereview.stackexchange.com
🌐 codereview.stackexchange.com
August 5, 2017
python - Is there a maximum limit to the size of the query string for the pandas.read_sql() function? - Stack Overflow
I have a Python script that takes a list of names from the user, creates an SQL query string, and queries the database and puts the data into a dataframe using pandas.read_sql() method. However, I ... More on stackoverflow.com
🌐 stackoverflow.com
November 16, 2021
Reading table with chunksize still pumps the memory
I'm trying to migrate database tables from MySQL to SQL Server: import pandas as pd from sqlalchemy import create_engine my_engine = create_engine("mysql+pymysql://root:pass@localhost/gen") ms_engi... More on github.com
🌐 github.com
14
February 9, 2016
🌐
Python⇒Speed
pythonspeed.com › articles › pandas-sql-chunking
Loading SQL data into Pandas without running out of memory
January 6, 2023 - In many cases you don’t actually need all of the rows in memory at once. If you can load the data in chunks, you are often able to process the data one chunk at a time, which means you only need as much memory as a single chunk. An in fact, pandas.read_sql() has an API for chunking, by passing in a chunksize parameter.
🌐
Hex
hex.tech › blog › query-sql-database-pandas
How to query a SQL database from Pandas | Hex
February 12, 2023 - pip install pandas import pandas ... This outputs all 102,599 rows in the dataset: We can use the SQL limit clause to limit ourselves to just the first 5 rows, emulating df.head(): Copy ·...
🌐
Pandas
pandas.pydata.org › pandas-docs › stable › reference › api › pandas.read_sql_query.html
pandas.read_sql_query — pandas 3.0.1 documentation
Read SQL query into a DataFrame. Returns a DataFrame corresponding to the result set of the query string. Optionally provide an index_col parameter to use one of the columns as the index, otherwise default integer index will be used. ... SQL query to be executed. conSQLAlchemy connectable, str, or sqlite3 connection
Top answer
1 of 1
3

Your code

def import_db_table(chunk_size, offset):

It doesn't look like you need to pass offset to this function. All it does is give you the functionality to read from a given row to the bottom. I would omit it, or at least give it a default value of 0. It also looks like you need connection as one of the variables.

    dfs_ct = []
    j = 0
    start = dt.datetime.now()
    df = pd.DataFrame()

    while True:
        sql_ct = "SELECT * FROM my_table limit %d offset %d" % (chunk_size, offset)
        dfs_ct.append(psql.read_sql_query(sql_ct, connection))

        offset += chunk_size

        if len(dfs_ct[-1]) < chunk_size:
            break

As written, the while loop should stop here. You can also get better performance by making a generator instead of a list out of the query results. For example:

Code suggestions

    def generate_df_pieces(connection, chunk_size, offset = 0):
        while True:
            sql_ct = "SELECT * FROM my_table limit %d offset %d" % (chunk_size, offset)
            df_piece = psql.read_sql_query(sql_ct, connection)

            # don't yield an empty data frame
            if not df_piece.shape[0]:
                break
            yield df_piece

            # don't make an unnecessary database query
            if df_piece.shape[0] < chunk_size:
                break

            offset += chunk_size

Then you can call:

    df = pd.concat(generate_df_pieces(connection, chunk_size, offset=offset))

The function pd.concat can take a sequence. Making the sequence be a generator like this is more efficient than growing a list, as you don't need to keep more than one df_piece in memory until you actually make them into the final, larger one.

Back to your code

        df = pd.concat(dfs_ct)

You're resetting the entire dataframe each time and rebuilding it anew from the whole list! If this were outside of the loop it would make sense.

        # Convert columns to datetime
        columns = ['col1', 'col2', 'col3','col4', 'col5', 'col6',
                   'col7', 'col8', 'col9', 'col10', 'col11', 'col12',
                   'col13', 'col14', 'col15']

        for column in columns:
            df[column] = pd.to_datetime(df[column], errors='coerce')

        # Remove the uninteresting columns
        columns_remove = ['col42', 'col43', 'col67','col52', 'col39', 'col48','col49', 'col50', 'col60', 'col61', 'col62', 'col63', 'col64','col75', 'col80']

        for c in df.columns:
            if c not in columns_remove:
                df = df.drop(c, axis=1)

This part could be done in the loop / generator function or outside. Dropping columns is a good thing to place inside as then the big dataframe you build won't ever need to be larger than you want. If you're able to put only the columns you want in the SQL query, that would be even better, as it would be less to send over the connection.

Another point to make about df.drop is that by default it makes a new dataframe. So use inplace = True so you don't copy your huge dataframe. And it also accepts a list of columns to be dropped:

Code suggestions

        df.drop(columns_remove, inplace = True, axis = 1)

gives the same result without looping and copying df over and over. You can also use:

        columns_remove_numbers = [ ... ] # list the column numbers
        columns_remove = df.columns[columns_remove_numbers]

So you don't have to type all those strings.

Back to your code

        j+=1
        print('{} seconds: completed {} rows'.format((dt.datetime.now() - start).seconds, j*chunk_size))

If you use the generator function version of this, you could put this inside that function to keep track of the performance.

Find elsewhere
🌐
Medium
deveshpoojari.medium.com › efficiently-reading-and-writing-large-datasets-with-pandas-and-sql-13e593bd28b4
Efficiently Reading and Writing Large Datasets with Pandas and SQL | by Devesh Poojari | Medium
March 13, 2024 - With the addition of the chunksize parameter, you can control the number of rows loaded into memory at a time, allowing you to process the data in manageable chunks and manipulate it as needed.
🌐
Stack Overflow
stackoverflow.com › questions › 69982964 › is-there-a-maximum-limit-to-the-size-of-the-query-string-for-the-pandas-read-sql
python - Is there a maximum limit to the size of the query string for the pandas.read_sql() function? - Stack Overflow
November 16, 2021 - query_string = construct_string_above(names_list) pymongoConnection = ... try: print("Trying") df = pandas.read_sql(query_string, pymongoConnection) except Exception as ex: print(ex) traceback.print_exc() sys.exit(0) print("Finishing")
🌐
Pandas
pandas.pydata.org › docs › reference › api › pandas.read_sql_table.html
pandas.read_sql_table — pandas 3.0.1 documentation
Dict of {column_name: arg dict}, ... of pandas.to_datetime() Especially useful with databases without native Datetime support, such as SQLite. ... List of column names to select from SQL table. ... If specified, returns an iterator where chunksize is the number of rows to include ...
🌐
GitHub
github.com › pandas-dev › pandas › issues › 12265
Reading table with chunksize still pumps the memory · Issue #12265 · pandas-dev/pandas
February 9, 2016 - I'm using Python 3.5.1 with pandas 0.17.1 and all latest packages, although I tried also Python 2.7 with pandas 0.16 and same results · Reactions are currently unavailable · No one assigned · IO SQLto_sql, read_sql, read_sql_queryto_sql, read_sql, read_sql_queryUsage Question ·
Author   klonuo
🌐
Pandas
pandas.pydata.org › pandas-docs › stable › reference › api › pandas.read_sql.html
pandas.read_sql — pandas 2.2.3 documentation - PyData |
Dict of {column_name: arg dict}, where the arg dict corresponds to the keyword arguments of pandas.to_datetime() Especially useful with databases without native Datetime support, such as SQLite. ... List of column names to select from SQL table (only used when reading a table). ... If specified, return an iterator where chunksize is the number of rows to include in each chunk.
🌐
Pandas
pandas.pydata.org › pandas-docs › version › 0.20 › generated › pandas.read_sql_query.html
pandas.read_sql_query — pandas 0.20.3 documentation
Comparison with SQL · Comparison with SAS · API Reference · Internals · Release Notes · Enter search terms or a module, class or function name. pandas.read_sql_query(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, chunksize=None)[source]¶ ·
🌐
Dataquest
dataquest.io › blog › python-pandas-databases
Working with SQL using Python and Pandas – Dataquest
April 9, 2023 - pd.read_sql_query("select * from airlines limit 1;", conn) Note that all the columns are set to null in SQLite (which translates to None in Python) because there aren’t any values for the column yet. It’s also possible to use Pandas to alter tables by exporting the table to a DataFrame, making modifications to the DataFrame, then exporting the DataFrame to a table:
🌐
Plus2Net
plus2net.com › python › pandas-mysql.php
MySQL table data to Python Pandas DataFrame by read_sql()
February 5, 2019 - #To get all records of a particular ... DESC LIMIT 1,1" You can read more on various SQL in our SQL Exercise section. DataFrame to SQlite table at Colab platform using to_sql() » SQLite table to DataFrame at Colab platform using read_sql() » « Data input and output from Pandas DataFrame ...
🌐
GitHub
github.com › pandas-dev › pandas › issues › 7826
read_sql chokes on mysql when using labels with queries due to unnecessary quoting · Issue #7826 · pandas-dev/pandas
engine=create_engine('mysql://{username}:{password}@{host}/{database}?charset=utf8'.format(**db)) pandas.io.sql.read_sql('SELECT onlinetransactions.id FROM onlinetransactions LIMIT 1', engine) #Does what you'd expect pandas.io.sql.read_sql('SELECT ...
🌐
GitHub
github.com › modin-project › modin › issues › 3524
Usage of LIMIT clause to fetch column names incompatible ...
October 7, 2021 - Modin uses Limit to get column names. This is not standard sql and is not supported by sql server. Sql_dispatcher.py line number 85 cols_names_df = pandas.read_sql( "SELECT * FROM ({}) as foo LIMIT 0".format(sql), con, index_col=index_col ) The sql produced by the above line throws an error when used with sqlalchemy connection string.
Author   pointerness
🌐
DataLemur
datalemur.com › blog › sql-pandas-read_sql
How to Use Pandas read_sql to Write and Run SQL?
April 21, 2025 - Despite its slower performance compared to SQL, Pandas remains a valuable tool for its ease of use and flexibility in various data analysis workflows. Let's practice how to effectively utilize and functions using the LinkedIn SQL Interview Question. Let's start by executing a SQL query to retrieve data from the table. We'll select all columns and limit the results to the first five rows...