As mentioned in a comment, starting from pandas 0.15, you have a chunksize option in read_sql to read and process the query chunk by chunk:

sql = "SELECT * FROM My_Table"
for chunk in pd.read_sql_query(sql , engine, chunksize=5):
    print(chunk)

Reference: http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#querying

Answer from Kamil Sindi on Stack Overflow
🌐
Architecture-performance
architecture-performance.fr › accueil › sql › reading a sql table by chunks with pandas
Reading a SQL table by chunks with Pandas - Architecture et Performance
October 29, 2023 - In the following export_csv function, we create a connection, read the data by chunks with read_sql() and append the rows to a CSV file with to_csv(): def export_csv( chunksize=1000, connect_string=CONNECT_STRING, sql_query=SQL_QUERY, csv_file_path=CSV_FP, ): engine = create_engine(connect_string) connection = engine.connect().execution_options( stream_results=True, max_row_buffer=chunksize ) header = True mode = "w" for df in pd.read_sql(sql_query, connection, chunksize=chunksize): df.to_csv(csv_file_path, mode=mode, header=header, index=False) if header: header = False mode = "a" connection.close()
Discussions

python - Pandas SQL chunksize - Stack Overflow
This is more of a question on understanding than programming. I am quite new to Pandas and SQL. I am using pandas to read data from SQL with some specific chunksize. When I run a sql query e.g. im... More on stackoverflow.com
🌐 stackoverflow.com
Reading table with chunksize still pumps the memory
I'm trying to migrate database tables from MySQL to SQL Server: import pandas as pd from sqlalchemy import create_engine my_engine = create_engine("mysql+pymysql://root:pass@localhost/gen") ms_engi... More on github.com
🌐 github.com
14
February 9, 2016
I want to load a multi million row SQL output into python, how do i go about doing this efficiently?
If you can still connect to the database you can read from it directly using Pandas read_sql_table() function. If the table is too large and you run into memory limits you can use the chunksize parameter of read_sql_table and write each chunk to a file and then merge the files. I think my chunked approach below uses less memory than just reading the entire database response into a dataframe directly. I'm not completely sure why. Here is my code that reads an SQL query and returns a dataframe. You could use pd.read_sql_table() instead of read_sql_query if you just want a table and not a query. Pass in a string of "SELECT * FROM schemaname.tablename;" to query if you just want everything from the table. import sqlalchemy import pandas as pd import tempfile def make_connectstring(prefix, db, uname, hostname, port): """return an sql connectstring""" connectstring = prefix + "://" + uname + "@" + hostname + \ ":" + port + "/" + db return connectstring def query_to_df(connectstring, query, verbose=False, chunksize=100000): """ Return DataFrame from SELECT query and connectstring Given a valid SQL SELECT query and a connectstring, return a Pandas DataFrame with the response data. Args: connectstring: string with connection parameters query: Valid SQL, containing a SELECT query verbose: prints chunk progress if True. Default False. chunksize: Number of lines to read per chunk. Default 100000 Returns: df: A Pandas DataFrame containing the response of query """ engine = sqlalchemy.create_engine(connectstring, server_side_cursors=True, connect_args=make_ssl_args()) # get the data to temp chunk filese i = 0 paths_chunks = [] with tempfile.TemporaryDirectory() as td: for df in pd.read_sql_query(sql=query, con=engine, chunksize=chunksize): path = td + "/chunk" + str(i) + ".hdf5" df.to_hdf(path, key='data') if verbose: print("wrote", path) paths_chunks.append(path) i+=1 # Merge the chunks using concat, the most efficient way AFAIK df = pd.DataFrame() for path in paths_chunks: df_scratch = pd.read_hdf(path) df = pd.concat([df, df_scratch]) if verbose: print("read", path) return df connectstring = make_connectstring("postgres") uname = "username" prefix = "postgresql" db = "database_name" port = "5432" hostname = "hostname_of_db_server" connectstring = make_connectstring(prefix, db, uname, hostname, port) query = "SELECT * FROM schemaname.tablename" df = query_to_df(connectstring, query) More on reddit.com
🌐 r/learnpython
12
23
July 24, 2018
pandas read_sql reads the entire table in to memory despite specifying chunksize
When i tried reading the table using the pandas.read_sql_table i ran out of memory even though i had passed in the chunksize parameter. More on github.com
🌐 github.com
4
May 13, 2016
🌐
Python⇒Speed
pythonspeed.com › articles › pandas-sql-chunking
Loading SQL data into Pandas without running out of memory
January 6, 2023 - The result is an iterable of ... ) for chunk_dataframe in pd.read_sql( "SELECT * FROM users", engine, chunksize=1000): print( f"Got dataframe w/{len(chunk_dataframe)} rows" ) # ......
🌐
Pandas
pandas.pydata.org › docs › reference › api › pandas.read_sql.html
pandas.read_sql — pandas 3.0.1 documentation - PyData |
pandas.read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None, dtype_backend=<no_default>, dtype=None)[source]# Read SQL query or database table into a DataFrame. This function is a convenience wrapper around read_sql_table and read_sql_query (for backward compatibility).
Top answer
1 of 3
54

Let's consider two options and what happens in both cases:

  1. chunksize is None(default value):
    • pandas passes query to database
    • database executes query
    • pandas checks and sees that chunksize is None
    • pandas tells database that it wants to receive all rows of the result table at once
    • database returns all rows of the result table
    • pandas stores the result table in memory and wraps it into a data frame
    • now you can use the data frame
  2. chunksize in not None:
    • pandas passes query to database
    • database executes query
    • pandas checks and sees that chunksize has some value
    • pandas creates a query iterator(usual 'while True' loop which breaks when database says that there is no more data left) and iterates over it each time you want the next chunk of the result table
    • pandas tells database that it wants to receive chunksize rows
    • database returns the next chunksize rows from the result table
    • pandas stores the next chunksize rows in memory and wraps it into a data frame
    • now you can use the data frame

For more details you can see pandas\io\sql.py module, it is well documented

2 of 3
31

When you do not provide a chunksize, the full result of the query is put in a dataframe at once.

When you do provide a chunksize, the return value of read_sql_query is an iterator of multiple dataframes. This means that you can iterate through this like:

for df in result:
    print df

and in each step df is a dataframe (not an array!) that holds the data of a part of the query. See the docs on this: http://pandas.pydata.org/pandas-docs/stable/io.html#querying

To answer your question regarding memory, you have to know that there are two steps in retrieving the data from the database: execute and fetch.
First the query is executed (result = con.execute()) and then the data are fetched from this result set as a list of tuples (data = result.fetch()). When fetching you can specify how many rows at once you want to fetch. And this is what pandas does when you provide a chunksize.
But, many database drivers already put all data into memory in the execute step, and not only when fetching the data. So in that regard, it should not matter much for the memory. Apart from the fact the copying of the data into a DataFrame only happens in different steps while iterating with chunksize.

🌐
GitHub
github.com › pandas-dev › pandas › issues › 12265
Reading table with chunksize still pumps the memory · Issue #12265 · pandas-dev/pandas
February 9, 2016 - import pandas as pd from sqlalchemy ... in ['topics', 'fiction', 'compact']: for table in pd.read_sql_query('SELECT * FROM %s' % table_name, my_engine, chunksize=100000): table.to_sql(name=table_name, con=ms_engine, if_exists='append') ...
Author   klonuo
🌐
Medium
deveshpoojari.medium.com › efficiently-reading-and-writing-large-datasets-with-pandas-and-sql-13e593bd28b4
Efficiently Reading and Writing Large Datasets with Pandas and SQL | by Devesh Poojari | Medium
March 13, 2024 - We also specify a chunksize of 50000 rows, which means that the pd.read_sql() function will return a new DataFrame containing 50,000 rows at a time. We can then use a for loop to iterate over the chunks of data returned by the pd.read_sql() function.
🌐
Like Geeks
likegeeks.com › home › python › pandas › pandas read_sql with chunksize: unlock parallel processing
Pandas read_sql with chunksize: Unlock Parallel Processing
import pandas as pd from sqlalchemy ... bytes_received FROM data_usage_logs """ # Use chunksize to read the SQL query in chunks chunk_size = 5000 chunks = pd.read_read_sql(query, engine, chunksize=chunk_size) for chunk in chunks: ...
Find elsewhere
🌐
Rip Tutorial
riptutorial.com › to read mysql to dataframe, in case of large amount of data
pandas Tutorial => To read mysql to dataframe, In case of large...
import pandas as pd from sqlalchemy import create_engine from sqlalchemy.engine.url import URL # sqlalchemy engine engine = create_engine(URL( drivername="mysql" username="user", password="password" host="host" database="database" )) conn = engine.connect() generator_df = pd.read_sql(sql=query, # mysql query con=conn, chunksize=chunksize) # size you want to fetch each time for dataframe in generator_df: for row in dataframe: pass # whatever you want to do ·
🌐
Pandas
pandas.pydata.org › pandas-docs › version › 0.17.0 › generated › pandas.read_sql_query.html
pandas.read_sql_query — pandas 0.17.0 documentation
pandas.read_sql_query(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, chunksize=None)¶
🌐
Pandas
pandas.pydata.org › pandas-docs › stable › reference › api › pandas.read_sql.html
pandas.read_sql — pandas 2.2.3 documentation - PyData |
pandas.read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None, dtype_backend=<no_default>, dtype=None)[source]# Read SQL query or database table into a DataFrame. This function is a convenience wrapper around read_sql_table and read_sql_query (for backward compatibility).
🌐
Reddit
reddit.com › r/learnpython › i want to load a multi million row sql output into python, how do i go about doing this efficiently?
I want to load a multi million row SQL output into python, how do i go about doing this efficiently? : r/learnpython
July 24, 2018 - Default 100000 Returns: df: A Pandas DataFrame containing the response of query """ engine = sqlalchemy.create_engine(connectstring, server_side_cursors=True, connect_args=make_ssl_args()) # get the data to temp chunk filese i = 0 paths_chunks = [] with tempfile.TemporaryDirectory() as td: for df in pd.read_sql_query(sql=query, con=engine, chunksize=chunksize): path = td + "/chunk" + str(i) + ".hdf5" df.to_hdf(path, key='data') if verbose: print("wrote", path) paths_chunks.append(path) i+=1 # Merge the chunks using concat, the most efficient way AFAIK df = pd.DataFrame() for path in paths_chun
🌐
GitHub
github.com › pandas-dev › pandas › issues › 13168
pandas read_sql reads the entire table in to memory despite specifying chunksize · Issue #13168 · pandas-dev/pandas
May 13, 2016 - eng = sqlalchemy.create_engine("mysql+mysqldb://user:pass@localhost/db_name") dframe = pandas.read_sql_table('table_name', eng, chunksize=100)
Published   May 13, 2016
Author   jeetjitsu
🌐
datagy
datagy.io › home › pandas tutorials › pandas reading & writing data › pandas read_sql: reading sql into dataframes
Pandas read_sql: Reading SQL into DataFrames • datagy
February 22, 2023 - # Reading SQL Queries in Chunks import pandas as pd import sqlite3 conn = sqlite3.connect('users') df = pd.DataFrame() for chunk in pd.read_sql(sql="SELECT * FROM users", con=conn, index_col='userid', chunksize=2): df = pd.concat([df, chunk]) ...
🌐
GitHub
github.com › pandas-dev › pandas › issues › 19457
pd.read_sql_query with chunksize: pandasSQL_builder should only be called when first chunk is requested · Issue #19457 · pandas-dev/pandas
January 30, 2018 - Therefore pandasSQL_builder will be called within the Thread requesting the chunks. @attr.s(auto_attribs=True) class PDSQLQueryWrapper: """Wrap the iterator. To create the db engine in the thread that calls the iterator first. """ _read_sql_query_iterator = None query: str url: str chunksize: int def __iter__(self): return self def __next__(self): if self._read_sql_query_iterator is None: self._read_sql_query_iterator = pd.read_sql_query( self.query, self.url, chunksize=self.chunksize) return next(self._read_sql_query_iterator)
🌐
Lightrun
lightrun.com › answers › pandas-dev-pandas-pdread_sql_query-with-chunksize-pandassql_builder-should-only-be-called-when-first-chunk-is-reques
pd.read_sql_query with chunksize: pandasSQL_builder should only be called when first chunk is requested
Therefore pandasSQL_builder will be called within the Thread requesting the chunks. @attr.s(auto_attribs=True) class PDSQLQueryWrapper: """Wrap the iterator. To create the db engine in the thread that calls the iterator first. """ _read_sql_query_iterator = None query: str url: str chunksize: int def __iter__(self): return self def __next__(self): if self._read_sql_query_iterator is None: self._read_sql_query_iterator = pd.read_sql_query( self.query, self.url, chunksize=self.chunksize) return next(self._read_sql_query_iterator)
🌐
Andrew Wheeler
andrewpwheeler.com › 2021 › 08 › 12 › chunking-it-up-in-pandas
Chunking it up in pandas | Andrew Wheeler
August 12, 2021 - In the python pandas library, you ... chunks of the dataset, instead of the whole dataframe. data_chunks = pandas.read_sql_table('tablename',db_connection,chunksize=2000) I thought for awhile this was somewhat worthless, as ...
🌐
Like Geeks
likegeeks.com › home › python › pandas › read sql query/table into dataframe using pandas read_sql
Read SQL Query/Table into DataFrame using Pandas read_sql
October 16, 2023 - If we have a large ‘users’ table, ... engine.connect().execution_options( stream_results=True) chunks = pd.read_sql("SELECT * FROM users", con, chunksize=500) for chunk in chunks: print(chunk)...
🌐
Aetperf
aetperf.github.io › 2022 › 05 › 16 › Reading-a-SQL-table-by-chunks-with-Pandas.html
Reading a SQL table by chunks with Pandas | Architecture & Performance
May 16, 2022 - In the following export_csv function, we create a connection, read the data by chunks with read_sql() and append the rows to a CSV file with to_csv(): def export_csv( chunksize=1000, connect_string=CONNECT_STRING, sql_query=SQL_QUERY, csv_file_path=CSV_FP, ): engine = create_engine(connect_string) connection = engine.connect().execution_options( stream_results=True, max_row_buffer=chunksize ) header = True mode = "w" for df in pd.read_sql(sql_query, connection, chunksize=chunksize): df.to_csv(csv_file_path, mode=mode, header=header, index=False) if header: header = False mode = "a" connection.close()