pandas read_sql chunksize example

How to create a large pandas dataframe from an sql query without running out of memory?

stackoverflow.com › questions › 18107953 › how-to-create-a-large-pandas-dataframe-from-an-sql-query-without-running-out-of

As mentioned in a comment, starting from pandas 0.15, you have a chunksize option in read_sql to read and process the query chunk by chunk:

sql = "SELECT * FROM My_Table"
for chunk in pd.read_sql_query(sql , engine, chunksize=5):
    print(chunk)

Reference: http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#querying

Answer from Kamil Sindi on Stack Overflow

Architecture-performance

architecture-performance.fr › accueil › sql › reading a sql table by chunks with pandas

Reading a SQL table by chunks with Pandas - Architecture et Performance

October 29, 2023 - In the following export_csv function, we create a connection, read the data by chunks with read_sql() and append the rows to a CSV file with to_csv(): def export_csv( chunksize=1000, connect_string=CONNECT_STRING, sql_query=SQL_QUERY, csv_file_path=CSV_FP, ): engine = create_engine(connect_string) connection = engine.connect().execution_options( stream_results=True, max_row_buffer=chunksize ) header = True mode = "w" for df in pd.read_sql(sql_query, connection, chunksize=chunksize): df.to_csv(csv_file_path, mode=mode, header=header, index=False) if header: header = False mode = "a" connection.close()

Stack Overflow

stackoverflow.com › questions › 18107953 › how-to-create-a-large-pandas-dataframe-from-an-sql-query-without-running-out-of

python - How to create a large pandas dataframe from an sql query without running out of memory? - Stack Overflow

Top answer

1 of 11

As mentioned in a comment, starting from pandas 0.15, you have a chunksize option in read_sql to read and process the query chunk by chunk:

sql = "SELECT * FROM My_Table"
for chunk in pd.read_sql_query(sql , engine, chunksize=5):
    print(chunk)

Reference: http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#querying

2 of 11

Update: Make sure to check out the answer below, as Pandas now has built-in support for chunked loading.

You could simply try to read the input table chunk-wise and assemble your full dataframe from the individual pieces afterwards, like this:

import pandas as pd
import pandas.io.sql as psql
chunk_size = 10000
offset = 0
dfs = []
while True:
  sql = "SELECT * FROM MyTable limit %d offset %d order by ID" % (chunk_size,offset) 
  dfs.append(psql.read_frame(sql, cnxn))
  offset += chunk_size
  if len(dfs[-1]) < chunk_size:
    break
full_df = pd.concat(dfs)

It might also be possible that the whole dataframe is simply too large to fit in memory, in that case you will have no other option than to restrict the number of rows or columns you're selecting.

Discussions

python - Pandas SQL chunksize - Stack Overflow

This is more of a question on understanding than programming. I am quite new to Pandas and SQL. I am using pandas to read data from SQL with some specific chunksize. When I run a sql query e.g. im... More on stackoverflow.com

stackoverflow.com

Reading table with chunksize still pumps the memory

I'm trying to migrate database tables from MySQL to SQL Server: import pandas as pd from sqlalchemy import create_engine my_engine = create_engine("mysql+pymysql://root:pass@localhost/gen") ms_engi... More on github.com

github.com

February 9, 2016

I want to load a multi million row SQL output into python, how do i go about doing this efficiently?

If you can still connect to the database you can read from it directly using Pandas read_sql_table() function. If the table is too large and you run into memory limits you can use the chunksize parameter of read_sql_table and write each chunk to a file and then merge the files. I think my chunked approach below uses less memory than just reading the entire database response into a dataframe directly. I'm not completely sure why. Here is my code that reads an SQL query and returns a dataframe. You could use pd.read_sql_table() instead of read_sql_query if you just want a table and not a query. Pass in a string of "SELECT * FROM schemaname.tablename;" to query if you just want everything from the table. import sqlalchemy import pandas as pd import tempfile def make_connectstring(prefix, db, uname, hostname, port): """return an sql connectstring""" connectstring = prefix + "://" + uname + "@" + hostname + \ ":" + port + "/" + db return connectstring def query_to_df(connectstring, query, verbose=False, chunksize=100000): """ Return DataFrame from SELECT query and connectstring Given a valid SQL SELECT query and a connectstring, return a Pandas DataFrame with the response data. Args: connectstring: string with connection parameters query: Valid SQL, containing a SELECT query verbose: prints chunk progress if True. Default False. chunksize: Number of lines to read per chunk. Default 100000 Returns: df: A Pandas DataFrame containing the response of query """ engine = sqlalchemy.create_engine(connectstring, server_side_cursors=True, connect_args=make_ssl_args()) # get the data to temp chunk filese i = 0 paths_chunks = [] with tempfile.TemporaryDirectory() as td: for df in pd.read_sql_query(sql=query, con=engine, chunksize=chunksize): path = td + "/chunk" + str(i) + ".hdf5" df.to_hdf(path, key='data') if verbose: print("wrote", path) paths_chunks.append(path) i+=1 # Merge the chunks using concat, the most efficient way AFAIK df = pd.DataFrame() for path in paths_chunks: df_scratch = pd.read_hdf(path) df = pd.concat([df, df_scratch]) if verbose: print("read", path) return df connectstring = make_connectstring("postgres") uname = "username" prefix = "postgresql" db = "database_name" port = "5432" hostname = "hostname_of_db_server" connectstring = make_connectstring(prefix, db, uname, hostname, port) query = "SELECT * FROM schemaname.tablename" df = query_to_df(connectstring, query) More on reddit.com

r/learnpython

July 24, 2018

pandas read_sql reads the entire table in to memory despite specifying chunksize

When i tried reading the table using the pandas.read_sql_table i ran out of memory even though i had passed in the chunksize parameter. More on github.com

github.com

May 13, 2016

Python⇒Speed

pythonspeed.com › articles › pandas-sql-chunking

Loading SQL data into Pandas without running out of memory

January 6, 2023 - The result is an iterable of ... ) for chunk_dataframe in pd.read_sql( "SELECT * FROM users", engine, chunksize=1000): print( f"Got dataframe w/{len(chunk_dataframe)} rows" ) # ......

Pandas

pandas.pydata.org › docs › reference › api › pandas.read_sql.html

pandas.read_sql — pandas 3.0.1 documentation - PyData |

pandas.read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None, dtype_backend=<no_default>, dtype=None)[source]# Read SQL query or database table into a DataFrame. This function is a convenience wrapper around read_sql_table and read_sql_query (for backward compatibility).

Stack Overflow

stackoverflow.com › questions › 31837979 › pandas-sql-chunksize

python - Pandas SQL chunksize - Stack Overflow

Top answer

1 of 3

Let's consider two options and what happens in both cases:

chunksize is None(default value):
- pandas passes query to database
- database executes query
- pandas checks and sees that chunksize is None
- pandas tells database that it wants to receive all rows of the result table at once
- database returns all rows of the result table
- pandas stores the result table in memory and wraps it into a data frame
- now you can use the data frame
chunksize in not None:
- pandas passes query to database
- database executes query
- pandas checks and sees that chunksize has some value
- pandas creates a query iterator(usual 'while True' loop which breaks when database says that there is no more data left) and iterates over it each time you want the next chunk of the result table
- pandas tells database that it wants to receive chunksize rows
- database returns the next chunksize rows from the result table
- pandas stores the next chunksize rows in memory and wraps it into a data frame
- now you can use the data frame

For more details you can see pandas\io\sql.py module, it is well documented

2 of 3

When you do not provide a chunksize, the full result of the query is put in a dataframe at once.

When you do provide a chunksize, the return value of read_sql_query is an iterator of multiple dataframes. This means that you can iterate through this like:

for df in result:
    print df

and in each step df is a dataframe (not an array!) that holds the data of a part of the query. See the docs on this: http://pandas.pydata.org/pandas-docs/stable/io.html#querying

To answer your question regarding memory, you have to know that there are two steps in retrieving the data from the database: execute and fetch.
First the query is executed (result = con.execute()) and then the data are fetched from this result set as a list of tuples (data = result.fetch()). When fetching you can specify how many rows at once you want to fetch. And this is what pandas does when you provide a chunksize.
But, many database drivers already put all data into memory in the execute step, and not only when fetching the data. So in that regard, it should not matter much for the memory. Apart from the fact the copying of the data into a DataFrame only happens in different steps while iterating with chunksize.

GitHub

github.com › pandas-dev › pandas › issues › 12265

Reading table with chunksize still pumps the memory · Issue #12265 · pandas-dev/pandas

February 9, 2016 - import pandas as pd from sqlalchemy ... in ['topics', 'fiction', 'compact']: for table in pd.read_sql_query('SELECT * FROM %s' % table_name, my_engine, chunksize=100000): table.to_sql(name=table_name, con=ms_engine, if_exists='append') ...

Author klonuo

Medium

deveshpoojari.medium.com › efficiently-reading-and-writing-large-datasets-with-pandas-and-sql-13e593bd28b4

Efficiently Reading and Writing Large Datasets with Pandas and SQL | by Devesh Poojari | Medium

March 13, 2024 - We also specify a chunksize of 50000 rows, which means that the pd.read_sql() function will return a new DataFrame containing 50,000 rows at a time. We can then use a for loop to iterate over the chunks of data returned by the pd.read_sql() function.

Like Geeks

likegeeks.com › home › python › pandas › pandas read_sql with chunksize: unlock parallel processing

Pandas read_sql with chunksize: Unlock Parallel Processing

import pandas as pd from sqlalchemy ... bytes_received FROM data_usage_logs """ # Use chunksize to read the SQL query in chunks chunk_size = 5000 chunks = pd.read_read_sql(query, engine, chunksize=chunk_size) for chunk in chunks: ...

Find elsewhere

Google Bing Mojeek

Rip Tutorial

riptutorial.com › to read mysql to dataframe, in case of large amount of data

pandas Tutorial => To read mysql to dataframe, In case of large...

import pandas as pd from sqlalchemy import create_engine from sqlalchemy.engine.url import URL # sqlalchemy engine engine = create_engine(URL( drivername="mysql" username="user", password="password" host="host" database="database" )) conn = engine.connect() generator_df = pd.read_sql(sql=query, # mysql query con=conn, chunksize=chunksize) # size you want to fetch each time for dataframe in generator_df: for row in dataframe: pass # whatever you want to do ·

Pandas

pandas.pydata.org › pandas-docs › version › 0.17.0 › generated › pandas.read_sql_query.html

pandas.read_sql_query — pandas 0.17.0 documentation

pandas.read_sql_query(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, chunksize=None)¶

Pandas

pandas.pydata.org › pandas-docs › stable › reference › api › pandas.read_sql.html

pandas.read_sql — pandas 2.2.3 documentation - PyData |

reddit.com › r/learnpython › i want to load a multi million row sql output into python, how do i go about doing this efficiently?

I want to load a multi million row SQL output into python, how do i go about doing this efficiently? : r/learnpython

July 24, 2018 - Default 100000 Returns: df: A Pandas DataFrame containing the response of query """ engine = sqlalchemy.create_engine(connectstring, server_side_cursors=True, connect_args=make_ssl_args()) # get the data to temp chunk filese i = 0 paths_chunks = [] with tempfile.TemporaryDirectory() as td: for df in pd.read_sql_query(sql=query, con=engine, chunksize=chunksize): path = td + "/chunk" + str(i) + ".hdf5" df.to_hdf(path, key='data') if verbose: print("wrote", path) paths_chunks.append(path) i+=1 # Merge the chunks using concat, the most efficient way AFAIK df = pd.DataFrame() for path in paths_chun

GitHub

github.com › pandas-dev › pandas › issues › 13168

pandas read_sql reads the entire table in to memory despite specifying chunksize · Issue #13168 · pandas-dev/pandas

May 13, 2016 - eng = sqlalchemy.create_engine("mysql+mysqldb://user:pass@localhost/db_name") dframe = pandas.read_sql_table('table_name', eng, chunksize=100)

Published May 13, 2016

Author jeetjitsu

Towards Data Science

towardsdatascience.com › home › latest › loading large datasets in pandas

Loading large datasets in Pandas | Towards Data Science

January 21, 2025 - Effectively using Chunking and SQL for reading large datasets in pandas. 🐼

datagy

datagy.io › home › pandas tutorials › pandas reading & writing data › pandas read_sql: reading sql into dataframes

Pandas read_sql: Reading SQL into DataFrames • datagy

February 22, 2023 - # Reading SQL Queries in Chunks import pandas as pd import sqlite3 conn = sqlite3.connect('users') df = pd.DataFrame() for chunk in pd.read_sql(sql="SELECT * FROM users", con=conn, index_col='userid', chunksize=2): df = pd.concat([df, chunk]) ...

GitHub

github.com › pandas-dev › pandas › issues › 19457

pd.read_sql_query with chunksize: pandasSQL_builder should only be called when first chunk is requested · Issue #19457 · pandas-dev/pandas

January 30, 2018 - Therefore pandasSQL_builder will be called within the Thread requesting the chunks. @attr.s(auto_attribs=True) class PDSQLQueryWrapper: """Wrap the iterator. To create the db engine in the thread that calls the iterator first. """ _read_sql_query_iterator = None query: str url: str chunksize: int def __iter__(self): return self def __next__(self): if self._read_sql_query_iterator is None: self._read_sql_query_iterator = pd.read_sql_query( self.query, self.url, chunksize=self.chunksize) return next(self._read_sql_query_iterator)

Lightrun

lightrun.com › answers › pandas-dev-pandas-pdread_sql_query-with-chunksize-pandassql_builder-should-only-be-called-when-first-chunk-is-reques

pd.read_sql_query with chunksize: pandasSQL_builder should only be called when first chunk is requested

Therefore pandasSQL_builder will be called within the Thread requesting the chunks. @attr.s(auto_attribs=True) class PDSQLQueryWrapper: """Wrap the iterator. To create the db engine in the thread that calls the iterator first. """ _read_sql_query_iterator = None query: str url: str chunksize: int def __iter__(self): return self def __next__(self): if self._read_sql_query_iterator is None: self._read_sql_query_iterator = pd.read_sql_query( self.query, self.url, chunksize=self.chunksize) return next(self._read_sql_query_iterator)

Andrew Wheeler

andrewpwheeler.com › 2021 › 08 › 12 › chunking-it-up-in-pandas

Chunking it up in pandas | Andrew Wheeler

August 12, 2021 - In the python pandas library, you ... chunks of the dataset, instead of the whole dataframe. data_chunks = pandas.read_sql_table('tablename',db_connection,chunksize=2000) I thought for awhile this was somewhat worthless, as ...

Like Geeks

likegeeks.com › home › python › pandas › read sql query/table into dataframe using pandas read_sql

Read SQL Query/Table into DataFrame using Pandas read_sql

October 16, 2023 - If we have a large ‘users’ table, ... engine.connect().execution_options( stream_results=True) chunks = pd.read_sql("SELECT * FROM users", con, chunksize=500) for chunk in chunks: print(chunk)...

Aetperf

aetperf.github.io › 2022 › 05 › 16 › Reading-a-SQL-table-by-chunks-with-Pandas.html

Reading a SQL table by chunks with Pandas | Architecture & Performance

May 16, 2022 - In the following export_csv function, we create a connection, read the data by chunks with read_sql() and append the rows to a CSV file with to_csv(): def export_csv( chunksize=1000, connect_string=CONNECT_STRING, sql_query=SQL_QUERY, csv_file_path=CSV_FP, ): engine = create_engine(connect_string) connection = engine.connect().execution_options( stream_results=True, max_row_buffer=chunksize ) header = True mode = "w" for df in pd.read_sql(sql_query, connection, chunksize=chunksize): df.to_csv(csv_file_path, mode=mode, header=header, index=False) if header: header = False mode = "a" connection.close()