As mentioned in a comment, starting from pandas 0.15, you have a chunksize option in read_sql to read and process the query chunk by chunk:
sql = "SELECT * FROM My_Table"
for chunk in pd.read_sql_query(sql , engine, chunksize=5):
print(chunk)
Reference: http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#querying
Answer from Kamil Sindi on Stack OverflowAs mentioned in a comment, starting from pandas 0.15, you have a chunksize option in read_sql to read and process the query chunk by chunk:
sql = "SELECT * FROM My_Table"
for chunk in pd.read_sql_query(sql , engine, chunksize=5):
print(chunk)
Reference: http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#querying
Update: Make sure to check out the answer below, as Pandas now has built-in support for chunked loading.
You could simply try to read the input table chunk-wise and assemble your full dataframe from the individual pieces afterwards, like this:
import pandas as pd
import pandas.io.sql as psql
chunk_size = 10000
offset = 0
dfs = []
while True:
sql = "SELECT * FROM MyTable limit %d offset %d order by ID" % (chunk_size,offset)
dfs.append(psql.read_frame(sql, cnxn))
offset += chunk_size
if len(dfs[-1]) < chunk_size:
break
full_df = pd.concat(dfs)
It might also be possible that the whole dataframe is simply too large to fit in memory, in that case you will have no other option than to restrict the number of rows or columns you're selecting.
python - Pandas SQL chunksize - Stack Overflow
Reading table with chunksize still pumps the memory
I want to load a multi million row SQL output into python, how do i go about doing this efficiently?
pandas read_sql reads the entire table in to memory despite specifying chunksize
Let's consider two options and what happens in both cases:
- chunksize is None(default value):
- pandas passes query to database
- database executes query
- pandas checks and sees that chunksize is None
- pandas tells database that it wants to receive all rows of the result table at once
- database returns all rows of the result table
- pandas stores the result table in memory and wraps it into a data frame
- now you can use the data frame
- chunksize in not None:
- pandas passes query to database
- database executes query
- pandas checks and sees that chunksize has some value
- pandas creates a query iterator(usual 'while True' loop which breaks when database says that there is no more data left) and iterates over it each time you want the next chunk of the result table
- pandas tells database that it wants to receive chunksize rows
- database returns the next chunksize rows from the result table
- pandas stores the next chunksize rows in memory and wraps it into a data frame
- now you can use the data frame
For more details you can see pandas\io\sql.py module, it is well documented
When you do not provide a chunksize, the full result of the query is put in a dataframe at once.
When you do provide a chunksize, the return value of read_sql_query is an iterator of multiple dataframes. This means that you can iterate through this like:
for df in result:
print df
and in each step df is a dataframe (not an array!) that holds the data of a part of the query. See the docs on this: http://pandas.pydata.org/pandas-docs/stable/io.html#querying
To answer your question regarding memory, you have to know that there are two steps in retrieving the data from the database: execute and fetch.
First the query is executed (result = con.execute()) and then the data are fetched from this result set as a list of tuples (data = result.fetch()). When fetching you can specify how many rows at once you want to fetch. And this is what pandas does when you provide a chunksize.
But, many database drivers already put all data into memory in the execute step, and not only when fetching the data. So in that regard, it should not matter much for the memory. Apart from the fact the copying of the data into a DataFrame only happens in different steps while iterating with chunksize.