As mentioned in a comment, starting from pandas 0.15, you have a chunksize option in read_sql to read and process the query chunk by chunk:
sql = "SELECT * FROM My_Table"
for chunk in pd.read_sql_query(sql , engine, chunksize=5):
print(chunk)
Reference: http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#querying
Answer from Kamil Sindi on Stack OverflowAs mentioned in a comment, starting from pandas 0.15, you have a chunksize option in read_sql to read and process the query chunk by chunk:
sql = "SELECT * FROM My_Table"
for chunk in pd.read_sql_query(sql , engine, chunksize=5):
print(chunk)
Reference: http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#querying
Update: Make sure to check out the answer below, as Pandas now has built-in support for chunked loading.
You could simply try to read the input table chunk-wise and assemble your full dataframe from the individual pieces afterwards, like this:
import pandas as pd
import pandas.io.sql as psql
chunk_size = 10000
offset = 0
dfs = []
while True:
sql = "SELECT * FROM MyTable limit %d offset %d order by ID" % (chunk_size,offset)
dfs.append(psql.read_frame(sql, cnxn))
offset += chunk_size
if len(dfs[-1]) < chunk_size:
break
full_df = pd.concat(dfs)
It might also be possible that the whole dataframe is simply too large to fit in memory, in that case you will have no other option than to restrict the number of rows or columns you're selecting.
python - Equivalent of LIMIT and OFFSET of SQL in pandas? - Stack Overflow
python - Importing database of 4 million rows into Pandas DataFrame - Code Review Stack Exchange
python - Is there a maximum limit to the size of the query string for the pandas.read_sql() function? - Stack Overflow
Reading table with chunksize still pumps the memory
Yes, integer location, where iloc starting index is the 'offset' and ending index is incremented by 'limit':
df.sort_values('type', ascending=False).iloc[2:6]
Output:
id type city
7 8 u L
3 2 o G
9 8 k U
4 6 i F
And you can add reset_index to clean up indexing.
print(df.sort_values('type', ascending=False).iloc[2:6].reset_index(drop=True))
Output:
id type city
0 8 u L
1 2 o G
2 8 k U
3 6 i F
Update let's sort by type and index:
df.index.name = 'index'
df[['id','type']].sort_values(['type','index'], ascending=[False,True]).iloc[2:6]
Output:
index id type
0 3 6525 small_airport
1 5 322127 small_airport
2 6 6527 small_airport
3 7 6528 small_airport
You could use sort_values with ascending=False, and use .loc() to slice the result (having reset the index) with the rows and columns of interest:
offset = 2
limit = 4
(df.sort_values(by='type', ascending=False).reset_index(drop=True)
.loc[offset : offset+limit-1, ['id','type']])
id type
2 8 u
3 2 o
4 8 k
5 6 i
Thanks for all the input (still new here)! Accidentally stumbled upon the solution, which is by reducing the df.to_sql from
df.to_sql(chunksize=1000)
to
df.to_sql(chunksize=200)
After digging it turns out there's a limitation from SQL server (https://discuss.dizzycoding.com/to_sql-pyodbc-count-field-incorrect-or-syntax-error/)
In my case, I had the same "Output exceeds the size limit" error, and I fixed it adding "method='multi'" in df.to_sql(method='multi'). First I tried the "chuncksize" solution and it didn't work. So... check that if you're at the same scenario!
with engine.connect().execution_options(autocommit=True) as conn:
df.to_sql('mytable', con=conn, method='multi', if_exists='replace', index=True)