python - Executing an SQL query on a Pandas dataset - Stack Overflow
python - Read SQL query into pandas dataframe and replace string in query - Stack Overflow
python - Pandas read_sql with parameters - Stack Overflow
Do folks ever use Pandas when they should use SQL?
Videos
This is not what pandas.query is supposed to do. You can look at package pandasql (same like sqldf in R )
Update: Note pandasql hasn't been maintained since 2017. Use another library from an answer below.
import pandas as pd
import pandasql as ps
df = pd.DataFrame([[1234, 'Customer A', '123 Street', np.nan],
[1234, 'Customer A', np.nan, '333 Street'],
[1233, 'Customer B', '444 Street', '333 Street'],
[1233, 'Customer B', '444 Street', '666 Street']], columns=
['ID', 'Customer', 'Billing Address', 'Shipping Address'])
q1 = """SELECT ID FROM df """
print(ps.sqldf(q1, locals()))
ID
0 1234
1 1234
2 1233
3 1233
Update 2020-07-10
update the
pandasql
ps.sqldf("select * from df")
Much better solution is to use duckdb. It is much faster than sqldf because it does not have to load the entire data into sqlite and load back to pandas.
Update: duckdb is also faster than polars (which is also a very good solution and people are moving to polars for its performance), see https://benchmark.clickhouse.com/
pip install duckdb
import pandas as pd
import duckdb
test_df = pd.DataFrame.from_dict({"i":[1, 2, 3, 4], "j":["one", "two", "three", "four"]})
duckdb.query("SELECT * FROM test_df where i>2").df() # returns a result dataframe
Performance improvement over pandasql: test data NYC yellow cabs ~120mb of csv data
nyc = pd.read_csv('https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv',low_memory=False)
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
pysqldf("SELECT * FROM nyc where trip_distance>10")
# wall time 16.1s
duckdb.query("SELECT * FROM nyc where trip_distance>10").df()
# wall time 183ms
A improvement of speed of roughly 100x
This article gives good details and claims 1000x improvement over pandasql: https://duckdb.org/2021/05/14/sql-on-pandas.html
The read_sql docs say this params argument can be a list, tuple or dict (see docs).
To pass the values in the sql query, there are different syntaxes possible: ?, :1, :name, %s, %(name)s (see PEP249).
But not all of these possibilities are supported by all database drivers, which syntax is supported depends on the driver you are using (psycopg2 in your case I suppose).
In your second case, when using a dict, you are using 'named arguments', and according to the psycopg2 documentation, they support the %(name)s style (and so not the :name I suppose), see http://initd.org/psycopg/docs/usage.html#query-parameters.
So using that style should work:
df = psql.read_sql(('select "Timestamp","Value" from "MyTable" '
'where "Timestamp" BETWEEN %(dstart)s AND %(dfinish)s'),
db,params={"dstart":datetime(2014,6,24,16,0),"dfinish":datetime(2014,6,24,17,0)},
index_col=['Timestamp'])
I was having trouble passing a large number of parameters when reading from a SQLite Table. Then it turns out since you pass a string to read_sql, you can just use f-string. Tried the same with MSSQL pyodbc and it works as well.
For SQLite, it would look like this:
# write a sample table into memory
from sqlalchemy import create_engine
df = pd.DataFrame({'Timestamp': pd.date_range('2020-01-17', '2020-04-24', 10), 'Value1': range(10)})
engine = create_engine('sqlite://', echo=False)
df.to_sql('MyTable', engine);
# query the table using a query
tpl = (1, 3, 5, 8, 9)
query = f"""SELECT Timestamp, Value1 FROM MyTable WHERE Value1 IN {tpl}"""
df = pd.read_sql(query, engine)
If the parameters are datetimes, it's a bit more complicated but calling the datetime conversion function of the SQL dialect you're using should do the job.
start, end = '2020-01-01', '2020-04-01'
query = f"""SELECT Timestamp, Value1 FROM MyTable WHERE Timestamp BETWEEN STRFTIME("{start}") AND STRFTIME("{end}")"""
df = pd.read_sql(query, engine)