It seems that your data are measured with resolution 0.1 and that the range is at least 18.7. My guess given the mention of "weather" is that they are Celsius temperatures.
Let's guess that the variable has a range 50 in those units: the tails beyond the quartiles are often longer than the difference between the quartiles. That would mean of the order of 500 distinct values.
It seems that your sample size is of the order of 500000, so on average each distinct value occurs about 1000 times, and ties are everywhere.
It's also entirely possible that your data are quirkier than that if human readings are involved. Many observers use some final digits rather than others, although the quirks can vary, including preferences for 0 and 5 as final digits or for even digits.
Ties are likely to be the issue, together with a rule that the same values must be assigned to the same bin.
Answer from Nick Cox on Stack ExchangeBreak this up into three parts to help isolate the problem and improve readability:
- Build the SQL string
- Set parameter values
- Execute pandas.read_sql_query
Build SQL
First ensure ? placeholders are being set correctly. Use str.format with str.join and len to dynamically fill in ?s based on member_list length. Below examples assume 3 member_list elements.
Example
member_list = (1,2,3)
sql = """select member_id, yearmonth
from queried_table
where yearmonth between {0} and {0}
and member_id in ({1})"""
sql = sql.format('?', ','.join('?' * len(member_list)))
print(sql)
Returns
select member_id, yearmonth
from queried_table
where yearmonth between ? and ?
and member_id in (?,?,?)
Set Parameter Values
Now ensure parameter values are organized into a flat tuple
Example
# generator to flatten values of irregular nested sequences,
# modified from answers http://stackoverflow.com/questions/952914/making-a-flat-list-out-of-list-of-lists-in-python
def flatten(l):
for el in l:
try:
yield from flatten(el)
except TypeError:
yield el
params = tuple(flatten((201601, 201603, member_list)))
print(params)
Returns
(201601, 201603, 1, 2, 3)
Execute
Finally bring the sql and params values together in the read_sql_query call
query = pd.read_sql_query(sql, db2conn, params)
WARNING! Although my proposed solution here works, it is prone to SQL injection attacks. Therefor, it should never be used directly in backend code! It is only safe for offline analysis.
If you're using python 3.6+ you could also use a formatted string litteral for your query (cf https://docs.python.org/3/whatsnew/3.6.html#whatsnew36-pep498)
start, end = 201601, 201603
selected_members = (111, 222, 333, 444, 555) # requires to be a tuple
query = f"""
SELECT member_id, yearmonth FROM queried_table
WHERE yearmonth BETWEEN {start} AND {end}
AND member_id IN {selected_members}
"""
df = pd.read_sql_query(query, db2conn)
You can use pandas sqlio module to run and save query within pandas dataframe.
Let's say you have a connection of psycopg2 connection then you can use pandas sqlio like this.
import pandas.io.sql as sqlio
data = sqlio.read_sql_query("SELECT * FROM table", connection)
# Now data is a pandas dataframe having the results of above query.
data.head()
For me, sqlio pandas module is working fine. Please have a look at it and let me know if this is what you are looking for.
This may be helpful for your case:
import pandas.io.sql as sqlio
df = sqlio.read_sql_query(query, connection)
Where in your case, query = "select * from table"
I think aus_lacy is a bit off in his solution - first you have to convert the QuerySet to a string containing the SQL backing the QuerySet
from django.db import connection
query = str(ModelToRetrive.objects.all().query)
df = pandas.read_sql_query(query, connection)
Also there is a less memory efficient but still valid solution:
df = DataFrame(list(ModelToRetrive.objects.values('id','some_attribute_1','some_attribute_2')))
You need to use Django's built in QuerySet API. More information on it can be seen here. Once you create a QuerySet you can then use pandas read_sql_query method to construct the data frame. The simplest way to construct a QuerySet is simply query the entire database which can be done like so:
db_query = YourModel.objects.all()
You can use filters which are passed in as args when querying the database to create different QuerySet objects depending on what your needs are.
Then using pandas you could do something like:
d_frame = pandas.read_sql_query(db_query, other_args...)
You can pass a cursor object to the DataFrame constructor. For postgres:
import psycopg2
conn = psycopg2.connect("dbname='db' user='user' host='host' password='pass'")
cur = conn.cursor()
cur.execute("select instrument, price, date from my_prices")
df = DataFrame(cur.fetchall(), columns=['instrument', 'price', 'date'])
then set index like
df.set_index('date', drop=False)
or directly:
df.index = df['date']
Update: recent pandas have the following functions: read_sql_table and read_sql_query.
First create a db engine (a connection can also work here):
from sqlalchemy import create_engine
# see sqlalchemy docs for how to write this url for your database type:
engine = create_engine('mysql://scott:tiger@localhost/foo')
See sqlalchemy database urls.
pandas_read_sql_table
table_name = 'my_prices'
df = pd.read_sql_table(table_name, engine)
pandas_read_sql_query
df = pd.read_sql_query("SELECT instrument, price, date FROM my_prices;", engine)
The old answer had referenced read_frame which is has been deprecated (see the version history of this question for that answer).
It's often makes sense to read first, and then perform transformations to your requirements (as these are usually efficient and readable in pandas). In your example, you can pivot the result:
df.reset_index().pivot('date', 'instrument', 'price')
Note: You could miss out the reset_index you don't specify an index_col in the read_frame.
You need to use the params keyword argument:
f = pd.read_sql_query('SELECT open FROM NYSEMSFT WHERE date = (?)', conn, params=(date,))
As @alecxe and @Ted Petrou have already said, use explicit parameter names, especially for the params parameter as it's a fourth parameter in the pd.read_sql_query() function and you used it as a third one (which is coerce_float)
But beside that you can improve your code by getting rid of the for date in dates: loop using the following trick:
import sqlite3
dates=['2001-01-01','2002-02-02']
qry = 'select * from aaa where open in ({})'
conn = sqlite3.connect(r'D:\temp\.data\a.sqlite')
df = pd.read_sql(qry.format(','.join(list('?' * len(dates)))), conn, params=dates)
Demo:
Source SQLite table:
sqlite> .mode column
sqlite> .header on
sqlite> select * from aaa;
open
----------
2016-12-25
2001-01-01
2002-02-02
Test run:
In [40]: %paste
dates=['2001-01-01','2002-02-02']
qry = 'select * from aaa where open in ({})'
conn = sqlite3.connect(r'D:\temp\.data\a.sqlite')
df = pd.read_sql(qry.format(','.join(list('?' * len(dates)))), conn, params=dates)
## -- End pasted text --
In [41]: df
Out[41]:
open
0 2001-01-01
1 2002-02-02
Explanation:
In [35]: qry = 'select * from aaa where open in ({})'
In [36]: ','.join(list('?' * len(dates)))
Out[36]: '?,?'
In [37]: qry.format(','.join(list('?' * len(dates))))
Out[37]: 'select * from aaa where open in (?,?)'
In [38]: dates.append('2003-03-03') # <-- let's add a third parameter
In [39]: qry.format(','.join(list('?' * len(dates))))
Out[39]: 'select * from aaa where open in (?,?,?)'