Just to underline my comment to @maxymoo's answer, it's almost invariably a bad idea ("code smell") to add names dynamically to a Python namespace. There are a number of reasons, the most salient being:
Created names might easily conflict with variables already used by your logic.
Since the names are dynamically created, you typically also end up using dynamic techniques to retrieve the data.
This is why dicts were included in the language. The correct way to proceed is:
d = {}
for name in companies:
d[name] = pd.DataFrame()
Nowadays you can write a single dict comprehension expression to do the same thing, but some people find it less readable:
d = {name: pd.DataFrame() for name in companies}
Once d is created the DataFrame for company x can be retrieved as d[x], so you can look up a specific company quite easily. To operate on all companies you would typically use a loop like:
for name, df in d.items():
# operate on DataFrame 'df' for company 'name'
In Python 2 you were better writing
for name, df in d.iteritems():
because this avoids instantiating the list of (name, df) tuples
that .items() creates in the older version.
That's now largely of historical interest, though there will of
course be Python 2 applications still extant and requiring
(hopefully occasional) maintenance.
Just to underline my comment to @maxymoo's answer, it's almost invariably a bad idea ("code smell") to add names dynamically to a Python namespace. There are a number of reasons, the most salient being:
Created names might easily conflict with variables already used by your logic.
Since the names are dynamically created, you typically also end up using dynamic techniques to retrieve the data.
This is why dicts were included in the language. The correct way to proceed is:
d = {}
for name in companies:
d[name] = pd.DataFrame()
Nowadays you can write a single dict comprehension expression to do the same thing, but some people find it less readable:
d = {name: pd.DataFrame() for name in companies}
Once d is created the DataFrame for company x can be retrieved as d[x], so you can look up a specific company quite easily. To operate on all companies you would typically use a loop like:
for name, df in d.items():
# operate on DataFrame 'df' for company 'name'
In Python 2 you were better writing
for name, df in d.iteritems():
because this avoids instantiating the list of (name, df) tuples
that .items() creates in the older version.
That's now largely of historical interest, though there will of
course be Python 2 applications still extant and requiring
(hopefully occasional) maintenance.
You can do this (although obviously use exec with extreme caution if this is going to be public-facing code)
for c in companies:
exec('{} = pd.DataFrame()'.format(c))
python 3.x - Creating multiple dataframes with a loop - Stack Overflow
Create a for loop to make multiple data frames?
python - How can I create a multiple new dataframes inside a for loop? - Stack Overflow
Python Looping multiple dataframes
I think you think your code is doing something that it is not actually doing.
Specifically, this line: df = pd.read_csv(file)
You might think that in each iteration through the for loop this line is being executed and modified with df being replaced with a string in dfs and file being replaced with a filename in files. While the latter is true, the former is not.
Each iteration through the for loop is reading a csv file and storing it in the variable df effectively overwriting the csv file that was read in during the previous for loop. In other words, df in your for loop is not being replaced with the variable names you defined in dfs.
The key takeaway here is that strings (e.g., 'df1', 'df2', etc.) cannot be substituted and used as variable names when executing code.
One way to achieve the result you want is store each csv file read by pd.read_csv() in a dictionary, where the key is name of the dataframe (e.g., 'df1', 'df2', etc.) and value is the dataframe returned by pd.read_csv().
list_of_dfs = {}
for df, file in zip(dfs, files):
list_of_dfs[df] = pd.read_csv(file)
print(list_of_dfs[df].shape)
print(list_of_dfs[df].dtypes)
print(list(list_of_dfs[df]))
You can then reference each of your dataframes like this:
print(list_of_dfs['df1'])
print(list_of_dfs['df2'])
You can learn more about dictionaries here:
https://docs.python.org/3.6/tutorial/datastructures.html#dictionaries
Use dictionary to store you DataFrames and access them by name
files = ('data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv')
dfs_names = ('df1', 'df2', 'df3', 'df4', 'df5', 'df6')
dfs ={}
for dfn,file in zip(dfs_names, files):
dfs[dfn] = pd.read_csv(file)
print(dfs[dfn].shape)
print(dfs[dfn].dtypes)
print(dfs['df3'])
Use list to store you DataFrames and access them by index
files = ('data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv')
dfs = []
for file in files:
dfs.append( pd.read_csv(file))
print(dfs[len(dfs)-1].shape)
print(dfs[len(dfs)-1].dtypes)
print (dfs[2])
Do not store intermediate DataFrame, just process them and add to resulting DataFrame.
files = ('data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv')
df = pd.DataFrame()
for file in files:
df_n = pd.read_csv(file)
print(df_n.shape)
print(df_n.dtypes)
# do you want to do
df = df.append(df_n)
print (df)
If you will process them differently, then you do not need a general structure to store them. Do it simply independent.
df = pd.DataFrame()
def do_general_stuff(d): #here we do common things with DataFrame
print(d.shape,d.dtypes)
df1 = pd.read_csv("data1.csv")
# do you want to with df1
do_general_stuff(df1)
df = df.append(df1)
del df1
df2 = pd.read_csv("data2.csv")
# do you want to with df2
do_general_stuff(df2)
df = df.append(df2)
del df2
df3 = pd.read_csv("data3.csv")
# do you want to with df3
do_general_stuff(df3)
df = df.append(df3)
del df3
# ... and so on
And one geeky way, but don't ask how it works:)
from collections import namedtuple
files = ['data1.csv', 'data2.csv', 'data3.csv', 'data4.csv', 'data5.csv', 'data6.csv']
df = namedtuple('Cdfs',
['df1', 'df2', 'df3', 'df4', 'df5', 'df6']
)(*[pd.read_csv(file) for file in files])
for df_n in df._fields:
print(getattr(df, df_n).shape,getattr(df, df_n).dtypes)
print(df.df3)
I would generally discourage you from creating lots of variables with related names which is a dangerous design pattern in Python (although it's common in SAS for example). A better option would be to create a dictionary of dataframes with the key as your 'variable name'
df_dict = dict()
for df in 2011, 2012, 2013:
df_dict["pivot_"+df.name] = pd.pivot_table(df, index=["income"], columns=["area"], values=["id"], aggfunc='count')
I'm assuming here that your dataframes have the names "2011", "2012", "2013"
I don't see any other way but to create a list or a dictionary of data frames, you'd have to name them manually otherwise.
df_list = [pd.pivot_table(df, index=["income"], columns=["area"], values=["id"], aggfunc='count') for df in 2011, 2012, 2013]
You can find an example here.
I am learning python and is having trouble accessing data from multiple dataframes.
I want to make multiple bar plot with different dataframes. All of the dataframes have the same columns. So I thought, instead of writing the code one by one, maybe I could somehow iterate through the dataframes. But I haven't find the right way to do it. Could anyone advice me? I am curious if it can be done in one go instead of writing it for every dataframe.
This should do it:
for i in province_id:
for j in year:
locals()['sub_data_{}_{}'.format(i,j)] = data[(data.provid==i) & (data.wave==j)]
I initially suggested using exec, which is not usually considered best practice for safety reasons. Having said so, if your code is not exposed to anyone with malicious intentions, it should be OK, and I'll leave it here for the sake of completeness:
for i in province_id:
for j in year:
exec "sub_data_{}_{} = data[(data.provid==i) & (data.wave==j)]".format(i,j)
Nevertheless, for most use cases, it's probably better to use a collection of some sort, e.g. a dictionary, because it will be cumbersome to reference dynamically generated variable names in subsequent parts of your code. It's also a one-liner:
data_dict = {key:g for key,g in data.groupby(['provid','wave'])}
I think the best is create dictionary of DataFrames with groupby with filtering first by boolean indexing:
df = pd.DataFrame({'A':list('abcdef'),
'wave':[2004,2005,2004,2005,2005,2004],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'provid':list('aaabbb')})
print (df)
A C D E provid wave
0 a 7 1 5 a 2004
1 b 8 3 3 a 2005
2 c 9 5 6 a 2004
3 d 4 7 9 b 2005
4 e 2 1 2 b 2005
5 f 3 0 4 b 2004
province_id = ['a','b']
year = [2004]
df = df[(df.provid.isin(province_id)) &(df.wave.isin(year))]
print (df)
A C D E provid wave
0 a 7 1 5 a 2004
2 c 9 5 6 a 2004
5 f 3 0 4 b 2004
dfs = {'{0[0]}_{0[1]}'.format(i) : x for i, x in df.groupby(['provid','wave'])}
Another solution:
dfs = dict(tuple(df.groupby(df['provid'] + '_' + df['wave'].astype(str))))
print (dfs)
{'a_2004': A C D E provid wave
0 a 7 1 5 a 2004
2 c 9 5 6 a 2004, 'b_2004': A C D E provid wave
5 f 3 0 4 b 2004}
Last you can select each DataFrame:
print (dfs['b_2004'])
A C D E provid wave
5 f 3 0 4 b 2004
Your answer should be changed by:
sub_data = {}
province_id = ['a','b']
year = [2004]
for i in province_id:
for j in year:
sub_data[i + '_' + str(j)] = df[(df.provid==i) &(df.wave==j)]
print (sub_data)
{'a_2004': A C D E provid wave
0 a 7 1 5 a 2004
2 c 9 5 6 a 2004, 'b_2004': A C D E provid wave
5 f 3 0 4 b 2004}
I got the answer which I was looking for
import pandas as pd
gbl = globals()
for i in locations:
gbl['df_'+i] = df[df.Start_Location==i]
This will create 3 data frames df_HOME, df_office and df_SHOPPING
Thanks,
Use groupby() and then call it's get_group() method:
import pandas as pd
import io
text = b"""Start_Location End_Location Date
OFFICE HOME 3-Apr-15
OFFICE HOME 3-Apr-15
HOME SHOPPING 3-Apr-15
HOME SHOPPING 4-Apr-15
HOME SHOPPING 4-Apr-15
SHOPPING HOME 5-Apr-15
SHOPPING HOME 5-Apr-15
HOME SHOPPING 5-Apr-15"""
locations = ["HOME", "OFFICE", "SHOPPING"]
df = pd.read_csv(io.BytesIO(text), delim_whitespace=True)
g = df.groupby("Start_Location")
for name, df2 in g:
globals()["df_" + name.lower()] = df2
but I think add global variables in a for loop isn't a good method, you can convert the groupby to a dict by:
d = dict(iter(g))
then you can use d["HOME"] to get the data.