I'm currently working on simulated data. For each simulation, I have to create a dataframe with around 1500-4500 rows each. All those rows depend on the data of previous rows, so I must iterate over the dataframe to create a new row. I want to repeat this process 500 times, but each of these instances ale completely independent from eachother.
I have scrpits to generate all this data, but it takes too much time since my scripts are only able to run one simulation at a time. Is it possible for a single script to run each simulation in parallel so I can merge then into a single .csv file at the end of all calculations?
I have detailed the question on Stack Overflow for sake of completeness.
https://stackoverflow.com/questions/74449651/how-to-create-multiple-dataframes-from-a-single-dataframe-based-on-a-condition-i
Please let me know if anything is unclear. Thank you.
I don't think you need to create a new dictionary for that.
Just directly slice your data frame whenever needed.
df[['name', 'time', 'session 1']]
If you think the following design can help you, you can set the name and time to be indexes (df.set_index(['name', 'time'])) and just simply
df['session 1']
Organize it into a dictionary of dataframes:
dict_of_dfs = {f'df {i}':df[['name','time', i]] for i in df.columns[2:]}
Then you can access each dataframe as you would any other dictionary values:
>>> dict_of_dfs['df session1']
name time session1
0 Alex 135 10
1 Lee 136 2
>>> dict_of_dfs['df session2']
name time session2
0 Alex 135 3
1 Lee 136 6
You can use df.columns to get all column names and then create sub-dataframes:
outdflist =[]
# for each column beyond first:
for col in oridf.columns[1:]:
# create a subdf with desired columns:
subdf = oridf[['GeneID',col]]
# append subdf to list of df:
outdflist.append(subdf)
# to view all dataframes created:
for df in outdflist:
print(df)
Output:
GeneID A
0 1 0.3
1 2 0.5
2 3 0.4
3 4 0.9
4 5 0.1
GeneID B
0 1 0.2
1 2 0.3
2 3 0.1
3 4 0.7
4 5 0.4
GeneID C
0 1 0.6
1 2 0.1
2 3 0.5
3 4 0.1
4 5 0.7
GeneID D
0 1 0.4
1 2 0.2
2 3 0.1
3 4 0.6
4 5 0.2
GeneID E
0 1 0.8
1 2 0.6
2 3 0.3
3 4 0.7
4 5 0.5
Above for loop can also be written more simply as list comprehension:
outdflist = [ oridf[['GeneID', col]]
for col in oridf.columns[1:] ]
You can do with groupby
d={'df'+ str(x): y for x , y in df.groupby(level=0,axis=1)}
d
Out[989]:
{'dfA': A
0 0.3
1 0.5
2 0.4
3 0.9
4 0.1, 'dfB': B
0 0.2
1 0.3
2 0.1
3 0.7
4 0.4, 'dfC': C
0 0.6
1 0.1
2 0.5
3 0.1
4 0.7, 'dfD': D
0 0.4
1 0.2
2 0.1
3 0.6
4 0.2, 'dfE': E
0 0.8
1 0.6
2 0.3
3 0.7
4 0.5, 'dfGeneID': GeneID
0 1
1 2
2 3
3 4
4 5}
Just to underline my comment to @maxymoo's answer, it's almost invariably a bad idea ("code smell") to add names dynamically to a Python namespace. There are a number of reasons, the most salient being:
Created names might easily conflict with variables already used by your logic.
Since the names are dynamically created, you typically also end up using dynamic techniques to retrieve the data.
This is why dicts were included in the language. The correct way to proceed is:
d = {}
for name in companies:
d[name] = pd.DataFrame()
Nowadays you can write a single dict comprehension expression to do the same thing, but some people find it less readable:
d = {name: pd.DataFrame() for name in companies}
Once d is created the DataFrame for company x can be retrieved as d[x], so you can look up a specific company quite easily. To operate on all companies you would typically use a loop like:
for name, df in d.items():
# operate on DataFrame 'df' for company 'name'
In Python 2 you were better writing
for name, df in d.iteritems():
because this avoids instantiating the list of (name, df) tuples
that .items() creates in the older version.
That's now largely of historical interest, though there will of
course be Python 2 applications still extant and requiring
(hopefully occasional) maintenance.
You can do this (although obviously use exec with extreme caution if this is going to be public-facing code)
for c in companies:
exec('{} = pd.DataFrame()'.format(c))
I have two dataframes from which i want to create multiple new dataframes. My code currently looks like this:
import pandas as pd
df_h = pd.read_csv('filename1.csv',skiprows=6)
df_c = pd.read_csv('filename2.csv', skiprows=6)
merged_tables, sheet_titles = ( [] for i in range(2))
c1 = df_c[(df_c['Document'].str.startswith("AB")) & (df_c['Symbol '] == "ARD")]
h1 = df_h[df_h["Code "] == 7]
h1.at['Total', 'Amount '] = h1['Amount '].sum()
c1.at['Total', 'Amount '] = c1['Amount '].sum()
h1.reset_index(drop=True, inplace=True)
c1.reset_index(drop=True, inplace=True)
merged_table1 = pd.concat([h1,c1],axis=1)
merged_tables.append(merged_table1)
sheet_titles.append(7)So what I'm doing is basically checking two conditions in first dataframe, one condition in second dataframe and assigning it as new dataframes. Then i'm adding new row to sum one column, reseting index in both dataframes, merging them and appending new dataframe to a list, which i'm later using to create excel file from it.
But i want to create more new dataframes like this:
c10 = df_c[(df_c['Document'].str.startswith("CD")) & (df_c['Symbol '] == "ARD")]
h10 = df_h[df_h["Code "] == 23]
h10.at['Total', 'Amount '] = h10['Amount '].sum()
c10.at['Total', 'Amount '] = c10['Amount '].sum()
h10.reset_index(drop=True, inplace=True)
c10.reset_index(drop=True, inplace=True)
merged_table10 = pd.concat([h10,c10],axis=1)
merged_tables.append(merged_table10)
sheet_titles.append(23)
c19 = df_c[(df_c['Document'].str.startswith("EF")) & (df_c['Symbol '] == "ARD")]
h19 = df_h[df_h["Code "] == 30]
h19.at['Total', 'Amount '] = h19['Amount '].sum()
c19.at['Total', 'Amount '] = c19['Amount '].sum()
h19.reset_index(drop=True, inplace=True)
c19.reset_index(drop=True, inplace=True)
merged_table19 = pd.concat([h19,c19],axis=1)
merged_tables.append(merged_table19)
sheet_titles.append(30)Currently i'm just explicitly repeating the same code for all new dataframes that i want to create, only changing the conditions and variables name, as i don't know how to wrap my head around writing some for loop to it and reducing amount of code.
Basically, what's always changing for each new dataframe are starting characters from first condition, code number from second and sheet title that's being appended to a list. All the other operations, so suming a column, reseting index, merging selected tables and appending it to a list will always remain the same.
If it would be a csv file or data stored in a different lists, i would just make for loop with many elifs, but as it's pandas dataframe and instead of every element you're usually accessing whole column, i don't know how to write it efficently, as i know that writing up that many variables and repeating such amount of code isn't very efficent.
I know that i have to declare those changing conditions anyway, but wrapping it up in some concise for loop or function would definitely make it more efficient and scalable.
To solve the error, array 1 which is a multidimensional array is manipulated so as to get each array, this is done by converting it to a list.
Then a dictionary with all array values is created to make the solution elegant.
import pandas as pd
# get number of columns in multidimensional array
num_cols = len(array1)
# create column names
array1_col_names = ['reps_' + str(i) for i in range(num_cols)]
# get elements of array 1 and store them in a dictionary
values = dict(zip(col_names, list(array1)))
values.update({'labels': array2, 'weight': array3}) # add other arrays to dictionary
# create dataframe from the dictionary
df = pd.DataFrame(values)
df
You can covert each np(Numpy) array to pd (Pandas) Dataframe and then concat using pd.concat
at1=[43,54.5,40.4]
at2=[-0.33333,4,0.4]
at3=[91.3333,79,85.6]
at4=[-4,0,0.2]
at5=[31.3333,12,31.6]
at6=[1.33333,5.5,1.6]
at7=[48.6667,24,45]
at8=[59.6667,67,53.4]
at9=[11.3333,43,8.8]
base_at=np.array([at1,at2,at3,at4,at5,at6,at7,at8,at9])
This is your array1
at10=np.array([43,25,2])
at11=np.array([-3,5,20])
These are your array2 and array3 respectively.
at10_df=pd.DataFrame(at10).T
base_at_df=pd.DataFrame(base_at)
at11_df=pd.DataFrame(at11).T
You converted all three arrays into respective DataFrames
pd.concat([base_at_df,at10_df,at11_df])
And this is your DataFrame comprising of all three arrays
Here's the solution, I haven't changed the name of column, I think you can handle from here.
You can use concat:
In [11]: pd.concat([df1['c'], df2['c']], axis=1, keys=['df1', 'df2'])
Out[11]:
df1 df2
2014-01-01 NaN -0.978535
2014-01-02 -0.106510 -0.519239
2014-01-03 -0.846100 -0.313153
2014-01-04 -0.014253 -1.040702
2014-01-05 0.315156 -0.329967
2014-01-06 -0.510577 -0.940901
2014-01-07 NaN -0.024608
2014-01-08 NaN -1.791899
[8 rows x 2 columns]
The axis argument determines the way the DataFrames are stacked:
df1 = pd.DataFrame([1, 2, 3])
df2 = pd.DataFrame(['a', 'b', 'c'])
pd.concat([df1, df2], axis=0)
0
0 1
1 2
2 3
0 a
1 b
2 c
pd.concat([df1, df2], axis=1)
0 0
0 1 a
1 2 b
2 3 c
Well, I'm not sure that merge would be the way to go. Personally I would build a new data frame by creating an index of the dates and then constructing the columns using list comprehensions. Possibly not the most pythonic way, but it seems to work for me!
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randn(5,3), index=pd.date_range('01/02/2014',periods=5,freq='D'), columns=['a','b','c'] )
df2 = pd.DataFrame(np.random.randn(8,3), index=pd.date_range('01/01/2014',periods=8,freq='D'), columns=['a','b','c'] )
# Create an index list from the set of dates in both data frames
Index = list(set(list(df1.index) + list(df2.index)))
Index.sort()
df3 = pd.DataFrame({'df1': [df1.loc[Date, 'c'] if Date in df1.index else np.nan for Date in Index],\
'df2': [df2.loc[Date, 'c'] if Date in df2.index else np.nan for Date in Index],},\
index = Index)
df3
I suggest use str.get_dummies:
df = df.join(df.pop('Fruit_Color').str.get_dummies(','))
print (df)
Fruit_Type Fruit_Description Green Red Yellow
0 Apple Just an apple 1 1 1
1 Banana Just a Banana 1 0 1
2 Orange Just an Orange 0 1 1
3 Grape Just a Grape 0 0 0
You can create the columns using assign:
df.assign(
green=lambda d: d['Fruit_color'].str.contains('Green', case=True),
red=lambda d: d['Fruit_color'].str.contains('Red', case=True),
yellow=lambda d: d['Fruit_color'].str.contains('Yellow', case=True),
)
This results in a new dataframe with three additional columns of Booleans, namely "green", "red" and "yellow".
To detect a row with no known colour, you can also assign other_color=lambda d: ~(d['green'] | d['red'] | d['yellow']).
Another possibility is to use pandas.concat to concatenate multiple dataframes, but it's less elegant than the above solution.
You can use pd.read_csv with the correct separators, but you have to do 2 things:
- Before: Split the parts of the file for Level1 and Level2
- After: Transpose and set the columns of Level1
Here's the code, straight inside your with open [...] line
reader = output.read() # simply the entire file text, not split into lines
parts = reader.split('[Level2]\n')
lvl1_lines = parts[0].split('[Level1]\n')[1].replace('"','')
lvl2_lines = "Speed Energylevel\n" + parts[1]
from io import StringIO # to read strings as files for read_csv
df_level1 = pd.read_csv(StringIO(lvl1_lines), sep='\s*=\s*', engine='python').transpose().reset_index()
df_level1.columns = df_level1.iloc[0] # set the correct column names
df_level1 = df_level1[1:] # remove the column row
df_level2 = pd.read_csv(StringIO(lvl2_lines), sep='\\s+')
Try this:
def read_txt_file(path):
n = 0
pattern = re.compile(r'(.+)\s*=\s*\"(.+)\"')
level1 = {}
with open(path) as fp:
for line in fp:
line = line.strip()
n += 1
if line == '[Level2]':
break
m = pattern.match(line)
if m is not None:
key = m.group(1)
value = m.group(2)
level1[key] = value
level1 = pd.DataFrame(level1, index=[0])
level2 = pd.read_csv(path, sep='\s+', skiprows=n, header=None, names=['Speed', 'EnergyLevel'])
return level1, level2
Usage:
level1, level2 = read_txt_file('data.txt')