UPDATE: memory saving method - first set a new index with a gap for a new row:
In [30]: df
Out[30]:
Col1 Col2 Col3
0 A B 1
1 B C 1
2 D E 1
3 E F 1
if we want to insert a new row between indexes 1 and 2, we split the index at position 2:
In [31]: idxs = np.split(df.index, 2)
set a new index (with gap at position 2):
In [32]: df.set_index(idxs[0].union(idxs[1] + 1), inplace=True)
In [33]: df
Out[33]:
Col1 Col2 Col3
0 A B 1
1 B C 1
3 D E 1
4 E F 1
insert new row with index 2:
In [34]: df.loc[2] = ['X','X',2]
In [35]: df
Out[35]:
Col1 Col2 Col3
0 A B 1
1 B C 1
3 D E 1
4 E F 1
2 X X 2
sort index:
In [38]: df.sort_index(inplace=True)
In [39]: df
Out[39]:
Col1 Col2 Col3
0 A B 1
1 B C 1
2 X X 2
3 D E 1
4 E F 1
PS you also can insert DataFrame instead of single row using df.append(new_df):
In [42]: df
Out[42]:
Col1 Col2 Col3
0 A B 1
1 B C 1
2 D E 1
3 E F 1
In [43]: idxs = np.split(df.index, 2)
In [45]: new_df = pd.DataFrame([['X', 'X', 10], ['Y','Y',11]], columns=df.columns)
In [49]: new_df.index += idxs[1].min()
In [51]: new_df
Out[51]:
Col1 Col2 Col3
2 X X 10
3 Y Y 11
In [52]: df = df.set_index(idxs[0].union(idxs[1]+len(new_df)))
In [53]: df
Out[53]:
Col1 Col2 Col3
0 A B 1
1 B C 1
4 D E 1
5 E F 1
In [54]: df = df.append(new_df)
In [55]: df
Out[55]:
Col1 Col2 Col3
0 A B 1
1 B C 1
4 D E 1
5 E F 1
2 X X 10
3 Y Y 11
In [56]: df.sort_index(inplace=True)
In [57]: df
Out[57]:
Col1 Col2 Col3
0 A B 1
1 B C 1
2 X X 10
3 Y Y 11
4 D E 1
5 E F 1
OLD answer:
One (among many) way to achieve that would be to split your DF and concatenate it together with needed DF in desired order:
Original DF:
In [12]: df
Out[12]:
Col1 Col2 Col3
0 A B 1
1 B C 1
2 D E 1
3 E F 1
let's split it into two parts ([0:1], [2:end]):
In [13]: dfs = np.split(df, [2])
In [14]: dfs
Out[14]:
[ Col1 Col2 Col3
0 A B 1
1 B C 1, Col1 Col2 Col3
2 D E 1
3 E F 1]
now we can concatenate it together with a new DF in desired order:
In [15]: pd.concat([dfs[0], pd.DataFrame([['C','D', 1]], columns=df.columns), dfs[1]], ignore_index=True)
Out[15]:
Col1 Col2 Col3
0 A B 1
1 B C 1
2 C D 1
3 D E 1
4 E F 1
Answer from MaxU - stand with Ukraine on Stack OverflowUse the timeits, Luke!

Conclusion
List comprehensions perform the best on smaller amounts of data because they incur very little overhead, even though they are not vectorized. OTOH, on larger data, loc and numpy.where perform better - vectorisation wins the day.
Keep in mind that the applicability of a method depends on your data, the number of conditions, and the data type of your columns. My suggestion is to test various methods on your data before settling on an option.
One sure take away from here, however, is that list comprehensions are pretty competitive—they're implemented in C and are highly optimised for performance.
Benchmarking code, for reference. Here are the functions being timed:
def numpy_where(df):
return df.assign(is_rich=np.where(df['salary'] >= 50, 'yes', 'no'))
def list_comp(df):
return df.assign(is_rich=['yes' if x >= 50 else 'no' for x in df['salary']])
def loc(df):
df = df.assign(is_rich='no')
df.loc[df['salary'] > 50, 'is_rich'] = 'yes'
return df
Another method is by using the pandas mask (depending on the use-case where) method. First initialize a Series with a default value (chosen as "no") and replace some of them depending on a condition (a little like a mix between loc[] and numpy.where()).
df['is_rich'] = pd.Series('no', index=df.index).mask(df['salary']>50, 'yes')
It is probably the fastest option. For example, for a frame with 10 mil rows, mask() option is 40% faster than loc option.1
I also updated the perfplot benchmark in cs95's answer to compare how the mask method performs compared to the other methods:

1: The benchmark result that compares mask with loc.
def mask(df):
return df.assign(is_rich=pd.Series('no', index=df.index).mask(df['salary']>50, 'yes'))
df = pd.DataFrame({'salary': np.random.rand(10_000_000)*100})
%timeit mask(df)
# 391 ms ± 3.87 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit loc(df)
# 558 ms ± 75.6 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)
python - Pandas - add value at specific iloc into new dataframe column - Stack Overflow
Pandas/Python add a row based on condition - Stack Overflow
Pandas - Add value to column IF row is in another
How to add rows to the dataframe based on a condition?
Videos
There are two steps to created & populate a new column using only a row number... (in this approach iloc is not used)
First, get the row index value by using the row number
CopyrowIndex = df.index[someRowNumber]
Then, use row index with the loc function to reference the specific row and add the new column / value
Copydf.loc[rowIndex, 'New Column Title'] = "some value"
These two steps can be combine into one line as follows
Copydf.loc[df.index[someRowNumber], 'New Column Title'] = "some value"
If you have a dataframe like
Copyimport pandas as pd
df = pd.DataFrame(data={'X': [1.5, 6.777, 2.444, pd.np.NaN], 'Y': [1.111, pd.np.NaN, 8.77, pd.np.NaN], 'Z': [5.0, 2.333, 10, 6.6666]})
Instead of iloc,you can use .loc with row index and column name like df.loc[row_indexer,column_indexer]=value
Copydf.loc[[0,3],'Z'] = 3
Output:
X Y Z 0 1.500 1.111 3.000 1 6.777 NaN 2.333 2 2.444 8.770 10.000 3 NaN NaN 3.000
Hi there,
I've got a little bit stuck with Pandas. I'm trying to add a new column to my database, and then give each row of the database a 1-10 value based on whether it's present in another df.
So my main dataframe is divided into 10 sub dataframes (df1, df2, df3...) and each row in the dataframe has a unique 'GroupID' value. I'd like to give every entry in the main dataframe a different value in a new column (called Segment), so if its present in df1, give it the value 1 for example.
I hope I've explained this as clear as possible, and many thanks in advance!
This is what I've written, but it's returning an error
if Attempt[Attempt['Group ID'].isin(df1['Group ID'])]: Attempt["Segment"] ='1' elif Attempt[Attempt['Group ID'].isin(df2['Group ID'])]: Attempt["Segment"] ='2' elif Attempt[Attempt['Group ID'].isin(df3['Group ID'])]: Attempt["Segment"] ='3' elif Attempt[Attempt['Group ID'].isin(df4['Group ID'])]: Attempt["Segment"] ='4' elif Attempt[Attempt['Group ID'].isin(df5['Group ID'])]: Attempt["Segment"] ='5' elif Attempt[Attempt['Group ID'].isin(df6['Group ID'])]: Attempt["Segment"] ='6' elif Attempt[Attempt['Group ID'].isin(df7['Group ID'])]: Attempt["Segment"] ='7' elif Attempt[Attempt['Group ID'].isin(df8['Group ID'])]: Attempt["Segment"] ='8' elif Attempt[Attempt['Group ID'].isin(df9['Group ID'])]: Attempt["Segment"] ='9' elif Attempt[Attempt['Group ID'].isin(df10['Group ID'])]: Attempt["Segment"] ='10'
I have a dataframe that, after grouping by User and Time, looks like below:
User | Time | Event type
————————————
A | xx:xx | 1
————————————
A | yy:yy | 0
————————————
B | xx:xx | 0
————————————
B | yy:yy | 1
————————————
B | zz:zz | 0
————————————
Now I want to check the current, the previous and the next event.
If the current event is 1 and there’s no previous events, I want to add a 0 event manually. So a new row
If the current event is a 0 and there’s no next event, I want to add a 1 event manually.
So basically i can have complete 0-1 sessions.
Is there any way to achieve this?
I think you can use loc if you need update two columns to same value:
df1.loc[df1['stream'] == 2, ['feat','another_feat']] = 'aaaa'
print df1
stream feat another_feat
a 1 some_value some_value
b 2 aaaa aaaa
c 2 aaaa aaaa
d 3 some_value some_value
If you need update separate, one option is use:
df1.loc[df1['stream'] == 2, 'feat'] = 10
print df1
stream feat another_feat
a 1 some_value some_value
b 2 10 some_value
c 2 10 some_value
d 3 some_value some_value
Another common option is use numpy.where:
df1['feat'] = np.where(df1['stream'] == 2, 10,20)
print df1
stream feat another_feat
a 1 20 some_value
b 2 10 some_value
c 2 10 some_value
d 3 20 some_value
EDIT: If you need divide all columns without stream where condition is True, use:
print df1
stream feat another_feat
a 1 4 5
b 2 4 5
c 2 2 9
d 3 1 7
#filter columns all without stream
cols = [col for col in df1.columns if col != 'stream']
print cols
['feat', 'another_feat']
df1.loc[df1['stream'] == 2, cols ] = df1 / 2
print df1
stream feat another_feat
a 1 4.0 5.0
b 2 2.0 2.5
c 2 1.0 4.5
d 3 1.0 7.0
If working with multiple conditions is possible use multiple numpy.where
or numpy.select:
df0 = pd.DataFrame({'Col':[5,0,-6]})
df0['New Col1'] = np.where((df0['Col'] > 0), 'Increasing',
np.where((df0['Col'] < 0), 'Decreasing', 'No Change'))
df0['New Col2'] = np.select([df0['Col'] > 0, df0['Col'] < 0],
['Increasing', 'Decreasing'],
default='No Change')
print (df0)
Col New Col1 New Col2
0 5 Increasing Increasing
1 0 No Change No Change
2 -6 Decreasing Decreasing
You can do the same with .ix, like this:
In [1]: df = pd.DataFrame(np.random.randn(5,4), columns=list('abcd'))
In [2]: df
Out[2]:
a b c d
0 -0.323772 0.839542 0.173414 -1.341793
1 -1.001287 0.676910 0.465536 0.229544
2 0.963484 -0.905302 -0.435821 1.934512
3 0.266113 -0.034305 -0.110272 -0.720599
4 -0.522134 -0.913792 1.862832 0.314315
In [3]: df.ix[df.a>0, ['b','c']] = 0
In [4]: df
Out[4]:
a b c d
0 -0.323772 0.839542 0.173414 -1.341793
1 -1.001287 0.676910 0.465536 0.229544
2 0.963484 0.000000 0.000000 1.934512
3 0.266113 0.000000 0.000000 -0.720599
4 -0.522134 -0.913792 1.862832 0.314315
EDIT
After the extra information, the following will return all columns - where some condition is met - with halved values:
>> condition = df.a > 0
>> df[condition][[i for i in df.columns.values if i not in ['a']]].apply(lambda x: x/2)
You could use groupby to check if 'C' and 'D' are in the 'col_name' column and add them if not.
df = pd.DataFrame([{'owner':'svc','name':'dmn_dmn','col_name':'A','test_col1':1,'test_col2':'String1'},{'owner':'svc','name':'dmn_dmn','col_name':'B','test_col1':2,'test_col2':'String12'},{'owner':'svc','name':'dmn_dmn','col_name':'C','test_col1':'remain_constant_3','test_col2':'String13'},{'owner':'svc','name':'dmn_dmn','col_name':'D','test_col1':'remain_constant_3','test_col2':'String14'},{'owner':'svc','name':'time1','col_name':'E','test_col1':5,'test_col2':'String1123'}])
for g,g_hold in df.groupby('name'):
if 'C' not in g_hold['col_name'].tolist():
df = df.append({'owner':'svc','name':g,'col_name':'C','test_col1':'remain_constant_3','test_col2':'String13'},ignore_index=True)
if 'D' not in g_hold['col_name'].tolist():
df = df.append({'owner':'svc','name':g,'col_name':'D','test_col1':'remain_constant_3','test_col2':'String14'},ignore_index=True)
print(df.sort_values(['name','col_name']))
The code would end up looking something like this.
A better way is to use
import pandas as pd
df = pd.DataFrame({"owner": ["owner"] * 9,
"name": ["dmn_dmn", "dmn_dmn", "dmn_dmn", "dmn_dmn", "time1", "time1", "sap", "sap", "sap"],
"col_name": ["A", "B", "C", "D", "A", "B", "A", "B", "D"]})
index = pd.MultiIndex.from_product([df.owner.unique(), df.name.unique(), df.col_name.unique()])
result = df.set_index(['owner', 'name', "col_name"]).reindex(index).reset_index()
print(result)
The easiest way of doing this is probably to first convert the dataframe back to a list of rows, then use base python syntax to repeat each row n times, and then convert that back to a dataframe:
import pandas as pd
df = pd.DataFrame({
"event": ["A","B","C","D"],
"budget": [123, 433, 1000, 1299],
"duration_days": [6, 3, 4, 2]
})
pd.DataFrame([
row # select the full row
for row in df.to_dict(orient="records") # for each row in the dataframe
for _ in range(row["duration_days"]) # and repeat the row for row["duration"] times
])
Which gives the following dataframe:
| event | budget | duration_days |
|---|---|---|
| A | 123 | 6 |
| A | 123 | 6 |
| A | 123 | 6 |
| A | 123 | 6 |
| A | 123 | 6 |
| A | 123 | 6 |
| B | 433 | 3 |
| B | 433 | 3 |
| B | 433 | 3 |
| C | 1000 | 4 |
| C | 1000 | 4 |
| C | 1000 | 4 |
| C | 1000 | 4 |
| D | 1299 | 2 |
| D | 1299 | 2 |
First of all, I think your dataset have some problem because if you use single quote like this '['1','2','3','4']' in the value for the key then python will show you a syntax error.
So I will take the as below
df = {'event':['A','B','C','D'],'budget':['123','433','1000','1299'],'duration_days':['6','3','4','2']}
Then convert it to a data frame as your requirement.
data = []
for i in range(len(df['event'])):
for j in range(int(df['duration_days'][i])):
temp = [df['event'][i], df['budget'][i], df['duration_days'][i]]
data.append(temp)
data_df = pd.DataFrame(data, columns=['event','budget','duration_days'])
data_df
| event | budget | duration_days |
|---|---|---|
| 0 | A | 123 |
| 1 | A | 123 |
| 2 | A | 123 |
| 3 | A | 123 |
| 4 | A | 123 |
| 5 | A | 123 |
| 6 | B | 433 |
| 7 | B | 433 |
| 8 | B | 433 |