Use contains for boolean mask and then numpy.where:
m = df['a'].str.contains('foo') & (df['b'] == 'bar')
print (m)
0 True
1 False
2 False
dtype: bool
df['new'] = np.where(m, 'yes', 'no')
print (df)
a b c new
0 foo bar baz yes
1 bar foo baz no
2 foobar barfoo baz no
Or if need alo check column b for substrings:
m = df['a'].str.contains('foo') & df['b'].str.contains('bar')
df['new'] = np.where(m, 'yes', 'no')
print (df)
a b c new
0 foo bar baz yes
1 bar foo baz no
2 foobar barfoo baz yes
If need custom function, what should be slowier in bigger DataFrame:
def somefunction (row):
if 'foo' in row['a'] and row['b'] == 'bar':
return 'yes'
return 'no'
print (df.apply(somefunction, axis=1))
0 yes
1 no
2 no
dtype: object
def somefunction (row):
if 'foo' in row['a'] and 'bar' in row['b']:
return 'yes'
return 'no'
print (df.apply(somefunction, axis=1))
0 yes
1 no
2 yes
dtype: object
Timings:
df = pd.concat([df]*1000).reset_index(drop=True)
def somefunction (row):
if 'foo' in row['a'] and row['b'] == 'bar':
return 'yes'
return 'no'
In [269]: %timeit df['new'] = df.apply(somefunction, axis=1)
10 loops, best of 3: 60.7 ms per loop
In [270]: %timeit df['new1'] = np.where(df['a'].str.contains('foo') & (df['b'] == 'bar'), 'yes', 'no')
100 loops, best of 3: 3.25 ms per loop
df = pd.concat([df]*10000).reset_index(drop=True)
def somefunction (row):
if 'foo' in row['a'] and row['b'] == 'bar':
return 'yes'
return 'no'
In [272]: %timeit df['new'] = df.apply(somefunction, axis=1)
1 loop, best of 3: 614 ms per loop
In [273]: %timeit df['new1'] = np.where(df['a'].str.contains('foo') & (df['b'] == 'bar'), 'yes', 'no')
10 loops, best of 3: 23.5 ms per loop
Answer from jezrael on Stack OverflowUse contains for boolean mask and then numpy.where:
m = df['a'].str.contains('foo') & (df['b'] == 'bar')
print (m)
0 True
1 False
2 False
dtype: bool
df['new'] = np.where(m, 'yes', 'no')
print (df)
a b c new
0 foo bar baz yes
1 bar foo baz no
2 foobar barfoo baz no
Or if need alo check column b for substrings:
m = df['a'].str.contains('foo') & df['b'].str.contains('bar')
df['new'] = np.where(m, 'yes', 'no')
print (df)
a b c new
0 foo bar baz yes
1 bar foo baz no
2 foobar barfoo baz yes
If need custom function, what should be slowier in bigger DataFrame:
def somefunction (row):
if 'foo' in row['a'] and row['b'] == 'bar':
return 'yes'
return 'no'
print (df.apply(somefunction, axis=1))
0 yes
1 no
2 no
dtype: object
def somefunction (row):
if 'foo' in row['a'] and 'bar' in row['b']:
return 'yes'
return 'no'
print (df.apply(somefunction, axis=1))
0 yes
1 no
2 yes
dtype: object
Timings:
df = pd.concat([df]*1000).reset_index(drop=True)
def somefunction (row):
if 'foo' in row['a'] and row['b'] == 'bar':
return 'yes'
return 'no'
In [269]: %timeit df['new'] = df.apply(somefunction, axis=1)
10 loops, best of 3: 60.7 ms per loop
In [270]: %timeit df['new1'] = np.where(df['a'].str.contains('foo') & (df['b'] == 'bar'), 'yes', 'no')
100 loops, best of 3: 3.25 ms per loop
df = pd.concat([df]*10000).reset_index(drop=True)
def somefunction (row):
if 'foo' in row['a'] and row['b'] == 'bar':
return 'yes'
return 'no'
In [272]: %timeit df['new'] = df.apply(somefunction, axis=1)
1 loop, best of 3: 614 ms per loop
In [273]: %timeit df['new1'] = np.where(df['a'].str.contains('foo') & (df['b'] == 'bar'), 'yes', 'no')
10 loops, best of 3: 23.5 ms per loop
Your exception is probably from the fact that you write
if row['a'].str.contains('foo')==True
Remove '.str':
if row['a'].contains('foo')==True
Partial string matches in structural pattern matching - Ideas - Discussions on Python.org
Python search strings with partial match
python - How to retrieve partial matches from a list of strings - Stack Overflow
How to efficiently match strings between two big lists with python? (510.000.000 comparisons)
Videos
I have a dataframe with a few million rows of names and accompanying columns with relevant info. I want to narrow down the dataframe to only include names from a list of 2,000 names. What's the best method of going about this when I have middle names and states to help distinguish between duplicate names?
Here's an example of the list of names:
John Smith Alabama R John Smith Alabama Jeremy Smith Washington P
What I want to do is first match the name and state to the dataframe if there is a middle initial match (the last letter in the list name if there is a middle name). If not, then I would just like to match by the name and state.
Here's what I tried so far:
df2 <- df[grep(paste(list_of_names, collapse = "|"), df$name_state_middle_initial),]
However, I'm only getting complete string matches with the above code. Any help would be great!
startswithandin, return a Boolean.- The
inoperator is a test of membership. - This can be performed with a
list-comprehensionorfilter. - Using a
list-comprehension, within, is the fastest implementation tested. - If case is not an issue, consider mapping all the words to lowercase.
l = list(map(str.lower, l)).
- Tested with python 3.11.0
filter:
- Using
filtercreates afilterobject, solist()is used to show all the matching values in alist.
l = ['ones', 'twos', 'threes']
wanted = 'three'
# using startswith
result = list(filter(lambda x: x.startswith(wanted), l))
# using in
result = list(filter(lambda x: wanted in x, l))
print(result)
[out]:
['threes']
list-comprehension
l = ['ones', 'twos', 'threes']
wanted = 'three'
# using startswith
result = [v for v in l if v.startswith(wanted)]
# using in
result = [v for v in l if wanted in v]
print(result)
[out]:
['threes']
Which implementation is faster?
- Tested in Jupyter Lab using the
wordscorpus fromnltk v3.7, which has 236736 words - Words with
'three'['three', 'threefold', 'threefolded', 'threefoldedness', 'threefoldly', 'threefoldness', 'threeling', 'threeness', 'threepence', 'threepenny', 'threepennyworth', 'threescore', 'threesome']
from nltk.corpus import words
%timeit list(filter(lambda x: x.startswith(wanted), words.words()))
%timeit list(filter(lambda x: wanted in x, words.words()))
%timeit [v for v in words.words() if v.startswith(wanted)]
%timeit [v for v in words.words() if wanted in v]
%timeit results
62.8 ms ± 816 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
53.8 ms ± 982 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
56.9 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
47.5 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
A simple, direct answer:
test_list = ['one', 'two','threefour']
r = [s for s in test_list if s.startswith('three')]
print(r[0] if r else 'nomatch')
Result:
threefour
Not sure what you want to do in the non-matching case. r[0] is exactly what you asked for if there is a match, but it's undefined if there is no match. The print deals with this, but you may want to do so differently.
I am facing the problem of a very long running for loop.
There are two python lists (A and B):
A contains around 170.000 strings with lengths between 1 and 100 characters. B contains around 3.000 strings with the same length variety.
Now i need to find items from list A which contain one item from list B.
Considering that each string from A needs to be compared with each string from B it results in 510.000.000 comparisons. This seems computational too expensive.
What possibilities are there to speed things up?
I don't want to stop after the first match as there could be more matches. The goal is to store all matches in some new variable/db.
Pseudo-code:
A = [] # length: 170.000 (strings)
B = [] # length: 3.000 (strings)
for item in A:
for element in B:
if element in item:
print("store the item which contains the element to db")
# Some sample content
A[0] = "This is some random text in which I want to find words"
A[1] = "It is just some random text"
...
B[0] = "text"
B[1] = "some random text"
...