One option is just to use the regex | character to try to match each of the substrings in the words in your Series s (still using str.contains).
You can construct the regex by joining the words in searchfor with |:
>>> searchfor = ['og', 'at']
>>> s[s.str.contains('|'.join(searchfor))]
0 cat
1 hat
2 dog
3 fog
dtype: object
As @AndyHayden noted in the comments below, take care if your substrings have special characters such as $ and ^ which you want to match literally. These characters have specific meanings in the context of regular expressions and will affect the matching.
You can make your list of substrings safer by escaping non-alphanumeric characters with re.escape:
>>> import re
>>> matches = ['$money', 'x^y']
>>> safe_matches = [re.escape(m) for m in matches]
>>> safe_matches
['\\$money', 'x\\^y']
The strings with in this new list will match each character literally when used with str.contains.
python - How to test if a string contains one of the substrings in a list, in pandas? - Stack Overflow
python - How to use str.contains() with multiple expressions in pandas dataframes - Stack Overflow
python - pandas dataframe str.contains() AND operation - Stack Overflow
How to use str.contains to get exact matches and not partial ones?
Videos
One option is just to use the regex | character to try to match each of the substrings in the words in your Series s (still using str.contains).
You can construct the regex by joining the words in searchfor with |:
>>> searchfor = ['og', 'at']
>>> s[s.str.contains('|'.join(searchfor))]
0 cat
1 hat
2 dog
3 fog
dtype: object
As @AndyHayden noted in the comments below, take care if your substrings have special characters such as $ and ^ which you want to match literally. These characters have specific meanings in the context of regular expressions and will affect the matching.
You can make your list of substrings safer by escaping non-alphanumeric characters with re.escape:
>>> import re
>>> matches = ['$money', 'x^y']
>>> safe_matches = [re.escape(m) for m in matches]
>>> safe_matches
['\\$money', 'x\\^y']
The strings with in this new list will match each character literally when used with str.contains.
You can use str.contains alone with a regex pattern using OR (|):
s[s.str.contains('og|at')]
Or you could add the series to a dataframe then use str.contains:
df = pd.DataFrame(s)
df[s.str.contains('og|at')]
Output:
0 cat
1 hat
2 dog
3 fog
They should be one regular expression, and should be in one string:
"nt|nv" # rather than "nt" | " nv"
f_recs[f_recs['Behavior'].str.contains("nt|nv", na=False)]
Python doesn't let you use the or (|) operator on strings:
In [1]: "nt" | "nv"
TypeError: unsupported operand type(s) for |: 'str' and 'str'
If you have the patterns in a list, then it might be convenient if you join them by a pipe (|) and pass it to str.contains. Return False for NaNs by na=False and turn off case sensitivity by case=False.
lst = ['nt', 'nv', 'nf']
df['Behavior'].str.contains('|'.join(lst), na=False)
Otherwise, it might be cleaner to group the alternations. For the example in the OP, that is:
df['Behavior'].str.contains(r'n[t|v|f]')
You can do that as follows:
df[(df['col_name'].str.contains('apple')) & (df['col_name'].str.contains('banana'))]
You can also do it in regex expression style:
df[df['col_name'].str.contains(r'^(?=.*apple)(?=.*banana)')]
You can then, build your list of words into a regex string like so:
base = r'^{}'
expr = '(?=.*{})'
words = ['apple', 'banana', 'cat'] # example
base.format(''.join(expr.format(w) for w in words))
will render:
'^(?=.*apple)(?=.*banana)(?=.*cat)'
Then you can do your stuff dynamically.
Hi, I don't get why when I use str.contains to get exact matches from a list of keywords, the output still contains partial matches. Here is an extract of what I have (I'm only including one keyword in the list for the example):
keyword= ['SE.TER.ENRL']
subset = df[df['Code'].str.contains('|'.join(keyword), case=False, na=False)]
Output: ['SE.TER.ENRL' 'SE.TER.ENRL.FE' 'SE.TER.ENRL.FE.ZS']
Does anyone know how to get around this?
Thanks!