They should be one regular expression, and should be in one string:
"nt|nv" # rather than "nt" | " nv"
f_recs[f_recs['Behavior'].str.contains("nt|nv", na=False)]
Python doesn't let you use the or (|) operator on strings:
In [1]: "nt" | "nv"
TypeError: unsupported operand type(s) for |: 'str' and 'str'
Answer from Andy Hayden on Stack OverflowThey should be one regular expression, and should be in one string:
"nt|nv" # rather than "nt" | " nv"
f_recs[f_recs['Behavior'].str.contains("nt|nv", na=False)]
Python doesn't let you use the or (|) operator on strings:
In [1]: "nt" | "nv"
TypeError: unsupported operand type(s) for |: 'str' and 'str'
If you have the patterns in a list, then it might be convenient if you join them by a pipe (|) and pass it to str.contains. Return False for NaNs by na=False and turn off case sensitivity by case=False.
lst = ['nt', 'nv', 'nf']
df['Behavior'].str.contains('|'.join(lst), na=False)
Otherwise, it might be cleaner to group the alternations. For the example in the OP, that is:
df['Behavior'].str.contains(r'n[t|v|f]')
Videos
You can do that as follows:
df[(df['col_name'].str.contains('apple')) & (df['col_name'].str.contains('banana'))]
You can also do it in regex expression style:
df[df['col_name'].str.contains(r'^(?=.*apple)(?=.*banana)')]
You can then, build your list of words into a regex string like so:
base = r'^{}'
expr = '(?=.*{})'
words = ['apple', 'banana', 'cat'] # example
base.format(''.join(expr.format(w) for w in words))
will render:
'^(?=.*apple)(?=.*banana)(?=.*cat)'
Then you can do your stuff dynamically.
You need to set the regex flag (to interpret your search as a regular expression):
whatIwant = df['Column_with_text'].str.contains('value1|value2|value3',
case=False, regex=True)
df['New_Column'] = np.where(whatIwant, df['Column_with_text'])
------ Edit ------
Based on the updated problem statement, here is an updated answer:
You need to define a capture group in the regular expression using parentheses and use the extract() function to return the values found within the capture group. The lower() function deals with any upper case letters
df['MatchedValues'] = df['Text'].str.lower().str.extract( '('+pattern+')', expand=False)
Here is one way:
foods =['apples', 'oranges', 'grapes', 'blueberries']
def matcher(x):
for i in foods:
if i.lower() in x.lower():
return i
else:
return np.nan
df['Match'] = df['Text'].apply(matcher)
# Text Match
# 0 I want to buy some apples. apples
# 1 Oranges are good for the health. oranges
# 2 John is eating some grapes. grapes
# 3 This line does not contain any fruit names. NaN
# 4 I bought 2 blueberries yesterday. blueberries
One option is just to use the regex | character to try to match each of the substrings in the words in your Series s (still using str.contains).
You can construct the regex by joining the words in searchfor with |:
>>> searchfor = ['og', 'at']
>>> s[s.str.contains('|'.join(searchfor))]
0 cat
1 hat
2 dog
3 fog
dtype: object
As @AndyHayden noted in the comments below, take care if your substrings have special characters such as $ and ^ which you want to match literally. These characters have specific meanings in the context of regular expressions and will affect the matching.
You can make your list of substrings safer by escaping non-alphanumeric characters with re.escape:
>>> import re
>>> matches = ['$money', 'x^y']
>>> safe_matches = [re.escape(m) for m in matches]
>>> safe_matches
['\\$money', 'x\\^y']
The strings with in this new list will match each character literally when used with str.contains.
You can use str.contains alone with a regex pattern using OR (|):
s[s.str.contains('og|at')]
Or you could add the series to a dataframe then use str.contains:
df = pd.DataFrame(s)
df[s.str.contains('og|at')]
Output:
0 cat
1 hat
2 dog
3 fog
You could either use .str again to get access to the string methods, or (better, IMHO) use case=False to guarantee case insensitivity:
>>> df = pd.DataFrame({"body": ["ball", "red BALL", "round sphere"]})
>>> df[df["body"].str.contains("ball")]
body
0 ball
>>> df[df["body"].str.lower().str.contains("ball")]
body
0 ball
1 red BALL
>>> df[df["body"].str.contains("ball", case=False)]
body
0 ball
1 red BALL
>>> df[df["body"].str.contains("ball", case=True)]
body
0 ball
(Note that if you're going to be doing assignments, it's a better habit to use df.loc, to avoid the dreaded SettingWithCopyWarning, but if we're just selecting here it doesn't matter.)
(Note #2: guess I really didn't need to specify 'round' there..)
You can also use contains inside query:
In [2]: df = pd.DataFrame({'body': ['Ball', 'cUbE', 'bAll'], 'color': ['red', 'green', 'blue']})
In [3]: df
Out[3]:
body color
0 Ball red
1 cUbE green
2 bAll blue
In [4]: df.query('body.str.contains("ball", case=False).values')
Out[4]:
body color
0 Ball red
2 bAll blue
If you try to match multiple patterns use |:
In [5]: df.query('body.str.contains("ball|cube", case=False).values')
Out[5]:
body color
0 Ball red
1 cUbE green
2 bAll blue