Videos
Use contains for boolean mask and then numpy.where:
m = df['a'].str.contains('foo') & (df['b'] == 'bar')
print (m)
0 True
1 False
2 False
dtype: bool
df['new'] = np.where(m, 'yes', 'no')
print (df)
a b c new
0 foo bar baz yes
1 bar foo baz no
2 foobar barfoo baz no
Or if need alo check column b for substrings:
m = df['a'].str.contains('foo') & df['b'].str.contains('bar')
df['new'] = np.where(m, 'yes', 'no')
print (df)
a b c new
0 foo bar baz yes
1 bar foo baz no
2 foobar barfoo baz yes
If need custom function, what should be slowier in bigger DataFrame:
def somefunction (row):
if 'foo' in row['a'] and row['b'] == 'bar':
return 'yes'
return 'no'
print (df.apply(somefunction, axis=1))
0 yes
1 no
2 no
dtype: object
def somefunction (row):
if 'foo' in row['a'] and 'bar' in row['b']:
return 'yes'
return 'no'
print (df.apply(somefunction, axis=1))
0 yes
1 no
2 yes
dtype: object
Timings:
df = pd.concat([df]*1000).reset_index(drop=True)
def somefunction (row):
if 'foo' in row['a'] and row['b'] == 'bar':
return 'yes'
return 'no'
In [269]: %timeit df['new'] = df.apply(somefunction, axis=1)
10 loops, best of 3: 60.7 ms per loop
In [270]: %timeit df['new1'] = np.where(df['a'].str.contains('foo') & (df['b'] == 'bar'), 'yes', 'no')
100 loops, best of 3: 3.25 ms per loop
df = pd.concat([df]*10000).reset_index(drop=True)
def somefunction (row):
if 'foo' in row['a'] and row['b'] == 'bar':
return 'yes'
return 'no'
In [272]: %timeit df['new'] = df.apply(somefunction, axis=1)
1 loop, best of 3: 614 ms per loop
In [273]: %timeit df['new1'] = np.where(df['a'].str.contains('foo') & (df['b'] == 'bar'), 'yes', 'no')
10 loops, best of 3: 23.5 ms per loop
Your exception is probably from the fact that you write
if row['a'].str.contains('foo')==True
Remove '.str':
if row['a'].contains('foo')==True
You can use [] with re module:
re.findall('A0[0-9].0[0-9]|A0[0-9]','A01')
output:
['A01']
Non occurance:
re.findall('A0[0-9].0[0-9]|A0[0-9]','A11')
output:
[]
Use re.match() to check this. here is an example:
import re
section_id = "A01.09"
if re.match("^A00-9?$", section_id):
print "yes"
Here the regex means A0X is mandatory, and .0X is optional. X is from 0-9.
Hi, for the life of me, I do not know why I am getting a partial match for this regex
I want to match and print out "FOO-2334" but I am only getting back "FOO"
It has something to do with the hyphen...I think.
Any hints please?
import remyStr = "FOO-2334 is an id"matches = re.findall(r'(FOO|BAR)-[\d]{4}', myStr)for m in matches:print (f"{m}")
startswithandin, return a Boolean.- The
inoperator is a test of membership. - This can be performed with a
list-comprehensionorfilter. - Using a
list-comprehension, within, is the fastest implementation tested. - If case is not an issue, consider mapping all the words to lowercase.
l = list(map(str.lower, l)).
- Tested with python 3.11.0
filter:
- Using
filtercreates afilterobject, solist()is used to show all the matching values in alist.
l = ['ones', 'twos', 'threes']
wanted = 'three'
# using startswith
result = list(filter(lambda x: x.startswith(wanted), l))
# using in
result = list(filter(lambda x: wanted in x, l))
print(result)
[out]:
['threes']
list-comprehension
l = ['ones', 'twos', 'threes']
wanted = 'three'
# using startswith
result = [v for v in l if v.startswith(wanted)]
# using in
result = [v for v in l if wanted in v]
print(result)
[out]:
['threes']
Which implementation is faster?
- Tested in Jupyter Lab using the
wordscorpus fromnltk v3.7, which has 236736 words - Words with
'three'['three', 'threefold', 'threefolded', 'threefoldedness', 'threefoldly', 'threefoldness', 'threeling', 'threeness', 'threepence', 'threepenny', 'threepennyworth', 'threescore', 'threesome']
from nltk.corpus import words
%timeit list(filter(lambda x: x.startswith(wanted), words.words()))
%timeit list(filter(lambda x: wanted in x, words.words()))
%timeit [v for v in words.words() if v.startswith(wanted)]
%timeit [v for v in words.words() if wanted in v]
%timeit results
62.8 ms ± 816 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
53.8 ms ± 982 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
56.9 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
47.5 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
A simple, direct answer:
test_list = ['one', 'two','threefour']
r = [s for s in test_list if s.startswith('three')]
print(r[0] if r else 'nomatch')
Result:
threefour
Not sure what you want to do in the non-matching case. r[0] is exactly what you asked for if there is a match, but it's undefined if there is no match. The print deals with this, but you may want to do so differently.
I have a dataframe with a few million rows of names and accompanying columns with relevant info. I want to narrow down the dataframe to only include names from a list of 2,000 names. What's the best method of going about this when I have middle names and states to help distinguish between duplicate names?
Here's an example of the list of names:
John Smith Alabama R John Smith Alabama Jeremy Smith Washington P
What I want to do is first match the name and state to the dataframe if there is a middle initial match (the last letter in the list name if there is a middle name). If not, then I would just like to match by the name and state.
Here's what I tried so far:
df2 <- df[grep(paste(list_of_names, collapse = "|"), df$name_state_middle_initial),]
However, I'm only getting complete string matches with the above code. Any help would be great!
Lets say we have: ThisIsAPrettyLongSingleWord and GirlWasPrettySkinny
Pretty matches but nothing else. but there is a partial match.
You can use the almost-ready-to-be-everyones-regex package with fuzzy matching:
>>> import regex
>>> bigString = "AGAHKGHKHASNHADKRGHFKXXX_I_AM_THERE_XXXXXMHHGRFSAHGSKHASGKHGKHSKGHAK"
>>> regex.search('(?:I_AM_HERE){e<=1}',bigString).group(0)
'I_AM_THERE'
Or:
>>> bigString = "AGAH_I_AM_HERE_RGHFKXXX_I_AM_THERE_XXX_I_AM_NOWHERE_EREXXMHHGRFS"
>>> print(regex.findall('I_AM_(?:HERE){e<=3}',bigString))
['I_AM_HERE', 'I_AM_THERE', 'I_AM_NOWHERE']
The new regex module will (hopefully) be part of Python3.4
If you have pip, just type pip install regex or pip3 install regex until Python 3.4 is out (with regex part of it...)
Answer to comment Is there a way to know the best out of the three in your second example? How to use BESTMATCH flag here?
Either use the best match flag (?b) to get the single best match:
print(regex.search(r'(?b)I_AM_(?:ERE){e<=3}', bigString).group(0))
# I_AM_THE
Or combine with difflib or take a levenshtein distance with a list of all acceptable matches to the first literal:
import regex
def levenshtein(s1,s2):
if len(s1) > len(s2):
s1,s2 = s2,s1
distances = range(len(s1) + 1)
for index2,char2 in enumerate(s2):
newDistances = [index2+1]
for index1,char1 in enumerate(s1):
if char1 == char2:
newDistances.append(distances[index1])
else:
newDistances.append(1 + min((distances[index1],
distances[index1+1],
newDistances[-1])))
distances = newDistances
return distances[-1]
bigString = "AGAH_I_AM_NOWHERE_HERE_RGHFKXXX_I_AM_THERE_XXX_I_AM_HERE_EREXXMHHGRFS"
cl=[(levenshtein(s,'I_AM_HERE'),s) for s in regex.findall('I_AM_(?:HERE){e<=3}',bigString)]
print(cl)
print([t[1] for t in sorted(cl, key=lambda t: t[0])])
print(regex.search(r'(?e)I_AM_(?:ERE){e<=3}', bigString).group(0))
Prints:
[(3, 'I_AM_NOWHERE'), (1, 'I_AM_THERE'), (0, 'I_AM_HERE')]
['I_AM_HERE', 'I_AM_THERE', 'I_AM_NOWHERE']
Here is a bit of a hacky way to do it with difflib:
from difflib import *
window = len(smallString) + 1 # allow for longer matches
chunks = [bigString[i:i+window] for i in range(len(bigString)-window)]
get_close_matches(smallString,chunks,1)
Output:
['_I_AM_THERE']
I am facing the problem of a very long running for loop.
There are two python lists (A and B):
A contains around 170.000 strings with lengths between 1 and 100 characters. B contains around 3.000 strings with the same length variety.
Now i need to find items from list A which contain one item from list B.
Considering that each string from A needs to be compared with each string from B it results in 510.000.000 comparisons. This seems computational too expensive.
What possibilities are there to speed things up?
I don't want to stop after the first match as there could be more matches. The goal is to store all matches in some new variable/db.
Pseudo-code:
A = [] # length: 170.000 (strings)
B = [] # length: 3.000 (strings)
for item in A:
for element in B:
if element in item:
print("store the item which contains the element to db")
# Some sample content
A[0] = "This is some random text in which I want to find words"
A[1] = "It is just some random text"
...
B[0] = "text"
B[1] = "some random text"
...