Hi, I don't get why when I use str.contains to get exact matches from a list of keywords, the output still contains partial matches. Here is an extract of what I have (I'm only including one keyword in the list for the example):
keyword= ['SE.TER.ENRL']
subset = df[df['Code'].str.contains('|'.join(keyword), case=False, na=False)]
Output: ['SE.TER.ENRL' 'SE.TER.ENRL.FE' 'SE.TER.ENRL.FE.ZS']
Does anyone know how to get around this?
Thanks!
Videos
My problem is using str.contains or str.match returns rows that contain even substrings of the string I am looking for. new_dataframe = df[df['number'].str.match(number)]
I want only the rows that are an exact match for the string.
Is there a way to match a list of strings exactly with the strings in a pandas column to filter out the ones that do not have?
Say, words = ['ab', 'ml']
df =
| data |
|---|
| 'example string ab' |
| 'absolute value' |
After filtering, I must get only the row with value 'example string ab' for it contains exact string 'ab' from the list 'words'.
You can use the word-boundaries of regular expressions. Example:
import re
s = '98787This is correct'
for words in ['This is correct', 'This', 'is', 'correct']:
if re.search(r'\b' + words + r'\b', s):
print('{0} found'.format(words))
That yields:
is found
correct found
For an exact match, replace \b assertions with ^ and $ to restrict the match to the begin and end of line.
Use the comparison operator == instead of in then:
if text == 'This is correct':
print("Correct")
This will check to see if the whole string is just 'This is correct'. If it isn't, it will be False
It sounds like the thing you're trying to do is somewhat insane. With 40k first names to search for, false positives are inevitable. At the same time, with only 40k names, false negatives are also inevitable. People's names are untidy; hopefully you have plans to accommodate. Even when you get correct matches for a "first" and "last" name, as your example email shows, there's no guarantee that they'll be the first and last names of the same person.
Maybe someone with experience in natural-language-processing AI would be able to solve your problem in a robust way. More likely you've resigned yourself to a solution that simply isn't robust. You still pretty definitely need case-sensitivity and "whole word" matching.
I'm not convinced by the example you give of a false positive. The pandas function you're using is regex-based. r'tero' does not match 't er o'; it does match 'interoperability'. With name lists as long as you're using, it seems more likely that you over-looked some other match in the email in question. I would kinda expect just a few of the names to be responsible for the majority of false-positives; outputting the matched text will help you identify them.
- Case-sensitive regex matching should be the default.
- I think
\b...\bas a regex pattern will give the kind of "whole word" matching you need. - pandas.extract will do the capturing.
Given the size of your datasets, you may be a bit concerned with the performance. Or you may not, it's up to you.
I haven't tested this at all:
# Import datasets and create lists/variables
import pandas as pd
from pandas import ExcelWriter
from typing import Iterable
# Document, sheet, and column names:
names_source_file = 'names.xlsx'
first_names_sheet = 'Alle Navne'
first_names_column = 'Names'
last_names_sheet = 'Frie Efternavne'
last_names_column = 'Frie Efternavne'
subject_file = 'Entreprise Beskeder.xlsx'
subject_sheet = 'dataark'
subject_column = 'Besked'
output_first_name = 'Navner'
output_last_name = 'Efternavner'
output_file = 'PythonExport.xlsx'
# Build (very large!) search patterns:
first_names_df = pd.read_excel(names_file, sheet_name=first_names_sheet)
first_names: Iterable[str] = namesdf[first_names_column]
first_names_regex = '''\b{}\b'''.format('|'.join(first_names))
last_names_df = pd.read_excel(names_file, sheet_name=last_names_sheet)
last_names: Iterable[str] = lastnamesdf[last_names_column]
last_names_regex = '''\b{}\b'''.format('|'.join(last_names))
# Import dataset and drop NULLS:
data_frame = pd.read_excel(subject_file, sheet_name=subject_sheet)
data_frame[subject_column].dropna(inplace=True)
# Add columns for found first and last names:
data_frame[output_first_name] = data_frame[subject_column].str.extract(
first_names_regex,
expand=False
)
data_frame[output_last_name] = data_frame[subject_column].str.extract(
last_names_regex,
expand=False
)
# Save the result
writer = ExcelWriter(output_file)
df.to_excel(writer)
writer.save()
One obvious problem that I still haven't talked about is that there may be multiple name matches in a given subject. Assuming that you care about multiple matches, you can probably do something with extractall.
To see what is being matched, use apply() with a python function:
import re
regex = re.compile(pat)
def search(item):
mo = regex.search(item)
if mo:
return mo[0]
else:
return ''
df.msg.apply(search)
This will yield a Series with the names that matched or '' if there isn't a match.
Hi,
I have a dataframe with columns made up of strings and date which I'd like to create a new dataframe with the condition of column B containing a specific set of string in the first four characters.
The dataframe looks like this:
| A | B | C |
|---|---|---|
| ['textstring1, 'textstring2',...,'textstringN'] | ['1234-5678-9'] | 2018-01-23 |
| ['textstring1, 'textstring2',...,'textstringN'] | ['9876-5432-1] | 2018-02-12 |
And I wish to create a dataframe with the rows containing '1234' in the first four characters of the cells in column B.
The code I have so far looks like this (example)
import pandas as pd
df = pd.DataFrame(["['1234-9493']", "['1254-1234']", "['3838-1234']", "['1235-3845']"])
df_sorted = df[(df[0].str.contains('1234'))]
df_sortedHowever... it doesn't take the position into account and the output looks like:
| 0 | |
|---|---|
| 0 | 1234-9493 |
| 1 | 1254-1234 |
| 2 | 3838-1234 |
How I wish it would look like:
| 0 | |
|---|---|
| 0 | 1234-9493 |
How can I change the code to take the position of the substring into account?
fruitlist is a string, not a list.
fruitlist = str(sys.argv[2:]).upper() converts the sys.argv to str then applies the upper case.
to avoid this you can do this instead:
fruitlist = [x.upper() for x in sys.argv[2:]]
full code:
import sys
fruitlist = [x.upper() for x in sys.argv[2:]]
print(sys.argv[1])
print(fruitlist)
if sys.argv[1].strip() in fruitlist:
print(sys.argv[1], 'exact match found in list')
Your fruitlist isn't actually a list; it is a string. Here is the correct code, which makes it a list not a string:
import sys
fruitlist = [str(a).upper() for a in sys.argv[2:]]
print(sys.argv[1])
print(fruitlist)
if sys.argv[1].strip() in fruitlist:
print(sys.argv[1], 'exact match found in list')
You could simply use ==
string_a == string_b
It should return True if the two strings are equal. But this does not solve your issue.
Edit 2: You should use len(df1.index) instead of len(df1.columns). Indeed, len(df1.columns) will give you the number of columns, and not the number of rows.
Edit 3: After reading your second post, I've understood your problem. The solution you propose could lead to some errors. For instance, if you have:
ls=['[email protected]','[email protected]', '[email protected]']
the first and the third element will match str.contains(r'(?:\s|^|Ei:|EI:|EI-)'+ls[i]) And this is an unwanted behaviour.
You could add a check on the end of the string: str.contains(r'(?:\s|^|Ei:|EI:|EI-)'+ls[i]+r'(?:\s|$)')
Like this:
for i in range(len(ls)):
df1 = df[df['A'].str.contains(r'(?:\s|^|Ei:|EI:|EI-)'+ls[i]+r'(?:\s|$)')]
if len(df1.index != 0):
print (ls[i])
(Remove parenthesis in the "print" if you use python 2.7)
Thanks for the help. But seems like I found a solution that is working as of now.
Must use str.contains(r'(?:\s|^|Ei:|EI:|EI-)'+ls[i]) This seems to solve the problem.
Although thanks to @IsaacDj for his help.