Join the list on the pipe character |, which represents different options in regex.
string_lst = ['fun', 'dum', 'sun', 'gum']
x="I love to have fun."
print re.findall(r"(?=("+'|'.join(string_lst)+r"))", x)
Output: ['fun']
You cannot use match as it will match from start.
Using search you will get only the first match. So use findall instead.
Also use lookahead if you have overlapping matches not starting at the same point.
Answer from vks on Stack OverflowJoin the list on the pipe character |, which represents different options in regex.
string_lst = ['fun', 'dum', 'sun', 'gum']
x="I love to have fun."
print re.findall(r"(?=("+'|'.join(string_lst)+r"))", x)
Output: ['fun']
You cannot use match as it will match from start.
Using search you will get only the first match. So use findall instead.
Also use lookahead if you have overlapping matches not starting at the same point.
regex module has named lists (sets actually):
#!/usr/bin/env python
import regex as re # $ pip install regex
p = re.compile(r"\L<words>", words=['fun', 'dum', 'sun', 'gum'])
if p.search("I love to have fun."):
print('matched')
Here words is just a name, you can use anything you like instead.
.search() methods is used instead of .* before/after the named list.
To emulate named lists using stdlib's re module:
#!/usr/bin/env python
import re
words = ['fun', 'dum', 'sun', 'gum']
longest_first = sorted(words, key=len, reverse=True)
p = re.compile(r'(?:{})'.format('|'.join(map(re.escape, longest_first))))
if p.search("I love to have fun."):
print('matched')
re.escape() is used to escape regex meta-characters such as .*? inside individual words (to match the words literally).
sorted() emulates regex behavior and it puts the longest words first among the alternatives, compare:
>>> import re
>>> re.findall("(funny|fun)", "it is funny")
['funny']
>>> re.findall("(fun|funny)", "it is funny")
['fun']
>>> import regex
>>> regex.findall(r"\L<words>", "it is funny", words=['fun', 'funny'])
['funny']
>>> regex.findall(r"\L<words>", "it is funny", words=['funny', 'fun'])
['funny']
So I spent a few days figuring out my beautiful regex found here to parse whatsapp messages. It through the text and puts it into groups, but now....I just don't know how to use it to pull out the data.
I am in google collab pulling my whatsapp raw text into a list like this:
def read_file(file):'''Reads Whatsapp text file into a list of strings'''x = open(file,'r', encoding = 'utf-8') #Opens the text file into variable x but the variable cannot be explored yety = x.read() #By now it becomes a huge chunk of string that we need to separate line by linecontent = y.splitlines() #The splitline method converts the chunk of string into a list of stringsreturn contentchat = read_file('18042020_cut.txt')
Cool. so like...what do I do now?
I tried:
content = re.search('[(?P<date>\d{2}/\d{2}/\d{4}),\s(?P<time>\d{1,2}:\d{2}:\d{2}.{3})]\s(?P<sender>[:]*):\s(?P<message>.+|\n+(?!)[\d{2}/\d{2}/\d{4})', chat).group('Content')
I am working on a project where I have a input list of filenames and I want to compute a regular expression that is as precise as possible and validate against all elements of the input list of file names. Is there a regular expression library or method to solve this? I've tried looking online but I only find results regarding validating a list of string inputs against an already defined input regular expression which is the opposite of what I am trying to do.
I have a DB with 2 tables that is being used to correlate excel tables/ ranges to specific files. One DB table has the data source name and file name pattern to the name of the file that data source is saved in. The other table has all the excel tables/ ranges that can be found in each data source. By joining these two tables, I can get a list of all data sources, the excel ranges the data source includes and the file name pattern they come from.
I want to fill the database's file patterns using existing XML files which store actual file names that had been used in the past to store the data. Processing these XML files, I can determine that, for example, data source A has in the past had file names of BAB.xls, CAB.xls, and DAB.xls.
I want to try and create a program that can take the list ['BAB.xls', 'CAB.xls', 'DAB.xls'] and return a regular expression like /.AB\.xls/.
Full Example (Python 3):
For Python 2.x look into Note below
import re
mylist = ["dog", "cat", "wildcat", "thundercat", "cow", "hooo"]
r = re.compile(".*cat")
newlist = list(filter(r.match, mylist)) # Read Note below
print(newlist)
Prints:
['cat', 'wildcat', 'thundercat']
Note:
For Python 2.x developers, filter returns a list already. In Python 3.x filter was changed to return an iterator so it has to be converted to list (in order to see it printed out nicely).
Python 3 code example
Python 2.x code example
You can create an iterator in Python 3.x or a list in Python 2.x by using:
filter(r.match, list)
To convert the Python 3.x iterator to a list, simply cast it; list(filter(..)).
Firstly, your regex seems to not work properly. The Key field should have values which could include f, right? So its group should not be ([0-9A-Ea-e]+) but instead ([0-9A-Fa-f]+). Also, it is a good - actually, a wonderful - practice to prefix the regex string with r when dealing with regexes because it avoids problems with \ escaping characters. (If you do not understand why to do it, look at raw strings)
Now, my approach to the problem. First, I would create a regex without pipes:
>>> regex = r"(Key):[\s]*([0-9A-Fa-f]+)[\s]*" \
... r"(Index):[\s]*([0-9]+)[\s]*" \
... r"(Field 1):[\s]*([0-9]+)[\s]*" \
... r"(Field 2):[\s]*([0-9 A-Za-z]+)[\s]*" \
... r"(Field 3):[\s]*([-+]?[0-9]+)[\s]*"
With this change, the findall() will return only one tuple of found groups for an entire line. In this tuple, each key is followed by its value:
>>> re.findall(regex, line)
[('Key', 'af12d9', 'Index', '0', 'Field 1', '1234', 'Field 2', '1234 Ring ', 'Field 3', '-10')]
So I get the tuple...
>>> found = re.findall(regex, line)[0]
>>> found
('Key', 'af12d9', 'Index', '0', 'Field 1', '1234', 'Field 2', '1234 Ring ', 'Field 3', '-10')
...and using slices I get only the keys...
>>> found[::2]
('Key', 'Index', 'Field 1', 'Field 2', 'Field 3')
...and also only the values:
>>> found[1::2]
('af12d9', '0', '1234', '1234 Ring ', '-10')
Then I create a list of tuples containing the key and its corresponding value with zip() function:
>>> zip(found[::2], found[1::2])
[('Key', 'af12d9'), ('Index', '0'), ('Field 1', '1234'), ('Field 2', '1234 Ring '), ('Field 3', '-10')]
The gran finale is to pass the list of tuples to the dict() constructor:
>>> dict(zip(found[::2], found[1::2]))
{'Field 3': '-10', 'Index': '0', 'Field 1': '1234', 'Key': 'af12d9', 'Field 2': '1234 Ring '}
I find this solution the best, but it is indeed a subjective question in some sense. HTH anyway :)
OK, with help of brandizzi, I have found THE answer to this question.
Solution:
listconfig = []
for line in list_of_strings:
matched = re.search(r"Key:[\s]*(?P<key>[0-9A-Fa-f]+)[\s]*" \
r"(Index:[\s]*(?P<index>[0-9]+)[\s]*)?" \
r"(Field 1:[\s]*(?P<field_1>[0-9]+)[\s]*)?" \
r"(Field 2:[\s]*(?P<field_2>[0-9 A-Za-z]+)[\s]*)?" \
r"(Field 3:[\s]*(?P<field_3>[-+]?[0-9]+)[\s]*)?", line)
if matched:
print matched.groupdict()
listconfig.append(matched.groupdict())
You can use the builtin any():
r = re.compile('.*search.*')
if any(r.match(line) for line in output):
do_stuff()
Passing in the lazy generator to any() will allow it to exit on the first match without having to check any farther into the iterable.
Starting Python 3.8, and the introduction of assignment expressions (PEP 572) (:= operator), we can also capture a witness of an any expression when a match is found and directly use it:
# pattern = re.compile('.*search.*')
# items = ['hello', 'searched', 'world', 'still', 'searching']
if any((match := pattern.match(x)) for x in items):
print(match.group(0))
# 'searched'
For each item, this:
- Applies the regex search (
pattern.match(x)) - Assigns the result to a
matchvariable (eitherNoneor are.Matchobject) - Applies the truth value of
matchas part of the any expression (None->False,Match->True) - If
matchisNone, then theanysearch loop continues - If
matchhas captured a group, then we exit theanyexpression which is consideredTrueand thematchvariable can be used within the condition's body
I'd like to be able to match strings which meet these criteria:
['foo' OR 'bar' OR 'Python'] AND ['me', OR 'you' OR 'we']
Use lookaheads. ^(?=.*foo|.*bar|.*Python)(?=.*me|.*you|.*we)
Add \b around the words (e.g. \bfoo\b) if you want them as isolated words, otherwise you get matches like fool)
https://regex101.com/r/dnqSjr/1
If you want to use regex you have to construct the regex string in your code from the lists.
You have to sting all the words together using regex '|' or - and that might not be the most efficient solution depending on the length of the word lists.
But lets start with the base regex:
\bAWORD\b
Will match "AWORD". \b means word boundary, meaning we don't match partial words. In sted of AWORD we can use a list here: (word1|word2|...ect).
This list you can construct in with python, like so:
import re
word_list1 = ['foo', 'bar', 'Python']
word_list2 = ['me', 'you', 'we']
words1 = '|'.join(word_list1)
words2 = '|'.join(word_list2)
regex = r'\b(?:{})\b'
test_str = "foo is a me word"
return (re.search(regex.format(words1), test_str) and
re.search(regex.format(words2), test_str)) != None
.format just inserts the '|' spectated words into the regex in place of '{}'. I am sure the is a more "pythonic" way of doing this, but this is the regex way. :)