Try with specifying the start and end rules in your regex:
re.compile(r'^test-\d+$')
Answer from hsz on Stack Overflowpython - How can I make a regex match the entire string? - Stack Overflow
regex - How can I find all matches to a regular expression in Python? - Stack Overflow
regex - Matching 2 regular expressions in Python - Stack Overflow
Structural Pattern Matching Should Permit Regex String Matches - Ideas - Discussions on Python.org
Videos
Try with specifying the start and end rules in your regex:
re.compile(r'^test-\d+$')
Since Python 3.4 you can use re.fullmatch to avoid adding ^ and $ to your pattern.
>>> import re
>>> p = re.compile(r'\d{3}')
>>> bool(p.match('1234'))
True
>>> bool(p.fullmatch('1234'))
False
Use re.findall or re.finditer instead.
re.findall(pattern, string) returns a list of matching strings.
re.finditer(pattern, string) returns an iterator over MatchObject objects.
Example:
re.findall( r'all (.*?) are', 'all cats are smarter than dogs, all dogs are dumber than cats')
# Output: ['cats', 'dogs']
[x.group() for x in re.finditer( r'all (.*?) are', 'all cats are smarter than dogs, all dogs are dumber than cats')]
# Output: ['all cats are', 'all dogs are']
Another method (a bit in keeping with OP's initial spirit albeit 13 years later) is to compile the pattern and call search() on the compiled pattern and move along the pattern. This is a bit verbose but if you don't want a lookahead etc. or you want to search over a string more explicitly, then you can use the following function.
import re
def find_all_matches(pattern, string, group=0):
pat = re.compile(pattern)
pos = 0
out = []
while m := pat.search(string, pos):
pos = m.start() + 1
out.append(m[group])
return out
pat = r'all (.*?) are'
s = 'all cats are smarter than dogs, all dogs are dumber than cats'
find_all_matches(pat, s) # ['all cats are', 'all dogs are']
find_all_matches(pat, s, group=1) # ['cats', 'dogs']
This works for overlapping matches too:
find_all_matches(r'(\w\w)', "hello") # ['he', 'el', 'll', 'lo']
Outside of the syntax clarification on re.match, I think I am understanding that you are struggling with taking two or more unknown (user input) regex expressions and classifying which is a more 'specific' match against a string.
Recall for a moment that a Python regex really is a type of computer program. Most modern forms, including Python's regex, are based on Perl. Perl's regex's have recursion, backtracking, and other forms that defy trivial inspection. Indeed a rogue regex can be used as a form of denial of service attack.
To see of this on your own computer, try:
>>> re.match(r'^(a+)+$','a'*24+'!')
That takes about 1 second on my computer. Now increase the 24 in 'a'*24 to a bit larger number, say 28. That take a lot longer. Try 48... You will probably need to CTRL+C now. The time increase as the number of a's increase is, in fact, exponential.
You can read more about this issue in Russ Cox's wonderful paper on 'Regular Expression Matching Can Be Simple And Fast'. Russ Cox is the Goggle engineer that built Google Code Search in 2006. As Cox observes, consider matching the regex 'a?'*33 + 'a'*33 against the string of 'a'*99 with awk and Perl (or Python or PCRE or Java or PHP or ...) Awk matches in 200 microseconds but Perl would require 1015 years because of exponential back tracking.
So the conclusion is: it depends! What do you mean by a more specific match? Look at some of Cox's regex simplification techniques in RE2. If your project is big enough to write your own libraries (or use RE2) and you are willing to restrict the regex grammar used (i.e., no backtracking or recursive forms), I think the answer is that you would classify 'a better match' in a variety of ways.
If you are looking for a simple way to state that (regex_3 < regex_1 < regex_2) when matched against some string using Python or Perl's regex language, I think that the answer is it is very very hard (i.e., this problem is NP Complete)
Edit
Everything I said above is true! However, here is a stab at sorting matching regular expressions based on one form of 'specific': How many edits to get from the regex to the string. The greater number of edits (or the higher the Levenshtein distance) the less 'specific' the regex is.
You be the judge if this works (I don't know what 'specific' means to you for your application):
import re
def ld(a,b):
"Calculates the Levenshtein distance between a and b."
n, m = len(a), len(b)
if n > m:
# Make sure n <= m, to use O(min(n,m)) space
a,b = b,a
n,m = m,n
current = range(n+1)
for i in range(1,m+1):
previous, current = current, [i]+[0]*n
for j in range(1,n+1):
add, delete = previous[j]+1, current[j-1]+1
change = previous[j-1]
if a[j-1] != b[i-1]:
change = change + 1
current[j] = min(add, delete, change)
return current[n]
s='Mary had a little lamb'
d={}
regs=[r'.*', r'Mary', r'lamb', r'little lamb', r'.*little lamb',r'\b\w+mb',
r'Mary.*little lamb',r'.*[lL]ittle [Ll]amb',r'\blittle\b',s,r'little']
for reg in regs:
m=re.search(reg,s)
if m:
print "'%s' matches '%s' with sub group '%s'" % (reg, s, m.group(0))
ld1=ld(reg,m.group(0))
ld2=ld(m.group(0),s)
score=max(ld1,ld2)
print " %i edits regex->match(0), %i edits match(0)->s" % (ld1,ld2)
print " score: ", score
d[reg]=score
print
else:
print "'%s' does not match '%s'" % (reg, s)
print " ===== %s ===== === %s ===" % ('RegEx'.center(10),'Score'.center(10))
for key, value in sorted(d.iteritems(), key=lambda (k,v): (v,k)):
print " %22s %5s" % (key, value)
The program is taking a list of regex's and matching against the string Mary had a little lamb.
Here is the sorted ranking from "most specific" to "least specific":
===== RegEx ===== === Score ===
Mary had a little lamb 0
Mary.*little lamb 7
.*little lamb 11
little lamb 11
.*[lL]ittle [Ll]amb 15
\blittle\b 16
little 16
Mary 18
\b\w+mb 18
lamb 18
.* 22
This based on the (perhaps simplistic) assumption that: a) the number of edits (the Levenshtein distance) to get from the regex itself to the matching substring is the result of wildcard expansions or replacements; b) the edits to get from the matching substring to the initial string. (just take one)
As two simple examples:
.*(or.*.*or.*?.*etc) against any sting is a large number of edits to get to the string, in fact equal to the string length. This is the max possible edits, the highest score, and the least 'specific' regex.- The regex of the string itself against the string is as specific as possible. No edits to change one to the other resulting in a 0 or lowest score.
As stated, this is simplistic. Anchors should increase specificity but they do not in this case. Very short stings don't work because the wild-card may be longer than the string.
Edit 2
I got anchor parsing to work pretty darn well using the undocumented sre_parse module in Python. Type >>> help(sre_parse) if you want to read more...
This is the goto worker module underlying the re module. It has been in every Python distribution since 2001 including all the P3k versions. It may go away, but I don't think it is likely...
Here is the revised listing:
import re
import sre_parse
def ld(a,b):
"Calculates the Levenshtein distance between a and b."
n, m = len(a), len(b)
if n > m:
# Make sure n <= m, to use O(min(n,m)) space
a,b = b,a
n,m = m,n
current = range(n+1)
for i in range(1,m+1):
previous, current = current, [i]+[0]*n
for j in range(1,n+1):
add, delete = previous[j]+1, current[j-1]+1
change = previous[j-1]
if a[j-1] != b[i-1]:
change = change + 1
current[j] = min(add, delete, change)
return current[n]
s='Mary had a little lamb'
d={}
regs=[r'.*', r'Mary', r'lamb', r'little lamb', r'.*little lamb',r'\b\w+mb',
r'Mary.*little lamb',r'.*[lL]ittle [Ll]amb',r'\blittle\b',s,r'little',
r'^.*lamb',r'.*.*.*b',r'.*?.*',r'.*\b[lL]ittle\b \b[Ll]amb',
r'.*\blittle\b \blamb$','^'+s+'$']
for reg in regs:
m=re.search(reg,s)
if m:
ld1=ld(reg,m.group(0))
ld2=ld(m.group(0),s)
score=max(ld1,ld2)
for t, v in sre_parse.parse(reg):
if t=='at': # anchor...
if v=='at_beginning' or 'at_end':
score-=1 # ^ or $, adj 1 edit
if v=='at_boundary': # all other anchors are 2 char
score-=2
d[reg]=score
else:
print "'%s' does not match '%s'" % (reg, s)
print
print " ===== %s ===== === %s ===" % ('RegEx'.center(15),'Score'.center(10))
for key, value in sorted(d.iteritems(), key=lambda (k,v): (v,k)):
print " %27s %5s" % (key, value)
And soted RegEx's:
===== RegEx ===== === Score ===
Mary had a little lamb 0
^Mary had a little lamb$ 0
.*\blittle\b \blamb$ 6
Mary.*little lamb 7
.*\b[lL]ittle\b \b[Ll]amb 10
\blittle\b 10
.*little lamb 11
little lamb 11
.*[lL]ittle [Ll]amb 15
\b\w+mb 15
little 16
^.*lamb 17
Mary 18
lamb 18
.*.*.*b 21
.* 22
.*?.* 22
It depends on what kind of regular expressions you have; as @carrot-top suggests, if you actually aren't dealing with "regular expressions" in the CS sense, and instead have crazy extensions, then you are definitely out of luck.
However, if you do have traditional regular expressions, you might make a bit more progress. First, we could define what "more specific" means. Say R is a regular expression, and L(R) is the language generated by R. Then we might say R1 is more specific than R2 if L(R1) is a (strict) subset of L(R2) (L(R1) < L(R2)). That only gets us so far: in many cases, L(R1) is neither a subset nor a superset of L(R2), and so we might imagine that the two are somehow incomparable. An example, trying to match "mary had a little lamb", we might find two matching expressions: .*mary and lamb.*.
One non-ambiguous solution is to define specificity via implementation. For instance, convert your regular expression in a deterministic (implementation-defined) way to a DFA and simply count states. Unfortunately, this might be relatively opaque to a user.
Indeed, you seem to have an intuitive notion of how you want two regular expressions to compare, specificity-wise. Why not simple write down a definition of specificity, based on the syntax of regular expressions, that matches your intuition reasonably well?
Totally arbitrary rules follow:
- Characters =
1. - Character ranges of
ncharacters =n(and let's say\b=5, because I'm not sure how you might choose to write it out long-hand). - Anchors are
5each. *divides its argument by2.+divides its argument by2, then adds1..=-10.
Anyway, just food for thought, as the other answers do a good job of outlining some of the issues you're facing; hope it helps.
Update
I condensed this answer into a python package to make matching as easy as pip install regex-spm,
import regex_spm
match regex_spm.fullmatch_in("abracadabra"):
case r"\d+": print("It's all digits")
case r"\D+": print("There are no digits in the search string")
case _: print("It's something else")
Original answer
As Patrick Artner correctly points out in the other answer, there is currently no official way to do this. Hopefully the feature will be introduced in a future Python version and this question can be retired. Until then:
PEP 634 specifies that Structural Pattern Matching uses the == operator for evaluating a match. We can override that.
import re
from dataclasses import dataclass
# noinspection PyPep8Naming
@dataclass
class regex_in:
string: str
def __eq__(self, other: str | re.Pattern):
if isinstance(other, str):
other = re.compile(other)
assert isinstance(other, re.Pattern)
# TODO extend for search and match variants
return other.fullmatch(self.string) is not None
Now you can do something like:
match regex_in(validated_string):
case r'\d+':
print('Digits')
case r'\s+':
print('Whitespaces')
case _:
print('Something else')
Caveat #1 is that you can't pass re.compile'd patterns to the case directly, because then Python wants to match based on class. You have to save the pattern somewhere first.
Caveat #2 is that you can't actually use local variables either, because Python then interprets it as a name for capturing the match subject. You need to use a dotted name, e.g. putting the pattern into a class or enum:
class MyPatterns:
DIGITS = re.compile('\d+')
match regex_in(validated_string):
case MyPatterns.DIGITS:
print('This works, it\'s all digits')
Groups
This could be extended even further to provide an easy way to access the re.Match object and the groups.
# noinspection PyPep8Naming
@dataclass
class regex_in:
string: str
match: re.Match = None
def __eq__(self, other: str | re.Pattern):
if isinstance(other, str):
other = re.compile(other)
assert isinstance(other, re.Pattern)
# TODO extend for search and match variants
self.match = other.fullmatch(self.string)
return self.match is not None
def __getitem__(self, group):
return self.match[group]
# Note the `as m` in in the case specification
match regex_in(validated_string):
case r'\d(\d)' as m:
print(f'The second digit is {m[1]}')
print(f'The whole match is {m.match}')
Clean solution
There is a clean solution to this problem. Just hoist the regexes out of the case-clauses where they aren't supported and into the match-clause which supports any Python object.
The combined regex will also give you better efficiency than could be had by having a series of separate regex tests. Also, the regex can be precompiled for maximum efficiency during the match process.
Example
Here's a worked out example for a simple tokenizer:
pattern = re.compile(r'(\d+\.\d+)|(\d+)|(\w+)|(".*)"')
Token = namedtuple('Token', ('kind', 'value', 'position'))
env = {'x': 'hello', 'y': 10}
for s in ['123', '123.45', 'x', 'y', '"goodbye"']:
mo = pattern.fullmatch(s)
match mo.lastindex:
case 1:
tok = Token('NUM', float(s), mo.span())
case 2:
tok = Token('NUM', int(s), mo.span())
case 3:
tok = Token('VAR', env[s], mo.span())
case 4:
tok = Token('TEXT', s[1:-1], mo.span())
case _:
raise ValueError(f'Unknown pattern for {s!r}')
print(tok)
This outputs:
Token(kind='NUM', value=123, position=(0, 3))
Token(kind='NUM', value=123.45, position=(0, 6))
Token(kind='VAR', value='hello', position=(0, 1))
Token(kind='VAR', value=10, position=(0, 1))
Token(kind='TEXT', value='goodbye', position=(0, 9))
Better Example
The code can be improved by writing the combined regex in verbose format for intelligibility and ease of adding more cases. It can be further improved by naming the regex sub patterns:
pattern = re.compile(r"""(?x)
(?P<float>\d+\.\d+) |
(?P<int>\d+) |
(?P<variable>\w+) |
(?P<string>".*")
""")
That can be used in a match/case statement like this:
for s in ['123', '123.45', 'x', 'y', '"goodbye"']:
mo = pattern.fullmatch(s)
match mo.lastgroup:
case 'float':
tok = Token('NUM', float(s), mo.span())
case 'int':
tok = Token('NUM', int(s), mo.span())
case 'variable':
tok = Token('VAR', env[s], mo.span())
case 'string':
tok = Token('TEXT', s[1:-1], mo.span())
case _:
raise ValueError(f'Unknown pattern for {s!r}')
print(tok)
Comparison to if/elif/else
Here is the equivalent code written using an if-elif-else chain:
for s in ['123', '123.45', 'x', 'y', '"goodbye"']:
if (mo := re.fullmatch('\d+\.\d+', s)):
tok = Token('NUM', float(s), mo.span())
elif (mo := re.fullmatch('\d+', s)):
tok = Token('NUM', int(s), mo.span())
elif (mo := re.fullmatch('\w+', s)):
tok = Token('VAR', env[s], mo.span())
elif (mo := re.fullmatch('".*"', s)):
tok = Token('TEXT', s[1:-1], mo.span())
else:
raise ValueError(f'Unknown pattern for {s!r}')
print(tok)
Compared to the match/case, the if-elif-else chain is slower because it runs multiple regex matches and because there is no precompilation. Also, it is less maintainable without the case names.
Because all the regexes are separate we have to capture all the match objects separately with repeated use of assignment expressions with the walrus operator. This is awkward compared to the match/case example where we only make a single assignment.
import re
pattern = re.compile("^([A-Z][0-9]+)+$")
pattern.match(string)
One-liner: re.match(r"pattern", string) # No need to compile
import re
>>> if re.match(r"hello[0-9]+", 'hello1'):
... print('Yes')
...
Yes
You can evaluate it as bool if needed
>>> bool(re.match(r"hello[0-9]+", 'hello1'))
True
Are you sure you need a regex? It seems that you only need to know if a word is present in a string, so you can do:
>>> line = 'This,is,a,sample,string'
>>> "sample" in line
True
The r makes the string a raw string, which doesn't process escape characters (however, since there are none in the string, it is actually not needed here).
Also, re.match matches from the beginning of the string. In other words, it looks for an exact match between the string and the pattern. To match stuff that could be anywhere in the string, use re.search. See a demonstration below:
>>> import re
>>> line = 'This,is,a,sample,string'
>>> re.match("sample", line)
>>> re.search("sample", line)
<_sre.SRE_Match object at 0x021D32C0>
>>>
You could create a little class that returns the boolean result of calling match, and retains the matched groups for subsequent retrieval:
import re
class REMatcher(object):
def __init__(self, matchstring):
self.matchstring = matchstring
def match(self,regexp):
self.rematch = re.match(regexp, self.matchstring)
return bool(self.rematch)
def group(self,i):
return self.rematch.group(i)
for statement in ("I love Mary",
"Ich liebe Margot",
"Je t'aime Marie",
"Te amo Maria"):
m = REMatcher(statement)
if m.match(r"I love (\w+)"):
print "He loves",m.group(1)
elif m.match(r"Ich liebe (\w+)"):
print "Er liebt",m.group(1)
elif m.match(r"Je t'aime (\w+)"):
print "Il aime",m.group(1)
else:
print "???"
Update for Python 3 print as a function, and Python 3.8 assignment expressions - no need for a REMatcher class now:
import re
for statement in ("I love Mary",
"Ich liebe Margot",
"Je t'aime Marie",
"Te amo Maria"):
if m := re.match(r"I love (\w+)", statement):
print("He loves", m.group(1))
elif m := re.match(r"Ich liebe (\w+)", statement):
print("Er liebt", m.group(1))
elif m := re.match(r"Je t'aime (\w+)", statement):
print("Il aime", m.group(1))
else:
print()
Less efficient, but simpler-looking:
m0 = re.match("I love (\w+)", statement)
m1 = re.match("Ich liebe (\w+)", statement)
m2 = re.match("Je t'aime (\w+)", statement)
if m0:
print("He loves", m0.group(1))
elif m1:
print("Er liebt", m1.group(1))
elif m2:
print("Il aime", m2.group(1))
The problem with the Perl stuff is the implicit updating of some hidden variable. That's simply hard to achieve in Python because you need to have an assignment statement to actually update any variables.
The version with less repetition (and better efficiency) is this:
pats = [
("I love (\w+)", "He Loves {0}" ),
("Ich liebe (\w+)", "Er Liebe {0}" ),
("Je t'aime (\w+)", "Il aime {0}")
]
for p1, p3 in pats:
m = re.match(p1, statement)
if m:
print(p3.format(m.group(1)))
break
A minor variation that some Perl folk prefer:
pats = {
"I love (\w+)" : "He Loves {0}",
"Ich liebe (\w+)" : "Er Liebe {0}",
"Je t'aime (\w+)" : "Il aime {0}",
}
for p1 in pats:
m = re.match(p1, statement)
if m:
print(pats[p1].format(m.group(1)))
break
This is hardly worth mentioning except it does come up sometimes from Perl programmers.