It seems that this bug is related to backtracking. It occurs when a capture group is repeated, and the capture group matches but the pattern after the group doesn't.
An example:
>>> regex.sub(r'(?:(\d{1,3})x)+', r'\1', '123x5')
'5'
For reference, the expected output would be:
>>> re.sub(r'(?:(\d{1,3})x)+', r'\1', '123x5')
'1235'
In the first iteration, the capture group (\d{1,3}) consumes the first 3 digits, and x consumes the following "x" character. Then, because of the +, the match is attempted a 2nd time. This time, (\d{1,3}) matches "5", but the x fails to match. However, the capture group's value is now (re)set to the empty string instead of the expected 123.
As a workaround, we can prevent the capture group from matching. In this case, changing it to (\d{2,3}) is enough to bypass the bug (because it no longer matches "5"):
>>> regex.sub(r'(?:(\d{2,3})x)+', r'\1', '123x5')
'1235'
As for the pattern in question, we can use a lookahead assertion; we change (\w{1,3}) to (?=\w{1,3}(?:-|\.\.))(\w{1,3}):
>>> pattern= r"(?i)\b((?=\w{1,3}(?:-|\.\.))(\w{1,3})(-|\.{2,10})[\t ]?)+(\2\w{2,})"
>>> regex.sub(pattern, substitute, content)
'"Erm....yes. T-Thank you for that."'
Answer from Aran-Fey on Stack OverflowIt seems that this bug is related to backtracking. It occurs when a capture group is repeated, and the capture group matches but the pattern after the group doesn't.
An example:
>>> regex.sub(r'(?:(\d{1,3})x)+', r'\1', '123x5')
'5'
For reference, the expected output would be:
>>> re.sub(r'(?:(\d{1,3})x)+', r'\1', '123x5')
'1235'
In the first iteration, the capture group (\d{1,3}) consumes the first 3 digits, and x consumes the following "x" character. Then, because of the +, the match is attempted a 2nd time. This time, (\d{1,3}) matches "5", but the x fails to match. However, the capture group's value is now (re)set to the empty string instead of the expected 123.
As a workaround, we can prevent the capture group from matching. In this case, changing it to (\d{2,3}) is enough to bypass the bug (because it no longer matches "5"):
>>> regex.sub(r'(?:(\d{2,3})x)+', r'\1', '123x5')
'1235'
As for the pattern in question, we can use a lookahead assertion; we change (\w{1,3}) to (?=\w{1,3}(?:-|\.\.))(\w{1,3}):
>>> pattern= r"(?i)\b((?=\w{1,3}(?:-|\.\.))(\w{1,3})(-|\.{2,10})[\t ]?)+(\2\w{2,})"
>>> regex.sub(pattern, substitute, content)
'"Erm....yes. T-Thank you for that."'
edit: the bug is now resolved in regex 2017.04.23
just tested in Python 3.6.1 and the original pattern works the same in re and regex
Original workaround - you can use a lazy operator +? (i.e. a different regex that will behave differently than original pattern in edge cases like T...Tha....Thank):
pattern = r"(?i)\b((\w{1,3})(-|\.{2,10})[\t ]?)+?(\2\w{2,})"
The bug in 2017.04.05 was due to backtracking, something like this:
The unsuccessful longer match creates empty \2 group and conceptually, it should trigger backtracking to shorter match, where the nested group will be not empty, but regex seems to "optimize" and does not compute the shorter match from scratch, but uses some cached values, forgetting to undo the update of nested match groups.
Example greedy matching ((\w{1,3})(\.{2,10})){1,3} will first attempt 3 repetitions, then backtracks to less:
import re
import regex
content = '"Erm....yes. T..T...Thank you for that."'
base_pattern_template = r'((\w{1,3})(\.{2,10})){%s}'
test_cases = ['1,3', '3', '2', '1']
for tc in test_cases:
pattern = base_pattern_template % tc
expected = re.findall(pattern, content)
actual = regex.findall(pattern, content)
# TODO: convert to test case, e.g. in pytest
# assert str(expected) == str(actual), '{}\nexpected: {}\nactual: {}'.format(tc, expected, actual)
print('expected:', tc, expected)
print('actual: ', tc, actual)
output:
expected: 1,3 [('Erm....', 'Erm', '....'), ('T...', 'T', '...')]
actual: 1,3 [('Erm....', '', '....'), ('T...', '', '...')]
expected: 3 []
actual: 3 []
expected: 2 [('T...', 'T', '...')]
actual: 2 [('T...', 'T', '...')]
expected: 1 [('Erm....', 'Erm', '....'), ('T..', 'T', '..'), ('T...', 'T', '...')]
actual: 1 [('Erm....', 'Erm', '....'), ('T..', 'T', '..'), ('T...', 'T', '...')]
regex - An alternative regular expression module that is intended to eventually replace Python's re, with behavior more consistent with PCRE
I've created a Python module for constructing Regex patterns in a more computer programming-familiar way, so you don't have to re-learn Regex each time you use it!
regular expressions - In what programming language is Python's regex module written in? - Software Engineering Stack Exchange
Regular Expressions (RE) Module - Search and Match Comparison
ยป pip install regex
There does not yet exist a separate documentation page with specific instructions on how to use each class of the module, though all classes are sufficiently documented. There also exists a small example within the repo's README file to get the hang of it.
Here is the link to the repo: https://github.com/manoss96/pregex
Any feedback is welcome!
UPDATE: Thank you all for your comments and feedback, I hope this package helps you get the job done faster! I've gotten a lot of comments mentioning that having to import every stuff is annoying, and I can understand that. However, I still think that all classes should remain separated into different modules, as each module expresses a different functionality, but at the same time I don't think that importing everything all at once is a good thing, so I tried a different approach. All the modules that you'll need are now imported within the package's "__init__.py" by using a short alias for each module. For instance, "quantifiers.py" is imported as "qu". Thus, you can simply write "from pregex import *" at the top of your .py script, and then just use these aliases. Just be careful, this can only be done in pregex version >=1.0.2.