findall just returns the captured groups:
>>> re.findall('abc(de)fg(123)', 'abcdefg123 and again abcdefg123')
[('de', '123'), ('de', '123')]
Relevant doc excerpt:
Answer from Eli Bendersky on Stack OverflowReturn all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
findall just returns the captured groups:
>>> re.findall('abc(de)fg(123)', 'abcdefg123 and again abcdefg123')
[('de', '123'), ('de', '123')]
Relevant doc excerpt:
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
Use groups freely. The matches will be returned as a list of group-tuples:
>>> re.findall('(1(23))45', '12345')
[('123', '23')]
If you want the full match to be included, just enclose the entire regex in a group:
>>> re.findall('(1(23)45)', '12345')
[('12345', '23')]
hey all, curious what would need to be done to have a regex with a capture group show all of the matches.
import re var = 'Agent Alice and Agent Bob' ourRegex = re.compile(r'Agent (\w)\w*') print(ourRegex.findall(var))
This will output "['A', 'B']" and not "Agent Alice" and "Agent Bob"
Videos
Take 3, based on a further clarification of the OP's intent in this comment.
Ashwin is correct that findall does not preserve named capture groups (e.g. (?P<name>regex)). finditer to the rescue! It returns the individual match objects one-by-one. Simple example:
data = """34% passed 23% failed 46% deferred"""
for m in re.finditer('(?P<percentage>\w+)%\s(?P<word>\w+)', data):
print( m.group('percentage'), m.group('word') )
As you've identified in your second example, re.findall returns the groups in the original order.
The problem is that the standard Python dict type does not preserve the order of keys in any way. Here's the manual for Python 2.x, which makes it explicit, but it's still true in Python 3.x: https://docs.python.org/2/library/stdtypes.html#dict.items
What you should use instead is collections.OrderedDict:
from collections import OrderedDict as odict
data = """34% passed 23% failed 46% deferred"""
result = odict((key,value) for value, key in re.findall('(\w+)%\s(\w+)', data))
print(result)
>>> OrderedDict([('passed', '34'), ('failed', '23'), ('deferred', '46')])
Notice that you must use the pairwise constructor form (dict((k,v) for k,v in ...) rather than the dict comprehension constructor ({k:v for k,v in ...}). That's because the latter constructs instances of dicttype, which cannot be converted to OrderedDict without losing the order of the keys... which is of course what you are trying to preserve in the first place.
Using Pattern.finditer() then Match.groupdict():
>>> import re
>>> s = "bob sue jon richard harry"
>>> r = re.compile('(?P<name>[a-z]+)\s+(?P<name2>[a-z]+)')
>>> [m.groupdict() for m in r.finditer(s)]
[{'name2': 'sue', 'name': 'bob'}, {'name2': 'richard', 'name': 'jon'}]
you could switch to finditer
>>> import re
>>> text = "bob sue jon richard harry"
>>> pat = re.compile('(?P<name>[a-z]+)\s+(?P<name2>[a-z]+)')
>>> for m in pat.finditer(text):
... print m.groupdict()
...
{'name2': 'sue', 'name': 'bob'}
{'name2': 'richard', 'name': 'jon'}
With re.findall()
Example:
s = "-ab-cde-fghi-jkl-mn"
re.findall(r'[a-z]+', s)
Output:
['ab', 'cde', 'fghi', 'jkl', 'mn']
It works like you want by default in .NET.
Python does not support this though. The closest behavior you could get in Python, would be to repeat the match on the captured substring:
>>> match = re.match(r"(?P<all>(?:-(?P<one>\w+))*)","-ab-cde-fghi-jkl-mn")
>>> re.findall(r"-(?P<one>\w+)", match.group("all"))
['ab', 'cde', 'fghi', 'jkl', 'mn']
It could get complicated if the inner pattern is not extremely simple.
You seem to be asking about whether you can use variable number of regex groups. Based on a quick Google search, the answer appears to be no, the regex will match the full pattern but only the last value will be recorded for repeated matches of the same group.
Consider simply doing s.split('|') and then whatever checks that are necessary on each of the substrings instead.
import re
s = '''aaa
bbb|30s
ccc|500ms|1s'''
print(re.findall(r'(\w+)\|?(\w+)?\|?(\w+)?', s))
Output:
[('aaa', '', ''), ('bbb', '30s', ''), ('ccc', '500ms', '1s')]