I've had a lot of experience running a compiled regex 1000s of times versus compiling on-the-fly, and have not noticed any perceivable difference. Obviously, this is anecdotal, and certainly not a great argument against compiling, but I've found the difference to be negligible.
EDIT:
After a quick glance at the actual Python 2.5 library code, I see that Python internally compiles AND CACHES regexes whenever you use them anyway (including calls to re.match()), so you're really only changing WHEN the regex gets compiled, and shouldn't be saving much time at all - only the time it takes to check the cache (a key lookup on an internal dict type).
From module re.py (comments are mine):
def match(pattern, string, flags=0):
return _compile(pattern, flags).match(string)
def _compile(*key):
# Does cache check at top of function
cachekey = (type(key[0]),) + key
p = _cache.get(cachekey)
if p is not None: return p
# ...
# Does actual compilation on cache miss
# ...
# Caches compiled regex
if len(_cache) >= _MAXCACHE:
_cache.clear()
_cache[cachekey] = p
return p
I still often pre-compile regular expressions, but only to bind them to a nice, reusable name, not for any expected performance gain.
Answer from Kenan Banks on Stack OverflowI've had a lot of experience running a compiled regex 1000s of times versus compiling on-the-fly, and have not noticed any perceivable difference. Obviously, this is anecdotal, and certainly not a great argument against compiling, but I've found the difference to be negligible.
EDIT:
After a quick glance at the actual Python 2.5 library code, I see that Python internally compiles AND CACHES regexes whenever you use them anyway (including calls to re.match()), so you're really only changing WHEN the regex gets compiled, and shouldn't be saving much time at all - only the time it takes to check the cache (a key lookup on an internal dict type).
From module re.py (comments are mine):
def match(pattern, string, flags=0):
return _compile(pattern, flags).match(string)
def _compile(*key):
# Does cache check at top of function
cachekey = (type(key[0]),) + key
p = _cache.get(cachekey)
if p is not None: return p
# ...
# Does actual compilation on cache miss
# ...
# Caches compiled regex
if len(_cache) >= _MAXCACHE:
_cache.clear()
_cache[cachekey] = p
return p
I still often pre-compile regular expressions, but only to bind them to a nice, reusable name, not for any expected performance gain.
For me, the biggest benefit to re.compile is being able to separate definition of the regex from its use.
Even a simple expression such as 0|[1-9][0-9]* (integer in base 10 without leading zeros) can be complex enough that you'd rather not have to retype it, check if you made any typos, and later have to recheck if there are typos when you start debugging. Plus, it's nicer to use a variable name such as num or num_b10 than 0|[1-9][0-9]*.
It's certainly possible to store strings and pass them to re.match; however, that's less readable:
num = "..."
# then, much later:
m = re.match(num, input)
Versus compiling:
num = re.compile("...")
# then, much later:
m = num.match(input)
Though it is fairly close, the last line of the second feels more natural and simpler when used repeatedly.
Videos
Do you see any reason to use re.compile() when you are working with regular expressions?
compilers - Can regex be compiled into efficient machine code? - Programming Language Design and Implementation Stack Exchange
`re.compile` is compiling a invalid regex
[Python] How to speed up thousands of re.subs()
beginner here, i don't really see a difference just assigning the regex pattern to a variable. are there some cases where it can be helpful?
Most languages that have regex have a regex parsing library that interprets the regex at runtime and matches them to strings.
This is mostly right, but "interpret" isn't really accurate. Regexes in most mainstream languages are compiled to a form which is efficient to check. The compilation happens at runtime, but each regex only needs to be compiled once and can be tested many times. For example, in Java you call Pattern.compile, in Python it's re.compile; but even APIs which accept regexes as strings, often cache the compiled form of the regex so that it only needs to be compiled once for the whole run of the program.
That said, the compiled form of a regex is still generally not quite as fast as a direct compilation to machine code would be.
what would it take for a language to 'compile' regex at compile time to be as efficient as writing an equivalent string matching function by hand?
The classical way to compile a regular expression is to convert it to a deterministic finite automaton (DFA) โ a state machine. Compiling a regular expression to a DFA involves a few steps, but here's a typical procedure:
- First build a non-deterministic finite automaton (NFA) using Thompson's construction,
- Then convert the NFA to a DFA using the powerset construction,
- Then, optionally, convert the DFA into a minimal equivalent DFA.
The resulting DFA can then evaluate the regular expression on a string of length n in O(n) time with a low coefficient โ potentially just a single array lookup per character in the input string โ regardless of how complicated the regular expression is. However, the DFA itself may take up a lot more memory for more complicated regular expressions.
Why do most languages not do this?
Because the classical approach has two really significant downsides which make it inapplicable for many real regexes. Firstly, it only works for regexes that are truly regular expressions in the formal sense, i.e. they recognise a regular language. But many features of modern regex engines allow regexes which are not true regular expressions โ particularly backreferences. Additionally, in the worst case the resulting DFA is exponentially large in the length of the regex, so while the regex is very efficient to execute, it is not at all efficient in terms of code size.
That's not to say that there aren't other potential approaches to compiling regular expressions into machine code, without going via NFAs and DFAs. But
- Those other approaches aren't nearly as well-known;
- The current state of the art (i.e. compiling at runtime to a form which enables efficient execution) can be pretty efficient regardless, so there's probably not that much room for improvement;
- Compiling regexes to machine code at runtime requires some way of hooking that machine code into the program while it's executing, and most language implementations don't have a mechanism for this;
- On the other hand, compiling regexes to machine code at compile-time would require the regex implementation to be part of the compiler, rather than part of the standard library, and wouldn't support dynamic construction of regexes at runtime.
Can regex be compiled into efficient machine code?
In general and absent of any context about a specific language? Yes, it can. The deeper question is really about the context (design goals) and costs (working with tradeoffs).
At the library-level, one example is Hana Dusikova's compile-time regular expressions in C++. See also its WG21 proposal paper, GitHub repo, and website with links to various conference talks. It leverages C++'s compile-time facilities. I think it hits close to what you're talking about. From its proposal paper:
The current std::regex design and implementation are slow, mostly because the RE pattern is parsed and compiled at runtime. Users often donโt need a runtime RE parser engine as the pattern is known during compilation in many common use cases. I think this breaks C++โs promise of โdonโt pay for what you donโt use.โ If the RE is known at compile time, the pattern should be checked during the compilation. The design of std::regex doesnโt allow for this as the RE input is a runtime string and syntax errors are reported as exceptions.
Notice a couple of things:
- The proposal addresses a core design goal of the language.
- The language had features that were able to support the implementation of a non-standard library implementation.
What would it take for a language to 'compile' regex at compile time to be as efficient as writing an equivalent string matching function by hand?
The answer is not so black and white. There can be space and time efficiency tradeoffs. Now that you have contextual information at each site that can be used for specialized code-generation, how much code do you inline? What do you inline and what don't you inline? What do you extract to common procedures and what impact does that have on costs from function calls? Inlining isn't always the best- there can be cases where more code sharing results in better instruction-cache usage. How much code gets generated? In what way does the runtime data cost relate to the inputs? How do all those costs compare to the costs of a generalized regex engine? At what point does generating optimized code for known-at-compile-time regexes become more costly in code size than just having a regex engine in the runtime environment?
If implemented at the library level, (assuming that you care,) you'd need a way for the user to communicate at the library level what they want to optimize for / what tradeoff they want, and then for the language to provide powerful enough facilities for things living at the library level to estimate the space and time costs, and even at that point, if the language specification and language implementations are more separate, you'd hit some boundary in how accurate those estimates can be with respect to what the implementation actually does (Ex. what it compiles).
Why do most languages not do this?
Not all languages prioritize efficiency of codegen as a design goal enough to want this. That's a perfectly valid design choice. You really can't be everything. In fact, the general overarching trend in languages over time seems to be to move away from the hardware and losing some of its benefits and have higher-level, more abstract languages and runtimes.
And a lot of runtime models are not really geared towards doing heavy compile-time optimization / specialization. For example, Java bytecode has a limited set of supported instructions, and that can further limit what kinds of code you can generate. At that point, it can become a conversation about adding deeper features to the language or its components (which has its own costs), or just shrugging your shoulders and leaving it to runtime implementations to detect patterns and optimize what they do under the hood (which has its limitations).
Sometimes it's just a matter of nobody caring enough or having enough time to put the work into implementing or specifying it yet. Things don't just magically appear, and a lot of people who work on language design are not doing that as their day-job. Life can get in the way. For languages that aren't designed by just one person, collaborating on things and making decisions as a group has its own challenges as well, such as just meeting at the same time (timezone things), agreeing on the same priorities, dealing with conflicts in effects on different use-cases, etc.
What would be the advantages or disadvantages of having regex as a core language feature as opposed to a library function?
As stated above, you could have the people implementing the language compilers / interpreter optimizers implement the tradoff optimizations of time and space instead of having something at the library level try to do those optimizations within its limitations and then not have that information propagate to the compilers / interpreter optimizers.