Join the list on the pipe character |, which represents different options in regex.
string_lst = ['fun', 'dum', 'sun', 'gum']
x="I love to have fun."
print re.findall(r"(?=("+'|'.join(string_lst)+r"))", x)
Output: ['fun']
You cannot use match as it will match from start.
Using search you will get only the first match. So use findall instead.
Also use lookahead if you have overlapping matches not starting at the same point.
Answer from vks on Stack OverflowJoin the list on the pipe character |, which represents different options in regex.
string_lst = ['fun', 'dum', 'sun', 'gum']
x="I love to have fun."
print re.findall(r"(?=("+'|'.join(string_lst)+r"))", x)
Output: ['fun']
You cannot use match as it will match from start.
Using search you will get only the first match. So use findall instead.
Also use lookahead if you have overlapping matches not starting at the same point.
regex module has named lists (sets actually):
#!/usr/bin/env python
import regex as re # $ pip install regex
p = re.compile(r"\L<words>", words=['fun', 'dum', 'sun', 'gum'])
if p.search("I love to have fun."):
print('matched')
Here words is just a name, you can use anything you like instead.
.search() methods is used instead of .* before/after the named list.
To emulate named lists using stdlib's re module:
#!/usr/bin/env python
import re
words = ['fun', 'dum', 'sun', 'gum']
longest_first = sorted(words, key=len, reverse=True)
p = re.compile(r'(?:{})'.format('|'.join(map(re.escape, longest_first))))
if p.search("I love to have fun."):
print('matched')
re.escape() is used to escape regex meta-characters such as .*? inside individual words (to match the words literally).
sorted() emulates regex behavior and it puts the longest words first among the alternatives, compare:
>>> import re
>>> re.findall("(funny|fun)", "it is funny")
['funny']
>>> re.findall("(fun|funny)", "it is funny")
['fun']
>>> import regex
>>> regex.findall(r"\L<words>", "it is funny", words=['fun', 'funny'])
['funny']
>>> regex.findall(r"\L<words>", "it is funny", words=['funny', 'fun'])
['funny']
The Machine Learning Lab at the University of Trieste, Italy, wrote a web app to solve this exact problem, available at: Automatic Generation of Regular Expressions from Examples based on "genetic programming" and published a paper about it. Based on the fact that it had to come from the Machine Learning Lab of a university in Italy, and was worth publishing a paper about, it is probably a pretty hard problem to solve. Doesn't sound like the type of question for Code Golf SE, but I'm new here, so I wouldn't be the expert on that.
I would approach this problem by searching for common subsequences in each pair of strings in the list, and then building regular expressions that include those subsequences. This would be somewhat like the Re-Pair algorithm, but it would build a grammar from a set of strings instead of one string.
For instance, these strings:
["red wolf", "red fox", "gray wolf", "gray fox"]
could be combined into two regular expressions:
"red (wolf|fox)", "gray (wolf|fox)"
which can be combined into one regular expression:
"(red|gray) (wolf|fox)"
But this problem is probably NP-complete, since it is similar to the smallest grammar problem.
How can use RegEx to match a list of strings in C#? - Stack Overflow
c# - How to filter a list of strings matching a pattern - Stack Overflow
regex - Match list of words without the list of chars around - Stack Overflow
c# - Is it possible to write a regex that does one search then uses its results to do another search? - Software Engineering Stack Exchange
Assuming you're using POSIX regcomp/regexec, each call to regexec will only find the first match in the string. To find the next, use the end position of the first match (the 0th entry of the regmatch_t array filled by regexec) as an offset to apply to the string before searching it again. Repeat until you have no more matches. You can write a function to do this if you want.
The C standard library (as specified by ISO/IEC 9899) does not include a regular expression module, so you will need to use an external library. Good choices include regex.h from GNU libc, as detailed in /questions/635756 and PCRE, as detailed in /questions/1421785.
You probably want to use a regular expression for this if your patterns are going to be complex....
you could either use a proper regular expression as your filter (e.g for your specific example it would be new Regex(@"^.*_Test\.txt$") or you could apply a conversion algorithm.
Either way you could then just use linq to apply the regex.
for example
var myRegex=new Regex(@"^.*_Test\.txt$");
List<string> resultList=files.Where(myRegex.IsMatch).ToList();
Some people may think the above answer is incorrect, but you can use a method group instead of a lambda. If you wish the full lamda you would use:
var myRegex=new Regex(@"^.*_Test\.txt$");
List<string> resultList=files.Where(f => myRegex.IsMatch(f)).ToList();
or non Linq
List<string> resultList=files.FindAll(delegate(string s) { return myRegex.IsMatch(s);});
if you were converting the filter a simple conversion would be
var myFilter="*_Test.txt";
var myRegex=new Regex("^" + myFilter.Replace("*",".*") +"$");
You could then also have filters like "*Test*.txt" with this method.
However, if you went down this conversion route you would need to make sure you escaped out all the special regular expression chars e.g. "." becomes @".", "(" becomes @"(" etc.......
Edit -- The example replace is TOO simple because it doesn't convert the . so it would find "fish_Textxtxt" so escape atleast the .
so
string myFilter="*_Test.txt";
foreach(char x in @"\+?|{[()^$.#") {
myFilter = myFilter.Replace(x.ToString(),@"\"+x.ToString());
}
Regex myRegex=new Regex(string.Format("^{0}$",myFilter.Replace("*",".*")));
Have you tried LINQ:
List<string> resultList = files.Where(x => x.EndsWith("_Test.txt")).ToList();
or if you are running this on some old/legacy .NET version (< 3.5):
List<string> resultList = files.FindAll(delegate(string s) {
return s.EndsWith("_Test.txt");
});
Since your capture groups define explicitly one character on either side of the common word, it's looking for space word space and then when it doesn't find another space, it fails.
In this case, since you don't want to match all the characters word boundary's would catch (period, apostrophe, etc.) you need to use a bit of trickery with lookaheads, lookbehinds, and non-capture groups. Try this:
(?:^|(?<= ))(one|common|word|or|another)(?:(?= )|$)
http://regex101.com/r/cM9hD8
Word boundaries are still simpler to implement, so for reference sake, you could also do this (though it would include ', ., etc.).
\b(one|common|word|or|another)\b
You can use (?:[\s]|^)(one|common|word|or|another)(?=[\s]|$) instead.
It will not match one's , someone ,etc...
Check DEMO
**** RESOLVED ****
Hi,
I’m not sure if this is possible:
I’m looking for specific strings that contain an "a" with this regex: (flavour is c# (.net))
([^\s]+?)a([^\s]+?)\b
but they should only match if the found word is part of a list. Some kind of opposite of negative lookbehind.
So the above regex captures all kind of strings with "a" in them, but it should only match if the string is part of
"fass" or "arbecht" as I need to replace the a by some other string.
example: it should match "verfassen" or "verarbeit" but not "passen"
Best regards,
Pascal
Edit: Solution:
These two versions work fine and credits and many thanks go to:
u/gumnos: \b(?=\S*(?:fass|arbeit))(\S*?)a(\S*)\b
u/rainshifter (with some editing to match what I really need): (?<=(?:\b(?=\w*(?:fass|arbeit))|\G(?<!^))\w*)(\S*?)a(\S*)\b
There are quite a lot of regular expression packages, but yours seems to match the one in POSIX: regcomp() etc.
The two structures it defines in <regex.h> are:
regex_tcontaining at leastsize_t re_nsub, the number of parenthesized subexpressions.regmatch_tcontaining at leastregoff_t rm_so, the byte offset from start of string to start of substring, andregoff_t rm_eo, the byte offset from start of string of the first character after the end of substring.
Note that 'offsets' are not pointers but indexes into the character array.
The execution function is:
int regexec(const regex_t *restrict preg, const char *restrict string, size_t nmatch, regmatch_t pmatch[restrict], int eflags);
Your printing code should be:
for (int i = 0; i <= r.re_nsub; i++)
{
int start = m[i].rm_so;
int finish = m[i].rm_eo;
// strcpy(matches[ind], ("%.*s\n", (finish - start), p + start)); // Based on question
sprintf(matches[ind], "%.*s\n", (finish - start), p + start); // More plausible code
printf("Storing: %.*s\n", (finish - start), matches[ind]); // Print once
ind++;
printf("%.*s\n", (finish - start), p + start); // Why print twice?
}
Note that the code should be upgraded to ensure that the string copy (via sprintf()) does not overflow the target string — maybe by using snprintf() instead of sprintf(). It is also a good idea to mark the start and end of a string in the printing. For example:
printf("<<%.*s>>\n", (finish - start), p + start);
This makes it a whole heap easier to see spaces etc.
[In future, please attempt to provide an MCVE (Minimal, Complete, Verifiable Example) or SSCCE (Short, Self-Contained, Correct Example) so that people can help more easily.]
This is an SSCCE that I created, probably in response to another SO question in 2010. It is one of a number of programs I keep that I call 'vignettes'; little programs that show the essence of some feature (such as POSIX regexes, in this case). I find them useful as memory joggers.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <regex.h>
#define tofind "^DAEMONS=\\(([^)]*)\\)[ \t]*$"
int main(int argc, char **argv)
{
FILE *fp;
char line[1024];
int retval = 0;
regex_t re;
regmatch_t rm[2];
//this file has this line "DAEMONS=(sysklogd network sshd !netfs !crond)"
const char *filename = "/etc/rc.conf";
if (argc > 1)
filename = argv[1];
if (regcomp(&re, tofind, REG_EXTENDED) != 0)
{
fprintf(stderr, "Failed to compile regex '%s'\n", tofind);
return EXIT_FAILURE;
}
printf("Regex: %s\n", tofind);
printf("Number of captured expressions: %zu\n", re.re_nsub);
fp = fopen(filename, "r");
if (fp == 0)
{
fprintf(stderr, "Failed to open file %s (%d: %s)\n", filename, errno, strerror(errno));
return EXIT_FAILURE;
}
while ((fgets(line, 1024, fp)) != NULL)
{
line[strcspn(line, "\n")] = '\0';
if ((retval = regexec(&re, line, 2, rm, 0)) == 0)
{
printf("<<%s>>\n", line);
// Complete match
printf("Line: <<%.*s>>\n", (int)(rm[0].rm_eo - rm[0].rm_so), line + rm[0].rm_so);
// Match captured in (...) - the \( and \) match literal parenthesis
printf("Text: <<%.*s>>\n", (int)(rm[1].rm_eo - rm[1].rm_so), line + rm[1].rm_so);
char *src = line + rm[1].rm_so;
char *end = line + rm[1].rm_eo;
while (src < end)
{
size_t len = strcspn(src, " ");
if (src + len > end)
len = end - src;
printf("Name: <<%.*s>>\n", (int)len, src);
src += len;
src += strspn(src, " ");
}
}
}
return EXIT_SUCCESS;
}
This was designed to find a particular line starting DAEMONS= in a file /etc/rc.conf (but you can specify an alternative file name on the command line). You can adapt it to your purposes easily enough.
Since g++ regex is bugged until who knows when, you can use my code instead (License: AGPL, no warranty, your own risk, ...)
/**
* regexp (License: AGPL3 or higher)
* @param re extended POSIX regular expression
* @param nmatch maximum number of matches
* @param str string to match
* @return An array of char pointers. You have to free() the first element (string storage). the second element is the string matching the full regex, then come the submatches.
*/
char **regexp(char *re, int nmatch, char *str) {
char **result;
char *string;
regex_t regex;
regmatch_t *match;
int i;
match=malloc(nmatch*sizeof(*match));
if (!result) {
fprintf(stderr, "Out of memory !");
return NULL;
}
if (regcomp(®ex, re, REG_EXTENDED)!=0) {
fprintf(stderr, "Failed to compile regex '%s'\n", re);
return NULL;
}
string=strdup(str);
if (regexec(®ex,string,nmatch,match,0)) {
#ifdef DEBUG
fprintf(stderr, "String '%s' does not match regex '%s'\n",str,re);
#endif
free(string);
return NULL;
}
result=malloc(sizeof(*result));
if (!result) {
fprintf(stderr, "Out of memory !");
free(string);
return NULL;
}
for (i=0; i<nmatch; ++i) {
if (match[i].rm_so>=0) {
string[match[i].rm_eo]=0;
((char**)result)[i]=string+match[i].rm_so;
#ifdef DEBUG
printf("%s\n",string+match[i].rm_so);
#endif
} else {
((char**)result)[i]="";
}
}
result[0]=string;
return result;
}
I assume your regex_match is some combination of regcomp and regexec. To enable grouping, you need to call regcomp with the REG_EXTENDED flag, but without the REG_NOSUB flag (in the third argument).
regex_t compiled;
regcomp(&compiled, "(match1)|(match2)|(match3)", REG_EXTENDED);
Then allocate space for the groups. The number of groups is stored in compiled.re_nsub. Pass this number to regexec:
size_t ngroups = compiled.re_nsub + 1;
regmatch_t *groups = malloc(ngroups * sizeof(regmatch_t));
regexec(&compiled, str, ngroups, groups, 0);
Now, the first invalid group is the one with a -1 value in both its rm_so and rm_eo fields:
size_t nmatched;
for (nmatched = 0; nmatched < ngroups; nmatched++)
if (groups[nmatched].rm_so == (size_t)(-1))
break;
nmatched is the number of parenthesized subexpressions (groups) matched. Add your own error checking.
You could have them give you a array of strings that contain your regexps and test each one of them.
//count is the number of regexps provided
int give_me_number_of_regex_group(const char *needle,const char** regexps, int count ){
for(int i = 0; i < count; ++i){
if(regex_match(needle, regexp[i])){
return i;
}
}
return -1; //didn't match any
}
or am i overseeing something?
Regular expressions actually aren't part of ANSI C. It sounds like you might be talking about the POSIX regular expression library, which comes with most (all?) *nixes. Here's an example of using POSIX regexes in C (based on this):
#include <regex.h>
regex_t regex;
int reti;
char msgbuf[100];
/* Compile regular expression */
reti = regcomp(®ex, "^a[[:alnum:]]", 0);
if (reti) {
fprintf(stderr, "Could not compile regex\n");
exit(1);
}
/* Execute regular expression */
reti = regexec(®ex, "abc", 0, NULL, 0);
if (!reti) {
puts("Match");
}
else if (reti == REG_NOMATCH) {
puts("No match");
}
else {
regerror(reti, ®ex, msgbuf, sizeof(msgbuf));
fprintf(stderr, "Regex match failed: %s\n", msgbuf);
exit(1);
}
/* Free memory allocated to the pattern buffer by regcomp() */
regfree(®ex);
Alternatively, you may want to check out PCRE, a library for Perl-compatible regular expressions in C. The Perl syntax is pretty much that same syntax used in Java, Python, and a number of other languages. The POSIX syntax is the syntax used by grep, sed, vi, etc.
This is an example of using REG_EXTENDED. This regular expression
"^(-)?([0-9]+)((,|.)([0-9]+))?\n$"
Allows you to catch decimal numbers in Spanish system and international. :)
#include <regex.h>
#include <stdlib.h>
#include <stdio.h>
regex_t regex;
char msgbuf[100];
int reti = regcomp(®ex, "^(-)?([0-9]+)((,|.)([0-9]+))?\n$", REG_EXTENDED);
int main(int argc, char const *argv[])
{
while(1){
fgets( msgbuf, 100, stdin );
if (reti) {
fprintf(stderr, "Could not compile regex\n");
exit(1);
}
/* Execute regular expression */
printf("%s\n", msgbuf);
reti = regexec(®ex, msgbuf, 0, NULL, 0);
if (!reti) {
puts("Match");
}
else if (reti == REG_NOMATCH) {
puts("No match");
}
else {
regerror(reti, ®ex, msgbuf, sizeof(msgbuf));
fprintf(stderr, "Regex match failed: %s\n", msgbuf);
exit(1);
}
/* Free memory allocated to the pattern buffer by regcomp() */
regfree(®ex);
}
}