Don't forget about names like:
- Mathias d'Arras
- Martin Luther King, Jr.
- Hector Sausage-Hausen
This should do the trick for most things:
/^[a-z ,.'-]+$/i
OR Support international names with super sweet unicode:
/^[a-zA-ZàáâäãåąčćęèéêëėįìíîïłńòóôöõøùúûüųūÿýżźñçčšžæÀÁÂÄÃÅĄĆČĖĘÈÉÊËÌÍÎÏĮŁŃÒÓÔÖÕØÙÚÛÜŲŪŸÝŻŹÑßÇŒÆČŠŽ∂ð ,.'-]+$/u
Don't forget about names like:
- Mathias d'Arras
- Martin Luther King, Jr.
- Hector Sausage-Hausen
This should do the trick for most things:
/^[a-z ,.'-]+$/i
OR Support international names with super sweet unicode:
/^[a-zA-ZàáâäãåąčćęèéêëėįìíîïłńòóôöõøùúûüųūÿýżźñçčšžæÀÁÂÄÃÅĄĆČĖĘÈÉÊËÌÍÎÏĮŁŃÒÓÔÖÕØÙÚÛÜŲŪŸÝŻŹÑßÇŒÆČŠŽ∂ð ,.'-]+$/u
You make false assumptions on the format of first and last name. It is probably better not to validate the name at all, apart from checking that it is empty.
regex - Regular expression for validating names and surnames? - Stack Overflow
regular expression to catch names - C++ Forum
Is there a way to validate the syntax of regex pattern using C code - Stack Overflow
regular expressions - Best REGEX for first/last name validation? - Salesforce Stack Exchange
Videos
I sympathize with the need to constrain input in this situation, but I don't believe it is possible - Unicode is vast, expanding, and so is the subset used in names throughout the world.
Unlike email, there's no universally agreed-upon standard for the names people may use, or even which representations they may register as official with their respective governments. I suspect that any regex will eventually fail to pass a name considered valid by someone, somewhere in the world.
Of course, you do need to sanitize or escape input, to avoid the Little Bobby Tables problem. And there may be other constraints on which input you allow as well, such as the underlying systems used to store, render or manipulate names. As such, I recommend that you determine first the restrictions necessitated by the system your validation belongs to, and create a validation expression based on those alone. This may still cause inconvenience in some scenarios, but they should be rare.
I'll try to give a proper answer myself:
The only punctuations that should be allowed in a name are full stop, apostrophe and hyphen. I haven't seen any other case in the list of corner cases.
Regarding numbers, there's only one case with an 8. I think I can safely disallow that.
Regarding letters, any letter is valid.
I also want to include space.
This would sum up to this regex:
^[\p{L} \.'\-]+$
This presents one problem, i.e. the apostrophe can be used as an attack vector. It should be encoded.
So the validation code should be something like this (untested):
var name = nameParam.Trim();
if (!Regex.IsMatch(name, "^[\p{L} \.\-]+$"))
throw new ArgumentException("nameParam");
name = name.Replace("'", "'"); //' does not work in IE
Can anyone think of a reason why a name should not pass this test or a XSS or SQL Injection that could pass?
complete tested solution
using System;
using System.Text.RegularExpressions;
namespace test
{
class MainClass
{
public static void Main(string[] args)
{
var names = new string[]{"Hello World",
"John",
"João",
"タロウ",
"やまだ",
"山田",
"先生",
"мыхаыл",
"Θεοκλεια",
"आकाङ्क्षा",
"علاء الدين",
"אַבְרָהָם",
"മലയാളം",
"상",
"D'Addario",
"John-Doe",
"P.A.M.",
"' --",
"<xss>",
"\""
};
foreach (var nameParam in names)
{
Console.Write(nameParam+" ");
var name = nameParam.Trim();
if (!Regex.IsMatch(name, @"^[\p{L}\p{M}' \.\-]+$"))
{
Console.WriteLine("fail");
continue;
}
name = name.Replace("'", "'");
Console.WriteLine(name);
}
}
}
}
Yes! Let's validate some names with RegEx.
After all, we know that all people must have a first and last name, right? And no single person has more than three or four names total? And no doubt the same person will forever be identifiable by the same name?
Plus, we know that no modern culture uses patronymic naming and people in the same nuclear family must have the same last name, right?
Well, we can at least assume that people do not have single character names, right? And there are no names that use special characters, symbols, or apostrophes?
I think your choice of RegEx to validate names is missing the point: this is a huge unwieldy problem and, even if you massively restrict the scope of names you allow, you will forever suffer the risk of false negatives and you will be turning away people from other cultures and languages. In other words, I don't think that even attempting to validate names is worth your time.
The reason it's breaking on names like McGowan is because you're second character class doesn't allow for Capitalized characters.
Use the below regex to match names with Capitalization after the first character.
([A-Z][a-zA-Z]*)
A pattern to recognize variable declarations in C. Looking at a conventional declaration, we see:
int variable;
If that's the case, one should test for the type keyword before anything, to avoid matching something else, like a string or a constant defined with the preprocessor
(?:\w+\s+)([a-zA-Z_][a-zA-Z0-9]+)
variable name resides in \1.
The feature you need is look-behind/look-ahead.
UPDATE July 11 2015
The previous regex fail to match some variables with _ anywhere in the middle. To fix that, one just have to add the _ to the second part of the first capture group, it also assume variable names of two or more characters, this is how it looks after the fix:
(?:\w+\s+)([a-zA-Z_][a-zA-Z0-9_]*)
However, this regular expression has many false positives, goto jump; being one of them, frankly it's not suitable for the job, because of that, I decided to create another regex to cover a wider range of cases, though it's far from perfect, here it is:
\b(?:(?:auto\s*|const\s*|unsigned\s*|signed\s*|register\s*|volatile\s*|static\s*|void\s*|short\s*|long\s*|char\s*|int\s*|float\s*|double\s*|_Bool\s*|complex\s*)+)(?:\s+\*?\*?\s*)([a-zA-Z_][a-zA-Z0-9_]*)\s*[\[;,=)]
I've tested this regex with Ruby, Python and JavaScript and it works very well for the common cases, however it fails in some cases. Also, the regex may need some optimizations, though it is hard to do optimizations while maintaining portability across several regex engines.
Tests resume
unsignedchar *var; /* OK, doesn't match */
goto **label; /* OK, doesn't match */
int function(); /* OK, doesn't match */
char **a_pointer_to_a_pointer; /* OK, matches +a_pointer_to_a_pointer+ */
register unsigned char *variable; /* OK, matches +variable+ */
long long factorial(int n) /* OK, matches +n+ */
int main(int argc, int *argv[]) /* OK, matches +argc+ and +argv+ (needs two passes) */
const * char var; /* OK, matches +var+, however, it doesn't consider +const *+ as part of the declaration */
int i=0, j=0; /* 50%, matches +i+ but it will not match j after the first pass */
int (*functionPtr)(int,int); /* FAIL, doesn't match (too complex) */
False positives
The following case is hard to cover with a portable regular expression, text editors use contexts to avoid highlighting text inside quotes.
printf("int i=%d", i); /* FAIL, match i inside quotes */
False positives (syntax errors)
This can be fixed if one test the syntax of the source file before applying the regular expression. With GCC and Clang one can just pass the -fsyntax-only flag to test the syntax of a source file without compiling it
int char variable; /* matches +variable+ */
[a-zA-Z_][a-zA-Z0-9_]{0,31}
This will allow you to have variable names as "m_name" validated.
You can wrap the 2nd and 3rd character class in an optional non capture group
^a-zA-Z0-9?$
Regex demo
This will match all criteria:
^\w{1,2}$|^\w+(['\-_]+\w*)+\w+$
First we handle the edge cases of 1-char and 2-char strings, that can only contain alphanumerical characters (\w).
The second part matches all strings starting with an alphanumerical character, that contain at least one special character and that end with a alphanumerical character. The (['\-_]+\w*) block matches multiple not necessarily consecutive special characters in the string (e.g. a-a-a-a-a-a---aa)