I’m forcing a field in a UI to match the format:
This was my original version, until I wanted to add diacritic support:
Currently I’m debating one of three methods to add support, all of which I have tested and work (at least to some extent, I don’t really know what the “extent” is of the second approach). Here they are:
Explicitly listing all accented characters that I would want to accept as valid (lame and overly-complicated):
var accentedCharacters = "àèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ"; // Build the full regex var regex = "^[a-zA-Z" + accentedCharacters + "]+,\s[a-zA-Z" + accentedCharacters + "]+$"; // Create a RegExp from the string version regexCompiled = new RegExp(regex); // regexCompiled = /^[a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+,s[a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+$/
- This correctly matches a last/first name with any of the supported accented characters in
My other approach was to use the
. character class, to have a simpler expression:
var regex = /^.+,s.+$/;
- This would match for just about anything, at least in the form of:
something, something. That’s alright I suppose…
The last approach, which I just found might be simpler…
- It matches a range of unicode characters – tested and working, though I didn’t try anything crazy, just the normal stuff I see in our language department for faculty member names.
Here are my concerns:
- The first solution is far too limiting, and sloppy and convoluted at that. It would need to be changed if I forgot a character or two, and that’s just not very practical.
- The second solution is better, concise, but it probably matches far more than it actually should. I couldn’t find any real documentation on exactly what
.matches, just the generalization of “any character except the newline character” (from a table on the MDN).
- The third solution seems the be the most precise, but are there any gotchas? I’m not very familiar with Unicode, at least in practice, but looking at a code table/continuation of that table,
u00C0-u017Fseems to be pretty solid, at least for my expected input.
- Faculty won’t be submitting forms with their names in their native language (e.g., Arabic, Chinese, Japanese, etc.) so I don’t have to worry about out-of-Latin-character-set characters
Which of these three approaches is most suited for the task? Or are there better solutions?
The easier way to accept all accents is this:
[A-zÀ-ú] // accepts lowercase and uppercase characters [A-zÀ-ÿ] // as above but including letters with an umlaut (includes [ ] ^ × ÷) [A-Za-zÀ-ÿ] // as above but not including [ ] ^ [A-Za-zÀ-ÖØ-öø-ÿ] // as above but not including [ ] ^ × ÷
See https://unicode-table.com/en/ for characters listed in numeric order.