I’ve looked on Stack Overflow (replacing characters.. eh, how JavaScript doesn’t follow the Unicode standard concerning RegExp, etc.) and haven’t really found a concrete answer to the question “How can JavaScript match accented characters (those with diacritical marks)?“
I’m forcing a field in a UI to match the format: last_name, first_name
(last [comma space] first), and I want to provide support for diacritics, but evidently in JavaScript it’s a bit more difficult than other languages/platforms.
This was my original version, until I wanted to add diacritic support:
/^[a-zA-Z]+,s[a-zA-Z]+$/
Currently I’m debating one of three methods to add support, all of which I have tested and work (at least to some extent, I don’t really know what the “extent” is of the second approach). Here they are:
Explicitly listing all accented characters that I would want to accept as valid (lame and overly-complicated):
var accentedCharacters = "àèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ"; // Build the full regex var regex = "^[a-zA-Z" + accentedCharacters + "]+,\s[a-zA-Z" + accentedCharacters + "]+$"; // Create a RegExp from the string version regexCompiled = new RegExp(regex); // regexCompiled = /^[a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+,s[a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+$/
- This correctly matches a last/first name with any of the supported accented characters in
accentedCharacters
.
My other approach was to use the .
character class, to have a simpler expression:
var regex = /^.+,s.+$/;
- This would match for just about anything, at least in the form of:
something, something
. That’s alright I suppose…
The last approach, which I just found might be simpler…
/^[a-zA-Zu00C0-u017F]+,s[a-zA-Zu00C0-u017F]+$/
- It matches a range of Unicode characters – tested and working, though I didn’t try anything crazy, just the normal stuff I see in our language department for faculty member names.
Here are my concerns:
The first solution is far too limiting, and sloppy and convoluted at that. It would need to be changed if I forgot a character or two, and that’s just not very practical.
The second solution is better, concise, but it probably matches far more than it actually should. I couldn’t find any real documentation on exactly what
.
matches, just the generalization of “any character except the newline character” (from a table on the MDN).The third solution seems the be the most precise, but are there any gotchas? I’m not very familiar with Unicode, at least in practice, but looking at a code table/continuation of that table,
u00C0-u017F
seems to be pretty solid, at least for my expected input.
- Faculty won’t be submitting forms with their names in their native language (e.g., Arabic, Chinese, Japanese, etc.), so I don’t have to worry about out-of-Latin-character-set characters
Which of these three approaches is most suited for the task? Or are there better solutions?
Advertisement
Answer
The easier way to accept all accents is this:
[A-zÀ-ú] // accepts lowercase and uppercase characters [A-zÀ-ÿ] // as above, but including letters with an umlaut (includes [ ] ^ × ÷) [A-Za-zÀ-ÿ] // as above but not including [ ] ^ [A-Za-zÀ-ÖØ-öø-ÿ] // as above, but not including [ ] ^ × ÷
See Unicode Character Table for characters listed in numeric order.