I have to build a RegExp obejct, that will search words from an array, and will find only whole words match.
e.g. I have a words array (‘יל’,’ילד’), and I want the RegExp to find ‘a’ or ‘יל’ or ‘ילד’, but not ‘ילדד’.
This is my code:
var text = 'ילד ילדדד יל';
var matchWords = ['יל','ילד'];
text = text.replace(/n$/g, 'nn').replace(new RegExp('\b(' + matchWords.join('|') + ')\b','g'), '<mark>$&</mark>');
console.log(text);What I have tried:
I tried this code:
new RegExp('(יל|ילד)','g');
It works well, but it find also words like “ילדדדד”, I have to match only the whole words.
I tried also this code:
new RegExp('\b(יל|ילד)\b','g');
but this regular expression doesn’t find any word!
How should I build my RegExp?
Advertisement
Answer
The word boundary b is not Unicode aware. Use XRegExp to build a Unicode word boundary:
var text = 'ילד ילדדד יל';
var matchWords = ['יל','ילד'];
re = XRegExp('(^|[^_0-9\pL])(' + matchWords.join('|') + ')(?![_0-9\pL])','ig');
text = XRegExp.replace(text.replace(/n$/g, 'nn'), re, '$1<mark>$2</mark>');
console.log(text);<script src="http://cdnjs.cloudflare.com/ajax/libs/xregexp/3.1.1/xregexp-all.min.js"></script>
Here, (^|[^_0-9\pL]) is a capturing group with ID=1 that matches either the string start or any char other than a Unicode letter, ASCII digit or _ (a leading word boundary) and (?![_0-9\pL]) fails the match if the word is followed with _, ASCII digit or a Unicode letter.
With the modern ECMAScript 2018+ standard support, you can use
let text = 'ילד ילדדד יל';
const matchWords = ['יל','ילד'];
const re = new RegExp('(^|[^_0-9\p{L}])(' + matchWords.join('|') + ')(?![_0-9\p{L}])','igu');
text = text.replace(re, '$1<mark>$2</mark>');
console.log(text);Another ECMAScript 2018+ compliant solution that fully emulates Unicode-aware b construct is explained at Replace certain arabic words in text string using Javascript.