I have to build a RegExp obejct, that will search words from an array, and will find only whole words match.
e.g. I have a words array (‘יל’,’ילד’), and I want the RegExp to find ‘a’ or ‘יל’ or ‘ילד’, but not ‘ילדד’.
This is my code:
var text = 'ילד ילדדד יל'; var matchWords = ['יל','ילד']; text = text.replace(/n$/g, 'nn').replace(new RegExp('\b(' + matchWords.join('|') + ')\b','g'), '<mark>$&</mark>'); console.log(text);
What I have tried:
I tried this code:
new RegExp('(יל|ילד)','g');
It works well, but it find also words like “ילדדדד”, I have to match only the whole words.
I tried also this code:
new RegExp('\b(יל|ילד)\b','g');
but this regular expression doesn’t find any word!
How should I build my RegExp?
Advertisement
Answer
The word boundary b
is not Unicode aware. Use XRegExp
to build a Unicode word boundary:
var text = 'ילד ילדדד יל'; var matchWords = ['יל','ילד']; re = XRegExp('(^|[^_0-9\pL])(' + matchWords.join('|') + ')(?![_0-9\pL])','ig'); text = XRegExp.replace(text.replace(/n$/g, 'nn'), re, '$1<mark>$2</mark>'); console.log(text);
<script src="http://cdnjs.cloudflare.com/ajax/libs/xregexp/3.1.1/xregexp-all.min.js"></script>
Here, (^|[^_0-9\pL])
is a capturing group with ID=1 that matches either the string start or any char other than a Unicode letter, ASCII digit or _
(a leading word boundary) and (?![_0-9\pL])
fails the match if the word is followed with _
, ASCII digit or a Unicode letter.
With the modern ECMAScript 2018+ standard support, you can use
let text = 'ילד ילדדד יל'; const matchWords = ['יל','ילד']; const re = new RegExp('(^|[^_0-9\p{L}])(' + matchWords.join('|') + ')(?![_0-9\p{L}])','igu'); text = text.replace(re, '$1<mark>$2</mark>'); console.log(text);
Another ECMAScript 2018+ compliant solution that fully emulates Unicode-aware b
construct is explained at Replace certain arabic words in text string using Javascript.