Skip to content
Advertisement

Find words from array in string, whole words only (with hebrew characters)

I have to build a RegExp obejct, that will search words from an array, and will find only whole words match.

e.g. I have a words array (‘יל’,’ילד’), and I want the RegExp to find ‘a’ or ‘יל’ or ‘ילד’, but not ‘ילדד’.

This is my code:

var text = 'ילד ילדדד יל';
var matchWords = ['יל','ילד'];
text = text.replace(/n$/g, 'nn').replace(new RegExp('\b(' + matchWords.join('|') + ')\b','g'), '<mark>$&</mark>');
console.log(text);

What I have tried:

I tried this code:

new RegExp('(יל|ילד)','g');

It works well, but it find also words like “ילדדדד”, I have to match only the whole words.

I tried also this code:

new RegExp('\b(יל|ילד)\b','g');

but this regular expression doesn’t find any word!

How should I build my RegExp?

Advertisement

Answer

The word boundary b is not Unicode aware. Use XRegExp to build a Unicode word boundary:

var text = 'ילד ילדדד יל';
var matchWords = ['יל','ילד'];
re = XRegExp('(^|[^_0-9\pL])(' + matchWords.join('|') + ')(?![_0-9\pL])','ig');
text = XRegExp.replace(text.replace(/n$/g, 'nn'), re, '$1<mark>$2</mark>');
console.log(text);
<script src="http://cdnjs.cloudflare.com/ajax/libs/xregexp/3.1.1/xregexp-all.min.js"></script>

Here, (^|[^_0-9\pL]) is a capturing group with ID=1 that matches either the string start or any char other than a Unicode letter, ASCII digit or _ (a leading word boundary) and (?![_0-9\pL]) fails the match if the word is followed with _, ASCII digit or a Unicode letter.

With the modern ECMAScript 2018+ standard support, you can use

let text = 'ילד ילדדד יל';
const matchWords = ['יל','ילד'];
const re = new RegExp('(^|[^_0-9\p{L}])(' + matchWords.join('|') + ')(?![_0-9\p{L}])','igu');
text = text.replace(re, '$1<mark>$2</mark>');
console.log(text);

Another ECMAScript 2018+ compliant solution that fully emulates Unicode-aware b construct is explained at Replace certain arabic words in text string using Javascript.

Advertisement