Skip to content
Advertisement

How to find singular in the plural when some letters change? What is the best approach?

How can I find the singular in the plural when some letters change?

Following situation:

  • The German word Schließfach is a lockbox.
  • The plural is Schließfächer.

As you see, the letter a has changed in ä. For this reason, the first word is not a substring of the second one anymore, they are “regex-technically” different.

Maybe I’m not in the right corner with my chosen tags below. Maybe Regex is not the right tool for me. I’ve seen naturaljs (natural.NounIflector()) provides this functionality out of the box for English words. Maybe there are also solutions for the German language in the same way?

What is the best approach, how can I find singular in the plural in German?

Advertisement

Answer

I once had to build a text processor that parsed many languages, including very casual to very formal. One of the things to identify was if certain words were related (like a noun in the title which was related to a list of things – sometimes labeled with a plural form.)

IIRC, 70-90% of singular & plural word forms across all languages we supported had a “Levenshtein distance” of less than 3 or 4. (Eventually several dictionaries were added to improve accuracy because “distance” alone produced many false positives.) Another interesting find was that the longer the words, the more likely a distance of 3 or fewer meant a relationship in meaning.

Here’s an example of the libraries we used:

const fastLevenshtein = require('fast-levenshtein');

console.log('Deburred Distances:')
console.log('Score 1:', fastLevenshtein.get('Schließfächer', 'Schließfach'));
// -> 3
console.log('Score 2:', fastLevenshtein.get('Blumtach', 'Blumtächer'));
// -> 3
console.log('Score 3:', fastLevenshtein.get('schließfächer', 'Schliessfaech'));
// -> 7
console.log('Score 4:', fastLevenshtein.get('not-it', 'Schliessfaech'));
// -> 12
console.log('Score 5:', fastLevenshtein.get('not-it', 'Schiesse'));
// -> 8


/**
 * Additional strategy for dealing with other various languages:
 *   "Deburr" the strings to omit diacritics before checking the distance:
 */

const deburr = require('lodash.deburr');
console.log('Deburred Distances:')
console.log('Score 1:', deburr(fastLevenshtein.get('Schließfächer', 'Schließfach')));
// -> 3
console.log('Score 2:', deburr(fastLevenshtein.get('Blumtach', 'Blumtächer')));
// -> 3
console.log('Score 3:', deburr(fastLevenshtein.get('schließfächer', 'Schliessfaech')));
// -> 7


// Same in this case, but helpful in other similar use cases.
Advertisement