How can I find the singular in the plural when some letters change?
Following situation:
- The German word
Schließfach
is a lockbox. - The plural is
Schließfächer.
As you see, the letter a
has changed in ä
. For this reason, the first word is not a substring of the second one anymore, they are “regex-technically” different.
Maybe I’m not in the right corner with my chosen tags below. Maybe Regex is not the right tool for me. I’ve seen naturaljs
(natural.NounIflector()
) provides this functionality out of the box for English words. Maybe there are also solutions for the German language in the same way?
What is the best approach, how can I find singular in the plural in German?
Advertisement
Answer
I once had to build a text processor that parsed many languages, including very casual to very formal. One of the things to identify was if certain words were related (like a noun in the title which was related to a list of things – sometimes labeled with a plural form.)
IIRC, 70-90% of singular & plural word forms across all languages we supported had a “Levenshtein distance” of less than 3 or 4. (Eventually several dictionaries were added to improve accuracy because “distance” alone produced many false positives.) Another interesting find was that the longer the words, the more likely a distance of 3 or fewer meant a relationship in meaning.
Here’s an example of the libraries we used:
const fastLevenshtein = require('fast-levenshtein'); console.log('Deburred Distances:') console.log('Score 1:', fastLevenshtein.get('Schließfächer', 'Schließfach')); // -> 3 console.log('Score 2:', fastLevenshtein.get('Blumtach', 'Blumtächer')); // -> 3 console.log('Score 3:', fastLevenshtein.get('schließfächer', 'Schliessfaech')); // -> 7 console.log('Score 4:', fastLevenshtein.get('not-it', 'Schliessfaech')); // -> 12 console.log('Score 5:', fastLevenshtein.get('not-it', 'Schiesse')); // -> 8 /** * Additional strategy for dealing with other various languages: * "Deburr" the strings to omit diacritics before checking the distance: */ const deburr = require('lodash.deburr'); console.log('Deburred Distances:') console.log('Score 1:', deburr(fastLevenshtein.get('Schließfächer', 'Schließfach'))); // -> 3 console.log('Score 2:', deburr(fastLevenshtein.get('Blumtach', 'Blumtächer'))); // -> 3 console.log('Score 3:', deburr(fastLevenshtein.get('schließfächer', 'Schliessfaech'))); // -> 7 // Same in this case, but helpful in other similar use cases.