I’ve a little problem.
I’m using NodeJS as backend. Now, an user has a field “biography”, where the user can write something about himself.
Suppose that this field has 220 maxlength, and suppose this as input:
š¶š»š¦š»š§š»šØš»š©š»š±š»āāļøš±š»š“š»šµš»š²š»š³š»āāļøš³š»š®š»āāļøš®š»š·š»āāļøš·š»šš»āāļøšš»šµš»āāļøš©š»āāļøšØš»āāļøš©š»āš¾šØš»āš¾šØš»āš¾šØš»āš¾šØš»āš¾šØš»āš¾šØš»āš¾šØš»āš¾šØš»āš¾šØš»āš¾šØš»āš¾šØš»āš¾šØš»āš¾šØš»āš¾šØš»āš¾šØš»āš¾
As you can see there aren’t 220 emojis (there are 37 emojis), but if I do in my nodejs server
console.log(bio.length)
where bio is the input text, I got 221. How could I “parse” the string input to get the correct length? Is it a problem about unicode?
SOLVED
I used this library: https://github.com/orling/grapheme-splitter
I tried that:
var Grapheme = require('grapheme-splitter'); var splitter = new Grapheme(); console.log(splitter.splitGraphemes(bio).length);
and the length is 37. It works very well!
Advertisement
Answer
str.length
gives the count of UTF-16 units.Unicode-proof way to get string length in codepoints (in characters) is
[...str].length
as iterable protocol splits the string to codepoints.If we need the length in graphemes (grapheme clusters), we have these native ways:
a. Unicode property escapes in RegExp. See for example: Unicode-aware version of w or Matching emoji.
b. Intl.Segmenter ā coming soon, probably in ES2021. Can be tested with a flag in the last V8 versions (realization was synced with the last spec in V8 86). Unflagged (shipped) in V8 87.
See also: