How to parse and capture any measurement unit

Question

In my application, users can customize measurement units, so if they want to work in decimeters instead of inches or in full-turns instead of degrees, they can. However, I need a way to parse a string containing multiple values and units, such as 1&#8242; 2&#8243; 3/8. I&#8217;ve seen a few regular expression…

Accepted Answer

My objective is to have the most permissive input box possible.Careful, more permissive doesn&#8217;t always mean more intuitive. An ambiguous input should warn the user, not pass silently, as that might lead them to make multiple mistakes before they realize their input wasn&#8217;t interpreted like they hoped.How can I extract multiple value-unit pairs from a string? I guess I could use string manipulation functions to do most of this, but I feel like there must be a simpler way through regex.Regular expressions are a powerful tool, especially since they work in many programming languages, but be warned. When you&#8217;re holding a hammer everything starts to look like a nail. Don&#8217;t try to use a regular expression to solve every problem just because you recently learned how they work.Looking at the pseudocode you wrote, you are trying to solve two problems at once: splitting up a string (which we call tokenization) and interpreting input according to a grammar (which we call parsing). You should should try to first split up the input into a list of tokens, or maybe unit-value pairs. You can start making sense of these pairs once you&#8217;re done with string manipulation. Separation of concerns will spare you a headache, and your code will be much easier to maintain as a result.I&#8217;ve never used regex capturing though, so I&#8217;m not so sure how I&#8217;ll manage to extract the values out of this mess.If a regular expression has the global (g) flag, it can be used to find multiple matches in the same string. That would be useful if you had a regular expression that finds a single unit-value pair. In JavaScript, you can retrieve a list of matches using string.match(regex). However, that function ignores capture groups on global regular expressions.If you want to use capture groups, you need to call regex.exec(string) inside a loop. For each successful match, the exec function will return an array where item 0 is the entire match and items 1 and onwards are the captured groups.For example, /(d+) ([a-z]+)/g will look for an integer followed by a space and a word. If you made successive calls to regex.exec("1 hour 30 minutes") you would get:["1 hour", "1", "hour"]["30 minutes", "30", "minutes"]nullSuccessive calls work like this because the regex object keeps an internal cursor you can get or set with regex.lastIndex. You should set it back to 0 before using the regex again with a different input.You&#8217;ve been using parentheses to isolate OR clauses such as a|b and to apply quantifiers to a character sequence such as (abc)+. If you want to do that without creating capture groups, you can use (?:  ) instead. This is called a non-capturing group. It does the same thing as regular parentheses in a regex, but what&#8217;s inside it won&#8217;t create an entry in the returned array.Is there a better way to approach this?A previous version of this answer concluded with a regular expression even more incomprehensible than the one posted in the question because I didn&#8217;t know better at the time, but today this would be my recommendation. It&#8217;s a regular expression that only extracts one token at a time from the input string./ (s+)                             // 1 whitespace| (d+)/(d+)                      // 2,3 fraction| (d*)([.,])(d+)                  // 4,5,6 decimal| (d+)                             // 7 integer| (km|cm|mm|m|ft|in|pi|po|'|")      // 8 unit/giSorry about the weird syntax highlighting. I used whitespace to make this more readable but properly formatted it becomes:/(s+)|(d+)/(d+)|(d*)([.,])(d+)|(d+)|(km|cm|mm|m|ft|in|pi|po|'|")/giThis regular expression makes clever uses of capture groups separated by OR clauses. Only the capture groups of one type of token will contain anything. For example, on the string "10 ft", successive calls to exec would return:["10", "", "", "", "", "", "", "10", ""] (because &#8220;10&#8221; is an integer)[" ", " ", "", "", "", "", "", "", ""] (because &#8221; &#8221; is whitespace)["ft", "", "", "", "", "", "", "", "ft"] (because &#8220;ft&#8221; is a unit)nullA tokenizer function can then do something like this to treat each individual token:function tokenize (input) {    const localTokenRx = new RegExp(tokenRx);    return function next () {        const startIndex = localTokenRx.lastIndex;        if (startIndex >= input.length) {            // end of input reached            return undefined;        }        const match = localTokenRx.exec(input);        if (!match) {            localTokenRx.lastIndex = input.length;            // there is leftover garbage at the end of the input            return ["garbage", input.slice(startIndex)];        }        if (match.index !== startIndex) {            localTokenRx.lastIndex = match.index;            // the regex skipped over some garbage            return ["garbage", input.slice(startIndex, match.index)];        }        const [            text,            whitespace,            numerator, denominator,            integralPart, decimalSeparator, fractionalPart,            integer,            unit        ] = match;        if (whitespace) {            return ["whitespace", undefined];            // or return next(); if we want to ignore it        }        if (denominator) {            return ["fraction", Number(numerator) / Number(denominator)];        }        if (decimalSeparator) {            return ["decimal", Number(integralPart + "." + fractionalPart)];        }        if (integer) {            return ["integer", Number(integer)];        }        if (unit) {            return ["unit", unit];        }    };}This function can do all the necessary string manipulation and type conversion all in one place, letting another piece of code do proper analysis of the sequence of tokens. But that would be out of scope for this Stack Overflow answer, especially since the question doesn&#8217;t specify the rules of the grammar we are willing to accept.But this is most likely too generic and complex of a solution if all you&#8217;re trying to do is accept imperial lengths and metric lengths. For that, I&#8217;d probably only write a different regular expression for each acceptable format, then test the user&#8217;s input to see which one matches. If two different expressions match, then the input is ambiguous and we should warn the user.

Advertisement

Answer