Skip to content
Advertisement

How to parse and capture any measurement unit

In my application, users can customize measurement units, so if they want to work in decimeters instead of inches or in full-turns instead of degrees, they can. However, I need a way to parse a string containing multiple values and units, such as 1' 2" 3/8. I’ve seen a few regular expressions on SO and didn’t find any which matched all cases of the imperial system, let alone allowing any kind of unit. My objective is to have the most permissive input box possible.

So my question is: how can I extract multiple value-unit pairs from a string in the most user-friendly way?


I came up with the following algorithm:

  1. Check for illegal characters and throw an error if needed.
  2. Trim leading and trailing spaces.
  3. Split the string into parts every time there’s a non-digit character followed by a digit character, except for .,/ which are used to identify decimals and fractions.
  4. Remove all spaces from parts, check for character misuse (multiple decimal points or fraction bars) and replace '' with ".
  5. Split value and unit-string for each part. If a part has no unit:
    • If it is the first part, use the default unit.
    • Else if it is a fraction, consider it as the same unit as the previous part.
    • Else if it isn’t, consider it as in, cm or mm based on the previous part’s unit.
    • If it isn’t the first part and there’s no way to guess the unit, throw an error.
  6. Check if units mean something, are all of the same system (metric/imperial) and follow a descending order (ft > in > fraction or m > cm > mm > fraction), throw an error if not.
  7. Convert and sum all parts, performing division in the process.

I guess I could use string manipulation functions to do most of this, but I feel like there must be a simpler way through regex.


I came up with a regex:
((d+('|''|"|m|cm|mm|s|$) *)+(d+(/d+)?('|''|"|m|cm|mm|s|$) *)?)|((d+('|''|"|m|cm|mm|s) *)*(d+(/d+)?('|''|"|m|cm|mm|s|$) *))

It only allows fractions at the end and allows to place spaces between values. I’ve never used regex capturing though, so I’m not so sure how I’ll manage to extract the values out of this mess. I’ll work again on this tomorrow.

Advertisement

Answer

My objective is to have the most permissive input box possible.

Careful, more permissive doesn’t always mean more intuitive. An ambiguous input should warn the user, not pass silently, as that might lead them to make multiple mistakes before they realize their input wasn’t interpreted like they hoped.

How can I extract multiple value-unit pairs from a string? I guess I could use string manipulation functions to do most of this, but I feel like there must be a simpler way through regex.

Regular expressions are a powerful tool, especially since they work in many programming languages, but be warned. When you’re holding a hammer everything starts to look like a nail. Don’t try to use a regular expression to solve every problem just because you recently learned how they work.

Looking at the pseudocode you wrote, you are trying to solve two problems at once: splitting up a string (which we call tokenization) and interpreting input according to a grammar (which we call parsing). You should should try to first split up the input into a list of tokens, or maybe unit-value pairs. You can start making sense of these pairs once you’re done with string manipulation. Separation of concerns will spare you a headache, and your code will be much easier to maintain as a result.

I’ve never used regex capturing though, so I’m not so sure how I’ll manage to extract the values out of this mess.

If a regular expression has the global (g) flag, it can be used to find multiple matches in the same string. That would be useful if you had a regular expression that finds a single unit-value pair. In JavaScript, you can retrieve a list of matches using string.match(regex). However, that function ignores capture groups on global regular expressions.

If you want to use capture groups, you need to call regex.exec(string) inside a loop. For each successful match, the exec function will return an array where item 0 is the entire match and items 1 and onwards are the captured groups.

For example, /(d+) ([a-z]+)/g will look for an integer followed by a space and a word. If you made successive calls to regex.exec("1 hour 30 minutes") you would get:

  • ["1 hour", "1", "hour"]
  • ["30 minutes", "30", "minutes"]
  • null

Successive calls work like this because the regex object keeps an internal cursor you can get or set with regex.lastIndex. You should set it back to 0 before using the regex again with a different input.

You’ve been using parentheses to isolate OR clauses such as a|b and to apply quantifiers to a character sequence such as (abc)+. If you want to do that without creating capture groups, you can use (?: ) instead. This is called a non-capturing group. It does the same thing as regular parentheses in a regex, but what’s inside it won’t create an entry in the returned array.

Is there a better way to approach this?

A previous version of this answer concluded with a regular expression even more incomprehensible than the one posted in the question because I didn’t know better at the time, but today this would be my recommendation. It’s a regular expression that only extracts one token at a time from the input string.

/ (s+)                             // 1 whitespace
| (d+)/(d+)                      // 2,3 fraction
| (d*)([.,])(d+)                  // 4,5,6 decimal
| (d+)                             // 7 integer
| (km|cm|mm|m|ft|in|pi|po|'|")      // 8 unit
/gi

Sorry about the weird syntax highlighting. I used whitespace to make this more readable but properly formatted it becomes:

/(s+)|(d+)/(d+)|(d*)([.,])(d+)|(d+)|(km|cm|mm|m|ft|in|pi|po|'|")/gi

This regular expression makes clever uses of capture groups separated by OR clauses. Only the capture groups of one type of token will contain anything. For example, on the string "10 ft", successive calls to exec would return:

  • ["10", "", "", "", "", "", "", "10", ""] (because “10” is an integer)
  • [" ", " ", "", "", "", "", "", "", ""] (because ” ” is whitespace)
  • ["ft", "", "", "", "", "", "", "", "ft"] (because “ft” is a unit)
  • null

A tokenizer function can then do something like this to treat each individual token:

function tokenize (input) {
    const localTokenRx = new RegExp(tokenRx);

    return function next () {
        const startIndex = localTokenRx.lastIndex;
        if (startIndex >= input.length) {
            // end of input reached
            return undefined;
        }

        const match = localTokenRx.exec(input);

        if (!match) {
            localTokenRx.lastIndex = input.length;
            // there is leftover garbage at the end of the input
            return ["garbage", input.slice(startIndex)];
        }

        if (match.index !== startIndex) {
            localTokenRx.lastIndex = match.index;
            // the regex skipped over some garbage
            return ["garbage", input.slice(startIndex, match.index)];
        }

        const [
            text,
            whitespace,
            numerator, denominator,
            integralPart, decimalSeparator, fractionalPart,
            integer,
            unit
        ] = match;

        if (whitespace) {
            return ["whitespace", undefined];
            // or return next(); if we want to ignore it
        }

        if (denominator) {
            return ["fraction", Number(numerator) / Number(denominator)];
        }

        if (decimalSeparator) {
            return ["decimal", Number(integralPart + "." + fractionalPart)];
        }

        if (integer) {
            return ["integer", Number(integer)];
        }

        if (unit) {
            return ["unit", unit];
        }
    };
}

This function can do all the necessary string manipulation and type conversion all in one place, letting another piece of code do proper analysis of the sequence of tokens. But that would be out of scope for this Stack Overflow answer, especially since the question doesn’t specify the rules of the grammar we are willing to accept.

But this is most likely too generic and complex of a solution if all you’re trying to do is accept imperial lengths and metric lengths. For that, I’d probably only write a different regular expression for each acceptable format, then test the user’s input to see which one matches. If two different expressions match, then the input is ambiguous and we should warn the user.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement