Skip to content
Advertisement

RegExp matching only the first two entries within a capture group (whatever they happen to be)

I’m currently working on an Adobe inDesign script, part of which is a function that finds measurements and picks them apart. I have a set of regexes that are run first using inDesign’s findGrep() (not really relevant here), and then using the basic javascript exec() (because I need to do things with capture groups).

Now, I know that there are differences between these two regex engines, so I’ve been working to the capabilities of the much more limited JS engine (I think inDesign’s scripting language is based on ECMAscript v3), but I’ve recently hit a problem that I can’t seem to figure out.

Here’s the regex I’m currently testing (I’ve broken up the lines to make it a little easier to read –

  ((?:one|two|three|four|five|six|seven|eight|nine|ten|d{4,}|d{1,3}(?:,d{3})*)(?:.d+)?)
  (?=-|‑|s|°|º|˚|∙|⁰)
  (?:[-s](thousand|million|billion|trillion))?
  (?:[-s](cubic|cu.?|square|sq.?))?

  • The first line finds numbers formatted in various different ways.
  • The second line is a lookahead that makes sure I’ve reached the end of the numbers.
  • The third line finds any multipliers that refer to that number.
  • The fourth line is supposed to find any modifiers that go before the unit of measurement.

This is the sample text I was testing it on.

23 sq metres
45-square-metres
16-cubic metres
96 cu metres
409 cu. metres
12 sq metres
24 sq. metres

Now when I run the regex using inDesign’s findGrep() it works as expected. When I run it using exec(), however, it does something odd. It will match the numbers and the multipliers just fine, but only “cubic” and “cu” get matched, the “square” and “sq” text is ignored.

To make things more baffling, if I reverse the order of these entries in the regex capture group (so it’s (?:[-s](square|sq.?|cubic|cu.?))? instead), then it only matches “square” and “sq” and not “cubic” and “cu”.

Am I missing something really obvious here? I’m a javascript newbie, but I’ve been working with regular expressions in xslt for years.

str = `23 sq metres
45-square-metres
16-cubic metres
96 cu metres
409 cu. metres
12 sq metres
24 sq. metres
`;
  patt = /((?:one|two|three|four|five|six|seven|eight|nine|ten|d{4,}|d{1,3}(?:,d{3})*)(?:.d+)?)(?=-|‑|s|°|º|˚|∙|⁰)(?:[-s](thousand|million|billion|trillion))?(?:[-s](cubic|cu.?|square|sq.?))?/gm;
  while (res = patt.exec(str)) console.log(res);

EDIT:

So, here’s the code as I’m trying to run it right now.

  str = `23 sq metres
    45-square-metres
    16-cubic metres
    96 cu metres
    409 cu. metres
    12 sq metres
    24 sq. metres
    `;
 var re = '(one|two|three|four|five|six|seven|eight|nine|ten|(?:[0-9]|,|\.)+)(?:(\s?(?:-|–)\s?)(one|two|three|four|five|six|seven|eight|nine|ten|(?:[0-9]|,|\.)+))?(?:[-\s](thousand|million|billion|trillion))?(?:[-\s](cubic|cu\.?|square|sq\.?))?'; 
    
patt = new RegExp(re);

while (res = patt.exec(str)) console.log(res);

If I try to run this on my machine, using the inDesign script, it fails to find anything with “square” or “sq”, and when I run it in the code snippet view here it just freezes up. I’m guessing this is something to do with storing regexes as strings, yes?

Advertisement

Answer

I’m not sure if I understand you right. If you want that your second code works in about the same way as your first code does, you probably need just to add "gm" in the RegeExp constructor:

var patt = new RegExp(re, "gm");

str = `23 sq metres
    45-square-metres
    16-cubic metres
    96 cu metres
    409 cu. metres
    12 sq metres
    24 sq. metres
    `;
var re = '(one|two|three|four|five|six|seven|eight|nine|ten|(?:[0-9]|,|\.)+)(?:(\s?(?:-|–)\s?)(one|two|three|four|five|six|seven|eight|nine|ten|(?:[0-9]|,|\.)+))?(?:[-\s](thousand|million|billion|trillion))?(?:[-\s](cubic|cu\.?|square|sq\.?))?'; 
    
var patt = new RegExp(re, "gm");

while (res = patt.exec(str)) console.log(res[5]);

It gives me this output:

sq
square
cubic
cu
cu.
sq
sq.

Update

I’ve changed (cubic|cu\.?|square|sq\.?) with (cubic|cu\.|cu|square|sq\.|sq) and it seems work in InDesign now:

str = "23 sq metresn45-square-metresn16-cubic metresn96 cu metresn409 cu. metresn12 sq metresn24 sq. metres";

var re = '(one|two|three|four|five|six|seven|eight|nine|ten|(?:[0-9]|,|\.)+)(?:(\s?(?:-|–)\s?)(one|two|three|four|five|six|seven|eight|nine|ten|(?:[0-9]|,|\.)+))?(?:[-\s](thousand|million|billion|trillion))?(?:[-\s](cubic|cu\.|cu|square|sq\.|sq))?'; 
    
var patt = new RegExp(re, "gm");

var msg = "";

while (res = patt.exec(str)) msg += res[0] + " : " + res[5] + "n";

alert(msg);

enter image description here

Probably these ? inside (foo|bar) are too much for InDesign script model.

Advertisement