Skip to content
Advertisement

Prevent regex from becoming greedy when using optional tokens?

I’m trying to use regex to extract information from different strings.

For example, I have the following JSON:

{
"id": 1,
"title": "test", // comment
"cost": "$10",
}

and want to write a regex that extracts into capture groups (1) the text up to the colon, (2) the text up to the comma, (3) the comma if exists, and (4) the text after the comma.

Starting with the comma being non-optional, I came up with (.*?): (.*?)(,)(.*?)n.

This works correctly. However, I now tried to modify it so the comma is optional, by adding ?: (.*?): (.*?)(,?)(.*?)n. This breaks down, with what should normally be in capture groups 2 and 3 shifting to group 4.

How can I modify my regex to prevent this from occurring? I would like the modified version to function the same as the original non-optional version when a comma does exist, and when a comma does not exist, shift all text after the colon to group 2.

Advertisement

Answer

Let the second group capture anything that is not a comma nor a line break:

(.*?): ([^,nr]*)(,?)(.*?)n

Note that your regex requires the line to end with n. This may be too strict, as the last line of a text might not terminate with n. And there are also texts that use r or rn as line break. You might want to use the $ anchor, which also does not actually capture the line break, but just requires it. Use with the m (multiline) modifier.

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement