Regexp documentation issues

#1

I’d love to see some changes to the regexp documentation. Some parts are misleading, and some parts I just think should be added to.

Behaviour of .match()

The documentation says it

Determines whether the target regular expression matches any part of the passed string

but it actually determines whether the regular expression matches the whole of the passed string. For example, regexp("a") will .match() against "a", but not "aa", "ab", or "ba"; regexp.capture() and regexp.search() will both match against all four of these strings. This is true for both regexp and regexp2.

The example code actually directly contradicts the documentation above; it shows that regexp(@"\d+") does not match against "stuff 123 Test.", even though it does match some part of that passed string.

Behaviour of .capture()

The documentation seems to imply that .capture() finds multiple substrings which each match a regular expression. Actually, the regexp is only matched against the string once (to find the first match), and the multiple substrings returned correspond to multiple capture groups within the single match that is (maybe) found. For example, regexp("(a)").capture() against "aa" will not find both matches for the regexp within the target string, but rather will find one match (the first), and return the span of the whole match and the span of the capture group. This applies to both regexp and regexp2.

It is possible to use .capture() to tabulate multiple matches by using it repeatedly with the startIndex argument, but it doesn’t do that work for you.

Here are some examples of misleading documentation:

Finds and tabulates sub-strings which match the target regular expression

returns an array of tables, each of which indicates the location of a sub-string whose characters match the regular expression

Compare this method with search() which returns only the first of the matched sub-strings found within the source string.

regexp Greedy Bugs

I’ve recently raised an issue upstream about this. I think it should be added to the known issues for now, and hope it will eventually be fixed.

regexp vs regexp2

At the very least here, I’d like to see a more detailed explanation of the difference between these 2. The documentation says:

regexp2 is a considerable improvement upon the standard Squirrel regular expression evaluation functionality provided by regexp

but also that re2 has memory usage issues, which leads me to suspect that re2 must have additional functionality, and/or improved matching speed. This may be partially true, but there are other differences too. For example, the bug with greedy operators in regexp mentioned above does not apply to regexp2, so /a.*bc/ will match against "a b bc" with regexp2 (as expected) but not with regexp. Another bug I’ve recently reported, Bug with repetitions before regexp capture groups?, also applies to regexp but not to regexp2.

If someone knows any other differences between these two, I want to know about it! :slight_smile: So please add some more detailed documentation for this.

@smittytone I think you’re responsible for updating documentation? Thanks for taking a look.

#2

I am not a regexpert, so I’ve generally left these parts of the documentation as I found them. But I’ll take a look at the issues you raise with the regex methods.

1 Like