Skip to Content

Archived discussions are read-only. Learn more about SAP Q&A

Problem with REGEXP_SUBSTR

I am having a problem with REGEXP_SUBSTR(). I am trying to extract "captureThis" from the following string:

Token1 blah, blah, (Token1 blah Token2 ignoreThis blah) blah, blah Token2 captureThis blah blah'

The rules are that it must be preceded by "Token1" followed by arbitrary text, terminated by "Token2". Except that if the same pattern appears in parentheses in the arbitrary text, everthing in parentheses should be ignored.

If I run the following REGEXP_SUBSTR() statement, I get a result that correctly ends in "captureThis"

select REGEXP_SUBSTR('Token1 blah, blah, (Token1 blah Token2 ignoreThis blah) blah, blah Token2 captureThis blah blah',
    'Token1\\s+((\\([^\\)]+\\))|(((?!Token2).)+))*\\s+Token2\\s+\\S*' ,
  1,1);

Result: Token1 blah, blah, (Token1 blah Token2 ignoreThis blah) blah, blah Token2 captureThis

However, if I use a Positive Lookbehind Zero-Width Assertion to filter out the tokens, I get a different result:

select REGEXP_SUBSTR('Token1 blah, blah, (Token1 blah Token2 ignoreThis blah) blah, blah Token2 captureThis blah blah',
'(?<=Token1\\s+((\\([^\\)]+\\))|(((?!Token2).)+))*\\s+Token2\\s+)\\S*'
,  1,1);

Result: ignoreThis (I need it to be "captureThis")

I also found an interesting result by playing with the fourth parameter, occurrence-number. In the first statement above, occurrence-number 1 is a string ending in "captureThis", and occurrence-number 2 is a string ending in "ignoreThis".

However, in the second statement, with the Lookbehind, the order is reversed. Occurrence-number 1 is "ignoreThis", and occurrence-number 2 is "captureThis".

Is there any way to alter the regular expression in the second statement so occurrence-number 1 will be "captureThis"?

Thanks,

Eric

replied

Finally, the light has dawned. Thanks for your help and your patience. My confusion can be explained (or cleared up) by imagining a cursor position as the regular expression engine is scanning the input string.

In my first statement (without the look-behind), the cursor begins on the first instance of Token1, correctly passes the stuff in parentheses as an example of alternative 1, and stops at the first word following the 2nd Token2, which is "captureThis".

I was wrongly expecting the second statement to behave the same way. But because of the look-behind, the cursor begins on the first word following a Token2, which is "ignoreThis". It then scans backwards and finds a match consisting of what's inside the parentheses.

So now it makes sense to me that simply adding a look-behind to an otherwise identical regular expression applied to the identical input string would produce results in reverse order. The first statement uses instances of Token1 to anchor its searches, while the second uses Token2 to anchor its searches.

Unfortunately, I can't use your anchor(^) solution for my problem because there will be text before Token1. But I can use trimmed results from my non-look-behind version to get what I need in the order I need. Also, you're right about (?!Token2). I can take it out.

Thanks,

Eric

1 View this answer in context

Helpful Answer

by
Not what you were looking for? View more on this topic or Ask a question