RegExMatch with WeiDU

Sam. · July 23

I'm trying to write my own RegExMatch PATCH function that wraps other built-in commands to achieve some functionality I find hard to get at using only the base commands. With INDEX_BUFFER or RINDEX_BUFFER I can get the offset of the entire first matched regular expression. Using REPLACE_EVALUATE I can make an array containing the first match of the regular expression as well as what was matched by each of it's subgroups (submatches?). For my purpose I'm ignoring any subsequent matches beyond the first. Using these, I can get the length of the entire match and the length of each subgroup. I now have what was matched by the RegEx, where it is, and how long it is. I also have what was matched by each of the subgroups (submatch?) and how long each is.

My question is: Is there a way to reliably retrieve the OFFSET to each subgroup of the matched regular expression?

Edited July 23 by Sam.

Jarno Mikkola · July 24

What type of scenarion is this ? Or this would be used in ? At what level ...
Cause I find it hard to .. think one ...

Edited July 24 by Jarno Mikkola

paladin84 · July 24

I don't know weidu well, and, as a result, don't know how to code this with weidu (and if it is even possible) but I guess getting offset for every subgroup in the whole regexp match (subgroup 0) can be implemented using the function to find a substring in a string and taking characters in a string by indexes:
1. For every subgroup match string (sms_i):
2. Search sms_i in sms_0
3. If only one instance is found, we are done with sms_i, save left and right boundaries (offset, offset + size (sms_i)), go to the next sms_i
4. If more than one instance is found take the first one that is: 1. located after all the found left boundaries (offsets) for previous matches and 2. located after the found right boundaries parenthesis for which already closed in the regexp before opening parenthesis for this subgroup, save boundaries, go to the next sms_i

Subgroups are numerated in the same order as opening parenthesis in the regexp, and regexp engine takes the first possible match for a subgroup, so the algorithm should work.
For 4. you need to go through the regexp once and find the indexes for opening and close parenthesis for every subgroup, ignoring all the other characters. So if parentesis in a regexp looks like "((())()())", and every character in a string is indexed 0-9
the maps for parenthesis for subgroups would look like: 1->0,9; 2->1,4; 3->2,3; 4->5,6; 5->7,8 and if in the algorithm above many instances of substring is found for eg 4th subgroup, it should be first instance that is located on the right of offsets for subgroups 1-3 and on the right of right boundaries for subgroups 2,3.

I hope it makes sence. I can write a code on python if I wasn't clear enough.

Sam. · July 24

I'm not sure I follow. As a purely contrived example, let's say my string is ~IETME~ and my RegEx is ~^I$t*$.T.*$.$$~. My current results (looking for matched values) look something like:

The RegEx '^I\(t*\).T.*\(.\)$' was found at position 0.  The matched pattern is 'IETME'.
RegEx SubPattern 0 is 'IETME'
RegEx SubPattern 1 is ''
RegEx SubPattern 2 is 'E'

I already have the offset for SubPattern 0. The offset for an empty matched SubPattern should probably be -1 (i.e. not found). Can you step me thru how you're suggesting I get the offset to SubPattern 2?

paladin84 · July 24

Oh, you right, my approach is not going to work. Let me think more, but it seems that without parsing regexp completely, you cannot find this offset reliably.

lynx · July 24

Perl has the $-[0] and $+[0] for start/end of the (last) match.

Edited July 24 by lynx

Sign In

RegExMatch with WeiDU

Recommended Posts

Sam.

Link to comment

Jarno Mikkola

Link to comment

paladin84

Link to comment

Sam.

Link to comment

paladin84

Link to comment

lynx

Link to comment

Join the conversation

Website

Forums

My Activity Streams

Downloads

Gallery