Jump to content

RegExMatch with WeiDU


Recommended Posts

I'm trying to write my own RegExMatch PATCH function that wraps other built-in commands to achieve some functionality I find hard to get at using only the base commands.  With INDEX_BUFFER or RINDEX_BUFFER I can get the offset of the entire first matched regular expression.  Using REPLACE_EVALUATE I can make an array containing the first match of the regular expression as well as what was matched by each of it's subgroups (submatches?).  For my purpose I'm ignoring any subsequent matches beyond the first.  Using these, I can get the length of the entire match and the length of each subgroup.  I now have what was matched by the RegEx, where it is, and how long it is.  I also have what was matched by each of the subgroups (submatch?) and how long each is.

My question is:  Is there a way to reliably retrieve the OFFSET to each subgroup of the matched regular expression?

Edited by Sam.
Link to comment

I don't know weidu well, and, as a result, don't know how to code this with weidu (and if it is even possible) but I guess getting offset for every subgroup in the whole regexp match (subgroup 0) can be implemented using the function to find a substring in a string and taking characters in a string by indexes:
1. For every subgroup match string (sms_i):
2. Search sms_i in sms_0
3. If only one instance is found, we are done with sms_i, save left and right boundaries (offset, offset + size (sms_i)), go to the next sms_i
4. If more than one instance is found take the first one that is: 1. located after all the found left boundaries (offsets) for previous matches and 2. located after the found right boundaries parenthesis for which already closed in the regexp before opening parenthesis for this subgroup, save boundaries,   go to the next sms_i

Subgroups are numerated in the same order as opening parenthesis in the regexp, and regexp engine takes the first possible match for a subgroup, so the algorithm should work.
For 4. you need to go through the regexp once and find the indexes for opening and close parenthesis for every subgroup, ignoring all the other characters. So if parentesis in a regexp looks like "((())()())", and every character in a string is indexed 0-9
the maps for parenthesis for subgroups would look like: 1->0,9; 2->1,4; 3->2,3; 4->5,6; 5->7,8 and if in the algorithm above many instances of substring is found for eg 4th subgroup, it should be first instance that is located on the right of offsets for subgroups 1-3 and on the right of right boundaries for subgroups 2,3.

I hope it makes sence. I can write a code on python if I wasn't clear enough.

Link to comment

I'm not sure I follow.  As a purely contrived example, let's say my string is ~IETME~ and my RegEx is ~^I\(t*\).T.*\(.\)$~.  My current results (looking for matched values) look something like:

The RegEx '^I\(t*\).T.*\(.\)$' was found at position 0.  The matched pattern is 'IETME'.
RegEx SubPattern 0 is 'IETME'
RegEx SubPattern 1 is ''
RegEx SubPattern 2 is 'E'

I already have the offset for SubPattern 0.  The offset for an empty matched SubPattern should probably be -1 (i.e. not found).  Can you step me thru how you're suggesting I get the offset to SubPattern 2?

Link to comment

Join the conversation

You are posting as a guest. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...