Zugg Software :: View topic

Wizard Joined: 26 Mar 2008 Posts: 1547

So, I've never used {#,#} on my own before, and apparently I'm using it wrong.

SubAdmin Joined: 18 Nov 2001 Posts: 5182

The problem is actually with "[\w'- ]". A dash when in the class control "[]" is used to signify inclusion of all characters in between. For example "[a-z]" means the same as "[abcdefghijklmnopqrstuvwxyz]" In order to use a dash explicitly within a class you must either escape it or have it as the last character of the class. Correcting that specific item would be either "[\w' -]" or "[\w'\- ]".

The usage of {x,y} indicates to repeat the previous sequence a minimum of x times and a maximum of y times. In order for this to be effective you have to remove the space from the class which is covered by "+". The plus modifier for a sequence is the same as {1,}, and means minimum 1, maximum as much as possible. You want to limit your matching to 6 full words as an accelerator, but that can only be done by moving the space out of the class and into the position of the last item controlled by the {x,y} repetition. So far the modification is "([\w'-]+ ){1,6}".

Then you want the capture to actually contain all the words matched. As it stands at "([\w'-]+ ){1,6}", when a repetition occurs it dumps the previous capture value. This is because bothe the opening and closing parenthesis for the capture are repeated. To fix this use a non-capturing group for the repetition and place the capturing group outside of that. "((?:[\w'-]+ ){1,6})"

Finally this makes the space at the beginning of the solid text " Fireproof has dropped!" already captured. You could make that space optional, but it is much faster to trim it from your captured data. The final pattern is "^((?:[\w'-]+ ){1,6})Fireproof has dropped!$".

Why you marked it as a capture with the code you are presenting is beyond me, but since you did I made sure to explain all those details as well. A small speed gain is available whenever a non-capturing group is used instead of a capturing group.

Wizard Joined: 26 Mar 2008 Posts: 1547

Thank you, I think I understand that better now. Time to tackle input triggers! Razz

Wizard Joined: 26 Mar 2008 Posts: 1547

SubAdmin Joined: 18 Nov 2001 Posts: 5182

First change, the inner parenthesis should be non-capturing. This improves the speed slightly.
^((?:[\w'-]+){1,2}) pages: .*
Next is to add the additional character; specifically the period.
^((?:[\w.'-]+){1,2}) pages: .*
Now we have to understand what "[\w.'-]+" will match. This is going to cover as many characters as possible from this group "01234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ.'-abcdefghijklmnopqrstuvwxyz". So it will match "St.", "Croix", "Erin", or "Macquarie", etc. It does not match a space. The space is a seperate item and is the entire reason to use the range.
^((?:[\w'-]+ ){1,2}) pages: .*
Putting the space inside the repeating range means it needs to be removed from the outside.
^((?:[\w'-]+ ){1,2})pages: .*
Now you can set the repeat numbers to what ever you want, you just have to remember to %trim off the extra space in your code.

Wizard Joined: 26 Mar 2008 Posts: 1547

I knew I needed to place a space in there, but I couldn't get it to work right. I'm going to repeat it in my head a little and get it right. Sorry.

Wizard Joined: 26 Mar 2008 Posts: 1547

What wonderful fun, the space is the culprit again, but for a wholly different reason. In this particular pattern I need the space for between multiple words but not at the end, so that the colon is right after the last character of the last word. I cannot think of a way of delimiting the spaces like that, but I'm certain it has to do with how I arrange the space or how I tell it to repeat... does anyone know, or is there some reading that speaks specifically to this situation?

^Long distance to ((?:[\w'-.]+ ){1,3}): .*

Wizard Joined: 22 Mar 2007 Posts: 2320

In that situation, I would probably remove the curly braces and move the space inside the square bracket pattern:
^Long distance to ([\w'-. ]+): .*

Wizard Joined: 26 Mar 2008 Posts: 1547

Oh, that would effectively cover multiple words too? What's the point of the curly brackets then? To define a number of words when there isn't a constant following after?

Wizard Joined: 22 Mar 2007 Posts: 2320

+ means one or more of the previous pattern. I'm not very familiar with the curly brace syntax, but it looks to me that is used when you want to specify a specific number or repeats, or range of repeats. In your case, it was matching 1 to 3 instances of the pattern. By removing the curly braces and putting the space in the square brackets, the + will match any number of words separated by apostrophes, hyphens, periods, or spaces. At least, according to my understanding. I use regexes only occasionally.

Wizard Joined: 26 Mar 2008 Posts: 1547

Right, so I'm thinking curly is for when there isn't a defined constant to continue the pattern like ':', I was just so focused on that because of past experiences using it that I didn't realize I didn't need it, so thanks, I think I understand both methods better now as a result.

SubAdmin Joined: 18 Nov 2001 Posts: 5182

The use of the brace syntax is for a controlled repeat. The reason to use it is to cause backtracking to give up an entire word at a time. This provides a small speed increase when the repeated section is followed by a wildcard or list. The regex where I first used this with you had a list following the repeated words. There it was a definite speed enhancement. With this particular regex it is not nearly as useful, but I still wanted to explain the syntax.