FAQ

I found a terrible bug: TRegExpr raises Access Violation exception!

Answer

You must create the object before usage. So, after you declared something like:

r : TRegExpr

do not forget to create the object instance:

r := TRegExpr.Create.

Does it support Unicode?

Answer

How to use Unicode

Why does TRegExpr return more then one line?

For example, r.e.  returns the first <font, then the rest of the file including last </html>.

Answer

For backward compatibility, modifier /s is On by default.

Switch it Off and . will match any but Line separators - exactly as you wish.

BTW I suggest ]*)>, in Match[1] will be the URL.

Why does TRegExpr return more then I expect?

For example r.e. (.+) applyed to string ab returns ab but not a as I expected.

Answer

By default all operators works in greedy mode, so they match as more as it possible.

If you want non-greedy mode you can use non-greedy operators like +? and so on or switch all operators into non-greedy mode with help of modifier g (use appropriate TRegExpr properties or operator ?(-g) in r.e.).

How to parse sources like HTML with help of TRegExpr?

Answer

Sorry folks, but it's nearly impossible!

Of course, you can easily use TRegExpr for extracting some information from HTML, as shown in my examples, but if you want accurate parsing you have to use real parser, not r.e.

You can read full explanation in Tom Christiansen and Nathan Torkington Perl Cookbook, for example.

In short - there are many structures that can be easy parsed by real parser but cannot at all by r.e., and real parser is much faster to do the parsing, because r.e. doesn't simply scan input stream, it performs optimization search that can take a lot of time.

Is there a way to get multiple matches of a pattern on TRegExpr?

Answer

You can iterate matches with ExecNext method.

If you want some example, please take a look at TRegExpr.Replace method implementation or at the examples for HyperLinksDecorator

I am checking user faq.mdinput. Why does TRegExpr return `True` for wrong input strings?

Answer

In many cases TRegExpr users forget that regular expression is for search in input string.

So, for example if you use \d{4,4} expression, you will get success for wrong user inputs like 12345 or any letters 1234.

You have to check from line start to line end to ensure there are no anything else around: ^\d{4,4}$.

Why does non-greedy iterators sometimes work as in greedy mode?

For example, the r.e. a+?,b+? applied to string aaa,bbb matches aaa,b, but should it not match a,b because of non-greediness of first iterator?

Answer

This is because of TRegExpr way to work. In fact many others r.e. engines work exactly the same: they performe only simple search optimization, and do not try to do the best optimization.

In some cases it's bad, but in common it's rather advantage then limitation, because of performance and predictability reasons.

The main rule - r.e. first of all try to match from current place and only if that's completely impossible move forward by one char and try again from next position in the text.

So, if you use a,b+? it'll match a,b. In case of a+?,b+? it's now not recommended (we add non-greedy modifyer) but still possible to match more then one a, so TRegExpr will do it.

TRegExpr like Perl's or Unix's r.e. doesn't attempt to move forward and check - would it will be "better" match. Fisrt of all, just because there is no way to say it's more or less good match.

How can I use TRegExpr with Borland C++ Builder?

I have a problem since no header file (.h or .hpp) is available.

Answer

Add RegExpr.pas to bcb project.
Compile project. This generates the header file RegExpr.hpp.
Now you can write code which uses the RegExpr unit.
Don't forget to add #include “RegExpr.hpp” where needed.
Don't forget to replace all \ in regular expressions with \\ or redefined EscChar const.

Why many r.e. (including r.e. from TRegExpr help and demo) work wrong in Borland C++ Builder?

Answer

The hint is in the previous question ;) Symbol \ has special meaning in C++, so you have to escape it (as described in previous answer). But if you don't like r.e. like \\w+\\w+\\.\\w+ you can redefine the constant EscChar (in RegExpr.pas). For example EscChar = "/". Then you can write /w+/w+/./w+, looks unusual but more readable.

FAQ

I found a terrible bug: TRegExpr raises Access Violation exception!

Does it support Unicode?

Why does TRegExpr return more then one line?

Why does TRegExpr return more then I expect?

How to parse sources like HTML with help of TRegExpr?

Is there a way to get multiple matches of a pattern on TRegExpr?

I am checking user faq.mdinput. Why does TRegExpr return True for wrong input strings?

Why does non-greedy iterators sometimes work as in greedy mode?

How can I use TRegExpr with Borland C++ Builder?

Why many r.e. (including r.e. from TRegExpr help and demo) work wrong in Borland C++ Builder?

I am checking user faq.mdinput. Why does TRegExpr return `True` for wrong input strings?