This document outlines the ICU Regular Expressions package, which provides robust pattern matching for Unicode strings using a syntax based on Perl and a C++ API similar to Java's java.util.regex. It details the comprehensive support for standard matching, finding, replacing, and splitting operations, while strictly conforming to Unicode Technical Standard #18. Additionally, the guide offers essential performance tips to avoid backtracking issues and highlights key differences from Java's implementation to assist developers in writing efficient, Unicode-aware code.
ICU's Regular Expressions package empowers applications to apply pattern matching to Unicode string data. The behavior and pattern syntax are derived from Perl, making them powerful and flexible. For developers familiar with Java, the C++ programming API is loosely based on the JDK 1.4 package java.util.regex, though it includes specific extensions to function smoothly within a C++ environment. A plain C API is also available for those who need it.
The API supports a wide range of standard operations. You can test for a pattern match, search for patterns within text, and replace matched text. It also supports capture groups, allowing you to identify subranges within a match and use them in replacement text. Additionally, a Perl-inspired split() function is included to help break strings into fields based on delimiters.
ICU Regular Expressions strictly conform to version 19 of the Unicode Technical Standard #18, encompassing Level 1 requirements and specific features from Level 2, such as Default Word boundaries and Name Properties.
Matching behavior can sometimes be surprising, and this book [Mastering Regular Expressions] is highly recommended for anyone doing significant work with regular expressions.
The C++ API revolves around two primary classes: RegexPattern and RegexMatcher. RegexPattern represents the compiled regular expression, while RegexMatcher is the engine that associates that pattern with an input string to perform matching. In practice, you mostly interact with RegexMatcher.
To use a regular expression, you typically create a RegexMatcher from a pattern string. This object holds the compiled pattern and a reference to the text you want to check. A key feature is that matchers can be reset and reused with new input strings. This avoids the overhead of creating new objects when performing the same operation on different texts.
Note that matching happens directly in the string supplied by the application. This reduces the overhead when resetting a matcher to an absolute minimum – the matcher need only store a reference to the new string – but it does mean that the application must be careful not to modify or delete the string while the matcher is holding a reference to the string.
There are different ways to test for matches:
matches(): Returns true only if the pattern matches the entire string.lookingAt(): Returns true if the pattern matches at the start of the string.find(): Returns true if the pattern matches anywhere within the string.Once a match is found, you can access detailed information about the result. Functions like start(), end(), and group() allow you to retrieve indices and the actual text matched by the pattern or specific capture groups.
The library supports a massive list of character representations and metacharacters. It includes standard escapes like \d (digits) and \w (word characters), but also deep Unicode support:
\N{UNICODE CHARACTER NAME}: Matches a character by its name.\p{UNICODE PROPERTY NAME}: Matches any character with a specific property.\X: Matches a Grapheme Cluster.ICU supports a comprehensive set of operators for constructing patterns. You have access to standard tools like alternation (|), capturing parentheses (...), and various quantifiers:
*, +, ? (Match as much as possible).*?, +?, ?? (Match as little as possible).*+, ++, ?+ (Match as much as possible and do not backtrack).Possessive quantifiers are particularly useful for performance, as they commit to a match and do not retry with fewer characters if the overall match fails. The library also supports advanced assertions like look-ahead (?= ...) and look-behind (?<= ...) to check context without consuming characters.
Character classes (sets) in ICU are extremely powerful. Beyond standard ranges like [a-z], you can utilize Unicode properties and set arithmetic:
[\p{Letter}&&\p{script=cyrillic}] matches only Cyrillic letters.[\p{Letter}--\p{script=latin}] matches all letters except Latin ones.[a-zA-Z0-9] matches letters and digits.Case insensitivity in Unicode is more complex than in ASCII because changing the case of a character can change the string's length (e.g., "ß" can become "SS"). You can enable case insensitivity using the UREGEX_CASE_INSENSITIVE flag or the (?i) pattern option.
ICU handles this by performing full case folding on literal strings within the pattern. This means a pattern like "fussball" will match "fußball" even though the lengths differ.
With these rules, a match or capturing sub-match can never begin or end in the interior of an input text character that expanded when case folded.
Other useful flags include:
x): Allows whitespace and comments within the pattern.s): Allows the . character to match line terminators.m): Allows ^ and $ to match the start and end of individual lines, not just the whole text.w): Uses Unicode UAX 29 definitions for finding word boundaries (\b), which provides more accurate results than simple character classification.The split() function allows you to break a string into an array of fields using a regex match as the delimiter. It works similarly to Perl. The result is stored in an array of UnicodeString objects provided by the user.
If the number of fields in a string being split exceeds the capacity of the destination array, the last destination string will contain all of the input string data that could not be split, including any embedded field delimiters.
ICU provides robust tools for modifying text:
replaceFirst(): Replaces the first occurrence.replaceAll(): Replaces all occurrences.appendReplacement() & appendTail(): Allow for incremental replacement in a loop.In the replacement text, you can refer to capture groups from the match using $n (where n is the group number) or ${name} for named groups.
Regular expressions can sometimes encounter severe performance issues, often called "catastrophic backtracking." This happens when the engine tries to match a failing input against a complex pattern with nested quantifiers, such as (A+)+B.
The running time for troublesome patterns is exponential with the length of the input string. Every added character in the input doubles the (non)matching time. It doesn't take a particularly long string for the projected running time to exceed the age of the universe.
To ensure your application runs smoothly:
(A+)+ create ambiguity and massive backtracking.*+ or ++ to stop the engine from backtracking unnecessarily.Heap Usage: ICU stores its backtracking state on the heap (defaulting to an 8 MB limit) rather than the stack. This design choice prevents stack overflow errors, which are common in other regex implementations when processing complex patterns.
While the API is similar to Java, there are distinct differences to be aware of:
\p{punct} property in ICU strictly follows Unicode Technical Standard #18, resulting in slightly different matches compared to Java's implementation.The ICU Regular Expressions package offers a powerful, standard-compliant way to handle text processing in C++. By providing deep integration with Unicode standards, robust performance controls, and a familiar API structure, it enables developers to build globalized applications effectively. However, users should be mindful of the specific differences from Java and follow performance best practices to handle complex patterns efficiently.
Get instant summaries with Harvest