The FAST_REGEX_LOG_PARSE works by first decomposing the regular expression into a series of regular expressions, one for each expression inside a group and one for each expression outside a group. Any fixed length portions at the start of any expressions are moved to the end of the previous expression. If any expression is entirely fixed length, it is merged with the previous expression. The series of expressions is then evaluated using lazy semantics with no backtracking. (In regular expression parsing parlance, "lazy" means don't parse more than you need to at each step. "Greedy" means parse as much as you can at each step.)
The columns returned will be COLUMN1 through COLUMNn, where n is the number of groups in the regular expression. The columns will be of type varchar(1024). See sample usage below at First FRLP Example and at Further FRLP Examples.
This topic contains the following subtopics:
FAST_REGEX_LOG_PARSE('input_string', 'fast_regex_pattern')
FAST_REGEX_LOG_PARSER uses a lazy search - it stops at the first match. By contrast, the default java regex_parser is greedy unless possessive quantifiers are used.
FAST_REGEX_LOG_PARSE scans the supplied input string for all the characters specified by the Fast Regex pattern.
Fast Regex uses a different set of character class symbols from the regular regex parser:
Symbol or Construct | Meaning |
---|---|
- | Character range, including endpoints |
[ charclasses ] | Character class |
[^ charclasses ] | Negated character class |
| | Union |
& | Intersection |
? | Zero or one occurrence |
* | Zero or more occurrences |
+ | One or more occurrences |
{n} | n occurrences |
{n,} | n or more occurrences |
{n,m} | n to m occurrences, including both |
. | Any single character |
# | The empty language |
@ | Any string |
"<Unicode string without double-quotes>" | A string) |
( ) | The empty string) |
( unionexp ) | Precedence override |
< <identifier> > | Named pattern |
<n-m> | Numerical interval |
charexp:=<Unicode character> | A single non-reserved character |
\ |
A single character |
We support the following POSIX standard identifiers as named patterns:
This first example uses the following Fast Regex pattern
(.*)_(._.*)_.*
Example:
select t.r."COLUMN1", t.r."COLUMN2" from (values (FAST_REGEX_LOG_PARSE('Mary_had_a_little_lamb', '(.*)_(._.*)_.*'))) t(r);
+------------------------+------------------------+
| COLUMN1 | COLUMN2 |
+------------------------+------------------------+
| Mary_had | a_little |
+------------------------+------------------------+
(.*)
_
(._.*)
means any 3-character string with an underscore in the middle followed by at least one more character. The first occurrence of such a string within the input string is at "a_l". Once the parser finds a group-2 match, it can determine contents of the first column: "Mary_had".
)_.*
This means an underscore followed by any number of other characters. That specification is matched by "_little_lamb". after which the input_string ends. At this point, all the characters in the input string have been accounted for by the parser. The second column contains "a_little_lamb".
The next example uses a "+", which means repeat the last expression 1 or more times ("*" means 0 or more times).
A. In this case, the longest prefix is the first underscore. The first field/column group will match on "Mary" and the second will not match.
select t.r."COLUMN1", t.r."COLUMN2" from (values (FAST_REGEX_LOG_PARSE('Mary_had_a_little_lamb', '(.*)_+(._.*)'))) t(r);
+----------+----------+
| COLUMN1 | COLUMN2 |
+----------+----------+
+----------+----------+
No rows selected
The above example returns no fields because the "+" required there be at least one more underscore-in-a-row; and the input_string does not have that.
B. In the following case, the '+' is superfluous because of the lazy semantics:
select t.r."COLUMN1", t.r."COLUMN2" from (values (FAST_REGEX_LOG_PARSE('Mary____had_a_little_lamb', '(.*)_+(.*)'))) t(r);
+---------------------------+---------------------------+
| COLUMN1 | COLUMN2 |
+---------------------------+---------------------------+
| Mary | ___had_a_little_lamb |
+---------------------------+---------------------------+
The above example succeeds in returning two fields because after finding the multiple underscores required by the "_+" specification, the group-2 specification (.*) accepts all remaining characters in the .input_string. Underscores do not appear trailing "Mary" nor leading "had" because the "_+" specification is not enclosed in parentheses.
As mentioned in the introduction, "lazy" in regular expression parsing parlance means don't parse more than you need to at each step; "Greedy" means parse as much as you can at each step.
The first case in this topic, A, fails because when it gets to the first underscore, the regex processor has no way of knowing without backtracking that it can't use the underscore to match "_+", and FRLP doesn't backtrack, whereas REGEX_LOG_PARSE does.
The search directly above, B, gets turned into three searches:
(.*)_
_*(._
.*)
Notice that the second field group gets split between the second and third searches, also that "_+" is considered the same as "__*". That is, it considers "underscore repeat-underscore-1-or-more-times" the same as "underscore underscore repeat-underscore-0-or-more-times".)
Case A demonstrates the main difference between REGEX_LOG_PARSE and FAST_REGEX_LOG_PARSE, because the search in A would work under REGEX_LOG_PARSE because that function would use backtracking.
C. In the following example, the plus is not superfluous, because the "
select t.r."COLUMN1", t.r."COLUMN2" from (values (FAST_REGEX_LOG_PARSE('Mary____had_a_little_lamb', '(.*)_+(<Alpha>.*)'))) t(r);
+---------------------------+---------------------------+
| COLUMN1 | COLUMN2 |
+---------------------------+---------------------------+
| Mary | had_a_little_lamb |
+---------------------------+---------------------------+
'(.*) +(
'.* '
' *<Alpha>'
'.*$'
Each is matched in turn using lazy semantics.
The columns returned will be COLUMN1 through COLUMNn, where n is the number of groups in the regular expression. The columns will be of type varchar(1024).