Node:Rx Regexps, Next:, Previous:Formatted Output, Up:Top



44 The Rx Regular Expression Library

[FIXME: this is taken from Gary and Mark's quick summaries and should be reviewed and expanded. Rx is pretty stable, so could already be done!]

The guile-lang-allover package provides an interface to Tom Lord's Rx library (currently only to POSIX regular expressions). Use of the library requires a two step process: compile a regular expression into an efficient structure, then use the structure in any number of string comparisons.

For example, given the regular expression abc. (which matches any string containing abc followed by any single character):

guile> (define r (regcomp "abc."))
guile> r
#<rgx abc.>
guile> (regexec r "abc")
#f
guile> (regexec r "abcd")
#((0 . 4))
guile>

The definitions of regcomp and regexec are as follows:

regcomp pattern [flags] Scheme Procedure
Compile the regular expression pattern using POSIX rules. Flags is optional and should be specified using symbolic names:

REG_EXTENDED Variable
use extended POSIX syntax

REG_ICASE Variable
use case-insensitive matching

REG_NEWLINE Variable
allow anchors to match after newline characters in the string and prevents . or [^...] from matching newlines.

The logior procedure can be used to combine multiple flags. The default is to use POSIX basic syntax, which makes + and ? literals and \+ and \? operators. Backslashes in pattern must be escaped if specified in a literal string e.g., "\\(a\\)\\?".

regexec regex string [match-pick] [flags] Scheme Procedure
Match string against the compiled POSIX regular expression regex. match-pick and flags are optional. Possible flags (which can be combined using the logior procedure) are:

REG_NOTBOL Variable
The beginning of line operator won't match the beginning of string (presumably because it's not the beginning of a line)

REG_NOTEOL Variable
Similar to REG_NOTBOL, but prevents the end of line operator from matching the end of string.

If no match is possible, regexec returns #f. Otherwise match-pick determines the return value:

#t or unspecified: a newly-allocated vector is returned, containing pairs with the indices of the matched part of string and any substrings.

"": a list is returned: the first element contains a nested list with the matched part of string surrounded by the the unmatched parts. Remaining elements are matched substrings (if any). All returned substrings share memory with string.

#f: regexec returns #t if a match is made, otherwise #f.

vector: the supplied vector is returned, with the first element replaced by a pair containing the indices of the matched portion of string and further elements replaced by pairs containing the indices of matched substrings (if any).

list: a list will be returned, with each member of the list specified by a code in the corresponding position of the supplied list:

a number: the numbered matching substring (0 for the entire match).

#\<: the beginning of string to the beginning of the part matched by regex.

#\>: the end of the matched part of string to the end of string.

#\c: the "final tag", which seems to be associated with the "cut operator", which doesn't seem to be available through the posix interface.

e.g., (list #\< 0 1 #\>). The returned substrings share memory with string.

Here are some other procedures that might be used when using regular expressions:

compiled-regexp? obj Scheme Procedure
Test whether obj is a compiled regular expression.

regexp->dfa regex [flags] Scheme Procedure

dfa-fork dfa Scheme Procedure

reset-dfa! dfa Scheme Procedure

dfa-final-tag dfa Scheme Procedure

dfa-continuable? dfa Scheme Procedure

advance-dfa! dfa string Scheme Procedure