sub
, gsub
, and gensub
When using sub
, gsub
, or gensub
, and trying to get literal
backslashes and ampersands into the replacement text, you need to remember
that there are several levels of escape processing going on.
First, there is the lexical level, which is when awk reads your program and builds an internal copy of it that can be executed. Then there is the runtime level, which is when awk actually scans the replacement string to determine what to generate.
At both levels, awk looks for a defined set of characters that
can come after a backslash. At the lexical level, it looks for the
escape sequences listed in Escape Sequences.
Thus, for every `\' that awk processes at the runtime
level, type two backslashes at the lexical level.
When a character that is not valid for an escape sequence follows the
`\', Unix awk and gawk both simply remove the initial
`\' and put the next character into the string. Thus, for
example, "a\qb"
is treated as "aqb"
.
At the runtime level, the various functions handle sequences of
`\' and `&' differently. The situation is (sadly) somewhat complex.
Historically, the sub
and gsub
functions treated the two
character sequence `\&' specially; this sequence was replaced in
the generated text with a single `&'. Any other `\' within
the replacement string that did not precede an `&' was passed
through unchanged. This is illustrated in table-sub-escapes.
You typesub
seessub
generates ———– ————– ———————\&
&
the matched text\\&
\&
a literal `&'\\\&
\&
a literal `&'\\\\&
\\&
a literal `\&'\\\\\&
\\&
a literal `\&'\\\\\\&
\\\&
a literal `\\&'\\q
\q
a literal `\q'
Table 8.1: Historical Escape Sequence Processing for sub and gsub
This table shows both the lexical-level processing, where
an odd number of backslashes becomes an even number at the runtime level,
as well as the runtime processing done by sub
.
(For the sake of simplicity, the rest of the following tables only show the
case of even numbers of backslashes entered at the lexical level.)
The problem with the historical approach is that there is no way to get a literal `\' followed by the matched text.
The 1992 POSIX standard attempted to fix this problem. That standard
says that sub
and gsub
look for either a `\' or an `&'
after the `\'. If either one follows a `\', that character is
output literally. The interpretation of `\' and `&' then becomes
as shown in table-sub-posix-92.
You typesub
seessub
generates ———– ————– ———————&
&
the matched text\\&
\&
a literal `&'\\\\&
\\&
a literal `\', then the matched text\\\\\\&
\\\&
a literal `\&'
Table 8.2: 1992 POSIX Rules for sub and gsub Escape Sequence Processing
This appears to solve the problem. Unfortunately, the phrasing of the standard is unusual. It says, in effect, that `\' turns off the special meaning of any following character, but for anything other than `\' and `&', such special meaning is undefined. This wording leads to two problems:
Because of the problems just listed, in 1996, the gawk maintainer submitted proposed text for a revised standard that reverts to rules that correspond more closely to the original existing practice. The proposed rules have special cases that make it possible to produce a `\' preceding the matched text. This is shown in table-sub-proposed.
You typesub
seessub
generates ———– ————– ———————\\\\\\&
\\\&
a literal `\&'\\\\&
\\&
a literal `\', followed by the matched text\\&
\&
a literal `&'\\q
\q
a literal `\q'\\\\
\\
\\
Table 8.3: Propsosed rules for sub and backslash
In a nutshell, at the runtime level, there are now three special sequences of characters (`\\\&', `\\&' and `\&') whereas historically there was only one. However, as in the historical case, any `\' that is not part of one of these three sequences is not special and appears in the output literally.
gawk 3.0 and 3.1 follow these proposed POSIX rules for sub
and
gsub
.
The POSIX standard took much longer to be revised than was expected in 1996.
The 2001 standard does not follow the above rules. Instead, the rules
there are somewhat simpler. The results are similar except for one case.
The 2001 POSIX rules state that `\&' in the replacement string produces a literal `&', `\\' produces a literal `\', and `\' followed by anything else is not special; the `\' is placed straight into the output. These rules are presented in table-posix-2001-sub.
You typesub
seessub
generates ———– ————– ———————\\\\\\&
\\\&
a literal `\&'\\\\&
\\&
a literal `\', followed by the matched text\\&
\&
a literal `&'\\q
\q
a literal `\q'\\\\
\\
\
Table 8.4: POSIX 2001 rules for sub
The only case where the difference is noticeable is the last one: `\\\\' is seen as `\\' and produces `\' instead of `\\'.
Starting with version 3.1.4, gawk follows the POSIX rules when --posix is specified (see Options). Otherwise, it continues to follow the 1996 proposed rules, since, as of this writing, that has been its behavior for over seven years.
NOTE: At the next major release, gawk will switch to using the POSIX 2001 rules by default.
The rules for gensub
are considerably simpler. At the runtime
level, whenever gawk sees a `\', if the following character
is a digit, then the text that matched the corresponding parenthesized
subexpression is placed in the generated output. Otherwise,
no matter what character follows the `\', it
appears in the generated text and the `\' does not,
as shown in table-gensub-escapes.
You typegensub
seesgensub
generates ———– —————— ————————–&
&
the matched text\\&
\&
a literal `&'\\\\
\\
a literal `\'\\\\&
\\&
a literal `\', then the matched text\\\\\\&
\\\&
a literal `\&'\\q
\q
a literal `q'
Table 8.5: Escape Sequence Processing for gensub
Because of the complexity of the lexical and runtime level processing
and the special cases for sub
and gsub
,
we recommend the use of gawk and gensub
when you have
to do substitutions.
In awk, the `*' operator can match the null string.
This is particularly important for the sub
, gsub
,
and gensub
functions. For example:
$ echo abc | awk '{ gsub(/m*/, "X"); print }' -| XaXbXcX
Although this makes a certain amount of sense, it can be surprising.