Using library functions in awk can be very beneficial. It encourages code reuse and the writing of general functions. Programs are smaller and therefore clearer. However, using library functions is only easy when writing awk programs; it is painful when running them, requiring multiple -f options. If gawk is unavailable, then so too is the AWKPATH environment variable and the ability to put awk functions into a library directory (see Options). It would be nice to be able to write programs in the following manner:
# library functions @include getopt.awk @include join.awk ... # main program BEGIN { while ((c = getopt(ARGC, ARGV, "a:b:cde")) != -1) ... ... }
The following program, igawk.sh, provides this service. It simulates gawk's searching of the AWKPATH variable and also allows nested includes; i.e., a file that is included with `@include' can contain further `@include' statements. igawk makes an effort to only include files once, so that nested includes don't accidentally include a library function twice.
igawk should behave just like gawk externally. This means it should accept all of gawk's command-line arguments, including the ability to have multiple source files specified via -f, and the ability to mix command-line and library source files.
The program is written using the POSIX Shell (sh) command language.1 It works as follows:
This program uses shell variables extensively; for storing command line arguments, the text of the awk program that will expand the user's program, for the user's original program, and for the expanded program. Doing so removes some potential problems that might arise were we to use temporary files instead, at the cost of making the script somewhat more complicated.
The initial part of the program turns on shell tracing if the first argument is `debug'.
The next part loops through all the command-line arguments. There are several cases of interest:
--
-W
-v
, -F
-f
, --file
, --file=
, -Wfile=
program
with an
`@include' statement.
The expr utility is used to remove the leading option part of the
argument (e.g., `--file=').
(Typical sh usage would be to use the echo and sed
utilities to do this work. Unfortunately, some versions of echo evaluate
escape sequences in their arguments, possibly mangling the program text.
Using expr avoids this problem.)
--source
, --source=
, -Wsource=
program
.
--version
, -Wversion
If none of the -f, --file, -Wfile, --source,
or -Wsource arguments are supplied, then the first nonoption argument
should be the awk program. If there are no command-line
arguments left, igawk prints an error message and exits.
Otherwise, the first argument is appended to program
.
In any case, after the arguments have been processed,
program
contains the complete text of the original awk
program.
The program is as follows:
#! /bin/sh # igawk --- like gawk but do @include processing if [ "$1" = debug ] then set -x shift fi # A literal newline, so that program text is formmatted correctly n=' ' # Initialize variables to empty program= opts= while [ $# -ne 0 ] # loop over arguments do case $1 in --) shift; break;; -W) shift # The ${x?'message here'} construct prints a # diagnostic if $x is the null string set -- -W"${@?'missing operand'}" continue;; -[vF]) opts="$opts $1 '${2?'missing operand'}'" shift;; -[vF]*) opts="$opts '$1'" ;; -f) program="$program$n@include ${2?'missing operand'}" shift;; -f*) f=`expr "$1" : '-f\(.*\)'` program="$program$n@include $f";; -[W-]file=*) f=`expr "$1" : '-.file=\(.*\)'` program="$program$n@include $f";; -[W-]file) program="$program$n@include ${2?'missing operand'}" shift;; -[W-]source=*) t=`expr "$1" : '-.source=\(.*\)'` program="$program$n$t";; -[W-]source) program="$program$n${2?'missing operand'}" shift;; -[W-]version) echo igawk: version 2.0 1>&2 gawk --version exit 0 ;; -[W-]*) opts="$opts '$1'" ;; *) break;; esac shift done if [ -z "$program" ] then program=${1?'missing program'} shift fi # At this point, `program' has the program.
The awk program to process `@include' directives
is stored in the shell variable expand_prog
. Doing this keeps
the shell script readable. The awk program
reads through the user's program, one line at a time, using getline
(see Getline). The input
file names and `@include' statements are managed using a stack.
As each `@include' is encountered, the current file name is
“pushed” onto the stack and the file named in the `@include'
directive becomes the current file name. As each file is finished,
the stack is “popped,” and the previous input file becomes the current
input file again. The process is started by making the original file
the first one on the stack.
The pathto
function does the work of finding the full path to
a file. It simulates gawk's behavior when searching the
AWKPATH environment variable
(see AWKPATH Variable).
If a file name has a `/' in it, no path search is done. Otherwise,
the file name is concatenated with the name of each directory in
the path, and an attempt is made to open the generated file name.
The only way to test if a file can be read in awk is to go
ahead and try to read it with getline
; this is what pathto
does.2 If the file can be read, it is closed and the file name
is returned:
expand_prog=' function pathto(file, i, t, junk) { if (index(file, "/") != 0) return file for (i = 1; i <= ndirs; i++) { t = (pathlist[i] "/" file) if ((getline junk < t) > 0) { # found it close(t) return t } } return "" }
The main program is contained inside one BEGIN
rule. The first thing it
does is set up the pathlist
array that pathto
uses. After
splitting the path on `:', null elements are replaced with "."
,
which represents the current directory:
BEGIN { path = ENVIRON["AWKPATH"] ndirs = split(path, pathlist, ":") for (i = 1; i <= ndirs; i++) { if (pathlist[i] == "") pathlist[i] = "." }
The stack is initialized with ARGV[1]
, which will be /dev/stdin.
The main loop comes next. Input lines are read in succession. Lines that
do not start with `@include' are printed verbatim.
If the line does start with `@include', the file name is in $2
.
pathto
is called to generate the full path. If it cannot, then we
print an error message and continue.
The next thing to check is if the file is included already. The
processed
array is indexed by the full file name of each included
file and it tracks this information for us. If the file is
seen again, a warning message is printed. Otherwise, the new file name is
pushed onto the stack and processing continues.
Finally, when getline
encounters the end of the input file, the file
is closed and the stack is popped. When stackptr
is less than zero,
the program is done:
stackptr = 0 input[stackptr] = ARGV[1] # ARGV[1] is first file for (; stackptr >= 0; stackptr--) { while ((getline < input[stackptr]) > 0) { if (tolower($1) != "@include") { print continue } fpath = pathto($2) if (fpath == "") { printf("igawk:%s:%d: cannot find %s\n", input[stackptr], FNR, $2) > "/dev/stderr" continue } if (! (fpath in processed)) { processed[fpath] = input[stackptr] input[++stackptr] = fpath # push onto stack } else print $2, "included in", input[stackptr], "already included in", processed[fpath] > "/dev/stderr" } close(input[stackptr]) } }' # close quote ends `expand_prog' variable processed_program=`gawk -- "$expand_prog" /dev/stdin <<EOF $program EOF `
The shell construct `command << marker' is called a here document. Everything in the shell script up to the marker is fed to command as input. The shell processes the contents of the here document for variable and command substitution (and possibly other things as well, depending upon the shell).
The shell construct ``...`' is called command substitution. The output of the command between the two backquotes (grave accents) is substituted into the command line. It is saved as a single string, even if the results contain whitespace.
The expanded program is saved in the variable processed_program
.
It's done in these steps:
expand_prog
shell variable) on standard input.
program
.
Its contents are fed to gawk via a here document.
processed_program
by using command substitution.
The last step is to call gawk with the expanded program, along with the original options and command-line arguments that the user supplied.
eval gawk $opts -- '"$processed_program"' '"$@"'
The eval command is a shell construct that reruns the shell's parsing process. This keeps things properly quoted.
This version of igawk represents my fourth attempt at this program. There are four key simplifications that make the program work better:
getline
in the pathto
function when testing for the
file's accessibility for use with the main program simplifies things
considerably.
getline
loop in the BEGIN
rule does it all in one
place. It is not necessary to call out to a separate loop for processing
nested `@include' statements.
Also, this program illustrates that it is often worthwhile to combine sh and awk programming together. You can usually accomplish quite a lot, without having to resort to low-level programming in C or C++, and it is frequently easier to do certain kinds of string and argument manipulation using the shell than it is in awk.
Finally, igawk shows that it is not always necessary to add new features to a program; they can often be layered on top. With igawk, there is no real reason to build `@include' processing into gawk itself.
As an additional example of this, consider the idea of having two files in a directory in the search path:
getopt
and assert
.
One user suggested that gawk be modified to automatically read these files upon startup. Instead, it would be very simple to modify igawk to do this. Since igawk can process nested `@include' directives, default.awk could simply contain `@include' statements for the desired library functions.
[1] Fully explaining the sh language is beyond the scope of this book. We provide some minimal explanations, but see a good shell programming book if you wish to understand things in more depth.
[2] On some very old versions of awk, the test `getline junk < t' can loop forever if the file exists but is empty. Caveat emptor.