The goal of this project is to develop a loop vectorizer in GCC, based on the tree-ssa framework. This work is taking place in the autovect-branch.
VEC_UNPACK_HI
and VEC_UNPACK_LO
tree-codes (and the new vec_unpacks_hi/lo
and vec_unpacku_hi/lo
optabs). Type
demotion is supported using the new VEC_PACK_MOD
tree-code (and the new vec_pack_mod
optab).WIDEN_SUM
), dot-product (DOT_PROD
),
widening-multiplication (WIDEN_MULT
, VEC_WIDEN_MULT_HI/LO
), multiply-highpart
(MULT_HI
) and sum-of-absolute-differences (SAD
).Current contributors to this project include Dorit (Naishlos) Nuzman, Ira Rosen, Victor Kaplansky and Devang Patel. This web page is maintained by Dorit (Naishlos) Nuzman <dorit@il.ibm.com>.
Required enhancements and missing features (some of these are in the works or in our near term plans; other items are open for others to contribute!):
Vectorization is enabled by the flag
-ftree-vectorize
. To allow vectorization on
powerpc* platforms also use -maltivec
. On i?86 and
x86_64 platforms use -msse/-msse2
. The vectorizer
test cases demonstrate the current vectorization capabilities;
these can be found under
gcc/gcc/testsuite/gcc.dg/vect/. Information on which
loops were or were not vectorized and why, can be obtained
using the flag -ftree-vectorizer-verbose
. For details
see
http://gcc.gnu.org/ml/gcc-patches/2005-01/msg01247.html.
Example output using -ftree-vectorizer-verbose=5
:
vect-1.c:34: note: not vectorized: unsupported use in stmt. vect-1.c:43: note: not vectorized: nested loop. vect-1.c:44: note: not vectorized: unsupported use in stmt. vect-1.c:52: note: LOOP VECTORIZED. vect-1.c:59: note: LOOP VECTORIZED. vect-1.c:66: note: not vectorized: complicated access pattern. vect-1.c:74: note: LOOP VECTORIZED. vect-1.c:85: note: not vectorized: mixed data-types vect-1.c:94: note: not vectorized: possible dependence between data-refs a[i_283] and a[i_48] vect-1.c:14: note: vectorized 3 loops in function.
Examples of loops that can currently be vectorized by the autovect-branch. "feature" indicates the vectorization capabilities demonstrated by the example.
example1:int a[256], b[256], c[256]; foo () { int i; for (i=0; i<256; i++){ a[i] = b[i] + c[i]; } }example2:
int a[256], b[256], c[256]; foo (int n, int x) { int i; /* feature: support for unknown loop bound */ /* feature: support for loop invariants */ for (i=0; i<n; i++) b[i] = x; } /* feature: general loop exit condition */ /* feature: support for bitwise operations */ while (n--){ a[i] = b[i]&c[i]; i++; } }example3:
typedef int aint __attribute__ ((__aligned__(16))); foo (int n, aint * __restricted__ p, aint * __restricted q) { /* feature: support for (aligned) pointer accesses. */ while (n--){ *p++ = *q++; } }example4:
typedef int aint __attribute__ ((__aligned__(16))); int a[256], b[256], c[256]; foo (int n, aint * __restricted__ p, aint * __restricted__ q) { int i; /* feature: support for (aligned) pointer accesses */ /* feature: support for constants */ while (n--){ *p++ = *q++ + 5; } /* feature: support for read accesses with a compile time known misalignment */ for (i=0; i<n; i++){ a[i] = b[i+1] + c[i+3]; } /* feature: support for if-conversion (only in autovect-branch) */ for (i=0; i<n; i++){ j = a[i]; b[i] = (j > MAX ? MAX : 0); } }example5:
struct a { int ca[N]; } s; for (i = 0; i < N; i++) { /* feature: support for alignable struct access */ s.ca[i] = 5; }example6 (gfortran):
DIMENSION A(1000000), B(1000000), C(1000000) READ*, X, Y A = LOG(X); B = LOG(Y); C = A + B PRINT*, C(500000) ENDexample7:
int a[256], b[256]; foo (int x) { int i; /* feature: support for read accesses with an unknown misalignment */ for (i=0; i<N; i++){ a[i] = b[i+x]; } }example8:
int a[M][N]; foo (int x) { int i,j; /* feature: support for multidimensional arrays */ for (i=0; i<M; i++) { for (j=0; j<N; j++) { a[i][j] = x; } } }example9:
unsigned int ub[N], uc[N]; foo () { int i; /* feature: support summation reduction. note: in case of floats use -funsafe-math-optimizations */ unsigned int diff = 0; for (i = 0; i < N; i++) { udiff += (ub[i] - uc[i]); }example10:
/* feature: support data-types of different sizes. Currently only a single vector-size per target is supported; it can accommodate n elements such that n = vector-size/element-size (e.g, 4 ints, 8 shorts, or 16 chars for a vector of size 16 bytes). A combination of data-types of different sizes in the same loop requires special handling, now present in autovect-branch. This also include support for type conversions. */ short *sa, *sb, *sc; int *ia, *ib, *ic; for (i = 0; i < N; i++) { ia[i] = ib[i] + ic[i]; sa[i] = sb[i] + sc[i]; } for (i = 0; i < N; i++) { ia[i] = (int) sb[i]; }
Examples of loops that currently cannot be vectorized:
example1: uncountable loop:while (*p != NULL) { *q++ = *p++; }example2: induction:
for (i = 0; i < N; i++) { a[i] = i; }example3: strided access - the data elements that can be operated upon in parallel are not consecutive - they are accessed with a stride > 1 (in the example, the stride is 2). (This is now vectorizable on autovect-branch):
for (i = 0; i < N/2; i++){ a[i] = b[2*i+1] * c[2*i+1] - b[2*i] * c[2*i]; d[i] = b[2*i] * c[2*i+1] + b[2*i+1] * c[2*i]; }
sum += a[i]
).a[i] = 5
, a[i] = x
).a[i] = 5
, a[i] = x
).sum += a[i]
) or inductions
(a[i] = i
).__restricted__
.sum += a[i]
) or inductions
(a[i] = i
).a[i] = 5
, a[i] = x
).__restricted__
.p=&a[16]-4B
(pointer), and
a[i+off]
(arrays).sum += a[i]
) or inductions
(a[i] = i
).a[i] = 5
, a[i] = x
).__restricted__
.p=&a[16]-4B
(pointer), and a[i+off]
(arrays).sum += a[i]
) or
inductions (a[i] = i
).a[i] = 5
, a[i] = x
).__restricted__
sum += a[i]
) or
inductions (a[i] = i
).a[i] = 5
, a[i] = x
).extern
,
and stack boundary of target platform allows it),
and aligned pointers that are annotated as
__restricted__
.sum += a[i]
) or
inductions (a[i] = i
).a[i] = 5
, a[i] =
x
).__restricted__
.sum += a[i]
) or
inductions (a[i] = i
).a[i] = 5
, a[i] =
x
).Examples of newly vectorizable loops: loop
__restricted__
.sum += a[i]
) or
inductions (a[i] = i
).a[i] = 5
, a[i] =
x
).__restricted__
.a[i+x]
, where the value of x is
unknown). Misalignment support for
loads was also made more efficient.sum += a[i]
) or
inductions (a[i] = i
).a[i] = 5
, a[i] = x
).
New: Support for invariants was made more
efficient.Examples of newly vectorizable loops: loop
__restricted__
. The pointers have to
point to an aligned address.sum += a[i]
) or
inductions (a[i] = i
).a[i] = 5
, a[i] =
x
).Examples of newly vectorizable loops: loop
__restricted__
. (new experimental
feature)a[i+1]
, where array a
is
aligned and i
starts from 0).
Stores (memory writes) still have to be
aligned.sum += a[i]
) or
inductions (a[i] = i
).a[i] = 5
, a[i] =
x
).Examples of newly vectorizable loops: loop
a[i]
, where i
is updated
from 0 to N in steps of 1.sum += a[i]
) or
inductions (a[i] = i
).a[i] = 5
, a[i] =
x
).Examples of newly vectorizable loops: loop
i
is updated from 0
to N in steps of 1, and the loop exit condition is
of the form i<N
.sum += a[i]
) or
inductions (a[i] = i
).a[i] = 5
).Examples of vectorizable loops: loop
The table below outlines the high level vectorization scheme along with a proposal for an implementation scheme, as follows:
The first column ("vectorization driver") lists the tasks that the vectorizer must consist of. It briefly describes the expected functionality of each task.
The second column ("basic-vectorizer") describes a
proposal for a basic vectorizer that provides minimal
support for each of these tasks, listing the
restrictions that will be imposed by the basic
vectorizer on candidate loops. Loops that are
considered vectorizable by the basic-vectorizer are of
the form: for(i=0; i<N; i++) {a[i] = b[i] +
c[i]; }.
The "basic vectorizer" was implemented by
Dorit (Naishlos) Nuzman.
The third column ("enhancements") lists possible directions for extending the capabilities of the basic vectorizer. Some of these enhancements are aimed at improving the quality of the vector code that is being generated. Other enhancements aim at broadening the range of computations that are amenable for vectorization. Other focus on improved robustness. Following the table is a complete and detailed list of these enhancements. Next to each item which is already being addressed, there should be the name of the relevant contact person.
vectorization driver | basic vectorizer | enhancements |
analyze_loop_CFG(loop)
Checks the control flow properties of the loop (number of basic-blocks it consists of, nesting, single entry/exit, etc.), in order to determine whether the control flow of the loop falls within the range of loop forms that are supported by this vectorizer. |
|
|
Analyzes the loop termination condition to determine the loop bound and properties of the loop index (its bounds and step). The functionality of this utility should be largely provided by the information computed by the Induction Variable Analyzer. |
|
|
Scan the loop statements and check whether there are any statements that prohibit vectorization (function calls, statements that don't have a mapping to a built-in vector function, etc.) |
|
|
Analyze the memory references in the loop, and classify them according to the access pattern that they exhibit. |
|
|
Analyze the alignment of the memory references in the loop. For each memory reference, record its misalignment amount, if it can be resolved at compile time. |
|
|
Build the loop dependence graph (for scalar and array references); Detect Strongly Connected Components (SCCs) in the graph (statements that are involved in a dependence cycle); Perform a topological sort on the reduced graph (in which each SCC is represented by a single node); Only singleton nodes w/o self dependencies can be vectorized. If other (compound) nodes (which represent SCCs) are present, loop transformations are required. |
|
|
At this point, it has been determined that the loop is vectorizable. It remains to decide whether it is indeed profitable to vectorize it. |
|
|
Replace the scalar statements with the corresponding vector statements (which could be calls to builtin functions); Also change the loop bound accordingly. |
|
|
The following is a list of independent directions by which the basic vectorizer can be enhanced. It should be possible for different people to work on different items on this list. Some of these items are already under development, or (partially) supported.
Detect loops, and record some basic control flow information about them (contained basic blocks, loop pre-header, exit and entry, etc.).
Status: Loop detection and control flow
analysis is already supported (cfgloop.c
,
cfgloopanal.c
).
tree
-level.Expose the required target specific information to
the tree
level. This includes providing a
mapping from scalar operations to the corresponding
vector support, which will answer the following
questions:
tree-
level?The general SIMD support in GCC already provides
some initial support; For simple operations which can
be expressed using existing (scalar)
tree-codes
(PLUS_EXPR, MULT_EXPR,
etc.) the existing infra-structure can provide answers
for questions 1 and 2 above, however, the
tree-
level currently does not have an idea
about the cost that this transformation actually
entails. A first basic implementation will support only
simple operations that fall into the above category. As
the capabilities of the vectorizer are extended, it
will be required to inform the vectorizer of the
advanced capabilities available in the architecture
(for example, support for operations on complex
numbers, reduction, etc.). Such operations cannot be
expressed using existing tree-codes. Possible
solutions: introduce new tree-codes (and corresponding
optabs); introduce new builtins that are exposed to the
compiler; use target hooks to handle these cases (the
hook could return a call to a machine specific builtin
function). Another related design question that needs
to be addressed here is how much information to expose
to the tree-level (is it sufficient to indicate that
conditional vector addition is supported, or do we want
the vectorizer to actually generate the required
masking/predication/select operations depending on the
target? similarly for alignment, multiplication of
integers, etc.).
Status: Open for discussion.
Related discussion:
http://gcc.gnu.org/ml/gcc/2004-08/msg00317.html
Currently the tree optimizers do not know the semantics of target specific builtin functions, so they do not attempt to optimize them (or to SSA the variables passed as arguments to these functions). Since the vectorizer will probably end up generating calls to target specific builtin functions, this situation needs to be improved, i.e. - the semantics of these builtins needs to somehow be exposed to the compiler.
Status: Open for discussion.
There is an overhead associated with vectorization -- moving data in to/out of vector registers before/after the vectorized loop, aligning of data accesses, etc. It is required to incorporate a cost model into the machine description in order to allow the vectorizer to evaluate whether it is worth while to vectorize a given loop. One can also consider using run time tests to decide which version of the loop to execute (scalar or vectorized).
Status: Open for discussion.
Related discussion:
http://gcc.gnu.org/ml/gcc-patches/2003-09/msg00469.html
Used by the vectorizer to detect loop bound, analyze access patterns and analyze data dependencies between array references. One option is that the dependence tests would be designed deal with array references that are expressed in terms of a linear function of the iteration counter (in this case, vectorization will also benefit from optimizations like induction variable substitution and replacement of auxiliary IVs with linear functions of the loop index). A dependence tester that is based on IVs represented in this form would analyze each subscript of each array reference, and apply the appropriate dependence test (SIV, ZIV, MIV etc., see dependence testing). Alternatively, an induction variable evolution analyzer could provide a different implementation to the dependence tester. This is the solution that is currently used, based on the Induction Variable evolution analysis developed by Sebastian Pop.
Status: Using the IV evolution analyzer developed by Sebastian Pop.
Following the classic dependence-based approach for vectorization as described in [1], apply dependence tests to pairs of array references in each loop nest, and analyze the resulting dependence graph. We will start from a dependence analyzer that relies on the array references being expressed in terms of a linear function of the loop index, apply the simplest dependence tests to all pairs of memory read/write and write/write, and gradually extend its capabilities. The scheme below follows the algorithm described in [2]:
Status: Many of the tests above are implemented. Omega test is in the works. We don't have a DDG graph built based on the dependence tests.
The memory architecture usually allows only restricted accesses to data in memory; one of the restrictions is that the accessed data must be consecutive in memory. Any other accesses (strided for example) require to perform special permutation of the data in order to pack the data elements in the right order into a vector register. Support for different access patterns consists of the following stages:
Status: Future work.
At first, the only computations that will be vectorized are those for which the vectorization process consists of trivially replacing each scalar operation in the loop with its vector counterpart. This includes simple loads, stores and arithmetic operations that operate on the same data type. Some computations require extra code to be generated in order to vectorize. These include:
a[i] = N
) and require scalar
expansion. [done]a[i] = i
), require scalar expansion, and
proper initialization and update code.
[planned]Status: Future work.
The memory architecture usually allows only restricted accesses to data in memory. One of the restrictions is that data accesses need to be properly aligned on a certain boundary. Even if the architecture supports unaligned accesses, these are usually much more costly than aligned accesses. The work on alignment consists of several stages:
Status: Currently the way we handle unaligned stores is by peeling the loop to force the alignment of the store. This is not always applicable. Vectorizing unaligned stores is in the works.
It is often the case that complicated computations can be reduced into a simpler, straight-line sequence of operations. These operations may not be directly supported in a scalar form, but are supported by the target in a vector form. Such cases include:
if-then-else
) into a single
vectorizable operation. For example, in the
following code sequence (taken from the SPECint
benchmark gzip) the conditional expression can be
collapsed into a "subtract and saturate" operation
(see
http://gcc.gnu.org/ml/gcc/2003-07/msg01355.html):
for (n = 0; n < HASH_SIZE; n++) { m = head[n]; head[n] = (Pos)(m >= 32768 ? m-32768 : 0); }
Status: Initial support in place.
The general principle we are trying to follow is to
keep the actual code transformation part of the
vectorizer as simple as possible: a simple scan of
straight-line code, and a one-to-one replacement of
each scalar operation with the equivalent vector
operation. To support this scheme in the presence of
conditional execution, we'll need to flatten the loop
body by collapsing if-then-else into a conditional
(scalar) operation (something like transforming -
'if (x) {c = PLUS (a,b)}
' into
'PLUS_COND(a,b,x)
'. These will later be
replaced with a conditional vector operation using
whatever support is available on the target (masking,
predication or select operation). Flattening the loop
body this way will greatly simplify the vectorizer.
Some of the issues to think about here: (a) how to
represent these conditional operations, (b) to what
extent does the tree vectorizer need to be aware of the
specific target support that is available for
conditional vector execution (mask/predicate/select),
and (c) how to allow a simple way to reverse this
transformation if the loop doesn't end up getting
vectorized.
Status: Done. Contact: Devang Patel.
Status: Future work.
Status: In the works.
Address the item from the tree-ssa todo list - "SSA information for arrays : The existing implementation treats arrays as an opaque object. A definition to an array location is treated as a definition for the whole array"
Status: Open for discussion.
Address the issue mentioned in http://gcc.gnu.org/ml/gcc/2003-07/msg02013.html, which turns out to be a front end issue.
Status: In the works. See http://gcc.gnu.org/wiki/Add%20MEM_REF%20operation.
Provide utilities that allow performing the following transformation: Given a condition and a loop, create -'if (condition) { loop_copy1 } else { loop_copy2 }', where loop_copy1 is the loop transformed in one way, and loop_copy2 is the loop transformed in another way (or unchanged). 'condition' may be a run time test for things that were not resolved by static analysis (overlapping ranges (anti-aliasing), alignment, etc.).
Status: Done. Contact: Devang Patel.
These include:
Status:Linear loop transformations are implemented by Daniel Berlin.
Status: Future work.
Using user hints for different purposes (aliasing, alignment, profitability of vectorizing a loop, etc.).
Status: In the works. See http://gcc.gnu.org/ml/gcc-patches/2005-02/msg01560.html.
Please send FSF & GNU inquiries & questions to gnu@gnu.org. There are also other ways to contact the FSF.
These pages are maintained by the GCC team.
For questions related to the use of GCC, please consult these web pages and the GCC manuals. If that fails, the gcc-help@gcc.gnu.org mailing list might help.Copyright (C) Free Software Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110, USA.
Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved.
Last modified 2006-06-21 |