Auto-vectorization in GCC

The goal of this project is to develop a loop vectorizer in GCC, based on the tree-ssa framework. This work is taking place in the autovect-branch.

Table of Contents

Latest News

2006-02-19
  1. Vectorization of loops that operate on multiple data-types, including type conversions: submitted for incorporation into GCC 4.2.
  2. Detection and vectorization of special idioms, such as dot-product and widening-summation: Incorporated into GCC 4.2.
  3. Vectorization of non consecutive (non-unit-stride) data-accesses with power-of-2 strides: Incoporated into autovect-branch. To be submitted to GCC 4.2.
2005-10-23
Autovect-branch has been enhanced to support the following features:
  1. Vectorization of loops that operate on multiple data-types, including type promotion (conversion to a wider type) and type demotion (conversion to a narrower type). Type promotion is supported using the new VEC_UNPACK_HI and VEC_UNPACK_LO tree-codes (and the new vec_unpacks_hi/lo and vec_unpacku_hi/lo optabs). Type demotion is supported using the new VEC_PACK_MOD tree-code (and the new vec_pack_mod optab).
  2. Vectorization of idioms that involve type conversion. This allows more efficient vectorization (if specialized target support is available) that avoids the data packing/unpacking that is otherwise required to handle multiple data-types. These idioms include: widening-summation (WIDEN_SUM), dot-product (DOT_PROD), widening-multiplication (WIDEN_MULT, VEC_WIDEN_MULT_HI/LO), multiply-highpart (MULT_HI) and sum-of-absolute-differences (SAD).
2005-08-11
The following enhancements have been incorpoated into GCC 4.1:
  1. Vectorization of reduction has been introduced and currently supports summation, and minimum/maximum computation.
  2. Vectorization of conditional code has been introduced.
  3. The mechanism of peeling a loop to force the alignment of data accesses has been improved. We now generate better code when the misalignment of an access is known at compile time, or when different accesses are known to have the same misalignment, even if the misalignment amount itself is unknown.
  4. Dependence distance is considered when checking dependences between data references.
  5. Loop-closed-ssa-form is incrementally updated during vectorization.
  6. Generic parts of the data references analysis were cleaned up and externalized to make this analysis available to other passes.
2005-04-05
Vectorization of reduction on autovect-branch was enhanced to support maximum/minimum computations, and special reduction idioms such as widening summation as a step towards supporting patterns like dot-product, sum of absolute differences and more: ( http://gcc.gnu.org/ml/gcc-patches/2005-04/msg00532.html).
2005-03-01
Vectorization projects for GCC 4.1: See http://gcc.gnu.org/wiki/Autovectorization%20Enhancements.
Vectorization capabilities in GCC 4.0: See 2005-03-01 mainline status.
2005-02-25
New features:
  1. Summation is the first reduction idiom that is vectorized (autovect-branch only).
  2. Verbosity levels for vectorization reports.
  3. Improved line number information.
  4. Revised data-references analysis.

Contributing

Current contributors to this project include Dorit (Naishlos) Nuzman, Ira Rosen, Victor Kaplansky and Devang Patel. This web page is maintained by Dorit (Naishlos) Nuzman <dorit@il.ibm.com>.

Required enhancements and missing features (some of these are in the works or in our near term plans; other items are open for others to contribute!):

  1. Support vectorization of induction (induction loop). [In the works. Contact: Victor Kaplansky]
  2. Support vectorization in the presence of values that are used after the loop (currently this is supported only for reduction). [In the works. Contact: Victor Kaplansky]
  3. #pragma support to guide autovectorizer. See http://gcc.gnu.org/ml/gcc-patches/2005-02/msg01560.html. [in the works. Contact: Devang Patel]
  4. Loop versioning to eliminate data dependence using run-time dependence tests.
  5. Misaligned stores support.
  6. SLP in loops.
  7. Outer-loop vectorization.
  8. Build a cost model to decide whether it's profitable to vectorize.
  9. Support certain operations on data-types that are not directly supported by a target, but yet vectorization is possible. For example, support data movements and bitwise operations on 64-bit data types for altivec).
  10. Support multiple vector sizes.
  11. Vectorize in the presence of library function-calls. [there is an initial implementation by Zdenek Dvorak].
  12. Vectorize instructions that operate on a sequence of bytes in memory, which means that they implement semantics that corresponds to code containing a loop in C (such as those available in S390).
  13. Improve debug information (mostly line-number information) for code created by the vectorizer (see http://gcc.gnu.org/ml/gcc-patches/2005-02/msg00197.html).
  14. Reuse generic loop peeling utilities in the vectorizer where possible (see http://gcc.gnu.org/ml/gcc-patches/2005-02/msg00165.html).

Using the Vectorizer

Vectorization is enabled by the flag -ftree-vectorize. To allow vectorization on powerpc* platforms also use -maltivec. On i?86 and x86_64 platforms use -msse/-msse2. The vectorizer test cases demonstrate the current vectorization capabilities; these can be found under gcc/gcc/testsuite/gcc.dg/vect/. Information on which loops were or were not vectorized and why, can be obtained using the flag -ftree-vectorizer-verbose. For details see http://gcc.gnu.org/ml/gcc-patches/2005-01/msg01247.html. Example output using -ftree-vectorizer-verbose=5:

vect-1.c:34: note: not vectorized: unsupported use in stmt.
vect-1.c:43: note: not vectorized: nested loop.
vect-1.c:44: note: not vectorized: unsupported use in stmt.
vect-1.c:52: note: LOOP VECTORIZED.
vect-1.c:59: note: LOOP VECTORIZED.
vect-1.c:66: note: not vectorized: complicated access pattern.
vect-1.c:74: note: LOOP VECTORIZED.
vect-1.c:85: note: not vectorized: mixed data-types
vect-1.c:94: note: not vectorized: possible dependence between data-refs a[i_283] and a[i_48]
vect-1.c:14: note: vectorized 3 loops in function.

Vectorizable Loops

Examples of loops that can currently be vectorized by the autovect-branch. "feature" indicates the vectorization capabilities demonstrated by the example.

example1:
int a[256], b[256], c[256];
foo () {
  int i;

  for (i=0; i<256; i++){
    a[i] = b[i] + c[i];
  }
}
example2:
int a[256], b[256], c[256];
foo (int n, int x) {
   int i;

   /* feature: support for unknown loop bound  */
   /* feature: support for loop invariants  */
   for (i=0; i<n; i++)
      b[i] = x;
   }

   /* feature: general loop exit condition  */
   /* feature: support for bitwise operations  */
   while (n--){
      a[i] = b[i]&c[i]; i++;
   }
}
example3:
typedef int aint __attribute__ ((__aligned__(16)));
foo (int n, aint * __restricted__ p, aint * __restricted q) {

   /* feature: support for (aligned) pointer accesses.  */
   while (n--){
      *p++ = *q++;
   }
}
example4:
typedef int aint __attribute__ ((__aligned__(16)));
int a[256], b[256], c[256];
foo (int n, aint * __restricted__ p, aint * __restricted__ q) {
   int i;

   /* feature: support for (aligned) pointer accesses  */
   /* feature: support for constants  */
   while (n--){
      *p++ = *q++ + 5;
   }

   /* feature: support for read accesses with a compile time known misalignment  */
   for (i=0; i<n; i++){
      a[i] = b[i+1] + c[i+3];
   }

   /* feature: support for if-conversion (only in autovect-branch)  */
   for (i=0; i<n; i++){
      j = a[i];
      b[i] = (j > MAX ? MAX : 0);
   }
}
example5:
struct a {
  int ca[N];
} s;
for (i = 0; i < N; i++)
  {
    /* feature: support for alignable struct access  */
    s.ca[i] = 5;
  }
example6 (gfortran):
DIMENSION A(1000000), B(1000000), C(1000000)
READ*, X, Y
A = LOG(X); B = LOG(Y); C = A + B
PRINT*, C(500000)
END
example7:
int a[256], b[256];
foo (int x) {
   int i;

   /* feature: support for read accesses with an unknown misalignment  */
   for (i=0; i<N; i++){
      a[i] = b[i+x];
   }
}
example8:
int a[M][N];
foo (int x) {
   int i,j;

   /* feature: support for multidimensional arrays  */
   for (i=0; i<M; i++) {
     for (j=0; j<N; j++) {
       a[i][j] = x;
     }
   }
}
example9:
unsigned int ub[N], uc[N];
foo () {
  int i;

  /* feature: support summation reduction.
     note: in case of floats use -funsafe-math-optimizations  */
  unsigned int diff = 0;
  for (i = 0; i < N; i++) {
    udiff += (ub[i] - uc[i]);
  }
example10:
/* feature: support data-types of different sizes.
   Currently only a single vector-size per target is supported; 
   it can accommodate n elements such that n = vector-size/element-size 
   (e.g, 4 ints, 8 shorts, or 16 chars for a vector of size 16 bytes). 
   A combination of data-types of different sizes in the same loop 
   requires special handling, now present in autovect-branch. 
   This also include support for type conversions.  */

short *sa, *sb, *sc;
int *ia, *ib, *ic;
for (i = 0; i < N; i++) {
  ia[i] = ib[i] + ic[i];
  sa[i] = sb[i] + sc[i];
}

for (i = 0; i < N; i++) {
  ia[i] = (int) sb[i];
}

Unvectorizable Loops

Examples of loops that currently cannot be vectorized:

example1: uncountable loop:
while (*p != NULL) {
  *q++ = *p++;
}
example2: induction:
for (i = 0; i < N; i++) {
  a[i] = i;
}
example3: strided access - the data elements that can be operated upon in parallel are not consecutive - they are accessed with a stride > 1 (in the example, the stride is 2). (This is now vectorizable on autovect-branch):
for (i = 0; i < N/2; i++){
  a[i] = b[2*i+1] * c[2*i+1] - b[2*i] * c[2*i];
  d[i] = b[2*i] * c[2*i+1] + b[2*i+1] * c[2*i];
}

Previous News and Status

2005-03-01, autovect-branch
Description of vectorizable loops:
  1. Vectorization is restricted to inner most countable loops, in which all operations operate on data types of the same size, and all memory accesses are consecutive.
  2. Certain forms of conditional code.
  3. Unaligned memory accesses are handled using loop peeling, or loop versioning, or direct misalignment support.
  4. Summation reduction is supported (sum += a[i]).
  5. Constants and invariants are supported (a[i] = 5, a[i] = x).
  6. Vectorization of the subtract-and-saturate idiom is supported.
2005-03-01, mainline (final 4.0 status)
Description of vectorizable loops:
  1. Vectorization is restricted to inner most countable loops, in which all operations operate on data types of the same size, and all memory accesses are consecutive.
  2. Unaligned memory write accesses are handled using loop peeling. This allows vectorization when there's only a single unaligned memory-write (or all memory-writes in the loop have the same misalignment). Unaligned memory read accesses are handled using direct misalignment support. This support is currently available for Alpha ev6, mmx, and altivec.
  3. Constants and invariants are supported (a[i] = 5, a[i] = x).
  4. No reductions (sum += a[i]) or inductions (a[i] = i).
  5. -ftree-vectorizer-verbose=[n] controls vectorization reports, with n ranging from 0 (no information reported) to 6 (all information reported).
2005-02-02
New features that are only in autovect-branch:
  1. Incrementally preserve loop closed SSA form during vectorization.
  2. Loop versioning guarded by a runtime alignment test.
  3. Idiom recognition, and vectorization of the subtract-and-saturate idiom.
  4. Incrementally preserve SSA form during vectorization.
  5. Improved handling of misalignment in case it is known at compile time, or in case multiple accesses are known to have the same misalignment.
  6. Vectorization of conditional code.
  7. Consider dependence distance.
2004-10-27, mainline
Description of vectorizable loops:
  1. Inner most loops that consist of a single basic block (i.e, straight-line code, no if-then-else).
  2. New: No restrictions on the loop bound. An epilog loop is generated to handle unknown loop bounds and loop bounds that do not divide by the vectorization factor.
  3. Supported memory accesses are multidimensional arrays, and pointers that are annotated as __restricted__.
  4. All memory accesses are consecutive (stride=1).
  5. New: Loops with a single unaligned write to memory (store). This is supported by peeling the loop to force the alignment of the store.
  6. Reads from memory (loads) can be unaligned. This support is currently available for Alpha ev6, mmx, and altivec.
  7. All operations operate on data types of the same size.
  8. No reductions (sum += a[i]) or inductions (a[i] = i).
  9. Constants and invariants are supported (a[i] = 5, a[i] = x).
2004-11-10
New branch for vectorization development opened: autovect-branch. lno-branch was retired. ( http://gcc.gnu.org/ml/gcc-patches/2004-11/msg00852.html).
2004-10-27
Mainline merge in progress. Last pending vectorization patches for mainline:
Support for vectorization of conditional code ( http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01768.html).
Consider dependence distance ( http://gcc.gnu.org/ml/gcc-patches/2004-10/msg01046.html).
2004-10-14
Support for unknown loop bound ( http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01104.html, vectorizer merge part #2) committed to mainline.
Peeling for alignment ( http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01894.html, vectorizer merge part #5) committed to mainline.
2004-09-27, mainline
Description of vectorizable loops:
  1. Inner most loops that consist of a single basic block (i.e, straight-line code, no if-then-else).
  2. The loop bound (number of iterations) is known and divides by the vectorization factor.
  3. New: Supported memory accesses are multi-dimensional arrays, and pointers that are annotated as __restricted__.
  4. New: Additional forms of accesses are supported: Restrictions on the the initial value of the access function of pointers and array indexes have been relaxed. These can now take a form like p=&a[16]-4B (pointer), and a[i+off] (arrays).
  5. All memory accesses are consecutive (stride=1).
  6. Writes to memory (stores) are aligned.
  7. New: Reads from memory (loads) can be unaligned.
  8. All operations operate on data types of the same size.
  9. No reductions (sum += a[i]) or inductions (a[i] = i).
  10. Constants and invariants are supported (a[i] = 5, a[i] = x).
2004-09-23
Support for unaligned loads ( http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01522.html, vectorizer merge part #4) committed to mainline.
2004-09-19
Support for additional forms of data references ( http://gcc.gnu.org/ml/gcc-patches/2004-09/msg01521.html, vectorizer merge part #3) committed to mainline.
2004-09-13, lno-branch
Description of vectorizable loops:
  1. Inner most loops that consist of a single basic block (i.e, straight-line code, no if-then-else).
  2. The loop has to be countable - i.e, the number of iterations can be evaluated before the loop starts to execute.
  3. Supported memory accesses are multi-dimensional arrays, and pointers that are annotated as __restricted__.
  4. New: Additional forms of accesses are supported: Restrictions on the the initial value of the access function of pointers and array indexes have been relaxed. These can now take a form like p=&a[16]-4B (pointer), and a[i+off] (arrays).
  5. All memory accesses are consecutive (stride=1) and aligned.
  6. Writes to memory (stores) are aligned.
  7. New: Reads from memory (loads) can be unaligned.
  8. Loop versioning for alignment is applied to unaligned stores.
  9. All operations operate on data types of the same size.
  10. No reductions (sum += a[i]) or inductions (a[i] = i).
  11. Constants and invariants are supported (a[i] = 5, a[i] = x).
2004-09-02, lno-branch
Description of vectorizable loops:
  1. Inner most loops that consist of a single basic block (i.e, straight-line code, no if-then-else).
  2. The loop has to be countable - i.e, the number of iterations can be evaluated before the loop starts to execute.
  3. New: The loop bound (number of iterations) can be unknown at compile time, or can be known but not divisible by the vectorization factor.
  4. New: Supported memory accesses are multi-dimensional arrays, and pointers that are annotated as __restricted__
  5. All memory accesses are consecutive (stride=1).
  6. New: Loop versioning for alignment: in the presence of unaligned accesses create two versions of the loop controlled by a runtime alignment check. In one version all the accesses are guaranteed to be aligned, and it can therefore be vectorized. In the second version there may be unaligned accesses, and it remains unvectorized.
  7. All operations operate on data types of the same size.
  8. No reductions (sum += a[i]) or inductions (a[i] = i).
  9. Constants and invariants are supported (a[i] = 5, a[i] = x).
2004-08-17, mainline
Description of vectorizable loops:
  1. Inner most loops that consist of a single basic block (i.e, straight-line code, no if-then-else).
  2. The loop bound (number of iterations) is known and divides by the vectorization factor.
  3. Supported memory accesses are one-dimensional arrays, whose alignment can be forced (not extern, and stack boundary of target platform allows it), and aligned pointers that are annotated as __restricted__.
  4. All memory accesses are consecutive (stride=1) and aligned.
  5. Supportable operations include plus/minus/mult, as well as bitwise operations - and/or/xor/1's-complement, according to available vector support on the target platform.
  6. All operations operate on data types of the same size.
  7. No reductions (sum += a[i]) or inductions (a[i] = i).
  8. Constants and invariants are supported (a[i] = 5, a[i] = x).

Examples of vectorizable loops: loop, loop

2004-08-17, lno-branch
Description of vectorizable loops:
  1. Inner most loops that consist of a single basic block (i.e, straight-line code, no if-then-else).
  2. The loop has to be countable - i.e, the number of iterations can be evaluated before the loop starts to execute.
  3. The loop bound (number of iterations) can be unknown.
  4. New: Supported memory accesses are multidimensional arrays, whose alignment can be forced (not extern), and aligned pointers that are annotated as __restricted__.
  5. All memory accesses are consecutive (stride=1) and aligned.
  6. Supportable operations include plus/minus/mult, as well as bitwise operations - and/or/xor/1's-complement, according to available vector support on the target platform.
  7. All operations operate on data types of the same size.
  8. No reductions (sum += a[i]) or inductions (a[i] = i).
  9. Constants and invariants are supported (a[i] = 5, a[i] = x).

Examples of newly vectorizable loops: loop

2004-08-17
Initial vectorization functionality committed to mainline ( http://gcc.gnu.org/ml/gcc-patches/2004-08/msg01219.html, http://gcc.gnu.org/ml/gcc-patches/2004-07/msg02127.html, vectorizer merge part #1).
2004-07-20, lno-branch
Description of vectorizable loops:
  1. Inner most loops that consist of a single basic block (i.e, straight-line code, no if-then-else).
  2. The loop has to be countable - i.e, the number of iterations can be evaluated before the loop starts to execute.
  3. The loop bound (number of iterations) can be unknown.
  4. New: Supported memory accesses are one-dimensional arrays, whose base can be a struct field, and whose alignment can be forced (not extern arrays), and aligned pointers that are annotated as __restricted__.
  5. All memory accesses are consecutive (stride=1) and aligned. Arrays are alignable
  6. Supportable operations include plus/minus/mult, as well as bitwise operations - and/or/xor/1's-complement, according to available vector support on the target platform.
  7. All operations operate on data types of the same size.
  8. No reductions (sum += a[i]) or inductions (a[i] = i).
  9. Constants and invariants are supported (a[i] = 5, a[i] = x).
  10. New: first gfortran program vectorized.

Examples of newly vectorizable loops: loop, loop

2004-06-25, apple-ppc-branch
Description of vectorizable loops:
  1. Inner most loops that consist of a single basic block (i.e, straight-line code, no if-then-else).
  2. The loop has to be countable - i.e, the number of iterations can be evaluated before the loop starts to execute.
  3. The loop bound (number of iterations) can be unknown.
  4. Supported memory accesses are one-dimensional arrays, and pointers. If more than one memory access is present in the loop, any pointers that are used in the loop have to be annotated as __restricted__.
  5. Store (memory-write) accesses have to be aligned.
  6. New: Loads (memory-reads) can be unaligned by an unknown amount (e.g. access a[i+x], where the value of x is unknown). Misalignment support for loads was also made more efficient.
  7. All memory accesses are consecutive (stride=1).
  8. Supportable operations include plus/minus/mult, as well as bitwise operations - and/or/xor/1's-complement, according to available vector support on the target platform.
  9. All operations operate on data types of the same size.
  10. Some forms of if-then-else patterns can be vectorized.
  11. New: Infrastructure for idiom recognition has been added. The first computation idiom that is recognized and vectorized is a multiplication of unsigned chars, whose (unsigned short) product is converted back to unsigned char (a similar computation comes up in pixel blending; we support a simplified version that does not require operating on different data types).
  12. No reductions (sum += a[i]) or inductions (a[i] = i).
  13. Constants and invariants are supported (a[i] = 5, a[i] = x). New: Support for invariants was made more efficient.

Examples of newly vectorizable loops: loop

2004-06-23, lno-branch
Description of vectorizable loops:
  1. Inner most loops that consist of a single basic block (i.e, straight-line code, no if-then-else).
  2. The loop has to be countable - i.e, the number of iterations can be evaluated before the loop starts to execute.
  3. The loop bound (number of iterations) can be unknown.
  4. Supported memory accesses are one-dimensional arrays, whose alignment can be forced (not extern arrays).
  5. New: Memory accesses can also be pointer based. If more than one memory access is present in the loop, any pointers that are used in the loop have to be annotated as __restricted__. The pointers have to point to an aligned address.
  6. All memory accesses are consecutive (stride=1) and aligned.
  7. Supportable operations include plus/minus/mult, as well as bitwise operations - and/or/xor/1's-complement, according to available vector support on the target platform.
  8. All operations operate on data types of the same size.
  9. No reductions (sum += a[i]) or inductions (a[i] = i).
  10. Constants and invariants are supported (a[i] = 5, a[i] = x).

Examples of newly vectorizable loops: loop

2004-06-17, apple-ppc-branch
Description of vectorizable loops:
  1. Inner most loops that consist of a single basic block (i.e, straight-line code, no if-then-else).
  2. The loop has to be countable - i.e, the number of iterations can be evaluated before the loop starts to execute.
  3. The loop bound (number of iterations) can be unknown.
  4. Memory accesses are one dimensional-arrays, whose alignment can be forced (not extern arrays).
  5. New: Memory accesses can also be pointer based. If more than one memory access is present in the loop, any pointers that are used in the loop have to be annotated as __restricted__. (new experimental feature)
  6. New: Loads (memory reads) can be unaligned by a known amount (e.g. access a[i+1], where array a is aligned and i starts from 0). Stores (memory writes) still have to be aligned.
  7. All memory accesses are consecutive (stride=1).
  8. Supportable operations include plus/minus/mult, as well as bitwise operations - and/or/xor/1's-complement, according to available vector support on the target platform.
  9. All operations operate on data types of the same size.
  10. New: Some forms of if-then-else patterns can be vectorized. (new experimental feature).
  11. No reductions (sum += a[i]) or inductions (a[i] = i).
  12. Constants and invariants are supported (a[i] = 5, a[i] = x).

Examples of newly vectorizable loops: loop

2004-06-04, lno-branch (and apple-ppc-branch)
Description of vectorizable loops:
  1. Inner most loops that consist of a single basic block (i.e, straight-line code, no if-then-else).
  2. The loop has to be countable - i.e, the number of iterations can be evaluated before the loop starts to execute.
  3. New: No restrictions on the form of the loop index and the loop exit condition.
  4. New: The loop bound (number of iterations) can be unknown. Currently known loop-bounds still need to be divisible by the vectorization factor.
  5. All memory accesses are one dimensional-arrays, whose alignment can be forced (not extern arrays).
  6. All array accesses are consecutive and aligned; i,e. all the array references are of the form a[i], where i is updated from 0 to N in steps of 1.
  7. New: Supportable operations include plus/minus/mult, as well as bitwise operations - and/or/xor/1's-complement, according to available vector support on the target platform.
  8. All operations operate on data types of the same size.
  9. No reductions (sum += a[i]) or inductions (a[i] = i).
  10. New: Constants and invariants are supported (a[i] = 5, a[i] = x).

Examples of newly vectorizable loops: loop

2004-01-01, lno-branch
Description of vectorizable loops:
  1. Inner most loops that consist of a single basic block (i.e, straight-line code, no if-then-else).
  2. The loop has to be countable - i.e, the number of iterations can be evaluated before the loop starts to execute.
  3. Known (constant) loop bound, divisible by the vectorization factor.
  4. The loop index i is updated from 0 to N in steps of 1, and the loop exit condition is of the form i<N.
  5. All memory accesses are one dimensional-arrays, whose alignment can be forced (not extern arrays).
  6. All array accesses are consecutive (stride=1) and aligned.
  7. Supportable operations include plus/minus/mult, according to available vector support on the target platform.
  8. All operations operate on data types of the same size.
  9. No reductions (sum += a[i]) or inductions (a[i] = i).
  10. No Constants or invariants in the loop (a[i] = 5).

Examples of vectorizable loops: loop

References/Documentation

  1. "Auto-Vectorization of Interleaved Data for SIMD", Dorit Nuzman, Ira Rosen and Ayal Zaks. To appear in the ACM SIGPLAN 2006 Conference on Programming Language Design and Implementation (PLDI), Ottawa, Canada, June 10-16, 2006.
  2. "Multi-platform Auto-vectorization", Dorit Nuzman Nuzman and Richard Henderson, To appear in CGO-4 (The 4th Annual International Symposium on Code Generation and Optimization), March 26-29, 2006, Manhattan, New York.
  3. "Autovectorization in GCC", GCC summit, June 2004. ftp://gcc.gnu.org/pub/gcc/summit/2004/Autovectorization.pdf
  4. "The Software Vectorization Handbook. Applying Multimedia Extensions for Maximum Performance.", Aart Bik, Intel Press, June 2004. http://www.intel.com/intelpress/sum_vmmx.htm
  5. "Vectorization for SIMD Architectures with Alignment Constraints", Alexandre E. Eichenberger, Peng Wu, Kevin O'brien, PLDI'04, June 9-11 2004.
  6. "Optimizing Compilers for Modern Architectures - A dependence based approach", Randy Allen & Ken Kennedy, Morgan Kaufmann Publishers, San Francisco, San Diego, New York (2001).
  7. "Exploiting Superword Level Parallelism with Multimedia Instruction Sets", Samuel Larsen and Saman Amarasinghe, PLDI 2000.

High-Level Plan of Implementation

The table below outlines the high level vectorization scheme along with a proposal for an implementation scheme, as follows:

vectorization driver basic vectorizer enhancements
analyze_loop_CFG(loop)

Checks the control flow properties of the loop (number of basic-blocks it consists of, nesting, single entry/exit, etc.), in order to determine whether the control flow of the loop falls within the range of loop forms that are supported by this vectorizer.

  • inner-most (single nest) loops.
  • single basic block loops; i.e., no if-then-else constructs, etc. (in practice this means loops that consist of exactly two basic blocks - header+latch).
  • other restrictions (single successor/predecessor, a pre-header block).

analyze_loop_index_and_bound(loop)

Analyzes the loop termination condition to determine the loop bound and properties of the loop index (its bounds and step). The functionality of this utility should be largely provided by the information computed by the Induction Variable Analyzer.

  • handle simple normalized loops (loop index is a trivial IV with step 1, etc.), with a simple termination condition.
  • loop-bound known at compile time.
  • loop-bound >= vector_size and loop-bound % vector_size = 0.

analyze_loop_stmts(loop-stmts)

Scan the loop statements and check whether there are any statements that prohibit vectorization (function calls, statements that don't have a mapping to a built-in vector function, etc.)

  • simple operations for which there's a 1-1 mapping between the scalar and vector operations.
  • no support for scalar expansion, induction variables, reduction operations...
  • no mixture of data types (all statements operate on the same data types).

analyze_access_pattern(loop-mem-refs)

Analyze the memory references in the loop, and classify them according to the access pattern that they exhibit.

  • support only memory accesses which are array references (no pointers...).
  • support only consecutive (unit stride) access pattern.

analyze_alignment(loop-mem-refs)

Analyze the alignment of the memory references in the loop. For each memory reference, record its misalignment amount, if it can be resolved at compile time.

  • misalignment amount for all memory references is known at compile time.
  • misalignment is zero for all references (all references are aligned).

analyze_loop_carried_dependences(loop)

Build the loop dependence graph (for scalar and array references); Detect Strongly Connected Components (SCCs) in the graph (statements that are involved in a dependence cycle); Perform a topological sort on the reduced graph (in which each SCC is represented by a single node); Only singleton nodes w/o self dependencies can be vectorized. If other (compound) nodes (which represent SCCs) are present, loop transformations are required.

  • handle only loops that do not contain any SCCs (i.e., no dependence cycles).
  • the only scalar loop-carried dependencies allowed are of IVs which are used in array references or for the loop index (i.e., reduction is not supported).
  • use simplest dependence tests.

estimate_vectorization_profitability(loop)

At this point, it has been determined that the loop is vectorizable. It remains to decide whether it is indeed profitable to vectorize it.

  • vectorize all loops with loop_bound >= MIN_LIMIT (?).

vectorize_loop(loop)

Replace the scalar statements with the corresponding vector statements (which could be calls to builtin functions); Also change the loop bound accordingly.

The following is a list of independent directions by which the basic vectorizer can be enhanced. It should be possible for different people to work on different items on this list. Some of these items are already under development, or (partially) supported.

  1. Loop detection and loop CFG analysis

    Detect loops, and record some basic control flow information about them (contained basic blocks, loop pre-header, exit and entry, etc.).

    Status: Loop detection and control flow analysis is already supported (cfgloop.c, cfgloopanal.c).

  2. Modeling the target machine vector capabilities to the tree-level.

    Expose the required target specific information to the tree level. This includes providing a mapping from scalar operations to the corresponding vector support, which will answer the following questions:

    1. Does vector support for this operation exist?
    2. At what cost?
    3. How to express the vector operation at the tree-level?

    The general SIMD support in GCC already provides some initial support; For simple operations which can be expressed using existing (scalar) tree-codes (PLUS_EXPR, MULT_EXPR, etc.) the existing infra-structure can provide answers for questions 1 and 2 above, however, the tree-level currently does not have an idea about the cost that this transformation actually entails. A first basic implementation will support only simple operations that fall into the above category. As the capabilities of the vectorizer are extended, it will be required to inform the vectorizer of the advanced capabilities available in the architecture (for example, support for operations on complex numbers, reduction, etc.). Such operations cannot be expressed using existing tree-codes. Possible solutions: introduce new tree-codes (and corresponding optabs); introduce new builtins that are exposed to the compiler; use target hooks to handle these cases (the hook could return a call to a machine specific builtin function). Another related design question that needs to be addressed here is how much information to expose to the tree-level (is it sufficient to indicate that conditional vector addition is supported, or do we want the vectorizer to actually generate the required masking/predication/select operations depending on the target? similarly for alignment, multiplication of integers, etc.).

    Status: Open for discussion.
    Related discussion: http://gcc.gnu.org/ml/gcc/2004-08/msg00317.html

  3. Enhance the Builtins Support

    Currently the tree optimizers do not know the semantics of target specific builtin functions, so they do not attempt to optimize them (or to SSA the variables passed as arguments to these functions). Since the vectorizer will probably end up generating calls to target specific builtin functions, this situation needs to be improved, i.e. - the semantics of these builtins needs to somehow be exposed to the compiler.

    Status: Open for discussion.

  4. Cost Model

    There is an overhead associated with vectorization -- moving data in to/out of vector registers before/after the vectorized loop, aligning of data accesses, etc. It is required to incorporate a cost model into the machine description in order to allow the vectorizer to evaluate whether it is worth while to vectorize a given loop. One can also consider using run time tests to decide which version of the loop to execute (scalar or vectorized).

    Status: Open for discussion.
    Related discussion: http://gcc.gnu.org/ml/gcc-patches/2003-09/msg00469.html

  5. Induction Variable Analysis

    Used by the vectorizer to detect loop bound, analyze access patterns and analyze data dependencies between array references. One option is that the dependence tests would be designed deal with array references that are expressed in terms of a linear function of the iteration counter (in this case, vectorization will also benefit from optimizations like induction variable substitution and replacement of auxiliary IVs with linear functions of the loop index). A dependence tester that is based on IVs represented in this form would analyze each subscript of each array reference, and apply the appropriate dependence test (SIV, ZIV, MIV etc., see dependence testing). Alternatively, an induction variable evolution analyzer could provide a different implementation to the dependence tester. This is the solution that is currently used, based on the Induction Variable evolution analysis developed by Sebastian Pop.

    Status: Using the IV evolution analyzer developed by Sebastian Pop.

  6. Dependence Testing

    Following the classic dependence-based approach for vectorization as described in [1], apply dependence tests to pairs of array references in each loop nest, and analyze the resulting dependence graph. We will start from a dependence analyzer that relies on the array references being expressed in terms of a linear function of the loop index, apply the simplest dependence tests to all pairs of memory read/write and write/write, and gradually extend its capabilities. The scheme below follows the algorithm described in [2]:

    Status: Many of the tests above are implemented. Omega test is in the works. We don't have a DDG graph built based on the dependence tests.

  7. Access Pattern Analysis

    The memory architecture usually allows only restricted accesses to data in memory; one of the restrictions is that the accessed data must be consecutive in memory. Any other accesses (strided for example) require to perform special permutation of the data in order to pack the data elements in the right order into a vector register. Support for different access patterns consists of the following stages:

    Status: Future work.

  8. Extend the range of supportable operations

    At first, the only computations that will be vectorized are those for which the vectorization process consists of trivially replacing each scalar operation in the loop with its vector counterpart. This includes simple loads, stores and arithmetic operations that operate on the same data type. Some computations require extra code to be generated in order to vectorize. These include:

    Status: Future work.

  9. Alignment

    The memory architecture usually allows only restricted accesses to data in memory. One of the restrictions is that data accesses need to be properly aligned on a certain boundary. Even if the architecture supports unaligned accesses, these are usually much more costly than aligned accesses. The work on alignment consists of several stages:

    Status: Currently the way we handle unaligned stores is by peeling the loop to force the alignment of the store. This is not always applicable. Vectorizing unaligned stores is in the works.

  10. Idiom Recognition

    It is often the case that complicated computations can be reduced into a simpler, straight-line sequence of operations. These operations may not be directly supported in a scalar form, but are supported by the target in a vector form. Such cases include:

    Status: Initial support in place.

  11. Conditional Execution

    The general principle we are trying to follow is to keep the actual code transformation part of the vectorizer as simple as possible: a simple scan of straight-line code, and a one-to-one replacement of each scalar operation with the equivalent vector operation. To support this scheme in the presence of conditional execution, we'll need to flatten the loop body by collapsing if-then-else into a conditional (scalar) operation (something like transforming - 'if (x) {c = PLUS (a,b)}' into 'PLUS_COND(a,b,x)'. These will later be replaced with a conditional vector operation using whatever support is available on the target (masking, predication or select operation). Flattening the loop body this way will greatly simplify the vectorizer. Some of the issues to think about here: (a) how to represent these conditional operations, (b) to what extent does the tree vectorizer need to be aware of the specific target support that is available for conditional vector execution (mask/predicate/select), and (c) how to allow a simple way to reverse this transformation if the loop doesn't end up getting vectorized.

    Status: Done. Contact: Devang Patel.

  12. Handle Advanced Loop Forms

    1. Support general loop bound (unknown, or doesn't divide by the vector size). [done. Contact: Olga Golovanevsky]
    2. Support more complex forms of loop termination condition and loop index update. [done]
    3. Support outer-loop vectorization (unroll and jam). [planned]
    4. Relax other restrictions on the loop form.

    Status: Future work.

  13. Handle Pointer Aliasing

    1. Improve aliasing analysis. [various gcc projects deal with improvements to alias analysis].
    2. Generate run-time tests for cases where memory anti-aliasing cannot be resolved at compile time. [Planned. Contact: Keith Besaw]
    3. Support user hints. [in the works. See http://gcc.gnu.org/ml/gcc-patches/2005-02/msg01560.html. Contact: Devang Patel]

    Status: In the works.

  14. Aliasing and Virtual def-use Chains

    Address the item from the tree-ssa todo list - "SSA information for arrays : The existing implementation treats arrays as an opaque object. A definition to an array location is treated as a definition for the whole array"

    Status: Open for discussion.

  15. Array addressing Represented as Pointer Arithmetic

    Address the issue mentioned in http://gcc.gnu.org/ml/gcc/2003-07/msg02013.html, which turns out to be a front end issue.

    Status: In the works. See http://gcc.gnu.org/wiki/Add%20MEM_REF%20operation.

  16. Loop versioning

    Provide utilities that allow performing the following transformation: Given a condition and a loop, create -'if (condition) { loop_copy1 } else { loop_copy2 }', where loop_copy1 is the loop transformed in one way, and loop_copy2 is the loop transformed in another way (or unchanged). 'condition' may be a run time test for things that were not resolved by static analysis (overlapping ranges (anti-aliasing), alignment, etc.).

    Status: Done. Contact: Devang Patel.

  17. Loop Transformations to Increase Vectorizability of Loops

    These include:

    Status:Linear loop transformations are implemented by Daniel Berlin.

  18. Other Optimizations

    1. Exploit data reuse (a la "Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architectures" by Shin, Chame and Hall). Mostly relies on unroll & jam having been applied.
    2. Vectorize loops that can't be vectorized using the classic vectorizer (until the proper loop transformations are developed) by applying SLP vectorization (a la "Exploiting Superword Level Parallelism with Multimedia Instruction Sets" by Amarasinghe and Larsen). This scheme can potentially more easily vectorize partially vectorizable loops, or loops that are already unrolled in the source code. It is possible to implement SLP vectorization either in the tree level or at the RTL level as a complementary approach to classic loop vectorization.

    Status: Future work.

  19. User Hints

    Using user hints for different purposes (aliasing, alignment, profitability of vectorizing a loop, etc.).

    Status: In the works. See http://gcc.gnu.org/ml/gcc-patches/2005-02/msg01560.html.