Auto-vectorization in GCC

The goal of this project is to develop a loop vectorizer in GCC, based on the tree-ssa framework. This work is taking place in the autovect-branch.

Latest News
Contributing
Using the Vectorizer
Vectorizable Loops
Unvectorizable Loops
Previous News and Status
References/Documentation
High-Level Plan of Implementation

Latest News

2006-02-19

Vectorization of loops that operate on multiple data-types, including type conversions: submitted for incorporation into GCC 4.2.
Detection and vectorization of special idioms, such as dot-product and widening-summation: Incorporated into GCC 4.2.
Vectorization of non consecutive (non-unit-stride) data-accesses with power-of-2 strides: Incoporated into autovect-branch. To be submitted to GCC 4.2.

2005-10-23

Autovect-branch has been enhanced to support the following features:

Vectorization of loops that operate on multiple data-types, including type promotion (conversion to a wider type) and type demotion (conversion to a narrower type). Type promotion is supported using the new VEC_UNPACK_HI and VEC_UNPACK_LO tree-codes (and the new vec_unpacks_hi/lo and vec_unpacku_hi/lo optabs). Type demotion is supported using the new VEC_PACK_MOD tree-code (and the new vec_pack_mod optab).
Vectorization of idioms that involve type conversion. This allows more efficient vectorization (if specialized target support is available) that avoids the data packing/unpacking that is otherwise required to handle multiple data-types. These idioms include: widening-summation (WIDEN_SUM), dot-product (DOT_PROD), widening-multiplication (WIDEN_MULT, VEC_WIDEN_MULT_HI/LO), multiply-highpart (MULT_HI) and sum-of-absolute-differences (SAD).

2005-08-11

The following enhancements have been incorpoated into GCC 4.1:

Vectorization of reduction has been introduced and currently supports summation, and minimum/maximum computation.
Vectorization of conditional code has been introduced.
The mechanism of peeling a loop to force the alignment of data accesses has been improved. We now generate better code when the misalignment of an access is known at compile time, or when different accesses are known to have the same misalignment, even if the misalignment amount itself is unknown.
Dependence distance is considered when checking dependences between data references.
Loop-closed-ssa-form is incrementally updated during vectorization.
Generic parts of the data references analysis were cleaned up and externalized to make this analysis available to other passes.

2005-04-05: Vectorization of reduction on autovect-branch was enhanced to support maximum/minimum computations, and special reduction idioms such as widening summation as a step towards supporting patterns like dot-product, sum of absolute differences and more: ( http://gcc.gnu.org/ml/gcc-patches/2005-04/msg00532.html).

2005-03-01: Vectorization projects for GCC 4.1: See http://gcc.gnu.org/wiki/Autovectorization%20Enhancements.; Vectorization capabilities in GCC 4.0: See 2005-03-01 mainline status.

2005-02-25

New features:

Summation is the first reduction idiom that is vectorized (autovect-branch only).
Verbosity levels for vectorization reports.
Improved line number information.
Revised data-references analysis.

Contributing

Current contributors to this project include Dorit (Naishlos) Nuzman, Ira Rosen, Victor Kaplansky and Devang Patel. This web page is maintained by Dorit (Naishlos) Nuzman <dorit@il.ibm.com>.

Required enhancements and missing features (some of these are in the works or in our near term plans; other items are open for others to contribute!):

Support vectorization of induction (induction loop). [In the works. Contact: Victor Kaplansky]
Support vectorization in the presence of values that are used after the loop (currently this is supported only for reduction). [In the works. Contact: Victor Kaplansky]
#pragma support to guide autovectorizer. See http://gcc.gnu.org/ml/gcc-patches/2005-02/msg01560.html. [in the works. Contact: Devang Patel]
Loop versioning to eliminate data dependence using run-time dependence tests.
Misaligned stores support.
SLP in loops.
Outer-loop vectorization.
Build a cost model to decide whether it's profitable to vectorize.
Support certain operations on data-types that are not directly supported by a target, but yet vectorization is possible. For example, support data movements and bitwise operations on 64-bit data types for altivec).
Support multiple vector sizes.
Vectorize in the presence of library function-calls. [there is an initial implementation by Zdenek Dvorak].
Vectorize instructions that operate on a sequence of bytes in memory, which means that they implement semantics that corresponds to code containing a loop in C (such as those available in S390).
Improve debug information (mostly line-number information) for code created by the vectorizer (see http://gcc.gnu.org/ml/gcc-patches/2005-02/msg00197.html).
Reuse generic loop peeling utilities in the vectorizer where possible (see http://gcc.gnu.org/ml/gcc-patches/2005-02/msg00165.html).

Vectorization is enabled by the flag -ftree-vectorize. To allow vectorization on powerpc* platforms also use -maltivec. On i?86 and x86_64 platforms use -msse/-msse2. The vectorizer test cases demonstrate the current vectorization capabilities; these can be found under gcc/gcc/testsuite/gcc.dg/vect/. Information on which loops were or were not vectorized and why, can be obtained using the flag -ftree-vectorizer-verbose. For details see http://gcc.gnu.org/ml/gcc-patches/2005-01/msg01247.html. Example output using -ftree-vectorizer-verbose=5:

vect-1.c:34: note: not vectorized: unsupported use in stmt.
vect-1.c:43: note: not vectorized: nested loop.
vect-1.c:44: note: not vectorized: unsupported use in stmt.
vect-1.c:52: note: LOOP VECTORIZED.
vect-1.c:59: note: LOOP VECTORIZED.
vect-1.c:66: note: not vectorized: complicated access pattern.
vect-1.c:74: note: LOOP VECTORIZED.
vect-1.c:85: note: not vectorized: mixed data-types
vect-1.c:94: note: not vectorized: possible dependence between data-refs a[i_283] and a[i_48]
vect-1.c:14: note: vectorized 3 loops in function.

Vectorizable Loops

Examples of loops that can currently be vectorized by the autovect-branch. "feature" indicates the vectorization capabilities demonstrated by the example.

example1:

int a[256], b[256], c[256];
foo () {
  int i;

  for (i=0; i<256; i++){
    a[i] = b[i] + c[i];
  }
}

example2:

int a[256], b[256], c[256];
foo (int n, int x) {
   int i;

   /* feature: support for unknown loop bound  */
   /* feature: support for loop invariants  */
   for (i=0; i<n; i++)
      b[i] = x;
   }

   /* feature: general loop exit condition  */
   /* feature: support for bitwise operations  */
   while (n--){
      a[i] = b[i]&c[i]; i++;
   }
}

example3:

typedef int aint __attribute__ ((__aligned__(16)));
foo (int n, aint * __restricted__ p, aint * __restricted q) {

   /* feature: support for (aligned) pointer accesses.  */
   while (n--){
      *p++ = *q++;
   }
}

example4:

typedef int aint __attribute__ ((__aligned__(16)));
int a[256], b[256], c[256];
foo (int n, aint * __restricted__ p, aint * __restricted__ q) {
   int i;

   /* feature: support for (aligned) pointer accesses  */
   /* feature: support for constants  */
   while (n--){
      *p++ = *q++ + 5;
   }

   /* feature: support for read accesses with a compile time known misalignment  */
   for (i=0; i<n; i++){
      a[i] = b[i+1] + c[i+3];
   }

   /* feature: support for if-conversion (only in autovect-branch)  */
   for (i=0; i<n; i++){
      j = a[i];
      b[i] = (j > MAX ? MAX : 0);
   }
}

example5:

struct a {
  int ca[N];
} s;
for (i = 0; i < N; i++)
  {
    /* feature: support for alignable struct access  */
    s.ca[i] = 5;
  }

example6 (gfortran):

DIMENSION A(1000000), B(1000000), C(1000000)
READ*, X, Y
A = LOG(X); B = LOG(Y); C = A + B
PRINT*, C(500000)
END

example7:

int a[256], b[256];
foo (int x) {
   int i;

   /* feature: support for read accesses with an unknown misalignment  */
   for (i=0; i<N; i++){
      a[i] = b[i+x];
   }
}

example8:

int a[M][N];
foo (int x) {
   int i,j;

   /* feature: support for multidimensional arrays  */
   for (i=0; i<M; i++) {
     for (j=0; j<N; j++) {
       a[i][j] = x;
     }
   }
}

example9:

unsigned int ub[N], uc[N];
foo () {
  int i;

  /* feature: support summation reduction.
     note: in case of floats use -funsafe-math-optimizations  */
  unsigned int diff = 0;
  for (i = 0; i < N; i++) {
    udiff += (ub[i] - uc[i]);
  }

example10:

/* feature: support data-types of different sizes.
   Currently only a single vector-size per target is supported; 
   it can accommodate n elements such that n = vector-size/element-size 
   (e.g, 4 ints, 8 shorts, or 16 chars for a vector of size 16 bytes). 
   A combination of data-types of different sizes in the same loop 
   requires special handling, now present in autovect-branch. 
   This also include support for type conversions.  */

short *sa, *sb, *sc;
int *ia, *ib, *ic;
for (i = 0; i < N; i++) {
  ia[i] = ib[i] + ic[i];
  sa[i] = sb[i] + sc[i];
}

for (i = 0; i < N; i++) {
  ia[i] = (int) sb[i];
}

Unvectorizable Loops

Examples of loops that currently cannot be vectorized:

example1: uncountable loop:

while (*p != NULL) {
  *q++ = *p++;
}

example2: induction:

for (i = 0; i < N; i++) {
  a[i] = i;
}

example3: strided access - the data elements that can be operated upon in parallel are not consecutive - they are accessed with a stride > 1 (in the example, the stride is 2). (This is now vectorizable on autovect-branch):

for (i = 0; i < N/2; i++){
  a[i] = b[2*i+1] * c[2*i+1] - b[2*i] * c[2*i];
  d[i] = b[2*i] * c[2*i+1] + b[2*i+1] * c[2*i];
}