__dmac Intrinsic

You can force the compiler to generate a DMAC instruction by using the __dmac intrinsic. In this case, you are fully responsible for ensuring that the data addresses are 32-bit aligned.

void __dmac(long *src1, long *src2, long &accum1, long &accum2, int shift);

See Table 7-6 for details about the __dmac intrinsic.

Example 1:

int src1[2N], src2[2N]; // src1 and src2 are int arrays of at least size 2N // You must ensure that both start on 32-bit // aligned boundaries. {...} int i; long res = 0; long temp = 0; for (i=0; i < N; i++) // N does not have to be a known constant __dmac(((long *)src1)[i], ((long *)src2)[i], res, temp, 0); res += temp;

Example 2:

int *src1, *src2; // src1 and src2 are pointers to int arrays of at // least size 2N. User must ensure that both are // 32-bit aligned addresses. {...} int i; long res = 0; long temp = 0; long *ls1 = (long *)src1; long *ls2 = (long *)src2; for (i=0; i < N; i++) // N does not have to be a known constant __dmac(*ls1++, *ls2++, res, temp, 0); res += temp;

In these examples, res holds the final sum of a multiply-accumulate operation on int arrays of length 2*N, with two computations being performed at a time.

Additionally, an optimization level >= -O2 must be used to generate efficient code. Moreover, if there is nothing else in the loop body as in these examples, the compiler generates a RPT || DMAC instruction, further improving performance.