Example 10. Dot Product From Example 8 Unrolled to Prevent Memory Bank Conflicts

_dotp2: .cproc a_0, b_0 .reg a_4, b_4, sum0, sum1, I .reg val1, val2, prod1, prod2 ADD 4,a_0,a_4 ADD 4,b_0,b_4 MVK 25,i ; I = 100/4 ZERO sum0 ; multiply result = 0 ZERO sum1 ; multiply result = 0 .mptr a_0,a+0,8 .mptr a_4,a+4,8 .mptr b_0,b+0,8 .mptr b_4,b+4,8 loop: .trip 25 LDW *a_0++[2],val1 ; load a[0-1] bankx LDW *b_0++[2],val2 ; load b[0-1] banky MPY val1,val2,prod1 ; a[0] * b[0] MPYH val1,val2,prod2 ; a[1] * b[1] ADD prod1,sum0,sum0 ; sum0 += a[0] * b[0] ADD prod2,sum1,sum1 ; sum1 += a[1] * b[1] LDW *a_4++[2],val1 ; load a[2-3] bankx+2 LDW *b_4++[2],val2 ; load b[2-3] banky+2 MPY val1,val2,prod1 ; a[2] * b[2] MPYH val1,val2,prod2 ; a[3] * b[3] ADD prod1,sum0,sum0 ; sum0 += a[2] * b[2] ADD prod2,sum1,sum1 ; sum1 += a[3] * b[3] [I] ADD -1,i,i ; I-- [I] B loop ; if (!0) goto loop ADD sum0,sum1,A4 ; compute final result .return A4 .endproc

The goal is to find a software pipeline in which the following instructions are in parallel:

LDW *a0++[2],val1 ; load a[0-1] bankx || LDW *a2++[2],val2 ; load a[2-3] bankx+2 LDW *b0++[2],val1 ; load b[0-1] banky || LDW *b2++[2],val2 ; load b[2-3] banky+2