Consider the following computation loop $X_{i+1} = aX_i - bY_i$ which is the inner loop in a numerical algorithmic process. For numerical convergence, this loop is supposed to run for a large number of iterations. The constants $a$ and $b$ are initialized in float registers.
Loop: LD F0,0(R1) // Load
MULTF F0,F0,F2 // Multiply
LD F4,0(R2) // Load
MULTF F4,F4,F6 // Multiply
SUBF F0,F0,F4 // Subtract
SD 0(R1),F0 // Store
SUBI R1,R1,8 // Decrement
SUBI R2,R2,8 // Decrement
BNEQZ R1,Loop // Branch not equal
Assume the pipeline data dependent latencies between instructions (WRITE to READ operands) are given by the following Table. For example, LD F0,0(R1) and MULTF F0,F0,F2 have a RAW dependency with latency (load slot) 1 slot. Also, the machine uses float arithmetic units (ADDF and MULTF) which are pipelined and embedded in the instruction pipeline. You may consider 4-stage float pipes. Furthermore, delayed branching is available.
WRITE Instruction READ Instruction 1 Latency in Cycles
FP ALU OP Another FP ALU OP 3
FP ALU OP store double 2
Load double FP ALU OP 1
Load double Store double 0
a) Unroll the above loop as many times as necessary to schedule it without any delays, collapsing the loop overhead instructions.
b) Briefly show the scheduling of the entire loop iterations.
c) Comment if there are other unrollings feasible.