> Even more complicated, SHARC DSPs are 2-delayed, the pipeline fetch poitner is 2
> instructions ahead than the one currently executing, that way the loop behaviour
> is a real bitch, because the loop condition is checked 2 instructions before the
> end of the loop (the registers are fetched 2 insts ahead) so if you change the
> register involved in the loop condition in the last instruction of the loop, it
> will take another loop to exit, weird, eh?. Fortunately I haven't seen any
> model2 game using that, so that part of the emulation is silently ignored in my
> current emulator :). Still all branches have to be 2 inst delayed, and the
> register dependencies between them caused me some problems.
It doesn't seem weird to me, as its just the result of the instruction fetch and decode stages being before the execute stage. What I consider weird is the fact that both instructions following a delayed branch are fetched but the instruction in the second delay slot is discarded! On the Aries processors, which have three pipeline stages (and a fourth writeback stage for loads and vector multiplies), both delay slots are executed. I'm not sure why you would want to discard a second slot as its not really that difficult to find useful work to do. I wonder if the goal was to ease porting code from MIPS to SHARC.
Delayed branches are pretty easy to deal with using a simple counter. The end of my execute loop looks like this:
pcexec = pcroute;
pcexec = pcfetchnext;
The handlers for the program flow instructions will set delayCount to 3 or 1 and set pcfetchnext to the target address. If you compile code, you can eliminate a lot of the delay count checks by keeping track of which delay slots each instruction resides in. If an instruction is only in delay slot zero, I don't have to do the delay count check. If an instruction can be in delay slot one, I will only have to emit code to check if the delay count is non-zero, as the Aries processor inhibits the ECU unit instructions after a taken branch (to allow contiguous conditional branches). If an instruction can be in delay slot two, I'll emit code for the full check and decrement, and exit back to dispatch if the count is decremented to zero.