The Pipeline

A conventional processor executes instructions one at a time, just as you expect it to when you write your code. Each execution can be broken down into three parts, which anybody who has learned this stuff at college will have fetch, decode, execute burned into their memory.

In English...

Fetch
Retrieve the instruction from memory.
Don't get all techie - whether the instruction comes from system memory or the processor cache is irrelevant, the instruction is not loaded 'into' the processor until it is specifically requested. The cache simply serves to speed things up. By loading chunks of system memory into the cache, the processor can satisfy many more of its instruction fetches by pulling instructions from the cache. This is necessary because processors are very fast (StrongARMs, 200MHz+; Pentiums up to GHz!) and system memory is not (33, 66, or 133MHz). To see the effect the cache has on your processor, use *Cache Off.
Decode
Figure out what the instruction is, and what is supposed to be done.
Execute
Perform the requested operation.

Each of these operations is performed along with the electronic 'heartbeat', the clock rate. Example clock rates for several microprocessors included in Acorn products are given here as an example:

BBC microcomputer	6502	2MHz
Acorn A310-A3000	ARM 2	8MHz
Acorn A5000	ARM 3	25MHz
Acorn A5000/I	ARM 3	30MHz
RiscPC600	ARM610	33MHz
RiscPC700	ARM710	40MHz
Early PC co-processor	486SXL-40	33MHz (not 40!)
RiscPC (StrongARM)	SA110	202MHz - 278MHz+

As shown in the PC world, processors are running into GHz speeds (1,000,000,000 ticks/sec) which will necessitate much in the way of speed tweaks (huge amounts of cache, extremely optimised pipeline) because there is no way the rest of the system can keep up. Indeed, the rest of the system is likely to be operating at a quarter of the speed of the processor. The RiscPC is designed to work, I believe, at 33MHz. That is why people thought the StrongARM wouldn't give much of a speed boost. However the small size of ARM programs, coupled with a rather large cache, made the StrongARM a viable proposition in the RiscPC, it bottlenecked horribly, but other factors meant that this wasn't so visible to the end-user, so the result was a system which is much faster than the ARM710. More recently, the Kinetic StrongARM processor card. This attempts to alleviate bottlenecks by installing a big wodge of memory directly on the processor card and using that. It even goes so far as to install the entirety of RISC OS into that memory so you aren't kept waiting for the ROMs (which are slower even than RAM).

There is an obvious solution. Since these three stages (fetch, decode, execute) are fairly independent, would it not be possible to:

     fetch   instruction #3
     decode  instruction #2
     execute instruction #1

     ...then, on the next clock tick...

     fetch   instruction #4
     decode  instruction #3
     execute instruction #2

     ...tick...

     fetch   instruction #5
     decode  instruction #4
     execute instruction #3

In practice, the answer is yes. And this is exactly what a pipeline is. Simply by doing this, you have just made your processor three times faster!

Now, it isn't a perfect solution.

When it comes to a branch, the pipeline is dumped as instructions after a branch are not required. This is why it is preferable to use conditional execution and not branching.
Next, you have to keep in mind the program counter is ahead of the instruction that is currently being executed. So if you see an error at 'x', then the real error is quite possibly at 'x-8' (or 'x-12' for StrongARM).

Return to assembler index