There's memory bandwidth and there's memory transaction
thruput
Alan Lovejoy
sourcery at pacbell.net
Wed Feb 17 09:22:42 UTC 1999
sqrmax at cvtci.com.ar wrote:
> The 68K has much more registers than the intel processors. Intel ones (386
> and up) have 4 general use registers named A, B, C and D (with suffix X
> meaning 16 bits, with extra E prefix meaning 32 bits, or with H and
>L suffixes to
> mean lower 8 bits and higher 8 bits of the first 16 bits), then segment and
> offset registers named (add an E prefix as needed for their 32 bit sibling in
> protected or real mode) CS, DS, SS, ES, DI, SI, SP, BP, FS and GS. These last
> registers have limited arithmetic capabilities, especially the ones ending
> in S (for segment). The 68K has 8 32bit data registers (d0 to d7) and 8 32bit
> index registers (a0 to a7) fully arithmetic capable (to my knowing). I think
> the 68K with its variants lacks the raw speed of the Intels, although it's
> much more maneuverable. I'd like to see intel processors with some of the 68K
> characteristics (like asynchronous IO for instance, why do you think intel
> machines need FIFOs everywhere?).
>
> Andres.
The address registers on the 68000 supported add and subtract, but
not multiply,
divide, nor any bitfield ops (bitwise OR, AND, etc). The data
registers could hold
an array index (an address register would hold the base address of
the array), but
could not otherwise be used for referencing memory locations. The
68000 was not
all that far from qualifying as a RISC processor, compared to a
canonical CISC CPU
such as the VAX. It's failings with respect to RISC were:
1. The 68k used microcode instead of hardwired instructions, so
most instructions
required 2 or more clock cycles to execute. The Intel processors always
had a lower average CPI (cycles per instruction) due to much less
reliance on
microcoded instructions (especially in the 486 and later
processors). This was
the great advantage of the x86 processors over the 68k family.
2. The 68k had too few registers. To satisfy the RISC
definition, a processor should
have (at least) 32 fully general purpose registers. The 68000
only had 16 "general purpose"
registers--and they weren't symmetrically general purpose
(although they were
much closer to this than the x86 line, even today). The 68k
would have looked far
worse relative to its x86 competition but for its much larger set
of general purpose
registers.
3. Most 68k opcodes had addressing modes where at least one operand was a
memory address. In a RISC processor, only the "load" and "store"
instructions
(what the 68000 would call "MOVE," and the x86 would call "MOV") should
reference memory locations. This restriction has two purposes:
a. Instructions that don't reference memory can be
implemented with less
circuitry, require less complex instruction decode logic, and
don't have to
worry about cache misses and/or page faults.
b. Separating memory load/store operations into separate instructions
makes it easier to optimally schedule instructions to make good use of
processor parallelism and pipelining. If a "load"
instruction can be issued a
few instructions before the data it loads is actually used,
then it is more likely
that the data will have arrived in the register before the
first instruction that
references it has begun to execute, thus avoiding a processor
stall while it
waits for data from memory.
4. The 68k uses 2-operand opcodes. RISC processors generally use
3-operand opcodes.
To compute the sum of two numbers, the 68000 would use code like
the following:
opcode operands comments
MOVE (A7)+,D0; pop the top of the stack into D0
ADD (A7)+,D0; pop the top of the stack, add
it to the value of D0, store result in D0
MOVE (A7)-,D0; push D0 onto the stack
In contrast, the code for the same thing on a RISC would look like this:
opcode operands comments
LD (R31),+R30,R1; R31 += R30; R1 = *R31;
LD (R31),+R30,R2; R31 += R30; R2 = *R31;
ADD R3,R1,R2; R3 = R1 + R2;
ST (R31),-R30,R3; R31 -= R30; *R31 = R3;
The superiority of the RISC approach may not be obvious (it
requires one more instruction
than the 68k, after all!), until one considers what really
happens in real code, as opposed
to this contrived example. Note, for example, that the 68k code
overwrites the value of
one of the addends with the sum, which the RISC code does not do.
And if the two addends
happen to already be in registers, then the addition can be done
in one instruction--without
overwriting one of the addends. Also, if one actually counts the
number of clock cycles
required to run either the 68k or RISC code in the above
examples, one quickly realizes
that the time required by the memory loads means that 4 clock
cycles is the best that can
be done, regardless of whether this is coded as 1, 2, 3 or 4
instructions (the 68k ADD
memory-to-register instruction will require at least two clock
cycles, no matter what).
5. The 68000 has too many addressing modes. The 68020 made this
even worse. This is
the other area in which the x86 processors have typically been
superior to the 68k: not
overdoing the addressing modes quite so egregiously.
Motorola released the 68060 in 1994, which finally did away with most
of the microcode, and
permitted most instructions to run in only one clock cycle (unless
they referenced memory).
The 68060 maxed out at 60Mhz or so (I stopped following it after it
was released), which was
on the low side at the time. It did away with the addressing modes,
and some of the instructions,
that had been introduced with the 68020 10 years before. Alas, this
was way too late to make
any difference for the 68k family, which has slowly faded into the
twilight of embedded processordom.
So RISC did kill off all the CISC processor families used in general
purpose computer systems, except
for two old warhorses whose installed base and dominance in their
market segments precluded their
retirement: the x86 family, and the venerable System370.
For Smalltalk, the key performance issue is memory access bandwidth,
and the high "non-locality
of reference" that characterizes sequential access to memory
locations in Smalltalk programs. Caches
work best when memory is accessed sequentially, which is
unfortunately precisely what Smalltalk
systems tend not to do. Perhaps if the cache logic were smart enough
to understand the structure
of Smalltalk objects, it could pre-fetch the objects referenced by
the instance variables of objects
loaded into the cache, to some finite (and configurable) depth...
--Alan
--Alan
Content-Type: text/x-vcard; charset=us-ascii; name="vcard.vcf"
Content-Transfer-Encoding: 7bit
Content-Description: Card for Alan Lovejoy
Content-Disposition: attachment; filename="vcard.vcf"
Attachment converted: Anon:vcard.vcf 8 (TEXT/ttxt) (00006EEA)
More information about the Squeak-dev
mailing list
|