History of ARM

In 1988, Sophie Wilson commented on the history of the ARM processor (from Usenet, comp.arch, on 2 November 1988):

There have now been enough partially correct postings about the Acorn RISC
Machine (ARM) to justify semi-official comment.

History:

ARM is a key member of a 4 chip set designed by Acorn, beginning in 1984, to
make a low cost, high performance personal computer. Our slogan was/is "MIPs
for the masses". The casting vote in each design decision was to make the
final computer economic.

The chips are (1) ARM: a 32 bit RISC Microprocessor; (2) MEMC: a MMU and
DRAM/ROM controller; (3) VIDC: a video CRTC with on chip DACs and sound; and
(4) IOC: a chip containing I/O bus and interrupt control logic, real time
clocks, serial keyboard link, etc.

The first ARM (that referred to by David Chase @ Menlo Park) was designed at
Acorn and built using VLSI Technology Inc's (VTI) 3 micron double level metal
CMOS process using full custom techniques; samples, working first time, were
obtained on 26th April 1985. The target clock was 4MHz, but it ran at 8. The
timings that David gives are for the ARM Evaluation System, where ARM was run
at 3.3MHz and 6.6MHz (20/3) for initial and page-mode DRAM cycles,
respectively. The ARM comprises 24,000 transistors (circa 8,000 gates). Every
instruction is conditional, but there are neither delayed loads/stores nor
delayed branches (sorry, Martin Hanley). Call is via Branch and Link (same
timing as Branch). All instructions are abortable, to support virtual memory.

The first VIDC was obtained on 22nd Oct 1985, the first MEMC on 25th Feb 1986,
and the first IOC 30th Apr 1986. All were "right first time".

We then redesigned ARM to make it go faster (since, by this time, Acorn had
decided roughly what market to aim the completed machines at and 8MHz minimum
capability was required - but we did continue to develop software on the 3
micron part!). Some more FIQ registers were added, bringing the total to 27
(some of our "must go as fast as possible for real time reasons" code didn't
manage with the smaller set). A multiply instruction (2 bits per cycle,
terminate when multiplier exhausted so that 8xn multiply takes 4 cycles max)
and a set of coprocessor interfaces were added. Scaled indexed by register
shifted by register (i.e. effective address was ra+rb<<rc) was removed from
the instruction set (too hard to compile for) [scaled indexed by register
shifted by constant was NOT removed!].

The new, 2 micron ARM was right first time on 19th Feb 1987. It's peak 
performance was 18MHz; its die size 230x230 mil^2; 25,000 transistors.

VTI were given a license to sell the chips to anyone. They renamed the chips:
VL86C010 (ARM), VL86C110 (MEMC), VL86C310 (VIDC), VL86C410 (IOC).

Acorn released volume machines "Acorn Archimedes" in June 1987. Briefly:
A305: 1/2 MByte, 1MByte floppy, graphics to 640x514x16 colours
A310: ditto, 1MByte
A310M: ditto with PC software emulator (circa a PC XT, if you're interested)
A440: 4MByte, 20MByte hard disc, 1152x896 graphics also.
All machines have ARM at 4/8MHz (circa 5000 dhrystones 1.1), 8 channel sound
synthesiser, proprietry OS, 6502 software emulator, software.... Prices
between 800 and 3000 pounds UK with monitor and mouse and all other useful
bits. Not available in the US, but try Olivetti Canada.

VTI make ARM available as an ASIC cell. Sanyo have taken a second source
license (in April 1988) for the chip set, and make a 32 bit microcomputer
(single chip controller). In "VLSI Systems Design" July 1988, the following
statements are made by VTI: ARM in 1.5 micron (18-20MHz clock), 180x180 mil^2;
future shrink to 1 micron (they are expecting "perhaps 40MHz" and 150 mil
square with the price dropping from $50 to $15); expected sales in 1988
90-100,000 units.

Contact Ron Cates, VTI Application Specific Logic Products Division,
Tempe, Arizona for details (e.g. the "VL86C010 RISC Family Data Manual").

Plug in boards for PCs are available. A controller for Laser printers
with ARM, MEMC, VIDC and 4MBytes DRAM has been sold to Olivetti [Acorn'
parent company as of 1985-6] (contact SWoo...@acorn.co.uk if you want to
know more).


In the Near Future:

We have a Floating Point Coprocessor interface chip working "in the lab" - the
fifth member of the four chip set. It interfaces an ATT WE32206 to ARM's
coprocessor bus. It benchmarks at 95.5 KFlops LINPACK DP FORTRAN Rolled BLAS
(slowest) (11KFlops with a floating point emulator) on an A310. Definitely
have to make our own, some time...

Acorn is about to release UNIX 4.3BSD including TCP/IP, NFS, X Windows and
IXI's X.desktop on the A440. Contact MJe...@acorn.co.uk or
DSl...@acorn.co.uk for more info (and to be told that it isn't available in
the US {yet}).


Operating Systems:

Acorn's proprietry OS "Arthur" is written in machine code: it fills 1/2MByte
of ROM! (yes, writing in RISC machine code is truly wonderful as others have
noted on comp.arch). Its main features are windows, anti-aliased fonts
(wonderful at 90 pixels per inch - I use 8 point all the time) and sound
synthesis. It runs on all Archimedes machines. A 2nd release is due real soon
now and features multitasking, a better desktop and a name change to RISC OS.

VTI are porting VRTX to the ARM; Cambridge (UK) Computer Lab's Tripos has been
ported to A310/A440. UNIX has been ported by Acorn: see above. There are MINIX
ports everywhere one looks (try querying the net...).


Software:

C Compiler: ANSI/pcc; register allocation by graph colouring; code motion;
dead code elimation; tail call elimination; very good local code generation;
CSE and cross-jumping work and will be in the next release. No peepholing (yet
- not much advantage, I'm afraid). Can't turn off most optimisation features.
Also FORTRAN 77, ISO PASCAL, interpreted BASIC (structured BBC BASIC, very
fast), Forth, Algol, APL, Smalltalk 80 (as seen at OOPSLA 88: on an A440 it
approximates a Dorado) and others (LISP, Prolog, ML, Ponder, BCPL....).

Specific applications for Archimedes computers are too numerous to mention!
(though the high speed Mandelbrot calculation has to be seen to be believed -
one iteration of the set in 28 clock ticks [32 bit fixed point] real time
scroll across the set [calculate row/column in a frame time and move the
picture]).

There is a part of the net that talks about Archimedes machines:
(eunet.micro.acorn).


Random Info:

Code density is approximately that of 80x86/68020. Occasionally 30% worse
(usually on very small programs).

The average number of ticks per instruction 1.895 (claims VTI - we've never
bothered to measure it).

DRAM page mode is controlled by the MEMC, but there is a prediction signal
from the ARM saying "I will use a sequential address in the next cycle" which
helps the timing a great deal! S=125nS, N=250nS with current MEMC and DRAM
(see David Chase's article for instruction timing). Static RAM ARM systems
have been implemented up to 18MHz - S=N=1/18 with these systems.

Approximately 1000 dhrystones 1.1 per MHz if N=S; about 1000/1.895 dhrystones
per MHz if N=2S (i.e. 5K dhrystones for a 4/8MHz system; 18K dhrystones for
an 18/18MHz system).

Most recent features: Electronic Design Jul 28 1988, VLSI Systems Design July
1988.

We had a competition to see who would use "ra := rb op rc shifted by rd" with
all of ra, rb, rc and rd actually different registers, but the graphics people
won it too easily!

ARM's byte sex is as VAX and NS32000 (little endian). The byte sex of a 32 bit
word can be changed in 4 clock ticks by:
EOR R1,R0,R0,R0R #16
BIC R1,R1,#&FF0000
MOV R0,R0,ROR #8
EOR R0,R0,R1,LSR #8
which reverses R0's bytes. Shifting and operating in one instruction is fun.

Shifted 8bit constants (see David Chase's article) catch virtually everything.

Major use of block register load/save (via bitmask) is procedure entry/exit.
And graphics - you just can't keep those boys down. The C and BCPL compilers
turn some multiple ordinary loads into single block loads.

MEMC's Content Addressable Memory inverted page table contains 128 entries.
This gives rather large pages (32KBytes with 4MBytes of RAM) and one can't
have the same page at two virtual addresses. Our UNIX hackers revolted, but
are now learning to love it (there's a nice bit in the standard kernel which
goes "allocate 31 pages to start a new process"....)

Data types: byte, word aligned word, and multi-word (usually with a
coprocessor e.g. single, double, double extended floating point).

Neatest trick: compressing all binary images by around a factor of 2. The
decompression is done FASTER than reading the extra data from a 5MBit
winchester!


Enough! (too much?) Specific questions to me, general brickbats to the net.