Today, the ARM family accounts for approximately 75% of all embedded 32-bit RISC CPUs, making it one of the most widely used 32-bit architectures. ARM CPUs are found in most corners of consumer electronics, from portable devices (PDAs, mobile phones, media players, handheld gaming units, and calculators) to computer peripherals (hard drives, desktop routers); however it no longer has significant penetration as the main processor in the desktop computer market and has never been used in a supercomputer or cluster. Important branches in this family include Marvell's XScale and the Texas Instruments OMAP series.
The ARM design was started in 1983 as a development project at Acorn Computers Ltd to build a compact RISC CPU. Led by Sophie Wilson and Steve Furber, a key design goal was achieving low-latency input/output (interrupt) handling like the MOS Technology 6502 used in Acorn's existing computer designs. The 6502's memory access architecture allowed developers to produce fast machines without the use of costly direct memory access hardware. The team completed development samples called ARM1 by April 1985, and the first "real" production systems as ARM2 the following year.
The ARM2 featured a 32-bit data bus, a 32-bit (4 Gbyte) address space and sixteen 32-bit registers. Program code had to lie within the first 64 Mbyte of the memory, as the program counter was limited to 26 bits because the top 6 bits of the 32-bit register served as status flags. The ARM2 was possibly the simplest useful 32-bit microprocessor in the world, with only 30,000 transistors (compare with Motorola's six-year older 68000 model with around 70,000 transistors). Much of this simplicity comes from not having microcode (which represents about one-quarter to one-third of the 68000) and, like most CPUs of the day, not including any cache. This simplicity led to its low power usage, while performing better than the Intel 80286. A successor, ARM3, was produced with a 4KB cache, which further improved performance.
In the late 1980s Apple Computer and VLSI Technology started working with Acorn on newer versions of the ARM core. The work was so important that Acorn spun off the design team in 1990 into a new company called Advanced RISC Machines Ltd. For this reason, ARM is sometimes expanded as Advanced RISC Machine instead of Acorn RISC Machine. Advanced RISC Machines became ARM Ltd when its parent company, ARM Holdings plc, floated on the London Stock Exchange and NASDAQ in 1998.
The new Apple-ARM work would eventually turn into the ARM6, first released in 1991. Apple used the ARM6-based ARM 610 as the basis for their Apple Newton PDA. In 1994, Acorn used the ARM 610 as the main CPU in their Risc PC computers. DEC licensed the ARM6 architecture (which caused some confusion because they also produced the DEC Alpha) and produced the StrongARM. At 233 MHz this CPU drew only 1 watt of power (more recent versions draw far less). This work was later passed to Intel as a part of a lawsuit settlement, and Intel took the opportunity to supplement their aging i960 line with the StrongARM. Intel later developed its own high performance implementation known as XScale which it has since sold to Marvell.
The ARM core has remained largely the same size throughout these changes. ARM2 had 30,000 transistors, while the ARM6 grew to only 35,000. ARM's business has always been to sell IP cores, which licensees use to create microcontrollers and CPUs based on this core. The most successful implementation has been the ARM7TDMI with hundreds of millions sold in almost every kind of microcontroller equipped device. The idea is that the Original Design Manufacturer combines the ARM core with a number of optional parts to produce a complete CPU, one that can be built on old semiconductor fabs and still deliver substantial performance at a low cost. As of January 2008, over 10 billion ARM cores have been built, and iSuppli predicts that 5 billion a year will ship in 2011.
The common architecture supported on smartphones, Personal Digital Assistants and other handheld devices is ARMv4. XScale and ARM926 processors are ARMv5TE, and are now more numerous in high-end devices than the StrongARM, ARM925T and ARM7TDMI based ARMv4 processors.
|Family||Architecture Version||Core||Feature||Cache (I/D)/MMU||Typical MIPS @ MHz||In application|
|ARM1||ARMv1||ARM1||None||ARM Evaluation System second processor for BBC Micro|
|ARM2||ARMv2||ARM2||Architecture 2 added the MUL (multiply) instruction||None||4 MIPS @ 8 MHz|
|Acorn Archimedes, Chessmachine|
|ARMv2a||ARM250||Integrated MEMC (MMU), Graphics and IO processor. Architecture 2a added the SWP and SWPB (swap) instructions.||None, MEMC1a||7 MIPS @ 12 MHz||Acorn Archimedes|
|ARM3||ARMv2a||ARM2a||First use of a processor cache on the ARM.||4K unified||12 MIPS @ 25 MHz|
|ARM6||ARMv3||ARM60||v3 architecture first to support addressing 32 bits of memory (as opposed to 26 bits)||None||10 MIPS @ 12 MHz||3DO Interactive Multiplayer, Zarlink GPS Receiver|
|ARM600||Cache and coprocessor bus (for FPA10 floating-point unit).||4K unified||28 MIPS @ 33 MHz|
|ARM610||Cache, no coprocessor bus.||4K unified||17 MIPS @ 20 MHz|
|Acorn Risc PC 600, Apple Newton 100 series|
|ARM7||ARMv3||ARM700||8 KB unified||40 MHz||Acorn Risc PC prototype CPU card|
|ARM710||8KB unified||40 MHz||Acorn Risc PC 700|
|ARM710a||8 KB unified||40 MHz|
|Acorn Risc PC 700, Apple eMate 300|
|ARM7100||Integrated SoC.||8 KB unified||18 MHz||Psion Series 5|
|ARM7500||Integrated SoC.||4 KB unified||40 MHz||Acorn A7000|
|ARM7500FE||Integrated SoC. "FE" Added FPA and EDO memory controller.||4 KB unified||56 MHz|
|ARM7TDMI||ARMv4T||ARM7TDMI(-S)||3-stage pipeline, Thumb||none||15 MIPS @ 16.8 MHz||Game Boy Advance, Nintendo DS, iPod, Lego NXT, Atmel AT91SAM7, Juice Box|
|ARM710T||8 KB unified, MMU||36 MIPS @ 40 MHz||Psion Series 5mx, Psion Revo/Revo Plus/Diamond Mako|
|ARM720T||8 KB unified, MMU||60 MIPS @ 59.8 MHz||Zipit Wireless Messenger|
|ARMv5TEJ||ARM7EJ-S||Jazelle DBX, Enhanced DSP instructions, 5-stage pipeline||none|
|StrongARM||ARMv4||SA-110||16 KB/16 KB, MMU||203 MHz|
|Apple Newton 2x00 series, Acorn Risc PC, Rebel/Corel Netwinder, Chalice CATS, Psion Netbook|
|SA-1110||16 KB/16 KB, MMU||233 MHz||LART, Intel Assabet, Ipaq H36x0, Balloon2, Zaurus SL-5x00, HP Jornada 7xx|
|ARM8||ARMv4||ARM810||5-stage pipeline, static branch prediction, double-bandwidth memory||8 KB unified, MMU|| 84 MIPS @ 72 MHz|
|Acorn Risc PC prototype CPU card|
|ARM920T||16 KB/16 KB, MMU||200 MIPS @ 180 MHz||Armadillo, GP32,GP2X (first core), Tapwave Zodiac (Motorola i. MX1), Hewlet Packard HP-49/50 Calculators, Sun SPOT, [Cirrus Logic EP9315], Samsung s3c2442 (HTC TyTN, FIC Neo FreeRunner)|
|ARM922T||8 KB/8 KB, MMU|
|ARM940T||4 KB/4 KB, MPU||GP2X (second core), Meizu M6 Mini Player|
|ARM9E||ARMv5TE||ARM946E-S||Enhanced DSP instructions||variable, tightly coupled memories, MPU||Nintendo DS, Nokia N-Gage, Conexant 802.11 chips|
|ARM966E-S||no cache, TCMs||ST Micro STR91xF, includes Ethernet|
|ARM968E-S||no cache, TCMs|
|ARMv5TEJ||ARM926EJ-S||Jazelle DBX, Enhanced DSP instructions||variable, TCMs, MMU||220 MIPS @ 200 MHz,||Mobile phones: Sony Ericsson (K, W series); Siemens and Benq (x65 series and newer); Texas Instruments OMAP1710, OMAP1610, OMAP1611, OMAP1612; Qualcomm MSM6100, MSM6125, MSM6225, MSM6245, MSM6250, MSM6255A, MSM6260, MSM6275, MSM6280, MSM6300, MSM6500, MSM6800; Freescale i.MX21, i.MX27, Atmel AT91SAM9, GPH Wiz|
|ARMv5TE||ARM996HS||Clockless processor, Enhanced DSP instructions||no caches, TCMs, MPU|
|ARM10E||ARMv5TE||ARM1020E||(VFP), 6-stage pipeline, Enhanced DSP instructions||32 KB/32 KB, MMU|
|ARM1022E||(VFP)||16 KB/16 KB, MMU|
|ARMv5TEJ||ARM1026EJ-S||Jazelle DBX, Enhanced DSP instructions||variable, MMU or MPU|
|XScale||ARMv5TE||80200/IOP310/IOP315||I/O Processor, Enhanced DSP instructions|
|80219||400/600 MHz||Thecus N2100|
|IOP321||600 BogoMips @ 600 MHz||Iyonix|
|IOP34x||1-2 core, RAID Acceleration||32K/32K L1, 512K L2, MMU|
|PXA210/PXA250||Applications processor, 7-stage pipeline||Zaurus SL-5600, iPAQ H3900|
|PXA255||32KB/32KB, MMU||400 BogoMips @ 400 MHz||Gumstix basix & connex, Palm Tungsten E2,Mentor Ranger & Stryder|
|PXA26x||default 400 MHz, up to 624 MHz||Palm Tungsten T3|
|PXA27x||Applications processor||32 Kb/32 Kb, MMU||800 MIPS @ 624 MHz||Gumstix verdex, HTC Universal, HP hx4700, Zaurus SL-C1000, 3000, 3100, 3200, Dell Axim x30, x50, and x51 series, Motorola Q, Balloon3, Trolltech Greenphone, Palm TX, Motorola Ezx Platform A728, A780, A910, A1200, E680, E680i, E680g, E690, E895, Rokr E2, Rokr E6, Fujitsu Siemens LOOX N560, Toshiba Portégé G500, Trēo 650-755p|
|Monahans||1000 MIPS @ 1.25 GHz|
|PXA900||Blackberry 8700, Blackberry Pearl (8100)|
|IXC1100||Control Plane Processor|
|ARM11||ARMv6||ARM1136J(F)-S||SIMD, Jazelle DBX, (VFP), 8-stage pipeline||variable, MMU||740 @ 532-665 MHz (i.MX31 SoC), 400-528 MHz||Texas Instruments OMAP2420 (Nokia E90, Nokia N93, Nokia N95, Nokia N82), Zune, BUGbase, Nokia N800, Nokia N810, Qualcomm MSM7200 (with integrated ARM926EJ-S Coprocessor@274MHz, used in Eten- Glofish (Kaiser), HTC Nike), Freescale i.MX31 (which was used in the original Zune 30gb).|
|ARMv6T2||ARM1156T2(F)-S||SIMD, Thumb-2, (VFP), 9-stage pipeline||variable, MPU|
|ARMv6KZ||ARM1176JZ(F)-S||SIMD, Jazelle DBX, (VFP)||variable, MMU+TrustZone||Apple iPhone, Apple iPod touch, Conexant CX2427X, Motorola RIZR Z8, Motorola RIZR Z10|
|ARMv6K||ARM11 MPCore||1-4 core SMP, SIMD, Jazelle DBX, (VFP)||variable, MMU||Nvidia APX 2500|
|Cortex||ARMv7-A||Cortex-A8||Application profile, VFP, NEON, Jazelle RCT, Thumb-2, 13-stage superscalar pipeline||variable (L1+L2), MMU+TrustZone||up to 2000 (2.0 DMIPS/MHz in speed from 600 MHz to greater than 1 GHz)||Texas Instruments OMAP3, Pandora|
|Cortex-A9||Application profile, (VFP), (NEON), Jazelle RCT and DBX, Thumb-2, Out-of-order speculative issue superscalar||MMU+TrustZone||2.0 DMIPS/MHz|
|Cortex-A9 MPCore||As Cortex-A9, 1-4 core SMP||MMU+TrustZone||2.0 DMIPS/MHz|
|ARMv7-R||Cortex-R4(F)||Embedded profile, (FPU)||variable cache, MPU optional||600 DMIPS||Broadcom is a user, TMS570 from Texas Instruments|
|ARMv7-M||Cortex-M3||Microcontroller profile, Thumb-2 only.||no cache, (MPU)||125 DMIPS @ 100 MHz||Luminary Micro microcontroller family, ST Microelectronics STM32|
|ARMv6-M||Cortex-M1||FPGA targeted, Microcontroller profile, Thumb-2 (BL, MRS, MSR, ISB, DSB, and DMB).||None, tightly coupled memory optional.||Up to 136 DMIPS @ 170 MHz (0.8 DMIPS/MHz, MHz achievable FPGA-dependent)||"Actel ProASIC3 and Actel Fusion PSC devices will sample in Q3 2007|
The ARM architecture includes the following RISC features:
An interesting addition to the ARM design is the use of a 4-bit condition code on the front of every instruction, meaning that execution of every instruction is optionally conditional. Other CPU architectures typically only have condition codes on branch instructions.
This cuts down significantly on the encoding bits available for displacements in memory access instructions, but on the other hand it avoids branch instructions when generating code for small
if statements. The standard example of this is the Euclidean algorithm:
which avoids the branches around the
Another unique feature of the instruction set is the ability to fold shifts and rotates into the "data processing" (arithmetic, logical, and register-register move) instructions, so that, for example, the C statement
a += (j << 2);
could be rendered as a single word, single cycle instruction on the ARM.
ADD Ra, Ra, Rj, LSL #2
This results in the typical ARM program being denser than expected with fewer memory accesses; thus the pipeline is used more efficiently. Even though the ARM runs at what many would consider to be low speeds, it nevertheless competes quite well with much more complex CPU designs.
The ARM processor also has some features rarely seen in other RISC architectures, such as PC-relative addressing (indeed, on the ARM the PC is one of its 16 registers) and pre- and post-increment addressing modes.
Another item of note is that the ARM has been around for a while, with the instruction set increasing somewhat over time. Some early ARM processors (prior to ARM7TDMI), for example, have no instruction to store a two-byte quantity, thus, strictly speaking, for them it's not possible to generate code that would behave the way one would expect for C objects of type "volatile short" .
The ARM7 and earlier designs have a three stage pipeline; the stages being fetch, decode, and execute. Higher performance designs, such as the ARM9, have a five stage pipeline. Additional changes for higher performance include a faster adder, and more extensive branch prediction logic.
The architecture provides a non-intrusive way of extending the instruction set using "coprocessors" which can be addressed using MCR, MRC, MRRC and MCRR commands from software. The coprocessor space is divided logically into 16 coprocessors with numbers from 0 to 15, coprocessor 15 (cp15) being reserved for some typical control functions like managing the caches and MMU operation (on processors that have one).
In ARM based machines, peripheral devices are usually attached to the processor by mapping their physical registers into ARM memory space or into the coprocessor space or connecting to another device (a bus) which in turn attaches to the processor. Coprocessor accesses have lower latency so some peripherals (for example XScale interrupt controller) are designed to be accessible in both ways (through memory and through coprocessors).
In Thumb, the smaller opcodes have less functionality. For example, only branches can be conditional, and many opcodes are restricted to accessing only half of all of the CPU's general purpose registers. The shorter opcodes give improved code density overall, even though some operations require extra instructions. In situations where the memory port or bus width is constrained to less than 32 bits, the shorter Thumb opcodes allow increased performance compared with 32-bit ARM code, as less program code may need to be loaded into the processor over the constrained memory bandwidth.
Embedded hardware, such as the Game Boy Advance, typically have a small amount of RAM accessible with a full 32-bit datapath; the majority is accessed via a 16 bit or narrower secondary datapath. In this situation, it usually makes sense to compile Thumb code and hand-optimise a few of the most CPU-intensive sections using full 32-bit ARM instructions, placing these wider instructions into the 32-bit bus accessible memory.
The first processor with a Thumb instruction decoder was the ARM7TDMI. All ARM9 and later families, including XScale have included a Thumb instruction decoder.
The new instructions are common in digital signal processor architectures. They are variations on signed multiply-accumulate, saturated add and subtract, and count leading zeros.
The most prominent use of Jazelle is by manufacturers of mobile phones to increase the execution speed of Java ME games and applications.
A Jazelle-aware Java Virtual Machine (JVM) will attempt to run Java bytecodes in hardware, while returning to the software for more complicated, or lesser-used bytecode operations. ARM claim that approximately 95% of bytecode in typical program usage ends up being directly processed in the hardware.
Jazelle functionality was specified in the ARMv5TEJ architecture and the first processor with Jazelle technology was the ARM926EJ-S: Jazelle is denoted by a 'J' appended to the CPU name.
The published specifications are very incomplete, being only sufficient for writing operating system code that can support a JVM that uses Jazelle. The declared intent is that only the JVM software needs to (or is allowed to) depend on the hardware interface details. This tight binding facilitates that the hardware and JVM can evolve together without affecting other software. In effect, this gives ARM Ltd. considerable control over which JVMs are able to exploit Jazelle.
The Jazelle mode moves JVM interpretation into hardware for the most common simple JVM instructions. This is intended to significantly reduce the cost of interpretation. Among other things, this reduces the need for JIT and other JVM accelerating techniques. JVM instructions that are not implemented in Jazelle hardware cause appropriate routines in the Jazelle-aware JVM implementation to be invoked. Details are not published.
Jazelle mode is entered via the BXJ instructions. A hardware implementation of Jazelle will only cover a subset of JVM bytecodes. For unhandled bytecodes—or if overridden by the operating system—the hardware will invoke the software JVM. The system is designed so that the software JVM does not need to know which bytecodes are implemented in hardware and a software fallback is provided by the software JVM for the full set of bytecodes.
Employees of ARM have in the past published several white papers that do give some good pointers about the processor extension. Versions of the ARM Architecture Reference Manual available from 2008 have included pseudocode for the 'BXJ' (Branch and eXchange to Java) instruction, but with the finer details being shown as "SUB-ARCHITECTURE DEFINED" and documented elsewhere.
The entire VM state is held within normal ARM registers, allowing compatibility with existing operating systems and interrupt handlers unmodified. Restarting a bytecode (such as following a return from interrupt) will re-execute the complete sequence of related ARM instructions.
Specific registers are designated to hold the most important parts the JVM state, registers r0-r3 hold an alias of the top of the Java stack, r4 holds Java local operand zero (pointer to
*this) and r6 contains the Java stack pointer.
Jazelle reuses the existing Program Counter register r15. A pointer to the next bytecode goes in r14, so the use of the PC is not generally user-visible except during debugging.
Bytecodes are decoded by the hardware in two stages (versus a single stage for Thumb and ARM code) and switching between hardware and software decoding (Jazelle mode and ARM mode) takes ~4 clock cycles..
For entry to Jazelle hardware state to succeed, the JE (Jazelle Enable) bit in the CP14:c0(c2)[bit 0] register must be set; clearing of the JE bit by a [privileged] operating-system provides a high-level override to prevent application programs from using the hardware Jazelle acceleration, additionally the CV (Configuration Valid) bit found in CP14:c0(c1)[bit 1] must be set to show that there is a consistent Jazelle state setup for the hardware to use.
Because the current state is held in the CPSR, the bytecode instruction set is automatically reselected after task-switching and processing of the current Java bytecode is restarted.
Following an entry into the Jazelle state mode, bytecodes can be processed in one of three ways; decoded and executed natively in hardware, handled in software (with optimised ARM/ThumbEE JVM code), or treated as an invalid/illegal opcode. The third case will cause a branch to an ARM exception mode, as will a Java bytecode of 0xff, which is used for setting JVM breakpoints.
Execution will continue in hardware until an unhandled bytecode is encountered, or an exception occurs. Between 134 and 149 bytecodes (out of 203 bytecodes specified in the JVM specification) are translated and executed directly in the hardware.
A "trival" hardware implementation of Jazelle (as found in the QEMU emulator) is only required to support the BXJ opcode itself (treating BXJ as a normal BX instruction) and to return RAZ (Read-As-Zero) for all of the CP14:c0 Jazelle-related registers.
Thumb-2 also extends both the ARM and Thumb instruction set with yet more instructions, including bit-field manipulation, table branches, and conditional execution.
All ARMv7 chips support the Thumb-2 instruction set. Some chips, such as the Cortex-M3, support only the Thumb-2 instruction set. Other chips in the Cortex and ARM11 series support both "ARM instruction set mode" and "Thumb-2 instruction set mode"
New features provided by ThumbEE include automatic null pointer checks on every load and store instruction, an instruction to perform an array bounds check. Access to registers r8-r15 (where the Jazelle/DBX Java VM state is held) and the ability to branch to handlers—small sections of frequently called code—commonly used to implement a feature of a high level language, such as allocating memory for a new object.
In practice, since the specific implementation details of TrustZone are proprietary and have not been disclosed for review, any assumption of or claim to security is flawed in principle, as trust would necessitate an unverifiable trust in a commercial entity. Without a rational and independant audit of the design and manufacturing process, these and other Secure microdevices are inherently untrustable by nature.
Like most IP vendors, ARM prices its IP based on perceived value. In architectural terms, the lower performance ARM cores command a lower license cost than the higher performance cores. In terms of silicon implementation, a synthesizable core is more expensive than a hard macro (blackbox) core. Complicating price matters, a merchant foundry who holds an ARM license (such as Samsung and Fujitsu) can offer reduced licensing costs to its fab customers. In exchange for acquiring the ARM core through the foundry's in-house design services, the customer can reduce or eliminate payment of ARM's upfront license fee. Compared to dedicated semiconductor foundries (such as TSMC and UMC) without in-house design services, Fujitsu/Samsung charge 2 to 3 times more per manufactured wafer. For low to mid volume applications, a design service foundry offers lower overall pricing (through subsidization of the license fee). For high volume mass produced parts, the long term cost reduction achievable through lower wafer pricing reduces the impact of ARM's NRE (Non-Recurring Engineering) costs, making the dedicated foundry a better choice.
Many semiconductor or IC design firms hold ARM licenses; Analog Devices, Atmel, Broadcom, Cirrus Logic, Faraday technology, Freescale, Fujitsu, Intel (through its settlement with Digital Equipment Corporation), IBM, Infineon Technologies, Nintendo, NXP Semiconductors, OKI, Qualcomm, Samsung, Sharp, STMicroelectronics, Texas Instruments and VLSI are some of the many companies who have licensed the ARM in one form or another. Although ARM's license terms are covered by NDA, within the IP industry, ARM is widely known to be among the most expensive CPU cores. A single customer product containing a basic ARM core can incur a one-time license fee in excess of (USD) $200,000. Where significant quantity and architectural modification are involved, the license fee can exceed $10M.
ARM believes that its base of 200+ semiconductor licensees gives it a chance to succeed in the ongoing controversies regarding the use of ARM or Intel architectures in mobile computers.
ARM's 2006 annual report and accounts state that royalties totalling 88.7 million GBP (164.1 million USD) were the result of licensees shipping 2.45 billion units. This is equivalent to 0.036 GBP (0.067 USD) per unit shipped. However, this is averaged across all cores, including expensive new cores and inexpensive older cores.
In the same year ARM's licensing revenues for processor cores were £65.2 million ($119.5 million), in a year when 65 processor licenses were signed, an average of 1 million GBP (1.84 million USD) per license. Again, this is averaged across both new and old cores.
Given that ARM's 2006 income from processor cores was approximately 60% from royalties and 40% from licenses, ARM makes the equivalent of 0.06 GBP (0.11 USD) per unit shipped including both royalties and licenses. However, as one-off licenses are typically bought for new technologies, unit sales (and hence royalties) are dominated by more established products. Hence, these figures above do not reflect the true costs of any single ARM product.