By the early 1960s computer designs were approaching the point of diminishing returns. At the time, computer design focussed on adding as many instructions as possible to the machines CPU, a concept known as "orthogonality", which made programs smaller and more efficient in use of memory. It also made the computers themselves fantastically complex, and in an era when many CPUs were hand-wired from individual transistors, the cost of additional orthogonality was often very high. Adding instructions could potentially slow the machine down, as the maximum speed was defined by the signal timing in the hardware, which was in turn a function of the overall size of the machine. The state of the art hardware design techniques of the time used individual transistors to build up logic circuits, so any increase in logic processing meant a larger machine. CPU speeds appeared to be reaching a plateau.
Several solutions to these problems were explored in the 1960s. One, then known as overlap but today known as an instruction pipeline, allows a single CPU to work on small parts of several instructions at a time. Normally the CPU would fetch an instruction from memory, "decode" it, run the instruction and then write the results back to memory. While the machine is working on any one stage, say decoding, the other portions of the CPU are not being used. Pipelining allows the CPU to start the load and decode stages (for instance) on the "next" instruction while still working on the last one and writing it out. Pipelining was a major feature of Seymour Cray's groundbreaking design, the CDC 6600, which outperformed almost all other machines by about ten times when it was introduced.
Another solution to the problem was parallel computing; building a computer out of a number of general purpose CPUs. The "computer" as a whole would have to be able to keep all of the CPUs busy, asking each one to work on a small part of the problem and then collecting up the results at the end into a single "answer". Not all tasks can be handled in this fashion, and extracting performance from multiple processors remains a problem even today, yet the concept has the advantage of having no theoretical limit to speed – if you need more performance, simply add more CPUs. General purpose CPUs were very expensive, however, so any "massively parallel" design would either be too expensive to be worth it, or have to use a much simpler CPU design.
Westinghouse explored the latter solution in a project known as Solomon. Since the highest performing computers were being used primarily for math processing in science and engineering, they decided to focus their CPU design on math alone. They designed a system in which the instruction stream was fetched and decoded by a single CPU, the "control unit" or CU. The CU was attached to an array of processors built to handle floating point math only, the "processing element"s, or PEs. Since much of the complexity of a CPU is due to the instruction fetching and decoding process, Solomon's PEs ended up being much simpler than the CU, so many of them could be built without driving up the price. Modern microprocessor designs are quite similar to this layout in general terms, with a single instruction decoder feeding a number of subunits dedicated to processing certain types of data. Where Solomon differed from modern designs was in the number of subunits; a modern CPU might have three or four integer units and a similar number of floating point, in Solomon there were 256 PE's, all dedicated to floating point.
Solomon would read instructions from memory, decode them, and then hand them off to the PE's for processing. Each PE had its own memory for holding operands and results, the PE Memory module, or PEM. The CU could access the entire memory via a dedicated memory bus, whereas the PE's could only access their own PEM. Although there are problems, known as embarrassingly parallel, that can be handled by entirely independent units, these problems are generally rare. To allow results from one PE to be used as inputs in another, a separate network connected each PE to its eight closest neighbors. Similar arrangements were common on massively parallel machines in the 1980s.
Unlike modern designs, Solomon's PEs could only run a single instruction at a time, and every PE had to be running the same instruction. That means the system was only useful when working on data sets that had "wide" arrays that could be spread out over the PEs. These sorts of problems are not uncommon in scientific processing, and are very common today when working with multimedia data. The concept of applying a single instruction to a large number of data elements at once is now common to most microprocessor designs, where it is referred to as SIMD, for Single Instruction, Multiple Data. In Solomon, the CU would normally load up the PEMs with data, scatter the instructions across the PEMs, and then start feeding the instructions to the PE's, one at every clock cycle.
Under a contract from the US Air Force's RADC research arm, they had built a breadboard prototype machine in 1964, but the RADC contract ended and Westinghouse decided not to follow it up on their own.
When Solomon ended the principal investigator, Daniel Slotnick, managed to gain the interest of Burroughs, who at that time were not able to serve the high-end scientific market. However, development of such a machine for an unknown customer base was risky, and Slotnick arranged for the University of Illinois to be both initial customer and development partner. As the performance of the machine was much more than the University could make good use of, it was expected that time on the machine would be rented out to commercial users. In 1964 the University signed a contract with DARPA to fund the effort, which became known as ILLIAC IV, following in line from a number of earlier research machines developed there. Development started in 1965, and a first-pass design was completed in 1966.
In many ways the machine was treated as an experimental design, so it included the most advanced features then available. The logic circuits were based on ECL integrated circuits (ICs), whereas many machines of the era still relied on individual transistors or low-speed ICs. Texas Instruments was contracted for the ECL based ICs. Each PE was given 2048-words of 240ns thin film memory (later replaced with semiconductor memory) for storing results. Burroughs also supplied the specialized hard drives, which featured a separate stationary head for every track and could offer speeds up to 500 Mbit/s and stored about 80 MB per 36" disk. They also provided a Burroughs B6500 mainframe to act as a front-end controller. Connected to the B6500 was a laser optical recording medium, a write-once system that stored up to 1 Tbit on a plastic disk covered with a thin metal film.
The ILLIAC was a 64-bit design, in a pre-ASCII era when 48-bit machines were more common and no word length could be considered "standard". The CPU had sixty-four 64-bit registers and another four 64-bit accumulators. The PEs had only six 64-bit registers, each with a special purpose. One of these, RGR, was used for communicating data to neighboring PEs, moving one "hop" per clock cycle. Another, RGD, indicated whether or not that PE was currently active. The PEs had instruction formats for 64, 32 and 8-bit data, and could be placed into a 32-bit mode that made it appear that there were 128 PEs.
The design goal called for a computer with the ability to process 1 billion floating point operations per second, or in today's terminology, 1 GFLOPS. To do this the basic design would require 256 PEs running on a 13 MHz clock, driven by four CPUs. Originally they intended to house all 256 PEs in a single large mainframe, but the project quickly ran behind schedule. Instead, a modification was made to divide the ALUs into quadrants of 64 with a single CU each, housed in separate cabinets. Eventually it became clear that only one quadrant would become available in any realistic timeframe, reducing performance from 1 GFLOPS to about 200 MFLOPS.
Work at the University was primarily aimed at ways to efficiently fill the PEs with data. Unless the "problem" being fed into the computer could be parallelized in SIMD fashion, the ILLIAC would be no faster than any other computer, and much slower than designs from companies like Control Data, which featured much higher clock rates. In order to make this as easy as possible, several new computer languages were created; IVTRAN and TRANQUIL were parallelized versions of FORTRAN, and Glypnir was a similar conversion of ALGOL. Generally these languages provided support for loading arrays of data "across" the PEs to be executed in parallel, and some even supported the unwinding of loops into array operations.
When the computer was being built in the late 1960s, it was met with hostility by protesters who were suspicious of the University's tie with the Department of Defense, and felt that the University had sold out to a conspiracy. The protests reached a boiling point on May 9, 1970, in a day of "Illiaction". Three months after the August 24th bombing at a University of Wisconsin mathematics building, the University of Illinois decided to back out of the project, and have it moved to a more secure location. The work was picked up by NASA, then still cash-flush in the post-Apollo years and interested in almost anything "high tech". They formed a new Advanced Computing division, and had the machine moved to Moffett Field, California, home of Ames Research Center.
The move slowed development, and the machine was not completed until 1972. By this time the original $8 million estimated from the first design in 1966 had risen to $31 million, while the performance had dropped even further, from 1 GFLOPS to 250 MFLOPS to perhaps 100 MFLOPS with peaks of 150. NASA also decided to replace the B6500 with a PDP-10, which were in common use at AMES, but this required the development of new compilers and support software. When the ILLIAC was finally turned on in 1972 it was found to be barely operable, failing continually. Efforts to correct the reliability allowed it to run its first complete program in 1974, and go into full operation in 1975. Even "full operation" was somewhat limited; the machine was operated only Monday to Friday and had up to 40 hours of planned maintenance a week. The first full application was run on the machine in 1976, the same year the Cray-1 was released with roughly the same performance.
Nevertheless the ILLIAC was increasingly used over the next few years, and Ames added their own FORTRAN version, CFD. On problems that could be parallelized the machine was still the fastest in the world, outperforming the CDC 7600 by two to six times, and it is generally credited as the fastest machine in the world until 1981. For NASA the machine was "perfect", as its performance was tuned for programs running the same operation on lots of data, which is exactly what computational fluid dynamics is all about. The machine was eventually decommissioned in 1982, and NASA's advanced computing division ended with it.
Burroughs was able to use the basic design for only one commercial system, the Parallel Element Processing Ensemble, or PEPE. PEPE was designed to allow high-accuracy tracking of 288 incoming ICBM warheads, each one assigned to a modified PE. Burroughs built only one PEPE system, although a follow-on design was built by Bell Labs.
Although the ILLIAC effort ended in uninspiring results, attempts to understand the reasons for the failure of the ILLIAC IV architecture pushed forward research in parallel computing. During the 1980s a number of companies used the same approach to build even more parallel machines, with compilers that could make better use of the parallelism. The Thinking Machines CM-1 and CM-2 are excellent examples of the "classic" ILLIAC IV concept, although they also included far better interconnectivity between their PE's in order to avoid data bottlenecks that reduced the problem set suitable for use on the ILLIAC.
Most supercomputers of the era took another approach to higher performance, using a single very high speed vector processor. Similar to the ILLIAC in concept at least, these processor designs loaded up many data elements into a single custom processor instead of a large number of low-powered ones. The classic example of this design is the Cray-1, which had performance similar to the ILLIAC, but was able to provide this high performance on a much wider variety of problems, not just those that were highly parallel. There was more than a little "backlash" against the ILLIAC design as a result, and for some time the supercomputer market looked on massively parallel designs with disdain, even when they were successful. As Seymour Cray famously quipped, "If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?"
But time has proven the ILLIAC approach to be the better one for almost all scientific computing. Today, supercomputers are almost universally made up from large numbers of commodity computers, precisely the concept that the ILLIAC pioneered. Progress in compiler technology explains much of this, although the rapid, and perhaps unexpected, continued improvement in microprocessor design rendered custom vector designs slower in most workloads.