COMPLEXITY TAKES ON RAW POWER AT IEEE CONFERENCE -- RISC face-off coming to Hot Chips
Palo Alto, Calif. - The two dominant schools of RISC-microprocessor performance-architectural complexity and simple, raw power-will stand in the spotlight this week as the IEEE Computer Society hosts the sixth annual Hot Chips conference on the campus of Stanford University.
Architectural sophistication will be represented by the reborn Metaflow Technologies Inc. (La Jolla, Calif.), which will describe its soon-to-ship Thunder Version-8 Sparc CPU. Espousing the virtues of the Spartan approach to power will be NEC Corp.'s System ULSI Research Laboratory (Sagamihara, Japan), which will report on an experimental chip. Both designs achieve record-setting speed, but through very different techniques.
The Metaflow Thunder processor seeks high throughput not from enormous clock frequency but by doing a lot of work on each clock cycle-a path pioneered by Metaflow with its Lightning processor several years ago. Though the disintegration of the relationship between Metaflow and foundry LSI Logic Corp. doomed Lightning never to be built, the architecture nonetheless introduced concepts that have since been taken up by virtually all superscalar adherents, most notably by the designers of the PowerPC 604.
Those concepts include a large number of function units working in parallel, out-of-order instruction issue, and heavy use of speculative execution. In effect, with enough function units sitting around, it makes sense to execute every instruction you can find, just in case you can use the results.
The Thunder CPU goes further down that path than any other announced processor. It comprises eight units: three integer, two floating-point, a branch unit and two memory/bypass units. Three integer, two floating-point and one branch instruction can be issued in a clock cycle.
Since it is unlikely that even optimized code will present the CPU with that combination of instructions on each cycle, elaborate techniques must be used to keep the pipelines full. Those include out-of-order issue, speculative execution and the use of non-blocking functional units. Even time-consuming operations such as square root in the floating-point unit do not block the FPU pipelines. Uniquely, the processor uses a cache and memory protocol that permits it to process cache misses out of order, so that a cache miss doesn't result in the CPU's stopping to wait for memory. Memory transactions may be split so that the response to a load, for instance, doesn't have to come back from memory before the next memory cycle is issued.
Thunder controls the execution of instructions with a data-flow model: When the resources that are required by an instruction are available, it will proceed. That permits the machine to issue instructions out of order, in the confidence that they will complete in the correct order after having done the right things to the right data. Precise traps and interrupts are maintained.
Keeping the pipelines full also requires heavy use of speculative execution, according to Metaflow vice president of development Bruce Lightner. Like some other recent machines, Thunder predicts the outcome of conditional branches and executes beyond them. If the prediction proves wrong, the machine quickly abandons the results from the speculative execution; if the prediction proves right, the machine uses the results.
But Thunder carries the concept further than other announced machines. It can perform memory renaming, keeping several alternative values for the same memory location while it is engaged in speculative execution. That allows speculative execution beyond store operations, a task that in other machines would force the CPU to wait. In an application of the speculative execution and repair mechanism, the processor is capable of speculative memory reads in a coherent multiprocessing environment. The CPU can read a memory location and begin using the data, only to find out later that what it read from the cache was not the current value of the data. The CPU then repairs the affected state and starts again with the correct data.
All of that complexity is neither experimental nor marginal. Lightner reported that Metaflow already has silicon, fabricated by VLSI Technology Inc., in 0.8-micron and 0.6-micron CMOS. A cost-reducing move to a new foundry-sources outside Metaflow suggested the foundry is IBM Microelectronics-will produce 0.5-micron, four-metal CMOS chips that Metaflow hopes to have available in samples by the end of the year.
The target for the half-micron silicon is 80 MHz-not a breathtaking frequency by modern standards. But here, Metaflow's architectural work shines. At 80 MHz, according to the company's estimates, the chip set will approach 200 SPECint92 and 350 SPECfp92. ``It is always difficult to make estimates like these before you actually load the Unix code and run it,'' Lightner warned. ``But we believe these figures are accurate.''
The processor will be packaged in ball-grid-array technology as a three-chip set: an integer processor, floating-point processor and cache/memory controller. In addition to on-chip instruction caches, the chip set will employ a 1-Mbyte external cache, fabricated from Pentium-type, 9-ns synchronous SRAMs. ``Using these parts-which are becoming commodity SRAMs-reduces costs considerably compared with using the low-volume SRAMs designed for the Supersparc processor,'' Lightner said.
The three processor chips and cache will be fit together on an MBus CPU board, making Thunder available as a plug-in upgrade to existing Sparcstations. In that market, its very high performance at relatively low clock frequency-and hence moderate system cost-could cause some disruption. By comparison, the PowerPC 604, IBM's latest and most aggressive superscalar processor, is scheduled to develop 160 SPECint92 and 165 SPECfp92 at 100 MHz. The Sun Microsystems Inc./Texas Instruments Inc. Ultrasparc-I, which depends on the revised Sparc Version-9 instruction set for at least some of its speed, is a four-issue machine targeting 250 to 300 SPECint and 300 to 350 SPECfp with a clock in the 150- to 200-MHz range. Thunder and the 604 are expected to sample in the fourth quarter; Ultrasparc is expected in early 1995.
Metaflow's resurgence is something of a phoenix story. Disabled by loss of foundry and funding when the company split from partner LSI Logic, Metaflow was revived by Hyundai Electronics Industrial Co., a deep-pockets company deeply interested in the Sparc-workstation market. With Hyundai's support, work continued; the Lightning design evolved into the Thunder processor, silicon was prototyped and now the design is moving toward its eventual, 0.5-micron implementation.
While Metaflow, IBM and Sun pursue increasing superscalar complexity and increasing SPECmarks/MHz, researchers at NEC in Japan have been investigating another path: how to make a simple RISC execution unit run as fast as possible. Project engineer Kazumasa Suzuki will detail the results at Hot Chips.
NEC will describe a 500-MHz MIPS-instruction-set processor chip fabricated in NEC's 0.4-micron CMOS process. ``This chip was only an experiment,'' Suzuki noted, ``so we implemented minimal functions on the die.
``We execute the full MIPS instruction set. But we have very small caches-there is only 1 kbyte for instructions and 1 kbyte for data. And we eliminated some of the more complicated controls from the pipeline, like pipeline hold and register forwarding. That made it possible to clock the pipeline at a very high speed.''
Starting with a five-stage pipeline, the NEC researchers expanded several of the stages to give them an extra clock cycle. The result, according to Suzuki, is a streamlined eight-stage pipe that clocks at 500 MHz.
Memory interface is a synchronous SRAM bus designed to run at 125 MHz or 250 MHz. To achieve that speed, NEC implemented a proprietary 1-V I/O system. Special 1-V synchronous SRAMs that can hit the necessary 4-ns cycle time are planned but have not yet been built.
Since the work was experimental, Suzuki said, there was no attempt to provide the complex control functions and large primary caches that would boost system performance. He estimated that the current chip would probably only achieve about 50 SPECmarks.
``Because of the small caches, the replacement time on cache misses would be very long,'' Suzuki said.
But he pointed out that the entire CPU die is less than 8 mm 9 mm, leaving plenty of room for enlarged primary caches or even for multiple processor elements.
Other processor papers at the conference will for the most part fit into either the complexity or simplicity category. Digital Equipment Corp. will discuss internals of the Alpha 21164, an architecture relying heavily on Digital's ability to extract 200-MHz-plus speeds from its proprietary CMOS processes. In contrast, IBM will detail the previously announced 604 and Power2+ CPUs, Motorola will describe the first superscalar 68000-the 68060-and Intel will unveil more innards of the P54C.
The growing miniRISC movement will also be represented, though the vendors involved might not be comfortable with the name Hot Chips. Intel will describe the P100-the latest member of the 960 family-and Hitachi will detail the workings of the SH-II CPU. The latter, the first announced processor to incorporate a synchronous-DRAM controller-is locked in a close contest with the Advanced RISC Machines ARM-7 architecture for the title of highest Mips/mW.
Specialized sessions this year will cover node processors, connection chips and encryption engines; network and modem chips; and core logic for the new generation of processors. A graphics session will be dominated by talk of 3-D, as 3Dlabs, Apple Computer Inc., Digital and Sun all describe specialized 3-D hardware, mostly for economical rendering at moderate to very high rates.
Video compression will be on display as well, with the programmable-or semi-programmable-codec enthusiasts holding the floor. Array Microsystems will show its digital signal processor (DSP)-based approach to video coding, while Hewlett-Packard Co.'s Ruby Lee will argue the case for multimedia extensions to the PA-RISC family. Integrated Information Technology, one of the first developers of a semi-programmable MPEG-1 codec, will describe an application for transcoding between H.320 and Indeo formats.