Out-of-order execution

This computer motherboard used in a personal computer is the result of computer engineering efforts.

Paradigm used in most high-performance central processing units to make use of instruction cycles that would otherwise be wasted.

- Out-of-order execution

163 related topics


Instruction pipelining

Technique for implementing instruction-level parallelism within a single processor.

Generic 4-stage pipeline; the colored boxes represent instructions independent of each other
A bubble in cycle 3 delays execution.

The processor can locate other instructions which are not dependent on the current ones and which can be immediately executed without hazards, an optimization known as out-of-order execution.

Pipeline (computing)

Set of data processing elements connected in series, where the output of one element is the input of the next one.

Computer simulation, one of the main cross-computing methodologies.

Instruction pipelines, such as the classic RISC pipeline, which are used in central processing units (CPUs) and other microprocessors to allow overlapping execution of multiple instructions with the same circuitry. The circuitry is usually divided up into stages and each stage processes a specific part of one instruction at a time, passing the partial results to the next stage. Examples of stages are instruction decode, arithmetic/logic and register fetch. They are related to the technologies of superscalar execution, operand forwarding, speculative execution and out-of-order execution.

Pentium Pro

Sixth-generation x86 microprocessor developed and manufactured by Intel and introduced on November 1, 1995.

Block Diagram of the Pentium Pro's Microarchitecture
200 MHz Pentium Pro with a 512 KB L2 cache in PGA package
200 MHz Pentium Pro with a 1 MB L2 cache in PPGA package.
Uncapped Pentium Pro 256 KB
Pentium II Overdrive with heatsink removed. Flip-chip Deschutes core is on the left. 512 KB cache is on the right.

The Pentium Pro thus featured out of order execution, including speculative execution via register renaming.

Very long instruction word

Very long instruction word (VLIW) refers to instruction set architectures designed to exploit instruction level parallelism (ILP).

One instruction may have several fields, which identify the logical operation, and may also include source and destination addresses and constant values. This is the MIPS "Add Immediate" instruction, which allows selection of source and destination registers and inclusion of a small constant.

The traditional means to improve performance in processors include dividing instructions into substeps so the instructions can be executed partly at the same time (termed pipelining), dispatching individual instructions to be executed independently, in different parts of the processor (superscalar architectures), and even executing instructions in an order different from the program (out-of-order execution).

Parallel computing

Type of computation in which many calculations or processes are carried out simultaneously.

IBM's Blue Gene/P massively parallel supercomputer
A graphical representation of Amdahl's law. The speedup of a program from parallelization is limited by how much of the program can be parallelized. For example, if 90% of the program can be parallelized, the theoretical maximum speedup using parallel computing would be 10 times no matter how many processors are used.
Assume that a task has two independent parts, A and B. Part B takes roughly 25% of the time of the whole computation. By working very hard, one may be able to make this part 5 times faster, but this only reduces the time for the whole computation by a little. In contrast, one may need to perform less work to make part A be twice as fast. This will make the computation much faster than by optimizing part B, even though part B's speedup is greater by ratio, (5 times versus 2 times).
Taiwania 3 of Taiwan, a parallel supercomputing device that joined COVID-19 research.
A canonical processor without pipeline. It takes five clock cycles to complete one instruction and thus the processor can issue subscalar performance.
A canonical five-stage pipelined processor. In the best case scenario, it takes one clock cycle to complete one instruction and thus the processor can issue scalar performance.
A canonical five-stage pipelined processor with two execution units. In the best case scenario, it takes one clock cycle to complete two instructions and thus the processor can issue superscalar performance.
A logical view of a non-uniform memory access (NUMA) architecture. Processors in one directory can access that directory's memory with less latency than they can access memory in the other directory's memory.
A Beowulf cluster
A cabinet from IBM's Blue Gene/L massively parallel supercomputer
Nvidia's Tesla GPGPU card
The Cray-1 is a vector processor
ILLIAC IV, "the most infamous of supercomputers"

These instructions can be re-ordered and combined into groups which are then executed in parallel without changing the result of the program.

Intel Atom

Brand name for a line of IA-32 and x86-64 instruction set ultra-low-voltage processors by Intel Corporation designed to reduce electric consumption and power dissipation in comparison with ordinary processors of the Intel Core series.

Logo since 2020
Intel Atom N2800

This enables relatively good performance with only two integer ALUs, and without any instruction reordering, speculative execution, or register renaming.


AMD's first x86 processor to be developed entirely in-house.

An AMD K5 PR166 microprocessor
K5 core diagram
AMD 5K86-P90 (SSA/5)
AMD K5 PR75 (SSA/5) die shot
AMD K5 PR150 (5k86) die shot

All models had 4.3 million transistors, with five integer units that could process instructions out of order and one floating-point unit.

Central processing unit

Electronic circuitry that executes instructions comprising a computer program.

EDVAC, one of the first stored-program computers
IBM PowerPC 604e processor
Fujitsu board with SPARC64 VIIIfx processors
CPU, core memory and external bus interface of a DEC PDP-8/I, made of medium-scale integrated circuits
Inside of laptop, with CPU removed from socket
Block diagram of a basic uniprocessor-CPU computer. Black lines indicate data flow, whereas red lines indicate control flow; arrows indicate flow directions.
Symbolic representation of an ALU and its input and output signals
A six-bit word containing the binary encoded representation of decimal value 40. Most modern CPUs employ word sizes that are a power of two, for example 8, 16, 32 or 64 bits.
Model of a subscalar CPU, in which it takes fifteen clock cycles to complete three instructions
Basic five-stage pipeline. In the best case scenario, this pipeline can sustain a completion rate of one instruction per clock cycle.
A simple superscalar pipeline. By fetching and dispatching two instructions at a time, a maximum of two instructions per clock cycle can be completed.

It also makes hazard-avoiding techniques like branch prediction, speculative execution, register renaming, out-of-order execution and transactional memory crucial to maintaining high levels of performance.

IBM System/360 Model 91

Announced in 1964 as a competitor to the CDC 6600.

System/360 Model 91 Panel at the Goddard Space Flight Center
System/360 Model 91 Panel at the Goddard Space Flight Center
Front Panel of the Model 91. Currently on display at the Living Computer Museum in Seattle, Washington.

Functionally, the Model 91 ran like any other large-scale System/360, but the internal organization was the most advanced of the System/360 line, and it was the first IBM computer to support out-of-order instruction execution.

PowerPC 600

The first family of PowerPC processors built.

The PowerPC 601 prototype reached first silicon in October 1992
An 80 MHz PowerPC 601
An IBM manufactured 90 MHz PowerPC 601v. Notice the slightly smaller die.
A 100 MHz Motorola PowerPC 603 in a wire bond Quad Flat Package
A 200 MHz Motorola PowerPC 603 in a ceramic Ball Grid Array packaging
IBM PPC603ev, 200 MHz
A 233 MHz Motorola PowerPC 604e mounted on a Phase5 CyberstormPPC processor card for the Commodore Amiga 4000 series computers
A 200 MHz IBM PowerPC 604e processor on the CPU module of an Apple Network Server 700

Two simple and one complex integer units, one floating-point unit, one branch-processing unit managing out-of-order execution and one load/store unit.