Modern CPUs use a number of advanced techniques to improve processor performance, including the following:

Memory caching
Pipelining
Branch prediction and speculative execution
Multiprocessing

Cache System

The cache memory is a buffer that allows the 80386 to function more efficiently with lower DRAM speeds.

The cache memory improves overall performance of the memory systems for data that is accessed more than once.

cache is a high-speed memory system that is placed between the microprocessor and the DRAM memory system.
Cache memory devices are usually static RAM memory components with access times of less than 10ns.
Internal cache is in the same IC with the CPU
External cache is on the motherboard.
Advanced microprocessors have 3 levels of internal caches.
Lower level internal caches => closer to the CPU / smaller size / more expensive / faster

For an 80386-based system, a 256K-byte cache provides the best cost performance to the operating speed of the system.

A cache hit occurs when the data requested by the CPU is available in the cache.
A cache miss occurs when the data requested by the CPU is not available in the cache.

Structure of a cache system

The system memory space is partitioned into a number of blocks.

Blocks are grouped into a number of sets
- the blocks in the same set compete with each other
  - to occupy a fixed number of block buffers (= number of ways) in the cache.
The data movement between cache and system memory is block-oriented.

Memory Space:

Partition Memory Space into blocks

Divide blocks into groups. Each group is called a set.

All blocks in the same set from the Memory space compete for a place in the same set in the cache space.

Cache Space:

Partition cache space into blocks (same block size). Each block is called a way.

Divide blocks into groups (same No. of Gps.)

When accessing a memory location, an address is given.

Tag address (m bits) | Set address (n bits) | Byte address (k bits)

Set address (n bits) tells which set

Tag address (m bits) tells which block in the set

Byte address (k bits) tells which byte in the block

Their lengths determine the structure of the cache system.

$2^m$ blocks in a set

$2^n$ sets in Memory space

$2^{n+m}$ blocks in Memory space

$2^k$ bytes in a block

For example,

32-bit memory address:

Tag address: 20 bits => $2^{20}$ = 1M blocks/set
Set address: 8 bits => $2^8$ = 256 sets
Byte address: 4 bits => $2^4$ = 16 bytes/block
System memory size = $2^{20+8} = 2^{28}$ blocks
Cache size = $2\times2^8$ blocks

For 4-way cache system, there are 4 block buffers for storing data in the cache.

The fewer the ways, the lower the hit rate and the less complex the circuitry is.

A cache entry = a cache directory entry and corresponding cache memory entry

cache directory = information such as what is stored in the cache

The actual data itself is stored as a cache memory entry in a buffer.

3-bit LRU entry: determine which buffer in the selected set should be replaced after a cache miss.

Example:

How it works?

To determine if a requested datum is in the cache:

Determine the set address and the tag address of the datum.
Check the tag fields of all cache directory entries of the corresponding set against the tag address simultaneously.
If there is a match in any one of the tag fields and the corresponding valid bit is active, there is a cache hit, else a cache miss.

Read Policy

If Data is in the cache

Forward to CPU

If Data is not in the cache:

Load Through:

Forward the word as cache line is filled

Fill cache line and then forward word

Write Policy

If Data is in the cache

Write Through:

Write data to both cache and main memory

Write Back:

Write data to cache only.

Defer main memory write until block is flushed.

If Data is not in the cache

Write Allocate:

Bring line into cache, then update it

Write No-Allocate:

Update main memory only.

Replacement policy

Random
LIFO – Last-in-first-out
FIFO – First-in-first-out
LRU – least recently used
LFU – least frequently used
Ideal – replace the one no longer used in the future.

Basically:

When accessing a memory location, an address is given.

Tag address (m bits) | Set address (n bits) | Byte address (k bits)

If tag found in ways => Hit++, Frequency counter++ in that tag

else => Miss++, Depend on Policy: replace tag and refresh counter as 0 in specific way

Performance of Cache Memory

The performance of a cache memory is measured by the hit ratio (HR) and the effective access

A cache hit occurs when the data requested by the CPU is available in the cache.

A cache miss occurs when the data requested by the CPU is not available in the cache.

$HR = \frac{No.\space of \space hits}{Total\space No.\space of \space memory \space accesses}$

$EAT = \frac{(No.\space of \space hits)\times T_h + (No.\space of \space misses) \times T_m}{Total\space No.\space of \space memory \space accesses}$

Total amount of time = $(No.\space of \space hits)\times T_h + (No.\space of \space misses) \times T_m$

Access time for a hit = $T_h$

Access time for a miss = $T_m$

Pipelining

Pipelining is a hardware technique that allows many operations to execute concurrently (同步), in an assembly-line fashion.

It is done by separating the operations into stages and providing special hardware for each processing stage.

Successive operations start, execute, and finish in sequence.

Speeding up the program execution by pipelining

Program execution Pipeline

Consider there are 3 steps in fetching and executing an instruction:

Fetch instruction from memory.
Decode Increment
Execute instruction in the ALU.

Conditional branches lower the efficiency of a program execution pipeline.

Solution 1:

Branch Prediction and Speculative Execution

CPU guesses whether a branch condition will be true or false based on past experience.

Right guess => gain

Speculative execution

Execute instructions after the guess but before the final result is known with certainty.

Solution 2:

Simultaneous execution of both program paths

CPU assumes that either path is possible and executes instructions from both paths until the conditional BRANCH is finally evaluated.

Once the condition is known, further work on the incorrect path is abandoned, and all effort is directed to the correct path.

Can be used only when resources are allowed.

Interrupts

Interrupts are particularly useful when interfacing I/O devices that provide or require data at relatively low data transfer rates.

Unlike the polling technique used in programmed I/O, interrupt processing allows the microprocessor to execute other software while there is no request for data transfer to or from the I/O devices.

Interrupt I/O vs Programmed I/O

Programmed I/O

the simplest I/O technique for exchanging data between processor and other external device

It module is treated as slow module.

In Programmed I/O, Processor executes a program that gives the direct control of I/O operation. Processor issue a command to the I/O module and wait until the operation is complete. Processor will periodically check the status of I/O module until it find that the operation is complete. If the processor is faster than I/O module , the processor time is waste.

Primary disadvantage of programmed I/O is that CPU spends most of its time in a tight loop waiting for the device to become ready. This is called busy waiting.
It used only in some low-end microcomputers.
It has single input and single output instruction.
Each instructions selects one I/O device (by number) and transfers a single character (byte)
Four registers: input status and character, output status and character.

Interrupt I/O

something similar to Programmed I/O technique

It module is faster than Programmed I/O module.

At this technique the processor do not wait for until the I/O operation is complete. Rather processor normally do the other task. When I/O is complete the I/O module interrupt into processor . Interrupt means that the operation is completed.

With interrupt-driven I/O, the CPU starts the device and tells it to generate an interrupt when it is finished.
Done by setting interrupt-enable bit in status register.
Still requires an interrupt for every character read or written.
Interrupting a running process is an expensive business (requires saving context).
Requires extra hardware (DMA controller chip).

Software Interrupts

Instructions INT, INTO, INT 3 and BOUND cause

Interrupt vectors

The interrupt table
- located in the first 1024 bytes of memory at address 000000H-0003FFH.
- contains 256 different 4-byte interrupt vectors
Each interrupt vector contains:
- the starting address (offset and segment) of the corresponding interrupt service procedure.

A type n INT means its interrupt vector No. is n.

The starting address of its interrupt service procedure (ISR) is the nth vector in the interrupt vector table.

Interrupt Instructions

Bound

Usage: check array index against bounds, raises software interrupt 5 if test fails

compares the content of a register with two words of memory data (lower and upper bounds).
A type 5 interrupt occurs if it is out of the bounds.

INT

Usage: generates a software interrupt

INT n instruction calls the interrupt service procedure that begins at the address represented in vector number n.
INT 3 instruction is a 1-byte instruction and is often used as a breakpoint interrupt.
INTO instruction checks the overflow flag. A type 4 interrupt occurs if OF=1.

IRET

Usage:

The IRET instruction is a special return instruction used to return for both software and hardware interrupts. It is different from the normal return as it also pops a copy of the flag register from the stack.
The IRETD instruction is a special return instruction used to return for both software and hardware interrupts in the protected mode in the 80386 and above microprocessors. It pops 32 bits EIP and EFLAG.

Normal subroutine vs INT subroutine

Operation of a real mode interrupt

Using the The interrupt table

located in the first 1024 bytes of memory at address 000000H-0003FFH.
contains 256 different 4-byte interrupt vectors
Each interrupt vector contains:
- the starting address (offset and segment) of the corresponding interrupt service procedure.

A type n INT means its interrupt vector No. is n.

The starting address of its interrupt service procedure (ISR) is the nth vector in the interrupt vector table.

Software interrupt

No one provides INT vector number. The CPU knows it by itself.

Complete executing the current instruction. Check if the interrupt is activated by (1) instruction executions, (2) single-step, (3) NMI, (4) coprocessor segment overrun, (5) INTR, and (6) INT instruction in the order presented.
Push the contents of the flag register onto the stack.
Clear IF and TF to disable the INTR pin and the trap or single-step feature.
Push the contents of CS and IP onto the stack.
Fetch and place the corresponding interrupt vector contents into both IP and CS.
Run the interrupt service procedure (ISR).
Pop IP, CS and flag register contents back to the registers to resume the interrupted programme after running the ISR.

Hardware interrupt

Once Interface sends interrupt signal:

Complete executing the current instruction. Check if the interrupt is activated by (1) instruction executions, (2) single-step, (3) NMI, (4) coprocessor segment overrun, (5) INTR, and (6) INT instruction in the order presented.
Push the contents of the flag register onto the stack.
Clear IF and TF to disable the INTR pin and the trap or single-step feature.
Push the contents of CS and IP onto the stack.
Fetch and place the corresponding interrupt vector contents into both IP and CS.
Run the interrupt service procedure (ISR).
Pop IP, CS and flag register contents back to the registers to resume the interrupted programme after running the ISR.

Operation of a protected mode interrupt

interrupts have exactly the same assignments as in real mode, but the interrupt descriptor table instead of the interrupt vector table is accessed to locate the starting addresses of the interrupt service procedures.

Protected mode uses a set of 256 interrupt descriptors stored in an interrupt descriptor table (IDT).
The starting address of the IDT is specified by the interrupt descriptor table address register (IDTR).

Interrupt Flag Bits

Interrupt flag (IF-bit)

When the interrupt flag (IF-bit) is set, it allows the INTR pin to cause an interrupt.

Trap flag (TF-bit)

When the trap flag (TF-bit) is set, it causes a trap (single-step) interrupt to occur after each instruction executes.
annot be directly set or cleared with a single instruction.

Hardware Interrupts

The interrupts of the entire Intel family of microprocessors include two hardware pins that request interrupts (INTR and NMI) and one hardware pin (INTA) that acknowledges the interrupt requested through INTR.

NMI

an edge-triggered input that requests an interrupt on the positive edge.
often used for parity errors and other major system faults such as power failures.

INTR and INTA

INTR is level-sensitive and must be held at a logic 1 level until it’s recognized.
The input is automatically disabled once it is accepted by the microprocessor (clear IF) and re-enabled by the IRET instruction (set IF) at the end of the interrupt service procedure.
The microprocessor responds to the INTR input by pulsing the INTA output in anticipation of receiving an interrupt vector number on data bus connection D7-D0.
Two INTA pulses will be generated by the system and the content on the data bus will only be interpreted as an interrupt vector number during the second pulse.

The interrupt Structure

Two flag bits, IF (interrupt flag) and TF (trap flag), are used with the interrupt structure and a special return instruction IRET (or IRETD in the 80386, 80486, or Pentium/Pentium Pro).

Using priority-solving circuit

Priority is established according to the position of the bits in the register.
Mask register controls the status of each interrupt request.
It is flexible to change the priority of each device by reprogramming the priority encoder.
The priority encoder output will combine with the prefix address, in this case is 0000 00 binary. So finally we may have vector address like 0000 0000, 0000 0010… Which are 0,1,2,3 in decimal.