Computer Systems Principle [2] - Cache, Pipelining and Interrupts
Modern CPUs use a number of advanced techniques to improve processor performance, including the following:
- Memory caching
- Pipelining
- Branch prediction and speculative execution
- Multiprocessing
Cache System
The cache memory is a buffer that allows the 80386 to function more efficiently with lower DRAM speeds.
The cache memory improves overall performance of the memory systems for data that is accessed more than once.
- cache is a high-speed memory system that is placed between the microprocessor and the DRAM memory system.
- Cache memory devices are usually static RAM memory components with access times of less than 10ns.
- Internal cache is in the same IC with the CPU
- External cache is on the motherboard.
- Advanced microprocessors have 3 levels of internal caches.
- Lower level internal caches => closer to the CPU / smaller size / more expensive / faster
For an 80386-based system, a 256K-byte cache provides the best cost performance to the operating speed of the system.
- A cache hit occurs when the data requested by the CPU is available in the cache.
- A cache miss occurs when the data requested by the CPU is not available in the cache.
Structure of a cache system
The system memory space is partitioned into a number of blocks.
- Blocks are grouped into a number of sets
- the blocks in the same set compete with each other
- to occupy a fixed number of block buffers (= number of ways) in the cache.
- the blocks in the same set compete with each other
- The data movement between cache and system memory is block-oriented.
Memory Space:
- Partition Memory Space into blocks
- Divide blocks into groups. Each group is called a set.
- All blocks in the same set from the Memory space compete for a place in the same set in the cache space.
Cache Space:
- Partition cache space into blocks (same block size). Each block is called a way.
- Divide blocks into groups (same No. of Gps.)
When accessing a memory location, an address is given.
Tag address (m bits) | Set address (n bits) | Byte address (k bits)
- Set address (n bits) tells which set
- Tag address (m bits) tells which block in the set
- Byte address (k bits) tells which byte in the block
Their lengths determine the structure of the cache system.
blocks in a set
sets in Memory space
blocks in Memory space
bytes in a block
For example,
32-bit memory address:
- Tag address: 20 bits => = 1M blocks/set
- Set address: 8 bits => = 256 sets
- Byte address: 4 bits => = 16 bytes/block
- System memory size = blocks
- Cache size = blocks
For 4-way cache system, there are 4 block buffers for storing data in the cache.
The fewer the ways, the lower the hit rate and the less complex the circuitry is.
A cache entry = a cache directory entry and corresponding cache memory entry
cache directory = information such as what is stored in the cache
The actual data itself is stored as a cache memory entry in a buffer.
3-bit LRU entry: determine which buffer in the selected set should be replaced after a cache miss.
Example:
How it works?
To determine if a requested datum is in the cache:
- Determine the set address and the tag address of the datum.
- Check the tag fields of all cache directory entries of the corresponding set against the tag address simultaneously.
- If there is a match in any one of the tag fields and the corresponding valid bit is active, there is a cache hit, else a cache miss.
Read Policy
If Data is in the cache
Forward to CPU
If Data is not in the cache:
Load Through:
Forward the word as cache line is filled
or
Fill cache line and then forward word
Write Policy
If Data is in the cache
Write Through:
Write data to both cache and main memory
or
Write Back:
Write data to cache only.
Defer main memory write until block is flushed.
If Data is not in the cache
Write Allocate:
Bring line into cache, then update it
or
Write No-Allocate:
Update main memory only.
Replacement policy
- Random
- LIFO – Last-in-first-out
- FIFO – First-in-first-out
- LRU – least recently used
- LFU – least frequently used
- Ideal – replace the one no longer used in the future.
Basically:
When accessing a memory location, an address is given.
Tag address (m bits) | Set address (n bits) | Byte address (k bits)
If tag found in ways => Hit++, Frequency counter++ in that tag
else => Miss++, Depend on Policy: replace tag and refresh counter as 0 in specific way
Performance of Cache Memory
The performance of a cache memory is measured by the hit ratio (HR) and the effective access
A cache hit occurs when the data requested by the CPU is available in the cache.
A cache miss occurs when the data requested by the CPU is not available in the cache.
Total amount of time =
Access time for a hit =
Access time for a miss =
Pipelining
Pipelining is a hardware technique that allows many operations to execute concurrently (同步), in an assembly-line fashion.
It is done by separating the operations into stages and providing special hardware for each processing stage.
Successive operations start, execute, and finish in sequence.
Speeding up the program execution by pipelining
Program execution Pipeline
Consider there are 3 steps in fetching and executing an instruction:
- Fetch instruction from memory.
- Decode Increment
- Execute instruction in the ALU.
Conditional branches lower the efficiency of a program execution pipeline.
Solution 1:
Branch Prediction and Speculative Execution
- CPU guesses whether a branch condition will be true or false based on past experience.
- Right guess => gain
- Speculative execution
- Execute instructions after the guess but before the final result is known with certainty.
Solution 2:
Simultaneous execution of both program paths
- CPU assumes that either path is possible and executes instructions from both paths until the conditional BRANCH is finally evaluated.
- Once the condition is known, further work on the incorrect path is abandoned, and all effort is directed to the correct path.
- Can be used only when resources are allowed.
Interrupts
Interrupts are particularly useful when interfacing I/O devices that provide or require data at relatively low data transfer rates.
Unlike the polling technique used in programmed I/O, interrupt processing allows the microprocessor to execute other software while there is no request for data transfer to or from the I/O devices.
Interrupt I/O vs Programmed I/O
Programmed I/O
- the simplest I/O technique for exchanging data between processor and other external device
- It module is treated as slow module.
In Programmed I/O, Processor executes a program that gives the direct control of I/O operation. Processor issue a command to the I/O module and wait until the operation is complete. Processor will periodically check the status of I/O module until it find that the operation is complete. If the processor is faster than I/O module , the processor time is waste.
- Primary disadvantage of programmed I/O is that CPU spends most of its time in a tight loop waiting for the device to become ready. This is called busy waiting.
- It used only in some low-end microcomputers.
- It has single input and single output instruction.
- Each instructions selects one I/O device (by number) and transfers a single character (byte)
- Four registers: input status and character, output status and character.
Interrupt I/O
- something similar to Programmed I/O technique
- It module is faster than Programmed I/O module.
At this technique the processor do not wait for until the I/O operation is complete. Rather processor normally do the other task. When I/O is complete the I/O module interrupt into processor . Interrupt means that the operation is completed.
- With interrupt-driven I/O, the CPU starts the device and tells it to generate an interrupt when it is finished.
- Done by setting interrupt-enable bit in status register.
- Still requires an interrupt for every character read or written.
- Interrupting a running process is an expensive business (requires saving context).
- Requires extra hardware (DMA controller chip).
Software Interrupts
Instructions INT, INTO, INT 3 and BOUND cause
Interrupt vectors
-
The interrupt table
- located in the first 1024 bytes of memory at address 000000H-0003FFH.
- contains 256 different 4-byte interrupt vectors
-
Each interrupt vector contains:
- the starting address (offset and segment) of the corresponding interrupt service procedure.
A type n INT means its interrupt vector No. is n.
The starting address of its interrupt service procedure (ISR) is the nth vector in the interrupt vector table.
Interrupt Instructions
Bound
Usage: check array index against bounds, raises software interrupt 5 if test fails
- compares the content of a register with two words of memory data (lower and upper bounds).
- A type 5 interrupt occurs if it is out of the bounds.
INT
Usage: generates a software interrupt
INT n
instruction calls the interrupt service procedure that begins at the address represented in vector number n.INT 3
instruction is a 1-byte instruction and is often used as a breakpoint interrupt.INTO
instruction checks the overflow flag. A type 4 interrupt occurs if OF=1.
IRET
Usage:
- The
IRET
instruction is a special return instruction used to return for both software and hardware interrupts. It is different from the normal return as it also pops a copy of the flag register from the stack. - The
IRETD
instruction is a special return instruction used to return for both software and hardware interrupts in the protected mode in the 80386 and above microprocessors. It pops 32 bits EIP and EFLAG.
Normal subroutine vs INT subroutine
Operation of a real mode interrupt
Using the The interrupt table
-
located in the first 1024 bytes of memory at address 000000H-0003FFH.
-
contains 256 different 4-byte interrupt vectors
-
Each interrupt vector contains:
- the starting address (offset and segment) of the corresponding interrupt service procedure.
A type n INT means its interrupt vector No. is n.
The starting address of its interrupt service procedure (ISR) is the nth vector in the interrupt vector table.
Software interrupt
No one provides INT vector number. The CPU knows it by itself.
- Complete executing the current instruction. Check if the interrupt is activated by (1) instruction executions, (2) single-step, (3) NMI, (4) coprocessor segment overrun, (5) INTR, and (6) INT instruction in the order presented.
- Push the contents of the flag register onto the stack.
- Clear IF and TF to disable the INTR pin and the trap or single-step feature.
- Push the contents of CS and IP onto the stack.
- Fetch and place the corresponding interrupt vector contents into both IP and CS.
- Run the interrupt service procedure (ISR).
- Pop IP, CS and flag register contents back to the registers to resume the interrupted programme after running the ISR.
Hardware interrupt
Once Interface sends interrupt signal:
- Complete executing the current instruction. Check if the interrupt is activated by (1) instruction executions, (2) single-step, (3) NMI, (4) coprocessor segment overrun, (5) INTR, and (6) INT instruction in the order presented.
- Push the contents of the flag register onto the stack.
- Clear IF and TF to disable the INTR pin and the trap or single-step feature.
- Push the contents of CS and IP onto the stack.
- Fetch and place the corresponding interrupt vector contents into both IP and CS.
- Run the interrupt service procedure (ISR).
- Pop IP, CS and flag register contents back to the registers to resume the interrupted programme after running the ISR.
Operation of a protected mode interrupt
interrupts have exactly the same assignments as in real mode, but the interrupt descriptor table instead of the interrupt vector table is accessed to locate the starting addresses of the interrupt service procedures.
- Protected mode uses a set of 256 interrupt descriptors stored in an interrupt descriptor table (IDT).
- The starting address of the IDT is specified by the interrupt descriptor table address register (IDTR).
Interrupt Flag Bits
Interrupt flag (IF-bit)
- When the interrupt flag (IF-bit) is set, it allows the INTR pin to cause an interrupt.
Trap flag (TF-bit)
- When the trap flag (TF-bit) is set, it causes a trap (single-step) interrupt to occur after each instruction executes.
- annot be directly set or cleared with a single instruction.
Hardware Interrupts
The interrupts of the entire Intel family of microprocessors include two hardware pins that request interrupts (INTR and NMI) and one hardware pin (INTA) that acknowledges the interrupt requested through INTR.
NMI
- an edge-triggered input that requests an interrupt on the positive edge.
- often used for parity errors and other major system faults such as power failures.
INTR and INTA
- INTR is level-sensitive and must be held at a logic 1 level until it’s recognized.
- The input is automatically disabled once it is accepted by the microprocessor (clear IF) and re-enabled by the IRET instruction (set IF) at the end of the interrupt service procedure.
- The microprocessor responds to the INTR input by pulsing the INTA output in anticipation of receiving an interrupt vector number on data bus connection D7-D0.
- Two INTA pulses will be generated by the system and the content on the data bus will only be interpreted as an interrupt vector number during the second pulse.
The interrupt Structure
Two flag bits, IF (interrupt flag) and TF (trap flag), are used with the interrupt structure and a special return instruction IRET (or IRETD in the 80386, 80486, or Pentium/Pentium Pro).
Using priority-solving circuit
- Priority is established according to the position of the bits in the register.
- Mask register controls the status of each interrupt request.
- It is flexible to change the priority of each device by reprogramming the priority encoder.
- The priority encoder output will combine with the prefix address, in this case is 0000 00 binary. So finally we may have vector address like 0000 0000, 0000 0010… Which are 0,1,2,3 in decimal.