| Both sides previous revisionPrevious revisionNext revision | Previous revision |
| mywiki:hw:mips:barrier_fence [2014/07/28 13:59] – [Pipeline Hazards/Branch Delay] shaoguoh | mywiki:hw:mips:barrier_fence [2022/04/02 17:29] (current) – external edit 127.0.0.1 |
|---|
| **MIPS Pipeline Hazards and Memory Barrier/Fences** | **MIPS Pipeline Hazards, Memory Alignment and Barrier/Fences** |
| ====== Pipeline Hazards/Branch Delay ====== | ====== Pipeline Hazards/Branch Delay ====== |
| | {{:mywiki:hw:mips:800px-mips_architecture_pipelined_.svg.png?600|}} |
| MIPS has explicit pipeline hazards; **the instruction immediately following a branch or jump instruction will always be executed** (this instruction is sometimes referred to as the "b**ranch delay slot**"). If your code was really assembled exactly as you wrote it: | MIPS has explicit pipeline hazards; **the instruction immediately following a branch or jump instruction will always be executed** (this instruction is sometimes referred to as the "b**ranch delay slot**"). If your code was really assembled exactly as you wrote it: |
| <file> | <file> |
| nop #must add nop here | nop #must add nop here |
| </file> | </file> |
| | |
| | ====== Memory Alignment ====== |
| | Most RISC processors will generate **an alignment fault** when a load or store instruction accesses a **misaligned address**. This allows **the operating system to emulate the misaligned access** using other instructions. For example, the alignment fault handler might use byte loads or stores (which are always aligned) to emulate a larger load or store instruction. |
| | |
| | Some architectures like MIPS have special unaligned load and store instructions. One unaligned load instruction gets the bytes from the memory word with the lowest byte address and another gets the bytes from the memory word with the highest byte address. Similarly, store-high and store-low instructions store the appropriate bytes in the higher and lower memory words respectively. |
| | |
| ====== Memory Barrier/Fences ====== | ====== Memory Barrier/Fences ====== |
| |
| CPU cores contain multiple execution units. For example, a modern Intel CPU contains 6 execution units which can do a combination of arithmetic, conditional logic, and memory manipulation. Each execution unit can do some combination of these tasks. These execution units operate in parallel allowing instructions to be executed in parallel. This introduces another level of non-determinism to program order if it was observed from another CPU. | CPU cores contain multiple execution units. For example, a modern Intel CPU contains 6 execution units which can do a combination of arithmetic, conditional logic, and memory manipulation. Each execution unit can do some combination of these tasks. These execution units operate in parallel allowing instructions to be executed in parallel. This introduces another level of non-determinism to program order if it was observed from another CPU. |
| |
| {{ ::cpu.png |}} | {{:mywiki:hw:mips:cpu.png|}} |
| |
| Loads and stores to the caches and main memory are buffered and re-ordered using the load, store, and write-combining buffers. These buffers are associative queues that allow fast lookup. This lookup is necessary when a later load needs to read the value of a previous store that has not yet reached the cache. Figure 1 above depicts a simplified view of a modern multi-core CPU. It shows how the execution units can use the local registers and buffers to manage memory while it is being transferred back and forth from the cache sub-system. | Loads and stores to the caches and main memory are buffered and re-ordered using the load, store, and write-combining buffers. These buffers are associative queues that allow fast lookup. This lookup is necessary when a later load needs to read the value of a previous store that has not yet reached the cache. Figure 1 above depicts a simplified view of a modern multi-core CPU. It shows how the execution units can use the local registers and buffers to manage memory while it is being transferred back and forth from the cache sub-system. |
| Memory barriers are a complex subject. They are implemented very differently across CPU architectures. At one end of the spectrum there is a relatively strong memory model on Intel CPUs that is more simple than say the weak and complex memory model on a DEC Alpha with its partitioned caches in addition to cache layers. Since x86 CPUs are the most common for multi-threaded programming I’ll try and simplify to this level. | Memory barriers are a complex subject. They are implemented very differently across CPU architectures. At one end of the spectrum there is a relatively strong memory model on Intel CPUs that is more simple than say the weak and complex memory model on a DEC Alpha with its partitioned caches in addition to cache layers. Since x86 CPUs are the most common for multi-threaded programming I’ll try and simplify to this level. |
| |
| ====== Store Barrier ====== | ===== Store Barrier ===== |
| |
| A store barrier, “sfence” instruction on x86, **forces all store instructions prior to the barrier to happen before the barrier and have the store buffers flushed to cache for the CPU on which it is issued.** This will make the program state visible to other CPUs so they can act on it if necessary. A good example of this in action is the following simplified code from the BatchEventProcessor in the Disruptor. When the sequence is updated other consumers and producers know how far this consumer has progressed and thus can take appropriate action. All previous updates to memory that happened before the barrier are now visible. | A store barrier, “sfence” instruction on x86, **forces all store instructions prior to the barrier to happen before the barrier and have the store buffers flushed to cache for the CPU on which it is issued.** This will make the program state visible to other CPUs so they can act on it if necessary. A good example of this in action is the following simplified code from the BatchEventProcessor in the Disruptor. When the sequence is updated other consumers and producers know how far this consumer has progressed and thus can take appropriate action. All previous updates to memory that happened before the barrier are now visible. |
| </code> | </code> |
| |
| ====== Load Barrier ====== | ===== Load Barrier ===== |
| |
| A load barrier, “lfence” instruction on x86, **forces all load instructions after the barrier to happen after the barrier and then wait on the load buffer to drain for that CPU.** This makes program state exposed from other CPUs visible to this CPU before making further progress. A good example of this is when the BatchEventProcessor sequence referenced above is read by producers, or consumers, in the corresponding barriers of the Disruptor. | A load barrier, “lfence” instruction on x86, **forces all load instructions after the barrier to happen after the barrier and then wait on the load buffer to drain for that CPU.** This makes program state exposed from other CPUs visible to this CPU before making further progress. A good example of this is when the BatchEventProcessor sequence referenced above is read by producers, or consumers, in the corresponding barriers of the Disruptor. |
| |
| ====== Full Barrier ====== | ===== Full Barrier ===== |
| |
| A full barrier, "mfence" instruction on x86, is **a composite of both load and store barriers** happening on a CPU. | A full barrier, "mfence" instruction on x86, is **a composite of both load and store barriers** happening on a CPU. |
| |
| ====== Java Memory Model ====== | ===== Java Memory Model ===== |
| |
| In the Java Memory Model a **volatile** field has a **store barrier inserted after a write** to it and **a load barrier inserted before a read of it**. Qualified final fields of a class have a store barrier inserted after their initialization to ensure these fields are visible once the constructor completes when a reference to the object is available. | In the Java Memory Model a **volatile** field has a **store barrier inserted after a write** to it and **a load barrier inserted before a read of it**. Qualified final fields of a class have a store barrier inserted after their initialization to ensure these fields are visible once the constructor completes when a reference to the object is available. |
| |
| ====== Atomic Instructions and Software Locks ====== | ===== Atomic Instructions and Software Locks ===== |
| |
| Atomic instructions, such as the “**lock** ...” instructions on x86, are effectively **a full barrier** as they lock the memory sub-system to perform an operation and have guaranteed total order, even across CPUs. Software locks usually employ memory barriers, or atomic instructions, to achieve visibility and preserve program order. | Atomic instructions, such as the “**lock** ...” instructions on x86, are effectively **a full barrier** as they lock the memory sub-system to perform an operation and have guaranteed total order, even across CPUs. Software locks usually employ memory barriers, or atomic instructions, to achieve visibility and preserve program order. |
| |
| ====== Performance Impact of Memory Barriers ====== | ===== Performance Impact of Memory Barriers ===== |
| |
| Memory barriers prevent a CPU from performing a lot of techniques to hide memory latency therefore they have a significant performance cost which must be considered. To achieve maximum performance it is best to model the problem so the processor can do units of work, then have all the necessary memory barriers occur on the boundaries of these work units. Taking this approach allows the processor to optimize the units of work without restriction. There is an advantage to grouping necessary memory barriers in that buffers flushed after the first one will be less costly because no work will be under way to refill them. | Memory barriers prevent a CPU from performing a lot of techniques to hide memory latency therefore they have a significant performance cost which must be considered. To achieve maximum performance it is best to model the problem so the processor can do units of work, then have all the necessary memory barriers occur on the boundaries of these work units. Taking this approach allows the processor to optimize the units of work without restriction. There is an advantage to grouping necessary memory barriers in that buffers flushed after the first one will be less costly because no work will be under way to refill them. |
| | |
| | ====== MIPS Linux barrier ====== |
| | Refer to: Linux/arch/mips/include/asm/barrier.h |
| | <file> |
| | define __sync() \ |
| | __asm__ __volatile__( \ |
| | ".set push\n\t" \ |
| | ".set noreorder\n\t" \ |
| | ".set mips2\n\t" \ |
| | "sync\n\t" \ |
| | ".set pop" \ |
| | : /* no output */ \ |
| | : /* no input */ \ |
| | : "memory") |
| | |
| | |
| | #define __fast_iob() \ |
| | __asm__ __volatile__( \ |
| | ".set push\n\t" \ |
| | ".set noreorder\n\t" \ |
| | "lw $0,%0\n\t" \ |
| | "nop\n\t" \ |
| | ".set pop" \ |
| | : /* no output */ \ |
| | : "m" (*(int *)CKSEG1) \ |
| | : "memory") |
| | |
| | # define fast_wmb() __sync() |
| | # define fast_rmb() __sync() |
| | # define fast_mb() __sync() |
| | |
| | #define wmb() fast_wmb() |
| | #define rmb() fast_rmb() |
| | #define mb() wbflush() |
| | #define iob() wbflush() |
| | |
| | #define set_mb(var, value) ... |
| | #define smp_llsc_mb() ... |
| | |
| | </file> |