Load and store unit
Load-Store Unit, LSU
The L1 data cache, and LSU are shown in figure. L1 data cache supports up to 2 128-bit read operations per cycle or two 64-bit writes per cycle or a mixture of these two.
The LSU has two queues, LS1 of 24 elements and LS2 of 64 elements (12 and 32 in the previous Stars architecture).
The LS1 unit can start two L1 cache operations (reads or tags control for later writing, remembering that the cache uses write allocate policy, then you should check if the data is cached before writing) for each clock cycle. Read operations can start out of order, provided that certain conditions are verified.
The LS2 queue contains requests that have given a miss in the L1 cache after the check made by the LS2 unit. The stores are taken from the LS2 queue, however, then it will have the tag check result.
The 128-bit writes are treated in a special way, since you can write 64 bits at a time, and take up two slots in the LS2.
Finally, the LSU will ensure that the sorting rules of the memory operation of the x86 architecture are observed.
Write Combining
Llano has 4 buffers of 64 bytes (one cache line) and 8 address buffer for merging up to 8 writing to 4 different cache lines.
When multiple stores are executed at a short distance, it can be useful to combine them together before they are written completely, in order to improve the efficiency of writing.
This feature is particularly useful when the data to be written is to external devices that are connected via the PCI Express bus or to the south bridge.