Andy Glew's comp-arch.net wiki, http://semipublic.comp-arch.net
If you are reading this elsewhere, e.g. at site waboba.info, it is an unauthorized copy, and probably a malware site.
comp-arch.net wiki on hold from October 17, 2011
Load-linked/store-conditional (LL/SC)
From CompArch
The load-linked/store-conditional instruction pair provide a RISC flavor of atomic RMW synchronization, emphasizing primitives which can be flexibly composed. It can be viewed as a minimal form of transactional memory.
Part of the RISC philosophy was to espouse load/store architecture - instruction sets that separated load and store memory operations from computational or ALU operations such as add or increment. This works fine for single processor operations, but runs into problems for multiprocessor synchronization.
Originally proposed by Jensen, Hagensen, and Broughton for the S-1 AAP multiprocessor at Lawrence Livermore National Laboratory. Per wikipedia, which has a greatly improved description of LL/SC (http://en.wikipedia.org/wiki/Load-link/store-conditional), and from which this page has borrowed.
Contents |
RISC Machines with LL/SC
Load-linked may be called load and reserve or load-locked.
DEC Alpha: ldl_l/stl_c
PowerPC: lwarx/stwcx
MIPS: ll/sc
ARMv6: ldrex/strex
All provide weak LL/SC, and may fail the SC in cases other than an accessto the same location. PowerPC is strongest, allowing loads and stores to other cache lines inside LL/SC.
Pre-LL/SC
After years of evolution, prior to the so-called RISC revolution multiprocessor synchronization instruction sets were converging on simple atomic RMW instructions such as compare-and-swap, atomic increment memory or fetch-and-add and other fetch-and-ops. Now, these atomic RMWs can be seen as composed of fairly simple primitives, for example:
locked increment memory begin-atomic tmp := load(MEM) tmp := tmp+1 store( MEM, tmp ) end-atomic
However, the implementation of begin-atomic/end-atomic is not necessarily simple.
The atomicity can be provided in a simple way:
locked increment memory tmp := load-locked(MEM) tmp := tmp+1 store-unlock( MEM, tmp )
Where load-locked may be
- implemented at the memory module
- locking the entire module
- or a limited number of memory locations at the module
- or potentially an arbitrary number of memory locations, using per-location lock-bit memory metadata, e.g. stolen from ECC
or
- implemented via a bus lock
or
- implemented via a cache protocol
- e.g. an address lock: acquiring ownership of the cache line, and then refusing to respond to snoops or probes until the address lock was released
- or, more primitive: acquiring ownership of the cache line, and then acquiring a cache lock, locking the entire cache or a fraction thereof
Interestingly, implementing locking at the memory module quite possibly came first, since many early multiprocessor systems were not snoopy or bus-based.
So far, so good: load-locked and store-unlocked are somewhat RISCy.
But they have a problem: load-locked and store-unlocked as separate instructions raise certain security, performance, and reliability problems in some (but not necessarily all) implementations.
E.g. what happens if a user uses load-locked to lock the bus, and never unlocks it? That *might* be interpreted as giving one user exclusive access to the bus.
Obviously, one can eliminate these problems by architecture and microarchitecture. But doing so is a source of complexity. Many, probably most, implementations have found it easier to package the atomic RMW so that the primitive operations are not exposed to the user
Load-linked/store-conditional
The basic idea behind LL/SC is that, instead of guaranteeing forward progress with a load-locked that always succeeds, but which may deadlock, load-linked would employ optimistic concurrency control. Software would assume that it had worked, perform the operation that you want to be atomic, and then try to commit the operation using store-conditional.
If the operation was atomic, i.e. if nobody else had accessed the memory location load-link'ed, the store-conditional would proceed.
But if the operation were non-atomic, if somebody else had accessed the memory location, the store-conditional would fail. And the user software would be required to handled that failure, e.g. by looping.
fetch-and-add:
L: oldval := load-linked MEM
newval := oldval+a
store-conditional( MEM, newval )
if SC_failed goto L
return oldval
Typically LL/SC are implemented via a snoopy bus protocol: a memory location is "linked" by putting its address into a snooper. If another location writes to it, the link is broken so that SC will fail.
Some implementations did not implement an address snooper - they might fail if there was ANY other activity on the bus. Needless to say, this does not exhibit good performance on contended systems.
Non-snoopy implementations of LL/SC are possible. E.g. implementing them at a memory module is possible. TBD.
As with more extensive forms of transactional memory, LL/SC has issues with fairness and forward progress. It is not desirable to loop forever in the LL/SC code. This is solvable - through much the same mechanisms as one uses to implement a locked atomic RMW with load-locked.
One way amounts to converting a load-linked into a load-locked after a certain number of failures.
Synthesizing Complex RMWs
The nice thing about LL/SC is that they can be used to implement many different forms of synchronization. E.g. almost any form of fetch-and-op. Say you want floating point fetch-and-add .. you can build that with LL/SC. Whereas if you don't have LL/SC, and just have a fixed repertoire of integer atomic RMWs, you may just be out of luck.
This *is* one of the big advantages of RISC: provide primitives, rather than full solutions.
The problem, of course, is what was mentioned above: "plain" LL/SC may have problems with fairness and forward progress. Mechanisms that solve these problems for LL/SC may be more complicated than they would be for non-LL/SC instructions.
See list of possible atomic RMW operations.
Remote Atomic RMWs
The "most natural" implementation of LL/SC is to pull the data into registers and/or cache of the processor on which the instructions are executing. By "the instructions" I mean those before the LL, the LL itself, between the LL and SC, the SC itself, and after the SC.
This is also the most natural implementation for locked implementations of atomic RMWs:
LOCK INC mem tmp := load-locked M[addr] dstreg := tmp+1 store-unlock M[addr] := dstreg
I call this the "local implementation" of the atomic RMW.
However, many systems have found it useful to create "remote implementations" of the atomic RMW.
For example, special bus or interconnect transactions could be created that indicate that the memory controller should do the atomic RMW operation itself.
For example, the NYU Ultracomputer allowed many fetch-and-add operations to be exported. They could be performed at the ultimate destination, the memory controller. But they could also be handled specially at intermediate nodes in the interconnection fabric, where conflicting atomic RMWs to the same location could be combined, forwarding only their net effect on towards the destination.
You can imagine such remote atomic RMWs as being performed by
LOCK INC mem send the command "fetch-and-add M[addr],1" to the outside system
This is the advantage of CISCy atomic RMW instructions: they can hide the implementation. If the interconnect fabric supports only fetch-and-add but not fetch-and-OR, then the two respective microoperation expansions might be:
LOCK INC mem send the command "fetch-and-add M[addr],1" to the outside system
LOCK FETCH-AND-OR mem oldval := load-locked M[addr] newtmp := oldval OR src1 store-unlock M[addr] := newtmp
For that matter, the CISC implementation migh well use LL/SC optimistic concurrency control within its microflow.
It is more work to create such a remote atomic RMW with LL/SC, since the very strength of LL/SC - that the user can place almost arbitrarily complicated instruction sequences between the LL and SC is also its weakness. If you wanted to create a remote atomic RMW implementation that supported just, say, fetch-and-add, then you would have to create an idiom recognizer that recognized code sequences such as:
fetch-and-add:
L: oldval := load-linked MEM
newval := oldval+a
store-conditional( MEM, newval )
if SC_failed goto L
return oldval
Actually, you might not have to recognize the full sequence, complete with the retry loop. You would only need to recognize the sequence
oldval := load-linked MEM
newval := oldval+a
store-conditional( MEM, newval )
and convert that into whatever bus operations create the remote atomic RMW.
Recognizing a 3-instruction idiom is not THAT hard. But in general it is harder to combine separate instructions than to break up complex instructions in an instruction decoder.
One might even imagine creating a fairly complete set of instructions executable on the interconnect.
The best of both worlds?
Let me end this section by describing a hybrid implementation of the CISC atomic RMWs that combines the best of both worlds with respect to local and remote atomic RMWs.
LOCK FETCH-AND-OP mem oldval := load-locked/fetch-and-op M[addr] newtmp := oldval OP src1 store-unlock/fetch-and-op M[addr] := newtmp return oldval
If M[addr] is exclusive in the local cache, then the load-locked/fetch-and-op and store-unlock/fetch-and-op are equivalent to ordinary load-locked and store-unlock.
If M[addr]] is not present in the local cache, then load-locked/fetch-and-op is equivalent to sending a remote atomic RMW fetch-and-op command to the external network. The store-unlock/fetch-and-op may then become a NOP (although it may be used for bookkeeping purposes).
If M[addr] is present in the local cache in a shared state, then we *COULD* perform the atomic RMW both locally and remotely. The remote version might invalidate or update other copies. However, if the operation is supposed to be serializing, then the local update cannot proceed without coordinating with the remote.
Extending the semantics
PAC:
Because existing LL/SC definitions either fail or have undefined semantics if other memory operations are performed, the semantics can be safely extended without sacrificing compatibility. (While it may be possible in some architectures (TBD: check whether this is the case for Alpha, MIPS, ARM, Power, etc.) that a non-SC memory operation was used by software to cancel an atomic section, such is generally discouraged and unlikely to have been done.) Such extensibility could allow increasingly more extensive forms of transactional memory to be implemented without requiring additional instructions. While an implementation of an older architecture would not generate an illegal instruction, an architecture which guaranteed failure on additional memory accesses would guarantee either deadlock (which would simplify debugging and fixing) or the use of an already in place software fallback (which, while slower, would maintain correctness). In architectures which use a full-sized register to return the success condition and define a single success condition, that register could be used to communicate failure information.
A natural progression for such extensions might be: to initially constrain the transaction to an aligned block of 32 or 64 bytes or the implementation-defined unit of coherence (This last could cause a significant performance incompatibility but would maintain the semantics.), perhaps with a single write, then to expand the semantics to something like those presented in Cliff Click's "IWannaBit!" where any cache miss or eviction cancels the reservation (For many uses, this would require that the cache be warmed up before the transaction can succeed. If the reservation mechanism is expensive, prefetching data outside of the atomic section might be preferred over retrying the transaction.), perhaps extending writes to an entire cache line, and then to an arbitrary number of memory accesses.
The existing definition of LL/SC does not allow an extension to nestable transactions since LL is defined to generate a new reservation. However, it would still be possible to use SC to close a transaction, using a new instruction to initiate a nestable transaction. (A kludge might be possible in a RISC ISA that has a hardwired zero register--e.g., MIPS and ALPHA--; a LL with a zero-register destination could be used to indicate the start of a nestable transaction. This would not allow an implementation of earlier definition of the ISA to run binaries using this extension; such implementations might filter such out as a nop or treat it simply as an ordinary LL. Using a new instruction would be much more reasonable.)
Providing non-transactional memory accesses within the atomic section would be more difficult. It would be possible to define register numbers which when used as base pointers for memory accesses have non-transactional semantics (See Semantic register numbers). This definition would have to be made at the first extension of LL/SC and it might be difficult to establish compiler support.
A non-transactional extension of LL/SC would be to guarantee the success of the store-conditional under certain limited conditions. E.g., a simple register-register operation might be guaranteed to succeed, providing the semantics of most RMW instructions. Such a guaranteed transaction could itself be extended to allow more complex operations, though it might be desirable to allow a transaction that can be guaranteed to fail if lock semantics are not desired under contention. E.g., a thread might work on other tasks rather than wait for the completion of a heavily contended locked operation.
Historic Trends
Although LL/SC have certain putative advantages, one can observe a historical trend: many ISAs that started off with only LL/SC eventually added atomic RMW instructions.
E.g. one of the original RISCs, Sun SPARC, has atomic instructions LDSTUB (Load Store Unsigned Byte) (the classic test-and-set instruction), SWAP (atomic exchange with memory) (deprecated by SPARC v9), and CAS (Compare And Swap). The latter is a synhetic instruction,
E.g MIPS, one of the original RISCs with LL/SC, added new atomic instructions ASET and ACLR in the MIPS M14K processors. ASET sets a bit within a 32 bit memory location. ACLR clears such a bit.
-
A Microprocssor Report article on the MIPS website, http://www.mips.com/media/files/M46_MIPS_Reprint.pdf, says:
New Atomic Instructions
The M14K and M14Kc processors have two additional new instructions that aren’t part of the microMIPS ISA. Both are atomic memory operations. They are small but important additions to the MIPS architecture, which has been hampered by historical limitations in this regard. (In the 1980s, strict RISC liturgy frowned on CISC-like instructions that perform multiple operations directly on data in memory.) The new atomic instructions perform read-modify-write operations on memory and cannot be interrupted.
ASET and ACLR are microcontroller oriented, and "work by automatically disabling interrupts while they’re busy". They may only be "atomic" wrt interrupts, not in a multiproessor sense.
There is no clear tren, just some Brownian motion. LL/SC is better is some circumstances, not in others.
Other Considerations
LL/SC is basically inspired by snoopy bus cache protocols. Although it can work for uncached memory where transactions can be snooped, it does not work quite so well when processors are connected to (uncached) memory by point-to-pont links, i.e. where they cannot snoop other processors' transactions.
Various implementations of LL/SC may fail the SC on various inexaxt conditions, sometimes gong so far as to fail if there is any intervening memory transaction, to completely different addresses. Wikipedia says this is called weak LL/SC, and breaks many theoretical LL/SC operations.
LL/SC is more difficult to emulate than specific atomic RMWs like CAS.
Single stepping through LL/SC code often guarantes that the SC will fail. I.e. debugging can be interesting.
See Also
- locked versus atomic
- Returning the old versus new value (e.g. fetch-and-increment versus increment-memory)
- atomic RMW spinloops and CISCy instructions
References
- Eric H. Jensen, Gary W. Hagensen, and Jeffrey M. Broughton. "A New Approach to Exclusive Data Access in Shared Memory Multiprocessors", Technical Report UCRL-97663, Lawrence Livermore National Laboratory, Nov. 1987. https://e-reports-ext.llnl.gov/pdf/212157.pdf
Variant links ^
Pages that should(!) link here. (Would be better if these links were automatically created: see Link Here versus Link There)
- load-linked/store-conditional
- load-linked/store-conditional (LL/SC)
Compensating for Buggy Hardware
I love this quote from http://www.ibm.com/developerworks/library/pa-atom/:
All the code presented in this article is instructional and not intended for production use. In particular, real-world atomic code tends to have sync instructions sprinkled throughout. This is usually to compensate for buggy hardware. This is why, in general, you'll want to use the atomic functions provided by your operating system or libraries instead of writing your own.
s/atomic functions provided by your operating system or libraries/ -> /atomic functions provided by your instruction set/ to understand one motivation for non-LL/SC code - and for strong memory ordering models.