Files
integration/Docs/About Determinism.md

5.6 KiB

About Determinism

Luprex uses two different kinds of determinism.

Synchronous Model Determinism Predictive reexecution uses four world models, including a server-synchronous and client-synchronous model. These two models are fed the same events, and must remain in the same state after executing the same events. See the document "Predictive Reexecution" for an explanation of why these models exist. I you were to do a comparison of the two models, they would be equal in the lisp sense of equal, but not in the sense of eq, because corresponding data structures are not at the same memory address.

Replay Log Determinism The server stores a log of all events it feeds into the Luprex DLL. It can replay a log by feeding the same events into a new copy of the Luprex DLL. When replaying a log, the new copy of the Luprex DLL reproduces the original execution right down to the memory level: every data structure is at the same address, every byte of memory is the same. This is the eq level of equivalence.

These two forms of determinism serve different purposes and impose different costs.

Implementing Synchronous Model Determinism

To get the two synchronous models to be deterministic enough, we had to take several steps:

  • Deterministic Lua table iteration. We patch the Lua runtime so that iterating over a table always produces keys in the same order. The order depends only on the order in which the keys were inserted, but not on the memory layout.
  • No iterating over C++ unordered maps. Unordered maps produce elements in an order that depends on memory addresses. Since addresses differ between the two models, iteration order would differ, breaking value-level determinism. An exception: iterating an unordered map and then immediately sorting the results into a predictable order is allowed, because the randomness is sandboxed.
  • No genuinely random numbers. We do not use random numbers in the world model. We do use pseudorandom numbers, we store the generator's state as part of the world model and maintain it using difference transmission.

Bit-Exact Determinism: Replay Debugging

Bit-exact determinism enables replay debugging. It is valuable but expensive, and its cost-benefit tradeoff is an open question.

As the server runs, the driver can write a log of every event it feeds into the driven portion. Later, a new DrivenEngine can be created and fed those same events from the log file. The goal of bit-exact determinism is that during this replay, the DrivenEngine does the exact same thing it did during the live run, right down to every data structure being at the same memory address.

Why does this matter? If the server crashed during the live run, the replay will crash in exactly the same way. You can run the replay inside a debugger, single-step right up to the crash, and examine the exact same pointers and memory layout that existed during the original crash.

Value-level determinism alone is not sufficient for this. If the replay produces the same logical state but at different memory addresses, then pointer-related bugs (buffer overruns, use-after-free, etc.) might not reproduce. Bit-exact determinism ensures they do.

To implement replay determinism, we took several difficult steps:

  • The Driver/Driven Partition. The luprex engine is event-driven portion, and an event-driver. The driven portion contains all the game logic. The driver is mainly for I/O. The driven portion cannot contain any I/O. That includes:

    • Clocks only in the Driver. The driven portion cannot call system functions to obtain the current time. However, the driver can feed the current time into the driven portion as an event.
    • Lua Source files only in the Driver The driven portion cannot read lua source files. It can however enter a state that indicates to the driver that it wants a lua source file. Then, the driver can feed the lua source file in as an event.
    • Sockets only in the Driver The driven portion cannot open TCP/IP sockets. However, it can enter a state that indicates its desire to make a TCP/IP connection, and then the driver can do it and feed the data into the driven portion.
  • The eng::malloc heap. A custom memory allocator positioned at a fixed address, used exclusively by the driven portion. The memory allocator, if asked to perform the same sequence of malloc/free operations, will return the same addresses.

  • No threads in the driven portion. Thread scheduling is nondeterministic at the OS level. We cannot use it in the driven portion.

Should we Ditch Replay Determinism?

Implementing synchronous model determinism is necessary for predictive reexecution. It is non-negotiable.

On the other hand, replay log determinism is not necessarily required for us to have a usable engine. We could ditch it. It certainly does impose a lot of difficult constraints on the engine.

The driver/driven distinction certainly required us to tie ourselves into knots in some part of the engine design. But, that's pretty baked in at this point, we're probably never going to change that.

However, it also imposes a no-threads requirement. That is certainly a bummer from a performance perspective.

Lua Scripters Don't Need to Worry

The Lua environment is carefully sandboxed to be deterministic at both levels without any effort from the scripter. Lua's random number generators are seeded pseudorandom generators owned by the driven portion. Table iteration is patched to be deterministic. Lua "threads" (coroutines) are not real OS threads and don't run concurrently. The scripter writes ordinary Lua code and gets determinism for free.