Files
integration/Docs/About Determinism.md

139 lines
5.6 KiB
Markdown

### About Determinism
Luprex uses two different kinds of determinism.
**Synchronous Model Determinism** Predictive reexecution
uses four world models, including a server-synchronous and
client-synchronous model. These two models are fed the same
events, and must remain in the same state after executing
the same events. See the document "Predictive Reexecution"
for an explanation of why these models exist. I you were to
do a comparison of the two models, they would be equal in
the lisp sense of `equal`, but not in the sense of `eq`,
because corresponding data structures are not at the
same memory address.
**Replay Log Determinism** The server stores a log of all
events it feeds into the Luprex DLL. It can replay a log by
feeding the same events into a new copy of the Luprex DLL.
When replaying a log, the new copy of the Luprex DLL
reproduces the original execution right down to the memory
level: every data structure is at the same address, every
byte of memory is the same. This is the `eq` level of
equivalence.
These two forms of determinism serve different purposes and
impose different costs.
## Implementing Synchronous Model Determinism
To get the two synchronous models to be deterministic
enough, we had to take several steps:
- **Deterministic Lua table iteration.** We patch the Lua
runtime so that iterating over a table always produces
keys in the same order. The order depends only on
the order in which the keys were inserted, but not on the
memory layout.
- **No iterating over C++ unordered maps.** Unordered maps
produce elements in an order that depends on memory
addresses. Since addresses differ between the two models,
iteration order would differ, breaking value-level
determinism. An exception: iterating an unordered map and
then immediately sorting the results into a predictable
order is allowed, because the randomness is sandboxed.
- **No genuinely random numbers.** We do not use random
numbers in the world model. We do use pseudorandom
numbers, we store the generator's state as part of the
world model and maintain it using difference transmission.
## Bit-Exact Determinism: Replay Debugging
Bit-exact determinism enables replay debugging. It is
valuable but expensive, and its cost-benefit tradeoff is an
open question.
As the server runs, the driver can write a log of every
event it feeds into the driven portion. Later, a new
DrivenEngine can be created and fed those same events from
the log file. The goal of bit-exact determinism is that
during this replay, the DrivenEngine does the *exact* same
thing it did during the live run, right down to every data
structure being at the same memory address.
Why does this matter? If the server crashed during the live
run, the replay will crash in exactly the same way. You can
run the replay inside a debugger, single-step right up to
the crash, and examine the exact same pointers and memory
layout that existed during the original crash.
Value-level determinism alone is not sufficient for this. If
the replay produces the same logical state but at different
memory addresses, then pointer-related bugs (buffer
overruns, use-after-free, etc.) might not reproduce.
Bit-exact determinism ensures they do.
To implement replay determinism, we took several
difficult steps:
- **The Driver/Driven Partition**. The luprex engine is
event-driven portion, and an event-driver. The driven
portion contains all the game logic. The driver is mainly
for I/O. The driven portion cannot contain any I/O. That
includes:
- **Clocks only in the Driver.** The driven portion cannot
call system functions to obtain the current time.
However, the driver can feed the current time into the
driven portion as an event.
- **Lua Source files only in the Driver** The driven
portion cannot read lua source files. It can however
enter a state that indicates to the driver that it
wants a lua source file. Then, the driver can feed
the lua source file in as an event.
- **Sockets only in the Driver** The driven portion
cannot open TCP/IP sockets. However, it can enter
a state that indicates its desire to make a TCP/IP
connection, and then the driver can do it and feed
the data into the driven portion.
- **The eng::malloc heap.** A custom memory allocator
positioned at a fixed address, used exclusively by the
driven portion. The memory allocator, if asked to
perform the same sequence of malloc/free operations,
will return the same addresses.
- **No threads in the driven portion.** Thread scheduling is
nondeterministic at the OS level. We cannot use it in the
driven portion.
## Should we Ditch Replay Determinism?
Implementing synchronous model determinism is necessary
for predictive reexecution. It is non-negotiable.
On the other hand, replay log determinism is not necessarily
required for us to have a usable engine. We could ditch it.
It certainly does impose a lot of difficult constraints on
the engine.
The driver/driven distinction certainly required us to tie
ourselves into knots in some part of the engine design.
But, that's pretty baked in at this point, we're probably
never going to change that.
However, it also imposes a no-threads requirement. That
is certainly a bummer from a performance perspective.
## Lua Scripters Don't Need to Worry
The Lua environment is carefully sandboxed to be
deterministic at both levels without any effort from the
scripter. Lua's random number generators are seeded
pseudorandom generators owned by the driven portion. Table
iteration is patched to be deterministic. Lua "threads"
(coroutines) are not real OS threads and don't run
concurrently. The scripter writes ordinary Lua code and gets
determinism for free.