integration/Docs/Difference-Transmission-with-Threads.md

### Concurrent Difference Transmission

Our system difference transmits sequentially. First, it transmits to client 1. Then, it transmits to client 2. It keeps going until it has done all the clients. We came up with the idea that it would be awesome if the system could use threads to difference transmit to all clients simultaneously.

Suppose you were to create one thread per client. Each thread would be responsible for comparing the master model to a single client model. In that case, all of the threads would be reading from the master model at the same time. What if we were to try to do this without any mutexes?

In the abstract, you would not expect that to be a problem. Almost all data structures in C++ are thread-safe for multiple readers at the same time. For example, if I create a std::map, then I can have an unlimited number of threads reading from the map at the same time, and there's no conflict. The only time you need a mutex is if somebody is mutating the data structure. In difference transmission, nobody needs to mutate the master model, so you would think it would be possible for all those threads to read from the master model at the same time, with no mutex.

The following sections will explain why this idea falls apart, and what the problems are. Most of the problems have to do with the Lua runtime, which simply wasn't designed with threads in mind.

Then, the sections after that will tell you some solutions that solve some of the problems, but each solution has its own issues. Currently, there isn't any one obvious solution that would solve everything at the same time.

## Using Threads: The Stack Problem

Let's say you want to read from a Lua table, you want to get the value of table[5]. You would use lua_rawget to do this, lua_rawget reads a key-value pair from a table. It doesn't mutate the table. But to do a lua_rawget, you have to push the key onto the stack. Pushing a value onto a stack is mutating that stack. Then, lua_rawget pops the key from the stack and pushes the result on the stack. So lua_rawget doesn't mutate the table it's reading from, but it *does* mutate the stack.

All of the functions in the Lua API mutate the lua stack. If you're using the Lua API to read parts of the master model, then you're mutating the master model. There's simply no way to "read only" from the master model, because every function in the lua API mutates the lua stack.

However, there is a potential solution for the stack problem. We could create a lua stack for each client thread. Each thread would indeed be mutating its own private lua stack, but since there is only one thread per stack, there's no race conditions. That eliminates the contention over the lua stack.

Each lua stack would have to be sized to make room for "big" recursion. It would not be feasible to enlarge the stacks after the fact.

## Using Threads: The Garbage Collector Problem

The lua garbage collector is, I believe, an incremental garbage collector. It does a little bit of garbage collection work every time you call a Lua API function. So that, unfortunately, makes every single API function a mutator of the lua heap.

I believe that lua has the means to shut down the garbage collector entirely for a time. We could just plain turn it off during difference transmission.

I'm not sure this is as simple as it sounds. Incremental garbage collectors have interesting things they do with marking objects 'white' and 'black', and these flag-setting operations are scattered all over the code.

## Using Threads: The Visited Bit Problem

Our difference transmission code traverses tables recursively. Because tables can contain cycles, ie, two tables can point to each other, the recursive traversal algorithm needs to use a *visited* bit to avoid getting into an infinite loop. Currently, we store the bit in the table itself, because that's fast.

However, it's also mutation of the table. It's not compatible with a read-only approach to scanning the master world model. We could change the code to store the visited set in some other way, it's solvable, but we need to be aware of this.

## Using Threads: The String Problem

Suppose you want to read a table: you want to find out the value of table["foo"]. To do that with the Lua API, you have to do a lua_pushstring("foo"). That, of course, mutates the lua stack, but we've solved the stack problem. Unfortunately, it also allocates dynamic memory in the lua runtime to hold the string "foo", and it also puts that string into Lua's central string table. All of the threads would be mutating the lua memory heap and the central string table every time they do a lua_pushstring.

We could solve this problem with a lua patch. Currently, lua's central string table only contains short strings. We could modify it so that all strings go into the central string table, even long ones. That way, every string used anywhere in the lua runtime will be in the central string table.

Then, we could create an API function lua_pushexistingstring. This pushes a string onto the stack, but only if the string is already present in the lua central string table. This function doesn't ever allocate any new strings. This function would be sufficient for table lookups with string keys.

The routine internshrstr in the lua runtime contains an 'isdead' check. If isdead is true, then I think we need to treat the string as if it's not already present in the central string table.

## Using Threads: The Cache Problem

It is entirely possible that something within the lua runtime uses caching.

In fact, I believe I have found an example of caching: metamethod lookup. When you do a lua_gettable, it has to check to see if the table has a metatable, and if it does, it has to check if the metatable has an "INDEX" metamethod. This is a lot of extra work for every single table lookup. I believe that the lua runtime uses a fairly simple trick to optimize this. After it looks for the metamethod "INDEX", if there is no such metamethod, it stores a flag in the table to indicate, "don't bother looking for an INDEX, there isn't one." The next time you do a table lookup on that table, it skips the whole INDEX check.

Unfortunately, that means that when you do a lua_rawget on a table, you may be mutating the flag bit. The flag bit is probably not a show-stopper, because it's just a bit. If two threads try to set it to "true" at the same time, then it still gets set to true, no problem. There's also no reason that two threads would disagree about whether it should be true or false: either INDEX is there, or it's not. The threads should agree.

The real problem is: what if something else in the lua runtime uses caching in a more complicated way? The Lua codebase is too large to reliably search for examples. I think this is just a significant risk and I don't see a simple solution.

## Using Threads: The Nondeterminism Problem

The code to do difference transmission uses the memory allocator. Of course, there are lots of thread-safe memory allocators that we could use. But even if the memory allocator were thread-safe, having multiple threads accessing it concurrently would lead to a situation where the memory addresses of things in the heap become unpredictable and nondeterministic.

We have worked hard to ensure the determinism of the Luprex library: see the document "The Event-Driven Structure of the Engine" for explanations of why we want determinism, and how threads screw up the determinism. I don't see a simple solution to the determinism problem, other than to not use threads, or to give up on determinism.

## Using Threads: The Unknown Problem

So I've done my damndest to find things in the lua runtime that cause "unexpected mutation." I found the stack problem, the garbage collector problem, the string problem, and the cache problem. But what if there's a fifth problem that I didn't find? We could write a big complicated multithreaded difference transmission system, and it might just crash randomly, and we wouldn't know why.

I don't see any simple solution to this problem. Lua just wasn't written with multiple threads in mind.

## Solution: Using Mutexes

One solution to all this would be to use a mutex to control access to the master model. Obviously, if there was only one thread inside the master lua at a time, then most of the problems above would be solved (not all: the determinism problem would still be an issue).

The thing I worry about is that if we do this, then pretty much, all the threads are just going to do nothing until they get their turn at the master model. I'm afraid that using a mutex isn't a solution, it just turns it back into sequential difference transmission.

We could try to do very fine-grained locking, where we lock the mutex, read one value from one table, and then unlock the mutex again. I'm afraid that the overhead of the locking would exceed the amount of time saved.

## Solution: Writing a Brand-New Lua API

Could we add a brand new API to lua, one that is designed from the ground up to not mutate anything? I think it's conceivably possible, but it would be a LOT of code.

So first of all, to solve the stack problem, the new API would not be passing parameters by pushing them on the lua stack. Instead, we could just pass parameters the regular way, as… well, C parameters. Normally, the garbage collector can't allow us to put lua pointers into C variables, but if we shut down the garbage collector when using this API, that would not be a problem. So this API would be set up so that it only works when the garbage collector is off.

So that part is pretty straightforward. But then, we'd have to actually write the code for the getters. For example, you want to write a replacement for lua_gettable. Our replacement might look like this:

```cpp
LuaValue our_lua_gettable(LuaValue table, LuaValue key);
```

We could write that… but we'd have to totally rewrite the code for table lookup, otherwise we might trigger the caching problem I told you about, or the unknown problem I told you about.

So what I'm saying is: this is certainly possible, but it's a lot of work.

## Solution: Writing a Brand-New Lua Implementation

We could just rewrite lua from scratch. If we did so, we could make it entirely thread-safe for multiple readers. That sounds crazy, but lua's a simple language. I don't think it would take that long to write a new implementation.

If we did that, it might make sense to write a non-incremental garbage collector. That would hugely simplify the whole garbage collection issue.

The big problem, though, is that we've implemented *tons* of C++ functions that hook into lua. All of that code would have to be rewritten.

## Solution: Serializing the Master World Model

Another solution to the contention over the master world model is that before the difference transmission process begins, we could serialize the entire master world model, turning it into the "master world string". Then, we could start a thread for each client. Each thread would deserialize the master world string, so each thread would then have a copy of the master world model all to itself. This solves all of the issues having to do with contention for the master world model.

In fact, you could optimize it further: since it's really just the lua interpreter where the problems reside, you could just serialize the lua part of the master world model. The threads could concurrently read from the C++ parts of the master world model, but they would each have their own copy of the lua parts.

The problem with this is that it requires copying the entire lua interpreter of the master world model using the eris serialization code. I don't think eris is designed to be particularly fast.

## Solution: Memcpy of the Master Lua Interpreter

Suppose that we were to arrange for the master world models' lua interpreter to reside in a custom memory allocator, so that all the lua data structures live in a contiguous region of memory. This is quite feasible, lua allows us to hook the memory allocator. In that case, each thread could make a memcpy of the master world model's lua interpreter by simply block-copying the lua interpreter's memory region. Then, each thread would have its own copy of the master world model's lua interpreter. This would solve all the issues of contention within Lua.

Unfortunately, the copies wouldn't be valid, because all the pointers in the copy would point back at the original. To make it work, you'd have to fix up the pointers. Fixing up the copy would consist of finding all the pointers in the copy, and adding the correct offset to all of them.

The fixup code would be a recursive routine to traverse all of lua's data structures. It would not be especially complicated, but it would be a lot of code. It would also have to be a lua interpreter patch. It would have to be 100% thorough, you would have to make sure you didn't miss a single pointer. Despite these difficulties, this actually seems relatively feasible to me, even if it is a lot of work.

One problem with this approach is that it still takes time proportional to the size of the master world model, though I believe it would be a pretty fast traversal. The other problem is that it still suffers from the thread nondeterminism problem.

Despite these two problems, I think this is one of the most promising approaches.

## Solution: Memcpy of the Master Lua Interpreter with Segmentation Protection

This solution is similar to the previous one: you store the master world model's lua interpreter in a contiguous region of memory, then when it's time for the clients to difference transmit, you make a memcpy for each client. Again, the copies would not be valid because all the pointers in the copies would point back to the original.

But this solution differs in that instead of fixing up the pointers in the copy, we allow them to remain as-is. We then apply an *mprotect* to the original master world model. When any code tries to follow a pointer that's in one of the copies, they will try to read the original master world model, and they will be blocked by the *mprotect*: a segmentation fault will occur. We trap that segmentation fault, and the segmentation fault handler fixes up the access by adding the correct offset at that time.

This could be slow because of a lot of invocations of the segmentation handler, but it might still work.

## Solution: Using Linux Fork

Instead of creating a separate thread for each client, call linux *fork* once per connected client. The child processes are all clones of the parent process, therefore, each child process has a copy of the master world model. The child process would compare one client model to the master model, and generate a difference transmission in the form of a string. It would then hand the difference string back to the parent process, and then the child would shut itself down.

The simplicity of this approach is amazing. We're talking about a couple dozen lines of code. This solution is so clean that it even solves the determinism problem, as long as the master process forks all the children in a deterministic order, and then reads back the solutions in a deterministic order.

But, we have to think a little about what's happening behind the scenes. When you do a fork, Linux sets up a page tables of the parent and child so that they both are sharing 100% of their pages: there's only one copy of everything. But the page-table entries are all flagged as "copy-on-write". If either the parent or child tries to write to any of those pages, then linux makes two copies of the page, one for the parent, one for the child, and then the write is allowed to happen.

It's kind of interesting to think about what happens when difference transmitting the master model. All those problems: the string problem, the stack problem, the garbage collector problem, etc – all of them trigger mutations, and those mutations in turn activate Linux's copy-on-write mechanism. If the size of the mutations are small - just writing a few bytes - then just one or two pages get copied-on-write.

The problem with this approach is that *fork* can be slow. Fork has to set up the page tables for the child process, which takes an amount of time proportional to the memory size of the parent process. Likely, *fork* is too slow for our purposes, but I haven't measured it.

You could improve the situation by just not forking so many children. For example, you could always fork two children, and assign each child half of the clients. That would give about a factor of two speedup, in theory. And two forks might not be so expensive.

It is possible, in Linux, to reconfigure the page size from 4KB to 2MB. Those are the only two sizes that x86-64 supports. Doing this vastly reduces the size of the page tables, with the effect that fork is much faster. But, it also makes the copy-on-write operation much slower. This would dramatically alter the performance of the forking approach, but would it be faster or slower? I don't know.

Of all the solutions here, I think the fork solution is the one most likely to be fruitful.

## Solution: Using Linux Posix_Spawn and MMap

Linux *fork* is slow because it has to create a copy of the parent process. This is usually a waste of time, because 99% of the time that somebody calls *fork*, they then immediately call *exec*, which throws out the copy that took so long to create. To solve this problem, Linux has a newer function, *posix_spawn*, which acts a lot like *fork-and-exec* in one step, but it entirely avoids the copying overhead.

Unfortunately, that doesn't seem very useful for us: we wanted the child to be a copy of the parent process, so that it would have a copy of the master and client world models. But Linux also has ways for two processes to manually share parts of their memory. You can even arrange for two processes to share copy-on-write regions of memory. This is all done through the *mmap* system call.

If we could somehow get the master world model to reside in a contiguous block of memory, we could spawn a child process using the fast posix_spawn, then we could manually hand the child process a copy-on-write copy of that region of memory holding the master world model. This would of course, require the copying of certain page-table pages, but it wouldn't involve copying the entire parent process - only the part of the parent process that holds the master world model. This would probably be a vastly smaller block of memory.

The big problem here is that each model would have to have its own memory heap. That, in turn, would lead to a new type of bug where you might accidentally allocate something using one memory allocator, and free it using a different memory allocator, causing heap corruption. This sounds very hard to navigate to me. I think this is just too error-prone.