MV3D Development Blog

February 8, 2009

Redundancy, check.

Filed under: Uncategorized — SirGolan @ 2:42 pm

After about a week of banging my head against the wall trying to figure out a better way to do redundancy in MV3D, I think I’ve finally got it worked out. I even have code to back it up. What I’ve created is a redundant pool of objects. Servers can join or leave the pool, and objects are replicated to all servers in the pool. One server is the master and the rest are slaves. If the master goes down, one of the slaves picks up the slack. The master can re-join the pool as a slave. It seems to work great.

The next step is writing a million more tests to make sure it acts properly in every situation I can think of. Once that is done, I think I can build on the platform I created to make different types of HA pools. For instance, one that divides objects up into sub pools.

Anyway, I’m pretty happy with this new method of keeping things redundant. It’ll take a while to integrate it into the rest of the code, but I don’t foresee any major problems. It is pretty cool though, I just implemented a method to move items from one pool to another. This could be used when changing the partitioning around, or (if I do one pool per area) when an item moves from one area to another.

February 7, 2009

Unified theory of high availability.

Filed under: Uncategorized — SirGolan @ 2:57 pm

I’ve been working on ticket #211 for MV3D lately. If you looked at the ticket, you’ll see it’s something that’s been kicked around for a bit. The current redundancy and load balancing code is fairly broken in trunk. It did work at some point in the past, but a myriad of changes have caused it to break in various ways. Clearly, one problem is that there weren’t enough unit tests for it or this would never have happened. The larger problem is that the code was very repetitive and not very organized. I’m setting out to fix this, but it’s not been easy. I’ve written a bunch of code and tests, but I have no idea if it is remotely the right direction to go in. Not to mention that this is the 3rd or 4th major direction change I’ve had since I started on this ticket. Nonetheless, I really want to get these details worked out before MV3D gets much further along. The longer you wait on something like this, the harder it gets to fix in my experience.

If what I’m talking about makes no sense, read up on MV3D’s Server Architecture. The high level requirements are that there be no single point of failure for any service MV3D provides and that servers below the Directory servers can come and go with little disruption. Starting at the top, a Directory should be split horizontally (by item id) across a number of servers. Each part of the Directory should be replicated to several servers. Asset Groups and Realms should get the same treatment. Realms should organically divide items across Simulation Services based on the Area they are in. All Items should exist on at least two Simulation Services. To further complicate things, some types of Areas can spread themselves across multiple servers, which means the objects within them have to do the same. It is very important to keep objects in the same Area (or piece of an area) on the same server so that there is no time lag for physics collisions.

The current way of doing this makes each level in charge of distributing the load for sub-levels. This means that there is also no method to get HA Directory Services. The Directory stores a master and a list of slaves for each Realm and Asset Group. Asset Groups also have no redundancy right now and must exist on a single server. The Directory Service is able to promote and demote copies of Realms on various servers from master to slave. Realms manage a list of Areas and what servers they can be simulated on. Each Area manages where its objects are simulated. This generally makes for a big mess without much rhyme or reason as to where anything is served from. Another problem I’ve encountered is that having all HA items be pb.Cacheable makes it hard to switch them from slave to master or back without re-caching them (which would cause confusion to any object that kept a reference to the old version).

I was going to write about my solution, but after I started, it became clear that it was completely wrong, so back to the drawing board.

Powered by WordPress