- Discover nodes entering or leaving the system
- Have them join the Mnesia database
- Have them take responsibility for a portion of the data
It's actually surprisingly complicated, given that at the root the way to add a node to a running database is to call mnesia:change_config/2 as mnesia:change_config (extra_db_nodes, NodeList). However while adding nodes is easy removing nodes is harder. Detecting nodes to remove is not that hard; we go with the (EC2-centric) strategy of periodically updating a record in a table to "stay alive" (and also a way to mark clean shutdown); nodes that fail to check in eventually are considered dead. To handle multiple node failure we patch mnesia to provide mnesia_schema:del_table_copies/2, the multiple table analog to mnesia:del_table_copy/2. To guard against rejoining the global schema after having been removed but without having removed one's mnesia database directory (this is a no-no), we check with the other database nodes to see if they think we've been removed.
Even with all this complication, we have yet to address the problem of handling network partitioning automatically. We'll see if we have time to address that before it bites us in the butt.
Now available on google code.