Race Conditions and Slony-I

13. Race Conditions and Slony-I

No, this has nothing to do with racial harmony or lack thereof; the Wikipedia describes it thus: "A race condition or race hazard is a flaw in a system or process whereby the output of the process is unexpectedly and critically dependent on the sequence or timing of other events. " In computing applications, race conditions arise most frequently in distributed or threaded applications when multiple parts of the application depend on some piece of shared state, and, if this state is not properly managed, confusion (error!) arises. More particularly, this usually involves situations where the state can change between the time it was checked and the time of use of the state.

Slony-I has run into a number of race conditions during its history:

  • SLONIK MOVE SET had, during the 1.0 and 1.1 branches, the problem that nodes did not have any way to prevent them from processing SYNC events from the new origin node (which their state would cause them to consider a mere provider, and therefore not a source of replicable data) before recognizing the role change from subscriber to provider.

    This was fixed by introducing a new ACCEPT SET event that would be submitted by the new origin; this allowed subscribers to be aware of their need to wait for the MOVE SET event.

  • In a number of places, Slony-I has the SQL lock table sl_config_lock; in order to prevent race conditions while changing the sl_log_status sequence value.

  • The slon option slon_conf_sync_interval_timeout is used to prevent a possible race condition in which the action sequence is bumped by the trigger while inserting the log row, which makes this bump is immediately visible to the sync thread, but where the resulting log rows are not visible yet.

  • The "snapshot visibility" approach used by Slony-I to determine what replicated data is to be associated with a specific SYNC avoids race conditions that would be associated with trying to purely use timestamps or ID ranges to determine what data is to be replicated.

  • In the 1.2 branch, up to version 1.2.11, which fixed this, log shipping had a race condition where any time configuration is reloaded by the slon (as takes place with a number of events, notably SLONIK SUBSCRIBE SET), there was a risk of the SYNC IDs used to ensure proper ordering and application of log shipping archive log files being off by one.

    This was resolved in 1.2.11 by moving the ID number from an in-memory variable (susceptible to all sorts of troubles) to being managed, transaction-safe, in the subscriber database.

    The problem was never exposed by the test bed framework, nicely demonstrating the common finding that race conditions are frequently highly dependent on patterns of data input or of application timing.