HA Locking Mechanism
HA locking mechanism is employed by the servers in replicated mode to determine the server state in case, a server of the pair is unavailable or if the network fails. A read and write permissions file is shared on a machine this file is referred as the LockFile. The machine hosting the LockFile is referred to as the gateway machine. A server can switch to Active only if it holds a lock over the LockFile.
In HA implementation prior to the locking mechanism, a network link failure between the servers could have led to both servers switching to Standalone state. Since the lock can be held by only one server at a time, it prevents both servers from switching to Active/Standalone state.
The locking mechanism makes the state switching of a HA server more deterministic.
3.1 Fiorano Replicated High Availability Working
The central concept of backchannel replication is that the Active Server (the server which is in the Active State) replicates its data store and state to the Passive Server, thus keeping both servers in sync. This replication channel is supported on a private network dedicated to the synchronization of the broker state and messaging data.
In Replicated HA, the database is replicated from the ACTIVE server to the PASSIVE server. This data include messages, admin/security data as well as status information whatever changes are made when the passive server was down. Whenever a server is disconnected from the network and reconnects back, the database of the active server is replicated into the passive server's database. In this case, both servers have their own copy of database but on any changes to active server's database those changes are propagated to passive server's database. In this way consistency of these two servers is maintained.
The passive server accepts no client connection while in its hot-standby (passive) role, but is prepared to immediately transition to the Active role as soon as it detects that the Active Server is unavailable. If the primary fails, all Fiorano applications fail over from the primary and reconnect to the designated secondary backup broker.
The primary and secondary broker-pair use the replication channel to routinely seek the heartbeat of the other and watch for any interruption in the data flow or connection to switch states. A locking mechanism (as already explained) is also employed to determine the state of the servers.
This Hot-failover process is immediate and is completely transparent to all client applications. The Secondary Server in the active role is sensitive to re-establishment of the replication channel. This reconnection may come from a recovery of the Primary Server or from a replacement Primary Server. Once the primary comes up again, it assumes the role of the Passive Server since the Secondary Server switched to Active Server).
3.1.1 States and state transition
The following states occur in different phases of the servers in replication mode.
- ACTIVE: refers to normal working state. In this state the server accepts client connections.
- PASSIVE: in this state the server monitors its active peer server and is in standby mode.
- ACTIVE_TRANSITION_STATE: this occurs while the server is synchronizing with the standby server and at the same time serving client applications.
- PASSIVE_TRANSITION_STATE: this occurs on standby server side while the active server is synchronizing with the standby server.
- WAITING: this occurs when the server is waiting for the state of other server/ trying to acquire lock on the lock file to become active. The server does not accept client connections in this state.
- STANDALONE: this occurs in a server when it is actively servicing client and the other server is disconnected.
- DEAD - Indicates that the server is down/not present in network.
The following diagram explains the transition to various states:
Note:
- failure detected – refers to the link between the servers being broken
- sync-complete – database synchronization complete
- Lock Lost – lock over the LockFile is lost
- Lock Obtained – lock obtained over the LockFile
- Resolve to Active/Passive – based on which server obtains the lock
3.1.2 How do Server State Changes?
- On startup, the Server enters into WAITING state. In this state, the server is waiting for its backup server to connect to it. This is the initial synchronization state, which is required to sync up the primary server with the secondary to avoid any message loss. This server will change state if one of the following occurs.
- Switch to PASSIVE SYNC state: If the HA channel is established and the other server is in STANDALONE state.
- Switch to PASSIVE(STANDBY) SYNC or ACTIVE SYNC state: If the HA channel is established and the other server is also in WAITING state, then the servers assumes themselves as being in Active or Passive roles depending on the Repository Timestamps and remote server status.
- When the Server is actively serving clients and its backup server is not running or if the HA transport channel is broken and it has the lock over the Lockfile, then the state of the active server is STANDALONE. If the server in STANDALONE state establishes the HA channel and the other server is in WAITING state, then the STANDALONE server shifts to ACTIVE SYNC state and further to ACTIVE state. However, a passive (standby) server can switch to STANDALONE if the other server is not running or if the transport channel is broken and passive server acquires the lock over the LockFile.
- When the Server is in ACTIVE SYNC state, the server starts synchronizing its data with the backup server which is in PASSIVE SYNC state. The Server in ACTIVE SYNC continues to serve its clients. Completion of the Runtime Synchronization Protocol causes a transition of the ACTIVE SYNC server to the ACTIVE state and the server in PASSIVE SYNC state to PASSIVE state.
- Once the ACTIVE SYNC Server completes the synchronization, it enters into the ACTIVE state and resumes actively transmitting state information and all replication data onto the PASSIVE server. At this point, if there is a failure of the ACTIVE server, the Hot Standby PASSIVE server is ready to move into the STANDLONE state and starts accepting requests from the clients.
- An ACTIVE server can switch to WAITING if the transport channel is broken and it loses the lock over the LockFile. Similarly, a STANDALONE server can switch to WAITING if it loses the lock over the LockFile.
- Whenever there is a change in the server state, it broadcasts the present and previous state to the Backup Server. The Servers transition is a function of its own state, the present and previous state of the Backup Server and whether or not it holds the lock over the LockFile.
3.1.3 What are the objects that are replicated?
In the database following objects need to be replicated on Primary and Secondary server's machine:
- PTP objects
- ACL objects
- Principal objects
- PubSub objects
- Admin objects
3.2 Fiorano Shared High Availability Working
In this mode of High availability, the database is shared between the active and passive servers and do not replicate data over the network. In this, only the active server makes changes to the common database. If the Active fails, all Fiorano applications fail over from the Active and reconnect to the designated Passive backup broker. The Active and Passive broker-pair use the network channel between them to routinely seek the heartbeat of the other and watch for any break in connection to switch states. A locking mechanism (as explained in section 3.1) is also employed to determine the state of the servers. The database which is common to both the servers is referred to as the shared database.
Note: The shared database connectivity is critical for the servers to function, as the servers store all data in it. It is mandatory for the Shared Database to be always accessible to the servers. Unavailability of the shared database could lead to data loss and data corruption.
3.2.1 States and State Transition
- ACTIVE: refers to normal working state. In this state the server accepts connections
- PASSIVE: in this state the server monitors its active peer server and is in standby mode.
- ACTIVATING: this occurs while the server is in transition to become active and synchronizing with the database.
The following diagram explains the transition to various states:
Note:
- failure detected – refers to the link between the servers being broken
- Lock Lost – lock over the LockFile is lost
- Lock Obtained – lock obtained over the LockFile
When the server starts up, the server tries to acquire a lock on the lock file. If it acquires the lock successfully, it switches to the ACTIVATING state. It then switches to ACTIVE state once all its services have been activated. Unlike in replicated HA, where the servers wait for each other to come up (that is, in WAITING state), a server in shared mode does not need to wait for its backup server to come up because they share a common database and no database synchronization is required which is the case for servers working in replicated mode.
After switching to ACTIVE state, the server keeps trying to connect to its backup server. If the backup server starts up, the backup server switches to PASSIVE state.
At this point, if there is a failure of the ACTIVE server, the Hot Standby PASSIVE Server is ready to move into the ACTIVE state and starts accepting requests from the clients.
3.2.2 How do failovers happens?
In case the primary server becomes unavailable, all the client applications connected to it are automatically reconnected to the secondary server. The process of shifting from the primary server to the backup server or vice versa is transparent to the application. The client application should not be concern about writing reconnect logic in its code. This is achieved by connecting to the server through a Durable Connection. In case a backup server is available, the Durable Connection would connect to the backup server else it waits for the server to restart. Further, it stores all the data sent during the disconnected period in a local repository and transfers this data as soon as the connection is re-established, thus making the system highly reliable and robust even in the case of network failures.
3.2.3 Database details?
In shared HA mode database is shared between active and passive servers.
3.3 Advantages or disadvantages / how to determine which HA to use
Both the HA implementations have their own advantages and disadvantages over each-other. These are pointed out here:
In Replicated HA implementation there are two copies of database available, so in case of any database failure on one system, up-to-date database is available on second system. But for Shared HA implementation there is just one database available. So any corruption to this database leads to degradation of system performance.
- In Shared HA implementation, for each database store operation it needs just one disk-write. But for Replicated HA implementation for each operation on database it needs two disk-writes (one each for both the databases on Primary and secondary). Thus performance will be better with shared HA implementation.
- In Shared HA implementation, just one datastore is used contrary to replicated HA, where two datastores are used. So, Cost of storage is less in shared HA implementation. But, for the normal operation of FioranoMQ it needs not much storage space. Thus, storage factor cannot be considered as major one, while deciding which HA implementation to use.
From above discussion we can infer that, when reliability is preferred over performance use Replicated HA implementation and use Shared HA implementation when performance is preferred over reliability.