This section gives an introduction to problems faced in distributed systems in real-world scenarios.
Client Unable to Connect
In certain situations, a server can be temporarily unable to accept a connection request. An example could be socket buffering that occurs if too many clients try to connect at the same time. The most desirable behavior for the client is to transparently attempt a few reconnections first.
Since the server can be completely down rather than temporarily overloaded, the client needs to be able to connect to alternate backup servers. If this list of backup servers is retrieved as a parameter of a connectionfactory object, the client code can become non-portable.
Connection if the Server is Lost
This again is an important failure that should be handled. Client side persistence is a requirement. This 'store and forward' feature enables a client to operate in a disconnected mode, avoiding loss of messages. A seamless integration of client-side persistence should be transparent, allow for transacted sessions and cater for duplicated messages.
The Server Runs Out of Resources
There are many resources that can effectively render a server inaccessible because of their shortage: connections, RAM, disk space, threads, file descriptors, sockets, and possibly, others. A cluster of servers can provide more resources, distribute requests more evenly (load-balancing) and configure servers as 'standby' and 'ready to take over' in case of an emergency. To preserve application portability, the cluster should appear as a single (super) server where: load balancing and failover are transparent to clients. At times, the shortage of a resource can be temporary and it can be advisable for a client to first try and reconnect for a while before looking for an alternate server. Again, such an option should be compatible with load balancing.
The Server Goes Down Altogether
If a server crashes, clients connected to it need to be able to continue working by connecting to a secondary server. This scenario is termed 'failover'. Once a server recovers, it needs to be reactivated for taking over the tasks assigned to it thereby restoring it to the state before the crash. It is termed 'hot failover' if processing can continue seamlessly (with nearly no latency). This requires that a secondary server is running and has access to persistent state and message data.