Automatic failover of a component lease

The following state diagram illustrates the different states in the automatic failover of a component lease. PolicyCenter defines these states in the FailoverState typelist. The FailoverState typelist does not contain the 'Completed' state. The 'Completed' state is implicit after the failover process completes and the lease manager deletes the original lease.


If a component lease expires, a lease manager places the lease in the 'In Progress' failover state. From the 'In Progress' state, the lease failover can move to one of the following three states: 'Postponed', 'Failed', or 'Completed'.

Initially, each component lease starts in the Not Started failover state. If a lease expires, the first lease manager that discovers the expired lease does the following:

  • It sets the lease to the 'In Progress' failover state. After set to this state, the component associated with the lease cannot run anywhere until there is a resolution of the issue that caused the lease to expire.
  • It sets the Retry Failover field in the Server Tools Cluster Components screen to the following value:
    • CurrentTime + BackgroundTaskFailoverPlugin.FailoverTimeout

If more than one lease manager discovers the expired lease at the same time, only the first lease manager continues the failover handling. The other lease managers detect that their SQL updates do not change anything and do not continue the failover process for that lease.

The lease manager that started the failover calls the handleComponentNameFailover method on the BackgroundTaskFailoverPlugin plugin to determine what to do next with the lease. The method returns one of the following actions to handle the component lease failover.

Possible actions

Description

Complete the failover

The BackgroundTaskFailoverPlugin plugin logic confirms the lease failure and instructs the lease manager to complete the failover. In this case, the lease manager completes the failover process, either by deleting or expiring the lease.

Postpone the failover

It is possible that the BackgroundTaskFailoverPlugin plugin logic cannot reliably confirm the lease failure. In this case, it can postpone the failover process by returning an associated action to take and the time duration to wait before taking that action. The lease manager updates the Retry Failover field in the Server Tools Cluster Components screen with the following value:

   Current Time + FailoverHandlingResult.Duration

After the updated retry failover time expires, the lease manager considers the lease expired and starts the process of lease failover again.

Dismiss the failover

It is possible that the BackgroundTaskFailoverPlugin plugin logic decides the specified background task did not fail, or, that this particular task requires some manual action. In this case, the BackgroundTaskFailoverPlugin plugin logic dismisses or fails the automatic failover of the lease. The lease with its FailoverState set to 'Failed' remains in the database until there is some kind of manual intervention. The failover process does not attempt to retry the automatic failover.

Use external tool

The BackgroundTaskFailoverPlugin plugin logic returns a failover handled acton. This action instructs the lease manager to do nothing with the lease. An external tool either deletes or renews the lease.

Calling an external tool to complete the failover can happen in any of the following ways:

  • Programmatically calling the SystemToolsAPI.nodeFailed method.
  • Programmatically calling the SystemToolsAPI.completeFailedFailover method.
  • Clicking the Complete Failover button on the Server Tools Cluster Components screen.

If the cluster member that started the failover does not complete the failover in the specified retry failover time, another cluster member detect this condition. The second cluster member then restarts the failover.

If at any point the original lease manager for the lease takes action to renew the lease, it does the following:

  • It sets the FailOver state for the lease to Not Started.
  • It resets the Retry Failover value to null.

At this point, the renewal of the lease resets the automatic failover process and negates any previous failover action undertaken for the renewed lease.