This article belongs to a series of blogs on the Customer-managed version of Privilege Service (now part of Centrify Infrastructure Services). The topic is high-availability and in customer managed deployments, the Windows Server Failover Service (WSFC) is the key enabler for this capability. In this article we'll discuss and demo some of the failover tests in a lab environment to illustrate. If you're looking to create a similar test scenario, check out the previous article in the series: Installing Centrify Privilege Service + High-Availability with Windows Failover Clustering.
High-availability is all about business continuity in case of known risks like software or hardware issues. The goal of any computer system is to aim for 100% uptime, but from a mathematical standpoint this is impossible, however it's quite clear that a system that holds passwords should provide not only high-availability capabilities, but also recoverability and disaster recovery. The focus of this lab is to work on the HA scenarios.
This all starts with an impact assessment of the information system to the organization's business continuity. The CIA rating (confidentiality-integrity-availability) will indicate importance when the last number is significant. For this lab, we've come up with the
Simple Test Form
Windows Failover Clustering - Dependencies and Policies 101
Each "Role" in a Windows Server Failover cluster has dependencies. The CPS generic script (iis_pgsql_cluster) has the following dependencies: network, storage (designated disk), role DNS name, the IIS service and the Centrify Identity Platform Database (cisdb-pgsql).
For more information about dependencies, please review some other examples from Microsoft TechNet articles:
The behavior of the roles in the cluster and its dependencies is highly customizable in WSFC; however, when performing this type of testing it's important o understand if the system is responding according to policy. For example, let's explore the failover policy for the Centrify Privilege Service role (as configured by default in a 3-node cluster).
This means that the cluster will not restart if it fails "n-1" times. In this case, this is a 3-node cluster. After, 2 failures, the cluster needs to be restarted manually, and if it's within the 6 hour interval, each failure will make it stay down.
In addition, you can have a "preferred" active node (perhaps one that has more resources). Let's now inspect the policy for the Centrify generic script:
This means that within a 15 minute period, on IIS or PostgreSQL failure, the first failure will restart the corresponding service, the second failure within the period will force a node transfer from active (2nd failure) to the next best node.
The same policies exist at the disk, network and server name.
Finally, the reason for this explanation has to do with the evaluation of your tests. For example, if I was testing HA failure of a service component (like the database), an excerpt of my testing looks like this:
|T1:00||Stop the cisdb-pgsql service.||Owner: node-1|
The service will restart in the same node.
Up to T1:15, another failure of the web service or database will trigger a node change.
|T1:05||Stop the World Wide|
The 'vault' role will be moved to the best possible node due to policy.
The other reason why we're covering this is to encourage critical thinking. The policy settings for a lab or demo environment are very different than a production system, and the input goes directly to IT or security operations, in addition operator actions well-documented (as process), as well as trained technical personnel.
Testing Cadence (generic)
- Note the details about the test (time, current node in WSFC and in CPS, service status, etc)
- Have an understanding of the policy and expected results based on timing and environment.
- Cause the test failure
- Record the results. Any deviation from expected results needs to be investigated.
Privilege Service HA Testing (videos)
- Maintaining HA when putting an active node in maintenance mode (paused).
- Maintaining HA while the database is being backed-up.
- Maintaining HA during Centrify Privilege Service Upgrade (shutdown the active node during upgrade).
- Maintaining HA during Centrify Connector Upgrade.
- Stop the World Wide Web Publishing Service - based on rules, the service is restarted.
- Stop the Centrify Identity Platform Database - based on rules and timing, the active node is transferred.
- Simulate a Network Card failure (in this environment, also causes a Storage failure due to iSCSI)
- Poweroff a CPS Service Active Node
Privilege Service On Premises - High Availablility - Where to next?