11 April,19 at 11:50 AM
This article belongs to a series of blogs on the Customer-managed version of Privilege Service (now part of Centrify Infrastructure Services). The topic is high-availability and in customer managed deployments, the Windows Server Failover Service (WSFC) is the key enabler for this capability. In this article we'll discuss and demo some of the failover tests in a lab environment to illustrate. If you're looking to create a similar test scenario, check out the previous article in the series: Installing Centrify Privilege Service + High-Availability with Windows Failover Clustering.
Success Criteria
High-availability is all about business continuity in case of known risks like software or hardware issues. The goal of any computer system is to aim for 100% uptime, but from a mathematical standpoint this is impossible, however it's quite clear that a system that holds passwords should provide not only high-availability capabilities, but also recoverability and disaster recovery. The focus of this lab is to work on the HA scenarios.
This all starts with an impact assessment of the information system to the organization's business continuity. The CIA rating (confidentiality-integrity-availability) will indicate importance when the last number is significant. For this lab, we've come up with the
Simple Test Form
Test Environment
Windows Failover Clustering - Dependencies and Policies 101
Each "Role" in a Windows Server Failover cluster has dependencies. The CPS generic script (iis_pgsql_cluster) has the following dependencies: network, storage (designated disk), role DNS name, the IIS service and the Centrify Identity Platform Database (cisdb-pgsql).
For more information about dependencies, please review some other examples from Microsoft TechNet articles:
The behavior of the roles in the cluster and its dependencies is highly customizable in WSFC; however, when performing this type of testing it's important o understand if the system is responding according to policy. For example, let's explore the failover policy for the Centrify Privilege Service role (as configured by default in a 3-node cluster).
This means that the cluster will not restart if it fails "n-1" times. In this case, this is a 3-node cluster. After, 2 failures, the cluster needs to be restarted manually, and if it's within the 6 hour interval, each failure will make it stay down.
In addition, you can have a "preferred" active node (perhaps one that has more resources). Let's now inspect the policy for the Centrify generic script:
This means that within a 15 minute period, on IIS or PostgreSQL failure, the first failure will restart the corresponding service, the second failure within the period will force a node transfer from active (2nd failure) to the next best node.
The same policies exist at the disk, network and server name.
Finally, the reason for this explanation has to do with the evaluation of your tests. For example, if I was testing HA failure of a service component (like the database), an excerpt of my testing looks like this:
Time | Action | Expected Result | Results |
---|---|---|---|
T1:00 | Stop the cisdb-pgsql service. | Owner: node-1 The service will restart in the same node. Up to T1:15, another failure of the web service or database will trigger a node change. | PASS |
T1:05 | Stop the World Wide Web Publishing Service. | Owner: node-2 The 'vault' role will be moved to the best possible node due to policy. | PASS |
The other reason why we're covering this is to encourage critical thinking. The policy settings for a lab or demo environment are very different than a production system, and the input goes directly to IT or security operations, in addition operator actions well-documented (as process), as well as trained technical personnel.
Testing Cadence (generic)
Privilege Service HA Testing (videos)
Administrative
Software
Hardware
Privilege Service On Premises - High Availablility - Where to next?