Tips for finding Knowledge Articles

  • - Enter just a few key words related to your question or problem
  • - Add Key words to refine your search as necessary
  • - Do not use punctuation
  • - Search is not case sensitive
  • - Avoid non-descriptive filler words like "how", "the", "what", etc.
  • - If you do not find what you are looking for the first time,reduce the number of key words you enter and try searching again.
  • - Minimum supported Internet Explorer version is IE9
Home  >
article

[Labs] Testing High-Availability for Centrify Privilege Service and Windows Failover Clustering

11 April,19 at 11:50 AM

This article belongs to a series of blogs on the Customer-managed version of Privilege Service (now part of Centrify Infrastructure Services).  The topic is high-availability and in customer managed deployments, the Windows Server Failover Service (WSFC) is the key enabler for this capability.  In this article we'll discuss and demo some of the failover tests in a lab environment to illustrate.  If you're looking to create a similar test scenario, check out the previous article in the series: Installing Centrify Privilege Service + High-Availability with Windows Failover Clustering.

 

Success Criteria

High-availability is all about business continuity in case of known risks like software or hardware issues.  The goal of any computer system is to aim for 100% uptime, but from a mathematical standpoint this is impossible, however it's quite clear that a system that holds passwords should provide not only high-availability capabilities, but also recoverability and disaster recovery.   The focus of this lab is to work on the HA scenarios. 

This all starts with an impact assessment of the information system to the organization's business continuity.  The CIA rating (confidentiality-integrity-availability) will indicate importance when the last number is significant.  For this lab, we've come up with the

 

Simple Test Form

tests.PNG

Test Environment

lab.png

 

Windows Failover Clustering - Dependencies and Policies 101

Each "Role" in a Windows Server Failover cluster has dependencies.  The CPS generic script (iis_pgsql_cluster) has the following dependencies: network, storage (designated disk), role DNS name, the IIS service and the Centrify Identity Platform Database (cisdb-pgsql).
dep-rep.png

For more information about dependencies, please review some other examples from Microsoft TechNet articles:

https://support.microsoft.com/en-us/help/835185/windows-failover-cluster-resource-dependencies-in-sql-server

 

The behavior of the roles in the cluster and its dependencies is highly customizable in WSFC; however, when performing this type of testing it's important o understand if the system is responding according to policy.  For example, let's explore the failover policy for the Centrify Privilege Service role (as configured by default in a 3-node cluster).
vault-poliy.png
This means that the cluster will not restart if it fails "n-1" times.  In this case, this is a 3-node cluster.  After, 2 failures, the cluster needs to be restarted manually, and if it's within the 6 hour interval, each failure will make it stay down.

In addition, you can have a "preferred"  active node (perhaps one that has more resources).  Let's now inspect the policy for the Centrify generic script:

iis-pgsql.png

This means that within a 15 minute period, on IIS or PostgreSQL failure, the first failure will restart the corresponding service, the second failure within the period will force a node transfer from active (2nd failure) to the next best node.

The same policies exist at the disk, network and server name.

 

Finally, the reason for this explanation has to do with the evaluation of your tests.  For example, if I was testing HA failure of a service component (like the database), an excerpt of my testing looks like this:

 

TimeActionExpected ResultResults
T1:00Stop the cisdb-pgsql service.Owner: node-1
The service will restart in the same node.
Up to T1:15, another failure of the web service or database will trigger a node change.
PASS
T1:05Stop the World Wide
Web Publishing
Service.
Owner: node-2
The 'vault' role will be moved to the best possible node due to policy.
PASS

 

The other reason why we're covering this is to encourage critical thinking.  The policy settings for a lab or demo environment are very different than a production system, and the input goes directly to IT or security operations, in addition operator actions well-documented (as process), as well as trained technical personnel.

 

Testing Cadence (generic)

  1. Note the details about the test (time, current node in WSFC and in CPS, service status, etc)
  2. Have an understanding of the policy and expected results based on timing and environment.
  3. Cause the test failure
  4. Record the results.  Any deviation from expected results needs to be investigated.

 

Privilege Service HA Testing (videos)

Administrative

  • Maintaining HA when putting an active node in maintenance mode (paused).
  • Maintaining HA while the database is being backed-up.
  • Maintaining HA during Centrify Privilege Service Upgrade (shutdown the active node during upgrade).
  • Maintaining HA during Centrify Connector Upgrade.

Software

  • Stop the World Wide Web Publishing Service - based on rules, the service is restarted.
  • Stop the Centrify Identity Platform Database - based on rules and timing, the active node is transferred.

 Hardware

  • Simulate a Network Card failure (in this environment, also causes a Storage failure due to iSCSI)
  • Poweroff a CPS Service Active Node

Privilege Service On Premises -  High Availablility - Where to next?

Still have questions? Click here to log a technical support case, or collaborate with your peers in Centrify's Online Community.