11 April,19 at 11:50 AM
This article belongs to a series of blog posts related to the customer-managed version of Centrify Privilege Service. In this particular entry we center the discussion around strategies for Disaster Recovery. If this is the first article you are reading, I recommend you go to the High Availability with Windows Server Failover Clustering post.
A quick primer on Disaster Recovery
Based on Wikipedia's definition, Disaster recovery (DR) involves a set of policies, tools and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster.
DR differs from high-availability on one fundamental aspect: in a disaster, depending on the environment, type of issue and recovery strategies, there's a high-probability of data loss. For example, if the suitable recovery is either from a backup or a low latency replicated file, the data loss is in the gap between backup and restoration (or replication and recovery). A lot of the strategies are situation-dependent; perhaps a natural disaster has left the main site unable to function, and the DR systems take over while the other site is brought back on line.
When breaking-down any security controls there are preventive, detective and corrective controls:
Preventive Controls - High-availability mechanisms
See these posts that discuss topics around HA, backups and replication.
Detective Controls
Privilege Service can be monitored via application monitoring, Windows Cluster Service events and connector monitoring.
Corrective Controls
This is the core of this blog post. We will try to describe the people-process-technology needs for effective disaster recovery of privilege service.
Recoverability
Security practitioners agree that that in DR scenarios some infrastructure systems may not be recoverable at its full capacity because of the circumstance or because of the infrastructure design of the disaster recovery site. Rarely the DR site has the same specs of the real data center site (with the exception of ISPs or Public Cloud providers).
Finally, your disaster recovery strategy has to align with the business continuity plan and have clear point and time objectives. Success in DR scenarios are dependent on having clear recovery point and time goals, planning, policy, procedures and having well-trained people who have practiced recovery.
Planning for Disaster Recovery
Here are some discussion and planning topics:
- Do you understand the DR site connectivity and infrastructure?
- Is Active Directory present in the DR site?
- Is DNS present in the DR site?
These are basic questions. A baseline infrastructure is needed for the recovery to work.
- Are there any core infrastructure components (like core switches, routers, systems, databases, secrets) that are crucial in a DR scenario?
If you answer yes to this question, you should have a sealed envelope (or other electronic means) to get access to key credentials. This is because the vault needs storage, network, AD and DNS to work; but depending on the disaster recovery strategy, you may need credentials to get to the point of recovering the vault.
- Do you have a stand-by server or remote cluster pre-staged in the DR site?
This will speed-up the recovery. Once Storage, Network, Active Directory and DNS are up and running, CPS recovery can commence.
- Do you have the right people in place, with clear instructions on how to communicate and coordinate during the recovery? Are the designated leads familiar and capable of restoring Privilege Service?
This topic should be self-evident. DR exercises can flush out issues in these areas.
- Data layer - what's the strategy? backups vs. replicated database file? How frequently backups are made and replicated to the DR site (or the how frequently the database file copy is replicated)
This consideration is all about speed of recovery.
- Is the cluster configuration file available?
This is quite important. If you need to restore from backup or create your additional node, the configuration file has crucial information required during recovery.
- How the system is being used?
In environments where secure session access is more prevalent than password checkouts, there will be less data loss because the passwords are mostly unknown (except when break glass is needed).
In this exercise
Setting-up a Stand-by Recovery Server
Restoring from Backup
Restoring from backup is quite simple with CPS, the only consideration is data loss. This is dependent on how the system is used and the backup frequency. Time to recovery is dependent solely dependencies (such as storage, network and systems for recovery) and on how familiar the tech leads are with the restore process.
What you need
Start-Service W3SVC Start-Service cisdb-pgsql
On the scripts folder, type: $ .\pg_restore.ps1 -SourceDir e:\backups -initdb -verbose
Now it's time to verify that the system can authenticate local (Centrify Directory users), followed by adding Connectors.
Restoring from Replicated Database file (or orphaned file)
The consideration continues to be data loss. However, in this case, depending on the latency and replication frequency, this may be a very close copy of what was in production before the disaster struck. The database engine and transaction logs will take care of the integrity of the data.
What you need
Restoring Normal Operations
The prescription of this section is dependent on the circumstance.
Centrify Connectors and Disaster Recovery
Centrify connectors provide many services and for CPS they are absolutely needed, this means that you should have either connectors in your DR site (even if they are underutilized) or have the ability to spin them really quickly after the CPS server infrastructure is back up again.
Videos (wip)
Privilege Service On Premises - High Availability and Disaster Recovery - Where to next?