Tips for finding Knowledge Articles

  • - Enter just a few key words related to your question or problem
  • - Add Key words to refine your search as necessary
  • - Do not use punctuation
  • - Search is not case sensitive
  • - Avoid non-descriptive filler words like "how", "the", "what", etc.
  • - If you do not find what you are looking for the first time,reduce the number of key words you enter and try searching again.
  • - Minimum supported Internet Explorer version is IE9
Home  >
article

[Labs] Testing the Disaster Recovery Options for Customer-managed Privilege Service

11 April,19 at 11:50 AM

This article belongs to a series of blog posts related to the customer-managed version of Centrify Privilege Service. In this particular entry we center the discussion around strategies for Disaster Recovery.  If this is the first article you are reading, I recommend you go to the High Availability with Windows Server Failover Clustering post.

 

A quick primer on Disaster Recovery

 Based on Wikipedia's definition, Disaster recovery (DR) involves a set of policies, tools and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster.

 

DR differs from high-availability on one fundamental aspect: in a disaster, depending on the environment, type of issue and recovery strategies, there's a high-probability of data loss. For example, if the suitable recovery is either from a backup or a low latency replicated file, the data loss is in the gap between backup and restoration (or replication and recovery). A lot of the strategies are situation-dependent; perhaps a natural disaster has left the main site unable to function, and the DR systems take over while the other site is brought back on line.

When breaking-down any security controls there are preventive, detective and corrective controls:

 

Preventive Controls - High-availability mechanisms

See these posts that discuss topics around HA, backups and replication.

Detective Controls 

Privilege Service can be monitored via application monitoring, Windows Cluster Service events and connector monitoring.

Corrective Controls 

This is the core of this blog post. We will try to describe the people-process-technology needs for effective disaster recovery of privilege service.

dr.png

 

Recoverability

Security practitioners agree that that in DR scenarios some infrastructure systems may not be recoverable at its full capacity because of the circumstance or because of the infrastructure design of the disaster recovery site. Rarely the DR site has the same specs of the real data center site (with the exception of ISPs or Public Cloud providers).

Finally, your disaster recovery strategy has to align with the business continuity plan and have clear point and time objectives.   Success in DR scenarios are dependent on having clear recovery point and time goals,  planning, policy, procedures and having well-trained people who have practiced recovery.

 

Planning for Disaster Recovery

Here are some discussion and planning topics:


- Do you understand the DR site connectivity and infrastructure?
- Is Active Directory present in the DR site?
- Is DNS present in the DR site?
These are basic questions. A baseline infrastructure is needed for the recovery to work.

 

- Are there any core infrastructure components (like core switches, routers, systems, databases, secrets) that are crucial in a DR scenario?
If you answer yes to this question, you should have a sealed envelope (or other electronic means) to get access to key credentials. This is because the vault needs storage, network, AD and DNS to work; but depending on the disaster recovery strategy, you may need credentials to get to the point of recovering the vault.

 

- Do you have a stand-by server or remote cluster pre-staged in the DR site?
This will speed-up the recovery. Once Storage, Network, Active Directory and DNS are up and running, CPS recovery can commence.

 

- Do you have the right people in place, with clear instructions on how to communicate and coordinate during the recovery? Are the designated leads familiar and capable of restoring Privilege Service?
This topic should be self-evident. DR exercises can flush out issues in these areas.

- Data layer - what's the strategy? backups vs. replicated database file?  How frequently backups are made and replicated to the DR site (or the how frequently the database file copy is replicated)
This consideration is all about speed of recovery.

 

- Is the cluster configuration file available?
This is quite important.  If you need to restore from backup or create your additional node, the configuration file has crucial information required during recovery.

 

- How the system is being used?
In environments where secure session access is more prevalent than password checkouts, there will be less data loss because the passwords are mostly unknown (except when break glass is needed).

 

In this exercise

  1. We have a stand-by Server in our DR site.
    Strategies may vary - you may need a cluster, but you can have a "single master" with your secrets to support the recovery effort This system may be part of the cluster (if it spans multiple sites), or simply a single-master server that has been configured to participate in the current installation (by using the configuration file).
  2. Choose your strategy based on your environment (or scenario)
    E.g. restore from backup or restore from file.
  3. Perform the recovery
  4. Test your results.
  5. Adjust any changes (generate report, update accounts)

 

Setting-up a Stand-by Recovery Server

  1. Install Windows Server 2012 R2 based on your corporate image.
  2. Configure the logical disks to match the original installation and create the folder structure.
  3. Set-up Centrify Privilege Service. 
  4. Configure Privilege Service with the cluster configuration file.
    clconf.png
  5. Optional:  Add to Windows Clustering.

 

Restoring from Backup

Restoring from backup is quite simple with CPS, the only consideration is data loss. This is dependent on how the system is used and the backup frequency.  Time to recovery is dependent solely dependencies (such as storage, network and systems for recovery) and on how familiar the tech leads are with the restore process.

 

What you need

  • backup files
  • configuration file (if no stand-by system is available)
  • access to DNS system
  1. Log in to your DR Windows 2012 R2 system (with administrative credentials)
  2. Make sure the system has the same logical letter layout as the cluster nodes (e.g. if the database was configured with the e:\cps-db folder, this must be consistent).
  3. Optional: If you don't have a stand-by system, read the section above.
  4. In DNS, set up a CNAME record for the CPS service address (e.g. vault.example.com) and point to your recovery server.
    cname.png
  5. Start the web and database Service (set to manual given that WSFC controls this normally).
    Start-Service W3SVC
    Start-Service cisdb-pgsql
  6. Run the restore program using the backup files.
On the scripts folder, type:
$ .\pg_restore.ps1  -SourceDir e:\backups -initdb -verbose


restore.PNG

Now it's time to verify that the system can authenticate local (Centrify Directory users), followed by adding Connectors.

 

Restoring from Replicated Database file  (or orphaned file)

The consideration continues to be data loss. However, in this case, depending on the latency and replication frequency, this may be a very close copy of what was in production before the disaster struck.  The database engine and transaction logs will take care of the integrity of the data.

 

What you need

  • replicated file (or left-behind database)
  • configuration file (if no stand-by system is available)
  • access to DNS system
  1. Log in to your DR Windows 2012 R2 system (with administrative credentials)
  2. Make sure the system has the same logical letter layout as the cluster nodes (e.g. if the database was configured with the e:\cps-db folder, this must be consistent).
  3. Copy the replicated file to the destination that is expected based on the installation (e.g. E:\cps-db).
    files.PNG
  4. Optional:See the previous sections.
  5. In DNS, set up a CNAME record for the CPS service address (e.g. vault.example.com) and point to your recovery server
  6. Start the web and database Service (set to manual given that WSFC controls this normally - you may have to adjust depending on the length of the disaster).
  7. Verify that the system is working as expected

Restoring Normal Operations

The prescription of this section is dependent on the circumstance.

  • If the main site is unrecoverable, then you must plan to complete the cluster and add connectors to service the remaining sites.
  • If the main site is recoverable, then this depends on how you used the system.  If you had checkouts of accounts, those would have been rotated after use; those passwords will need to be updated once the original CPS cluster is back online.
  • Once everything is ready to resume normal operations, delete the CNAME record created during recovery and set the services to  manual (if they were changed).

Centrify Connectors and Disaster Recovery

Centrify connectors provide many services and for CPS they are absolutely needed, this means that you should have either connectors in your DR site (even if they are underutilized) or have the ability to spin them really quickly after the CPS server infrastructure is back up again.

 

Videos (wip)

 

 

Privilege Service On Premises -  High Availability and Disaster Recovery - Where to next?

Still have questions? Click here to log a technical support case, or collaborate with your peers in Centrify's Online Community.