Tips for finding Knowledge Articles

  • - Enter just a few key words related to your question or problem
  • - Add Key words to refine your search as necessary
  • - Do not use punctuation
  • - Search is not case sensitive
  • - Avoid non-descriptive filler words like "how", "the", "what", etc.
  • - If you do not find what you are looking for the first time,reduce the number of key words you enter and try searching again.
  • - Minimum supported Internet Explorer version is IE9
Home  >
article

KB-4698: hadoop fs -ls /user commands fails with error message

Centrify DirectControl ,  

12 April,16 at 11:13 AM

Applies to:
 
All versions of Centrify DirectControl
 
Problem:
 
Kerberos Authentication Issues for Long Jobs
 
Description
 
After adjusting the kdc_timeout setting in /etc/krb5.conf (note: this setting does not exist by default) and reordering the KDCs in the krb5.conf file
(note: Centrify manages the /etc/krb5.conf by default and does not require it for its own operation), it was discovered that any Hadoop jobs taking longer than 20 minutes were failing with error strings like the following:
 
GSS initiate failed [Cause by GSSException: No valid credentials provided
(Mechanism level: Fail to create credential. (63) - No service creds)]
 
Cause:
 
Centrify was brought in to assist with troubleshooting this problem. It was discovered that Centrify created a new Kerberos credentials cache for each session
(Note: this is by default and by design)  that the user had open which led down the path of testing this issue when only a single session was ever established.
A loop that called hdfs dfs -ls / every 5 seconds was used as the test job. It was found that even with only a single session active the issue would occur again at the 20
minute mark. After the 20 minute mark, we tried accessing another Kerberized service using ldapsearch. This service worked, and worked again after waiting
another 20 minutes. This ruled out a Centrify issue
 
A tcpdump during the healthy period and during the trouble period were obtained.
 
Additionally, Kerberos debug was enabled using the HADOOP_OPTS environment variable. 
The tcpdump revealed that a service principal named krbtgt/YOURCOMPANY.COM was being requested that did not exist. The Kerberos debug showed that,
at the 20 minute mark, Kerberos was trying to revalidate the TGT with the KDC and the connection to each KDC was failing, even though the
KDCs (domain controllers) were active and responding to new kinit commands.
 
It was observed the issue was only occurring with Hadoop Kerberos calls, which are done through Java, and was not with other Kerberos calls (which are done
directly using the MIT-Kerberos client), it was suspected that the kdc_timeout setting may be set incorrectly.

This is an undocumented setting, so inspection of the Java source code was necessary to determine that the kdc_timeout
should be specified in milliseconds, not seconds. This was causing the calls to the KDC to timeout much faster than anticipated, and would not allow any calls to succeed because the timeout was too short.
 
The 20 minute interval is related to the fact that AD is being used as the KDC in this environment Microsoft only addresses the case of a revoked user's
account access to new resources within the same domain and for periods longer than 20 minutes.
 
Kerberos V5 does not enforce revocation of accounts prior to the expiration of issued tickets. If the POLICY_KERBEROS_VALIDATE_CLIENT bit is set in the AuthenticationOptions setting on the KDC, then KILE (Windows's implementation of Kerberos) will enforce revocation on the account KDCs.

When this property is set on the account KDC for the client's domain, and the TGT is older than an implementation-specific time (20 minutes), the account KDC MUST verify that the account is still in good standing. Good standing means the account has not expired, been locked out, been disabled, or otherwise
is not allowed to log on. If the KDC receiving the session ticket request is not in the user account's domain, then the check cannot be made.
 
Resolution:
 
It is recommended to follow these steps to prevent further issues regarding invalid kdc_timeout configuration:
 
1) In /etc/centrifydc/centrifydc.conf, customer need to turn off autoedit parameter. This means customer will be managing /etc/krb5.conf on their own.
Centrify will not revoke the below changes upon restart of adclient.
 
adclient.autoedit.krb5: false
 
2) Change the /etc/krb5.conf file to specify the kdc_timeout setting in milliseconds, rather than seconds.
 
After adjusting the kdc_timeout setting in the krb5.conf file to 2000 (for 2 seconds), it was observed that jobs longer than 20 minutes are executing successfully. 
 
It was also verified that the problem re-occurs when setting the timeout back to 2 (2 milliseconds).
 

Still have questions? Click here to log a technical support case, or collaborate with your peers in Centrify's Online Community.