System Health Monitoring
Key Manager monitors the internal condition of it's critical processes. The processes have self-recovery mechanisms, but in the rare cases where an irrecoverable fatal error occurs, the system watchdog will send an e-mail to alert about them. For more information about the watchdog, see Key Manager Watchdog.
The system health monitoring processes also monitor the periodic scan jobs. When the interval settings of the periodic scan jobs have been set as having an interval, the health monitoring processes check the last run times of at least half the jobs, and whether they occur too far in the past. The system considers too far in the past to be at least 150% of the set interval. The specific interval settings for the periodic scan jobs can be found in the Settings→General→Host page, and are the following.
-
Full-scan interval
-
Authorized key-scan interval
-
Configuration-scan interval
-
Key-activity-scan interval
Key Manager Watchdog
Key Manager watchdog is a set of processes that will alert in case of a notable failures in internal Key Manager functions, such as failures to run necessary services or failures in running periodic jobs. The watchdog logs messages into syslog in sed format, and can be set to send e-mail warnings about errors it detects.
Watchdog runs on only one back end at a time. If the Key Manager servers do not detect a running watchdog on one of the back ends, one of the back-end servers will be selected to start a new set of watchdog processes.
You can set the email recipients for watchdog alerts at Settings→Alerts page, in the Recipient(s) of PKM health (watchdog) alerts setting. You can also set the email format in Settings→Alerts→E-mail templates with the Watchdog event notification e-mail template setting.
Watchdog Event Codes
The following are event codes that indicate something is wrong with the watchdog:
5000
Watchdog worker not responding
5001
Watchdog worker exited
5002
Watchdog worker sent malformed message to master
5003
Could not get workers running or all workers exited
5050
Master watchdog has encountered an exception
5051
Watchdog worker has encountered an exception
5052
Master watchdog failed to send e-mail - this one will not appear in any e-mails, since the e-mails can not be sent
5053
Master watchdog could not start a worker
Following event codes are related to event execution:
5100
Too long time has elapsed since last successful execution
5101
A job of particular type has not been executed at all
Following event codes are internal ones, that will not appear in any e-mails:
75000
Master watchdog operating normally - this event is sent approximately once per hour, if no other events have been generated
75001
Watchdog worker operating normally
75002
Timout occurred inside watchdog
75003
Watchdog abort requested