Skip to main content

System Health Monitoring

Key Manager monitors the internal condition of it's critical processes. The processes have self-recovery mechanisms, but in the rare cases where an irrecoverable fatal error occurs, the system watchdog will send an e-mail to alert about them. For more information about the watchdog, see Key Manager Watchdog.

The system health monitoring processes also monitor the periodic scan jobs. When the interval settings of the periodic scan jobs have been set as having an interval, the health monitoring processes check the last run times of at least half the jobs, and whether they occur too far in the past. The system considers too far in the past to be at least 150% of the set interval. The specific interval settings for the periodic scan jobs can be found in the Settings→General→Host page, and are the following.

  • Full-scan interval

  • Authorized key-scan interval

  • Configuration-scan interval

  • Key-activity-scan interval

Key Manager Watchdog

Key Manager watchdog is a set of processes that will alert in case of a notable failures in internal Key Manager functions, such as failures to run necessary services or failures in running periodic jobs. The watchdog logs messages into syslog in sed format, and can be set to send e-mail warnings about errors it detects.

Watchdog runs on only one back end at a time. If the Key Manager servers do not detect a running watchdog on one of the back ends, one of the back-end servers will be selected to start a new set of watchdog processes.

You can set the email recipients for watchdog alerts at Settings→Alerts page, in the Recipient(s) of PKM health (watchdog) alerts setting. You can also set the email format in Settings→Alerts→E-mail templates with the Watchdog event notification e-mail template setting.

Watchdog Event Codes

The following are event codes that indicate something is wrong with the watchdog:

5000

Watchdog worker not responding

5001

Watchdog worker exited

5002

Watchdog worker sent malformed message to master

5003

Could not get workers running or all workers exited

5050

Master watchdog has encountered an exception

5051

Watchdog worker has encountered an exception

5052

Master watchdog failed to send e-mail - this one will not appear in any e-mails, since the e-mails can not be sent

5053

Master watchdog could not start a worker


Following event codes are related to event execution:

5100

Too long time has elapsed since last successful execution

5101

A job of particular type has not been executed at all


Following event codes are internal ones, that will not appear in any e-mails:

75000

Master watchdog operating normally - this event is sent approximately once per hour, if no other events have been generated

75001

Watchdog worker operating normally

75002

Timout occurred inside watchdog

75003

Watchdog abort requested