The health of the bots and hubs is monitored by jaiabot_health
which uses the framework from Goby (which is aggregated by goby_coroner
)
Every 10 seconds goby_coroner
publishes goby::health::request
(an empty message) that is subscribed to by all goby Multi- and SingleThreadApplications. For MultiThreadApplications, the main thread then queries all threads internally in a similar fashion.
Each application or thread can overload the virtual method health
to implement their response to this request:
The parameter health
(a ThreadHealth Protobuf message) can be modified to include the desired response. At a minimum, the state
field should be set to one of these:
HEALTH_OK
: Nominally functioning operationHEALTH_DEGRADED
: Something is going wrong but it isn't criticalHEALTH_FAILED
: Something critical has gone wrong.If the health
method isn't overloaded, each thread responds with HEALTH_OK
, which serves as a "heartbeat" for all apps and threads. However, where possible, applications should provide more detailed information (e.g. state of connected sensors for drivers, etc.). This aggregate health data from all threads is published interprocess as goby::health::response
.
goby_coroner
batches all the health::response
messages from all apps that it is configured to watch (--expected_name)
and puts them into a report (goby::health::report
, type VehicleHealth). The report includes a top level state
that is the worst of any of the reported app states (that is, if one app is DEGRADED, the system is considered DEGRADED; if one app is FAILED, the system is considered FAILED). Only if all apps are OK, is the system considered OK.
jaiabot_health
subscribes to the goby::health::report
produced by goby_coroner
and uses it as the basis for jaiabot-specific reporting.
Since goby::health::report
is a large and variable message with strings, we want to produce a specialized set of enumerated errors (which correspond to the HEALTH_FAILED
state) and warnings (which correspond to HEALTH_DEGRADED
) as well. These are defined in health.proto
as a Protobuf extension to goby::middleware::protobuf::ThreadHealth (the message used in the virtual health()
method).
Different apps set the enumerations that are appropriate for that app's function. These are grouped in rough "families":
ERROR__FAILED__*
(not yet implemented): The systemd service for this app failed.ERROR__NOT_RESPONDING__*
: The Goby app did not respond to the last goby_coroner request. (This often overlaps with ERROR__FAILED__*
but not necessarily; e.g. if an app is still running but hangs.)ERROR|WARNING__MISSING_DATA__*
(not yet implemented): A particular required or expected data stream is missing at jaiabot_fusion
.ERROR|WARNING__COMMS__*
(not yet implemented): Communications related errors or warnings.ERROR__MOOS__*
(not yet implemented): MOOS app related errorsERROR|WARNING__SYSTEM__*
(not yet implemented): System related errors or warnings (memory, disk, cpu, etc.)ERROR|WARNING__VEHICLE__*
(not yet implemented): Vehicle level errors or warnings (low battery, thruster, etc.)jaiabot_health
can be set to automatically restart all the jaiabot services if a period of time elapses with no HEALTH_OK report:
Since jaiabot_health
is run as root to allow it to restart the services, it is all the place that handles the system level powerstate changes (reboot or shutdown).
For simulation purposes, this is disabled to avoid powering off the development computer using:
The enumerations written by jaiabot_health and other jaiabot apps are very suitable for inclusion in the BotStatus DCCL message, and up to 5 each of errors and warnings are included by jaiabot_fusion
for display in Central Command. If more errors or warnings exist, the excess ones are omitted and ERROR__TOO_MANY_ERRORS_TO_REPORT_ALL
and/or WARNING__TOO_MANY_WARNINGS_TO_REPORT_ALL
are added to the list.