Low CBF Health Monitoring ========================= Philosophy ********** * Health means a device's ability to perform its function. * The ``healthState`` attribute is used to report the health state of Low CBF Controller, Subarray, Processor and Connector devices. * Controller and Subarray aggregate the health of other Tango devices. * Processor and Connector report the health of their underlying hardware as well as their firmware/programming and control software. * The information used by each device to evaluate its health state will be exposed as Tango attributes for use in GUIs or as alarms. Purpose ******* *Why do all SKA Tango devices have a healthState attribute?* To pinpoint faulty pieces of the telescope. Health State Definitions ************************ * **0 - OK**: fully functional * **1 - Degraded**: partial (not full) function available [optional state] * **2 - Failed**: unable to perform core function * **3 - Unknown**: initial state / we can’t determine an answer See also the documentation for :py:class:`ska_control_model.HealthState`. Evaluation is based on the component itself, not the previous/next piece of signal chain or neighbours. To use an analogy: we evaluate “our car”, not the road, bridge, or traffic lights. .. include:: alarms_vs_health.rst Health Aggregation ****************** The Low CBF Controller and Subarray devices will aggregate the health status of their underlying hardware devices. This means: * The health of a Subarray device will be an aggregation of the health of the Processor and Connector devices that are participating in the subarray. * The health of the Controller device will be an aggregation of the health of all Low CBF Processors and Connectors. In principle, these aggregated health states should be: * **OK** if the underlying hardware is sufficiently healthy to perform full functionality (there may be a FAILED state in some redundant piece of hardware) * **DEGRADED** if there are degradations or failures in underlying hardware that reduces overall functionality (perhaps beyond some threshold of acceptable degradation) * **FAILED** if no functionality is possible (either due to a single point of failure or to a combination of multiple failures) In practise, this is very difficult to codify! There will be pathological combinations of degraded hardware that can result in an overall failure. Rather than aim for a perfection that we fail to achieve, we will implement the following "close enough" algorithm: .. code-block:: text for each type of hardware (Connectors, Processors): if number of devices OK >= number required for full function: device_type_health = OK else if all devices FAILED: device_type_health = FAILED else: device_type_health = DEGRADED aggregated health = worst case of all device_type_health values If the **ENGINEERING_MODE_IGNORE_HEALTH** environment variable is defined and set to ``True`` (case ignored), the Subarray will ignore health state of external devices (switches, processors) when its ``adminMode`` is ``ENGINEERING``. This would prevent propagating the ``healthState`` to ``LowCbfController`` and possibly confusing the operator. Controller Health Aggregation ----------------------------- As mentioned above, the ``healthState`` attribute of the Low CBF Controller will reflect the aggregated health of all Low CBF hardware devices. The Controller device searches the Tango database for all Processor and Connector devices when its ``AdminMode`` is switched ``ONLINE``. Any devices not defined in the database at this time will not be detected. **Beware** that the way we deploy Tango devices for development and testing involves dynamically reconfiguring the Tango database, so there is a chance we may not discover devices (i.e. any that are late to inject themselves to the database). We hope that the Tango database used for the real operational telescope will have a fixed definition, thus avoiding this risk. The health state of all Processor & Connector devices will be available at the Controller via Tango array attributes ``health_processors`` & ``health_connectors``. Health of Low CBF Hardware Devices ********************************** As well as the ``healthState`` attribute, Low CBF hardware-related devices (Connector & Processor) will expose three health “category” attributes. These categories are intended to help with triaging faults (e.g. a hardware fault likely needs a technician on site, but a processing fault might be remedied by restarting the scan). Like ``healthState``, these health category attributes also use the :py:class:`ska_control_model.HealthState` enumeration data type. The three attributes are: * ``health_hardware`` to summarise the health of the hardware layer. For example: QSFPs, power supplies, temperatures. * ``health_function`` to summarise the health of the 'functional' layer. This includes things like driver interfaces, loading firmware, or routing rules. * ``health_process`` to summarise the health of the 'processing' layer. This means the dynamic conditions including other Tango connections. e.g. FPGA error registers, P4 switch queue overflow. The ``healthState`` attribute will report the worst case of the three categories. The individual parameters that contribute to these three summary attributes will also be exposed as individual Tango attributes. In other words, **all** the pieces of information that are used to assess health will be available as separate Tango attributes. * Each parameter will be exposed as a single attribute (numeric or boolean). Aggregations (e.g. using JSON structures) or Tango array types **will not be used**, as these cannot be unpacked by an ``AlarmHandler``. * We expose these individual attributes for the purposes of: helping troubleshooters pinpoint an active failure mode, displaying on user interfaces, and to allow for individual alarms to be configured using an ``AlarmHandler`` if desired. Using an ``AlarmHandler``, an alarm could be added to a particular parameter for early warning of impending failure, or logic could be used to look at the same parameter across multiple devices - whatever is useful to operations & maintenance. * Each attribute will be associated with its category via a naming convention: * ``hardware_`` for those contributing to ``health_hardware`` * ``function_`` for those contributing to ``health_function`` * ``process_`` for those contributing to ``health_process`` * When evaluation of an individual parameter is not possible (e.g. FPGA uptime cannot be evaluated when the FPGA is un-programmed), its attribute will report ``INVALID`` quality using the standard Tango :py:class:`~tango.AttrQuality` mechanism. * Invalid individual parameters will not contribute to their health category summary. * Invalid health category attributes will not contribute to the overall ``healthState``. Below is an example of the health evaluation scheme in action, represented as a table where each cell aggregates its neighbours on the right. Tango :py:class:`~tango.AttrQuality` is shown in *italics*. +------------------+--------------------+-----------------------------------------+ | Overall Health | Health Category | Individual Parameters | +==================+====================+=========================================+ | ``healthState`` | ``health_hardware``| ``hardware_12v`` 11.9 *VALID* | | | +-----------------------------------------+ | FAILED *VALID* | OK *VALID* | ``hardware_12v_aux`` 12.1 *VALID* | | | +-----------------------------------------+ | | | ``hardware_qsfp_temperature`` *INVALID* | | +--------------------+-----------------------------------------+ | | ``health_function``| ``function_driver_ok`` False *ALARM* | | | +-----------------------------------------+ | | FAILED *VALID* | ``function_firmware_loaded`` *INVALID* | | | +-----------------------------------------+ | | | ``function_rules_valid`` True *VALID* | | +--------------------+-----------------------------------------+ | | ``health_process`` | ``process_overflow_error`` *INVALID* | | | +-----------------------------------------+ | | *INVALID* | ``process_subscription_ok`` *INVALID* | +------------------+--------------------+-----------------------------------------+ Implementation Details ---------------------- A YAML configuration file will provide settings that control how each attribute contributes to health state (this is an indicative concept, not prescriptive specification) .. code-block:: yaml # top level keys are names of attributes that contribute to health assessment hardware_numeric_example: # trigger FAILED when outside the interval (-20,100) # i.e. "not -20 < value < 100" fail_limits: [-20, 100] # set DEGRADED when outside this interval (but not the fail interval) degrade_limits: [0, 50] # if any "limits" value is 'null' (=> None in Python), that limit does not apply # if we are not outside any fail/degrade limit then this parameter is OK hardware_boolean_example: # make hardware_health go to FAILED state if attribute value is True fail_state: true hardware_boolean_example_two: # DEGRADED hardware_health if our value is False degrade_state: false Using this, we configure the ``AttributeInfoEx`` Tango structure for each attribute, so the settings are visible to (and modifiable by) clients. Tango will then automatically drive the ``WARNING`` and ``ALARM`` status (:py:class:`~tango.AttrQuality`). .. include:: attrquality_alarm_note.rst ``health_hardware`` is determined by seeing if any ``hardware_*`` attribute is in ``WARNING``/``ALARM`` state (which implies that it's past its degrade/fail threshold). Likewise for ``health_function`` & ``health_process``. ``healthState`` evaluation uses a very simple algorithm: .. code-block:: python max(health_hardware, health_function, health_process) Implemented health attributes ----------------------------- As of Jul-2024 the following health related ``LowCbfProcessor`` Tango attributes are implemented (subject to change upon review): +--------------------+-----------------------------------------+---------------------------------------+-------------------------------------------+ | Category | Tango Attributes | Atribute's ``AttrQuality`` value | Description | +====================+=========================================+=======================================+===========================================+ | ``health_hardware``| ``hardware_fpga_temperature`` | | FPGA core temperature in degrees C | | +-----------------------------------------+---------------------------------------+-------------------------------------------+ | | ``hardware_fpga_power`` | | FPGA power consumption in Watts | | +-----------------------------------------+---------------------------------------+-------------------------------------------+ | | ``hardware_hbm_temperature`` | | FPGA memory temperature in degrees C | | +-----------------------------------------+---------------------------------------+-------------------------------------------+ | | ``hardware_power_supply_12v_voltage`` | | Auxiliary power supply voltage in Volts | | +-----------------------------------------+---------------------------------------+-------------------------------------------+ | | ``hardware_power_supply_12v_current`` | | Auxiliary power supply current in Amperes | | +-----------------------------------------+---------------------------------------+-------------------------------------------+ | | ``hardware_pcie_12v_voltage`` | | PCIe bus power supply voltage in Volts | | +-----------------------------------------+---------------------------------------+-------------------------------------------+ | | ``hardware_pcie_12v_current`` | | PCIe bus power supply current in Amperes | +--------------------+-----------------------------------------+---------------------------------------+-------------------------------------------+ | ``health_function``| ``function_firmware_loaded`` | | FPGA firmware loaded indicator (boolean) | | +-----------------------------------------+---------------------------------------+-------------------------------------------+ | | ``function_driver_ok`` | ``ATTR_INVALID`` when | FPGA device driver is operational | | | | ``function_firmware_loaded`` == False | (boolean) | +--------------------+-----------------------------------------+---------------------------------------+-------------------------------------------+ | ``health_process`` | ``process_delay_poly_valid`` | ``ATTR_INVALID`` when subarray is not | Delay polynomials valid indicator | | | | scanning | (boolean) | | +-----------------------------------------+---------------------------------------+-------------------------------------------+ | | ``process_delay_subscription_ok`` | ``ATTR_INVALID`` when subarray is not | Delay polynomials subscription valid | | | | scanning | indicator (boolean) | + +-----------------------------------------+---------------------------------------+-------------------------------------------+ | | ``process_spead_packets_ok`` | ``ATTR_INVALID`` when subarray is not | SPS SPEAD packets are arriving at FPGA | | | | scanning | input (boolean) | +--------------------+-----------------------------------------+---------------------------------------+-------------------------------------------+ Overrides for Testing ********************* We think that overriding of the ``healthState`` attribute and the other attributes that contribute to health evaluation will be useful for testing (e.g. to test Low CBF health aggregation logic, or CSP LMC health aggregation logic). The override will be controlled by the ``testMode`` attribute (see :py:class:`ska_control_model.TestMode`). Any override configuration will be cleared when ``testMode`` is changed to *NONE* (i.e. test mode off), to minimise the chance of accidentally overriding something. The override configuration will be set via the ``test_mode_overrides`` attribute, using a (JSON encoded) dictionary with attribute names as keys and their desired state as values. Any attribute not listed in the overrides dictionary will operate as normal. Examples: To force the ``healthState`` attribute to *FAILED* .. code-block:: python {"healthState": "FAILED"} To force the ``hardware_12v`` value to 13.8, as well as the ``hardware_health`` to *OK* .. code-block:: python {"hardware_health": "OK", "hardware_12v": 13.8} ----------- .. note:: *All content below here was written for an older health scheme and needs revision!* .. image:: ../diagrams/health_reporting_hierarchy.png :alt: reporting hierarchy block diagram Attribute Subscription ********************** Controller Tango device subscribes to changes in ``healthState`` attribute of all constituent Subarrays; it uses Tango database to retrieve the list of Subarrays: .. image:: ../diagrams/health_controller_subarray.png :alt: UML sequence diagram Subarray Tango devices subscribes to changes in ``healthState`` attribute of all Connector devices and all Processors allocated to the Subarray. The list of Connector Tango devices is retrieved from Tango database. The list of Processors assigned to Subarray is reported by the Allocator. .. image:: ../diagrams/health_subarray_connector.png :alt: UML sequence diagram Processor health **************** If the **NO_HEALTH_ROLLUP** environment variable is defined, the Subarray will not include the health of external devices (switches, processors) in its roll-up. This allows for tests in which switches or procssor devices are not present.