Monitoring P4 switch

Monitoring of the P4 switch is multifaceted and can be done through many avenues. In particular we can identify the following categories:

P4 Tables configuration: representing the configuration of the various table of the P4 switch (SPEAD routing, PSR…)
Port health monitoring: giving information about the current state of every data port including but not limited to:
- port configuration such as Speed (100G, 25G…), FEC configuration, Auto-negotiation
- port status such as Up (T/F), enable
- global statistics such as number of byte/packets received and sent through this port, number of errors on reception/transmission
- number of packets/bytes per type of traffic (PTP, ARP, SPEAD, SDP…)
Live traffic monitoring: through the advanced telemetry we gain access and we can expose precise realtime traffic monitoring for SKA specific traffic
- SPEAD monitoring (SPS to CBF)
- SPEAD monitoring (CBF to SDP)
- PSR monitoring (CBF to PSS/PST)
Tango Health State that follows the pre-defined SKA tango Health State

The way to access this monitoring information is not yet fully automated but is accessible via a few Tango attributes and methods.

Health State

Following the SKA Control Model, the Connector is reporting and publishing its own health state. This state is currently fairly simple as the switch is considered:

OK = 0, when all configured ports are UP and Enable
Degraded = 1, when one of more configured ports are Down and/or Disable

Health Status

The Health Status is a Tango attribute that provides a deeper view the current states of various ports. In particular the health status reports:

Port Status
- “Enable”: if the port is enable,
- “Up”: if the port is Up (aka synchronisation is done on the transceiver),
- “Speed”: the port speed configuration,
- “Rx”: the number of packets received on this port,
- “Tx”: the number of packets sent from this port,
- “FCS”: the number of frames with FCS error,
- “Rx_errors”: the number of packets received on this port with errors,
- “Tx_errors”: the number of packets sent from this port generating an error,
- “TX_PPS”: sent packet per second (calculated with a period of 1 second),
- “RX_PPS”: received packet per second (calculated with a period of 1 second),
- “TX_RATE”: received bytes per second (calculated with a period of 1 second),
- “RX_RATE”: received bytes per second (calculated with a period of 1 second).

This Health Status attribute is published every second so that EDA or any other monitoring systems would be able to display it and store them.

Table Counters

Associated with the various routing tables, we have activated the recording of counters for every configured entry in the table. Counters in table are twofolds: bytes and packets counters. Those counters are incremented every time a packets that matches an entry in the table is received on any ports. Note that these counters are 32-bit based and therefore would roll over after some time.

Currently those counters operate on a pull basis.

Advanced Telemetry

The final types of monitoring relates to detailed traffic monitoring for the SKA. This monitoring is no longer operating on a pulling basis from ASIC routing table but rather done via pushing mechanisms by constructing telemetry packets within the ASIC itself.

In particular, the advanced telemetry works as follows:

Update register counter for given SKA type packets (ska_a)
if counter = reporting packet number
- calculate number packets since last report
- calculate number of bytes since last report
- prepare telemetry header
- instruct ASIC to generate a telemetry packets to the Operating System
- send telemetry packets

In the tango device, we are leveraging the eBPF framework and in particular the BCC package to extract the information and update throughput information.