Monitoring P4 switch

Monitoring of the P4 switch is multifaceted and can be done through many avenues. In particular we can identify the following categories:

  • P4 Tables configuration: representing the configuration of the various table of the P4 switch (SPEAD routing, PSR…)

  • Port health monitoring: giving information about the current state of every data port including but not limited to:
    • port configuration such as Speed (100G, 25G…), FEC configuration, Auto-negotiation

    • port status such as Up (T/F), enable

    • global statistics such as number of byte/packets received and sent through this port, number of errors on reception/transmission

    • number of packets/bytes per type of traffic (PTP, ARP, SPEAD, SDP…)

  • Live traffic monitoring: through the advanced telemetry we gain access and we can expose precise realtime traffic monitoring for SKA specific traffic
    • SPEAD monitoring (SPS to CBF)

    • SPEAD monitoring (CBF to SDP)

    • PSR monitoring (CBF to PSS/PST)

  • Tango Health State that follows the pre-defined SKA tango Health State

The way to access this monitoring information is not yet fully automated but is accessible via a few Tango attributes and methods.

Health State

Following the SKA Control Model, the Connector is reporting and publishing its own health state. This state is currently fairly simple as the switch is considered:

  • OK = 0, when all configured ports are UP and Enable

  • Degraded = 1, when one of more configured ports are Down and/or Disable

Health Status

The Health Status is a Tango attribute that provides a deeper view the current states of various ports. In particular the health status reports:

  • Port Status
    • “Enable”: if the port is enable,

    • “Up”: if the port is Up (aka synchronisation is done on the transceiver),

    • “Speed”: the port speed configuration,

    • “Rx”: the number of packets received on this port,

    • “Tx”: the number of packets sent from this port,

    • “FCS”: the number of frames with FCS error,

    • “Rx_errors”: the number of packets received on this port with errors,

    • “Tx_errors”: the number of packets sent from this port generating an error,

    • “TX_PPS”: sent packet per second (calculated with a period of 1 second),

    • “RX_PPS”: received packet per second (calculated with a period of 1 second),

    • “TX_RATE”: received bytes per second (calculated with a period of 1 second),

    • “RX_RATE”: received bytes per second (calculated with a period of 1 second).

This Health Status attribute is published every second so that EDA or any other monitoring systems would be able to display it and store them.

Table Counters

Associated with the various routing tables, we have activated the recording of counters for every configured entry in the table. Counters in table are twofolds: bytes and packets counters. Those counters are incremented every time a packets that matches an entry in the table is received on any ports. Note that these counters are 32-bit based and therefore would roll over after some time.

Currently those counters operate on a pull basis.

Advanced Telemetry

The final types of monitoring relates to detailed traffic monitoring for the SKA. This monitoring is no longer operating on a pulling basis from ASIC routing table but rather done via pushing mechanisms by constructing telemetry packets within the ASIC itself.

In particular, the advanced telemetry works as follows:

  • Update register counter for given SKA type packets (ska_a)

  • if counter = reporting packet number
    • calculate number packets since last report

    • calculate number of bytes since last report

    • prepare telemetry header

    • instruct ASIC to generate a telemetry packets to the Operating System

    • send telemetry packets

In the tango device, we are leveraging the eBPF framework and in particular the BCC package to extract the information and update throughput information.