7. Monitoring and alerts

7.1. Monitoring

Prometheus is used for monitoring the vPOD infrastructure and is presented in a Grafana instance for each vPOD. Prometheus is then federated from multiple vPODs into a Prometheus collector and Grafana instance for whole-cloud monitoring.

Prometheus exporters are installed and federated as follows:

Prometheus in the OpenStack infrastructure is enabled through Kayobe which automatically deploys a containerised HA instance of Prometheus/Grafana across the control plane hosts.

Each hypervisor has a containerised node-exporter instance providing monitoring of each hypervisor. This is also deployed through Kayobe.

Every VM instance which uses a provided OS image contains an OS-standard node-exporter.

On each vPOD:

IPU metrics are served via the IPU-Gateway interface and gathered from http://<IPU-Machine IPU-Gateway IP>:2112/metrics.

The Poplar VMs use node-exporter.

Control VMs federate Poplar VM data with IPU-Machine data together on a single Prometheus instance.

A single vpod-management VM is created which federates all vPOD Prometheus data together.

Ceph clusters (where deployed) have a built-in Prometheus exporter.

_images/prometheus.png — Fig. 7.1 Monitoring with Prometheus

7.2. Alerting

Alerts for critical and upcoming issues are configured in the federated Prometheus monitoring and can be sent to email, slack, and so on.

7.2.1. Log aggregation

Each control VM (within each vPOD) aggregates the syslog from each IPU-Machine in the vPOD.

Each IPU-Machine BMC is configured to deliver syslog using rsyslogd configuration.

The IPU-Gateway in each IPU-Machine is configured to deliver syslog using rsyslogd configuration.

The rack_tool application is used to set up both these configurations either during or after IPU-Machine software updates.

The syslogs from all hypervisors and VMs (including aggregated vPODs) are collected by a central log analysis service (for example Elastic Stack) to allow for deep-log checking, trend analysis and alerting.

Search help

7. Monitoring and alerts

7.1. Monitoring

7.2. Alerting

7.2.1. Log aggregation