Overview:

Monitoring AKS clusters is critical for maintaining the availability, performance, and operation of applications and business processes relying on Azure resources. The integration of AKS with Azure Monitor provides a robust set of tools for detailed monitoring at various levels, from platform metrics to container insights. Utilizing these tools effectively allows for a comprehensive view of the health and performance of AKS clusters, ensuring that any issues can be quickly assessed, investigated, and resolved.

We will explore the monitoring data generated by AKS and analyzed with Azure Monitor 

Monitoring data

AKS generates the same kinds of monitoring data as other Azure resources that are described in Monitoring data from Azure resources. See Monitoring AKS data reference for detailed information on the metrics and logs created by AKS. Other Azure services and features collect other data and enable other analysis options as shown in the following diagram and table.

SourceDescription    
Platform metricsPlatform metrics are automatically collected for AKS clusters at no cost. You can analyze these metrics with metrics explorer or use them for metric alerts.
Prometheus metricsWhen you enable metric scraping for your cluster, Prometheus metrics are collected by Azure Monitor managed service for Prometheus and stored in an Azure Monitor workspace. Analyze them with prebuilt dashboards in Azure Managed Grafana and with Prometheus alerts.
Activity logsActivity log is collected automatically for AKS clusters at no cost. These logs track information such as when a cluster is created or has a configuration change. Send the Activity log to a Log Analytics workspace to analyze it with your other log data.
Resource logsControl plane logs for AKS are implemented as resource logs. Create a diagnostic setting to send them to Log Analytics workspace where you can analyze and alert on them with log queries in Log Analytics.
Container insightsContainer insights collects various logs and performance data from a cluster including stdout/stderr streams and stores them in a Log Analytics workspace and Azure Monitor Metrics. Analyze this data with views and workbooks included with Container insights or with Log Analytics and metrics explorer.

Monitoring overview page in Azure portal

The Monitoring tab on the Overview page offers a quick way to get started viewing monitoring data in the Azure portal for each AKS cluster. This includes graphs with common metrics for the cluster separated by node pool. Click on any of these graphs to further analyze the data in metrics explorer.

The Overview page also includes links to Managed Prometheus and Container insights for the current cluster. If you haven’t already enabled these tools, you are prompted to do so. You may also see a banner at the top of the screen recommending that you enable other features to improve monitoring of your cluster.

Integrations

The following Azure services and features of Azure Monitor can be used for extra monitoring of your Kubernetes clusters. You can enable these features during AKS cluster creation from the Integrations tab in the Azure portal, Azure CLI, Terraform, Azure Policy, or onboard your cluster to them later. Each of these features may incur cost, so refer to the pricing information for each before you enabled them.

Service / FeatureDescription
Container insightsUses a containerized version of the Azure Monitor agent to collect stdout/stderr logs, and Kubernetes events from each node in your cluster, supporting a variety of monitoring scenarios for AKS clusters.  

You can enable monitoring for an AKS cluster when it’s created by using Azure CLIAzure Policy, Azure portal or Terraform.

If you don’t enable Container insights when you create your cluster, see Enable Container insights for Azure Kubernetes Service (AKS) cluster for other options to enable it.

Container insights store most of its data in a Log Analytics workspace, and you’ll typically use the same log analytics workspace as the resource logs for your cluster.

See Design a Log Analytics workspace architecture for guidance on how many workspaces you should use and where to locate them.
Azure Monitor managed service for PrometheusPrometheus is a cloud-native metrics solution from the Cloud Native Compute Foundation and the most common tool used for collecting and analyzing metric data from Kubernetes clusters.

Azure Monitor managed service for Prometheus is a fully managed Prometheus-compatible monitoring solution in Azure.

If you don’t enable managed Prometheus when you create your cluster, see Collect Prometheus metrics from an AKS cluster for other options to enable it.

Azure Monitor managed service for Prometheus stores its data in an Azure Monitor workspace, which is linked to a Grafana workspace so that you can analyze the data with Azure Managed Grafana.
Azure Managed GrafanaFully managed implementation of Grafana, which is an open-source data visualization platform commonly used to present Prometheus data.

Multiple predefined Grafana dashboards are available for monitoring Kubernetes and full-stack troubleshooting.  

If you don’t enable managed Grafana when you create your cluster, see Link a Grafana workspace details on linking it to your Azure Monitor workspace so it can access Prometheus metrics for your cluster.

Logs

AKS control plane/resource logs

Control plane logs for AKS clusters are implemented as resource logs in Azure Monitor. Resource logs aren’t collected and stored until you create a diagnostic setting to route them to one or more locations. You’ll typically send them to a Log Analytics workspace, which is where most of the data for Container insights is stored.

See Create diagnostic settings for the detailed process for creating a diagnostic setting using the Azure portal, CLI, or PowerShell. When you create a diagnostic setting, you specify which categories of logs to collect. The categories for AKS are listed in AKS monitoring data reference.

AKS data plane/Container Insights logs

Container Insights collect various types of telemetry data from containers and Kubernetes clusters to help you monitor, troubleshoot, and gain insights into your containerized applications running in your AKS clusters. For a list of tables and their detailed descriptions used by Container insights, see the Azure Monitor table reference. All these tables are available for log queries.

Azure Managed Grafana

The most common way to analyze and present Prometheus data is with a Grafana Dashboard. Azure Managed Grafana includes prebuilt dashboards for monitoring Kubernetes clusters including several that present similar information as Container insights views. There are also various community-created dashboards to visualize multiple aspects of a Kubernetes cluster from the metrics collected by Prometheus.

Workbooks

Azure Monitor Workbooks is a feature in Azure Monitor that provides a flexible canvas for data analysis and the creation of rich visual reports. Workbooks help you to create visual reports that help in data analysis. Reports in Container insights are recommended out-of-the-box for Azure workbooks. Azure provides built-in workbooks for each service, including Azure Kubernetes Service (AKS), which you can access from the Azure portal. On the Azure Monitor menu in the Azure portal, select Containers. In the Monitoring section, select Insights, choose a particular cluster, and then select the Reports tab. You can also view them from the workbook gallery in Azure Monitor.

Alerts

Azure Monitor alerts help you detect and address issues before users notice them by proactively notifying you when Azure Monitor collected data indicates there might be a problem with your cloud infrastructure or application. They allow you to identify and address issues in your system before your customers notice them. You can set alerts on metricslogs, and the activity log. Different types of alerts have benefits and drawbacks.

There are two types of metric rules used by Container insights based on either Prometheus metrics or platform metrics.

Prometheus metrics based alerts

When you enable collection of Prometheus metrics for your cluster, then you can download a collection of recommended Prometheus alert rules.

This includes the following rules:

LevelAlerts
Pod levelKubePodCrashLooping
Job didn’t complete in time
Pod container restarted in last 1 hour
Ready state of pods is less than 80%
Number of pods in failed state are greater than 0
KubePodNotReadyByController
KubeStatefulSetGenerationMismatch
KubeJobNotCompleted
KubeJobFailed
Average CPU usage per container is greater than 95%
Average Memory usage per container is greater than 95%
KubeletPodStartUpLatencyHigh
Cluster levelAverage PV usage is greater than 80%
KubeDeploymentReplicasMismatch
KubeStatefulSetReplicasMismatch
KubeHpaReplicasMismatch
KubeHpaMaxedOut
KubeCPUQuotaOvercommit
KubeMemoryQuotaOvercommit
KubeVersionMismatch
KubeClientErrors
CPUThrottlingHigh
KubePersistentVolumeFillingUp
KubePersistentVolumeInodesFillingUp
KubePersistentVolumeErrors
Node levelAverage node CPU utilization is greater than 80%
Working set memory for a node is greater than 80%
Number of OOM killed containers is greater than 0
KubeNodeUnreachable
KubeNodeNotReady
KubeNodeReadinessFlapping
KubeContainerWaiting
KubeDaemonSetNotScheduled
KubeDaemonSetMisScheduled
KubeletPlegDurationHigh
KubeletServerCertificateExpiration
KubeletClientCertificateRenewalErrors
KubeletServerCertificateRenewalErrors
KubeQuotaAlmostFull
KubeQuotaFullyUsed
KubeQuotaExceeded

Platform metric based alerts

The following table lists the recommended metric alert rules for AKS clusters. These alerts are based on platform metrics for the cluster.

ConditionDescription
CPU Usage Percentage > 95Fires when the average CPU usage across all nodes exceeds the threshold.
Memory Working Set Percentage > 100Fires when the average working set across all nodes exceeds the threshold.

Network Observability

Network observability is an important part of maintaining a healthy and performant Kubernetes cluster. By collecting and analyzing data about network traffic, you can gain insights into how your cluster is operating and identify potential problems before they cause outages or performance degradation.

When the Network Observability add-on is enabled, it collects and converts useful metrics into Prometheus format, which can be visualized in Grafana. When enabled, the collected metrics are automatically ingested into Azure Monitor managed service for Prometheus. A Grafana dashboard is available in the Grafana public dashboard repo to visualize the network observability metrics collected by Prometheus. For more information, see Network Observability setup for detailed instructions.

Conclusion

The best practices for monitoring Kubernetes with Azure Monitor are designed to ensure that your AKS and Azure Arc-enabled Kubernetes clusters operate efficiently and securely. By adhering to the principles of the Azure Well-Architected Framework and implementing the recommended monitoring strategies, you can achieve a resilient and optimized monitoring environment that supports the reliability and security of your Kubernetes deployments.

References:

https://learn.microsoft.com/en-us/azure/aks/monitor-aks

https://learn.microsoft.com/en-us/azure/aks/monitor-aks-reference

Leave a Reply

Discover more from Rajeev Singh | Coder, Blogger, YouTuber

Subscribe now to keep reading and get access to the full archive.

Continue reading