Last month, there was an incident that caused AKS node pools to broken, if user attempted to upgrade AKS from an older version to 1.31.1. The root cause was simple: AKS did not unset the deprecation flag during the upgrade process, thus causing new nodes to fail to register with API server.
However, as the symptom of this incident, some people may be misled by this error when debugging the issue, because the following error will pop:
(UpgradeFailed) Drain node aks-agentpool-12345678-vmss000000 failed when evicting pod metrics-server-0c40b8d296-ej9un failed with Too Many Requests error. This is often caused by a restrictive Pod Disruption Budget (PDB) policy. See http://aka.ms/aks/debugdrainfailures. Original error: Cannot evict pod as it would violate the pod's disruption budget.. PDB debug info: kube-system/metrics-server-0c40b8d296-ej9un blocked by pdb metrics-server-pdb with 1 unready pods: [kube-system/metrics-server-0c40b8d296-ej9un].
Code: UpgradeFailed
Message: Drain node aks-agentpool-12345678-vmss000000 failed when evicting pod metrics-server-0c40b8d296-ej9un failed with Too Many Requests error. This is often caused by a restrictive Pod Disruption Budget (PDB) policy. See http://aka.ms/aks/debugdrainfailures. Original error: Cannot evict pod as it would violate the pod's disruption budget.. PDB debug info: kube-system/metrics-server-0c40b8d296-ej9un blocked by pdb metrics-server-pdb with 1 unready pods: [kube-system/metrics-server-0c40b8d296-ej9un].
This further misleads them into drawing wrong conclusions. Let’s have a review on this.
Simulation
In the article Convert public AKS cluster to private cluster, I wrote a note as below:
You cannot combine steps 4 and 5. The metrics-server will fail due to readiness probe failures caused by failure to refresh the API server address, and the Pod will not start automatically. This further triggers PDB failure.
At the time, I realized that this behavior could be used to simulate the process of this incident, as they both had same output with similar causes when trying to update/upgrade the AKS cluster.
Therefore, based on this behavior, we can have a reproduction:
- Initializing resources
ranNum=$(echo $RANDOM)
rG=aks-${ranNum}
aks=aks-${ranNum}
vnet=aks-${ranNum}-vnet
uami=${aks}-identity
location=southeastasia
az group create -n ${rG} -l ${location} -o none
az network vnet create -g ${rG} -n ${vnet} \
--address-prefixes 10.208.0.0/12 -o none
az network vnet subnet create -n akssubnet --no-wait \
--vnet-name ${vnet} -g ${rG} --address-prefixes 10.208.0.0/23
az network vnet subnet create -n apisubnet -o none \
--vnet-name ${vnet} -g ${rG} --address-prefixes 10.209.0.0/28
vnetId=$(az resource list -n ${vnet} -g ${rG} \
--resource-type Microsoft.Network/virtualNetworks \
--query [0].id -o tsv)
az identity create -n ${uami} -g ${rG} -o none
uamiId=$(az resource list -n ${uami} -g ${rG} \
--resource-type Microsoft.ManagedIdentity/userAssignedIdentities \
--query [0].id -o tsv)
uamiIdentityId=$(az identity show --ids $uamiId \
--query principalId -o tsv)
az role assignment create --role "Network Contributor" \
--assignee-object-id ${uamiIdentityId} -o none \
--scope ${vnetId}/subnets/akssubnet --assignee-principal-type ServicePrincipal
az role assignment create --role "Network Contributor" \
--assignee-object-id ${uamiIdentityId} -o none \
--scope ${vnetId}/subnets/apisubnet --assignee-principal-type ServicePrincipal
az aks create -n ${aks} -g ${rG} \
--nodepool-name agentpool \
--node-count 1 --no-ssh-key \
--node-vm-size Standard_A4_v2 \
--network-plugin azure -o none \
--assign-identity ${uamiId} \
--vnet-subnet-id ${vnetId}/subnets/akssubnet
- Set node drain timeout value to speed up the process of triggering error
az aks nodepool update -n agentpool -o none \
--cluster-name ${aks} -g ${rG} --drain-timeout 5
- Trigger error
az aks update -n ${aks} -g ${rG} -o none \
--enable-apiserver-vnet-integration --enable-private-cluster \
--apiserver-subnet-id ${vnetId}/subnets/apisubnet
NOTEIt takes about 30 minutes for the error to be triggered.
The output is as below:
(UpgradeFailed) Drain node aks-agentpool-12345678-vmss000000 failed when evicting pod metrics-server-0c40b8d296-kanfw failed with Too Many Requests error. This is often caused by a restrictive Pod Disruption Budget (PDB) policy. See http://aka.ms/aks/debugdrainfailures. Original error: Cannot evict pod as it would violate the pod's disruption budget.. PDB debug info: kube-system/metrics-server-0c40b8d296-kanfw blocked by pdb metrics-server-pdb with 2 unready pods: [kube-system/metrics-server-0c40b8d296-ej9un,kube-system/metrics-server-0c40b8d296-kanfw].
Code: UpgradeFailed
Message: Drain node aks-agentpool-12345678-vmss000000 failed when evicting pod metrics-server-0c40b8d296-kanfw failed with Too Many Requests error. This is often caused by a restrictive Pod Disruption Budget (PDB) policy. See http://aka.ms/aks/debugdrainfailures. Original error: Cannot evict pod as it would violate the pod's disruption budget.. PDB debug info: kube-system/metrics-server-0c40b8d296-kanfw blocked by pdb metrics-server-pdb with 2 unready pods: [kube-system/metrics-server-0c40b8d296-ej9un,kube-system/metrics-server-0c40b8d296-kanfw].
Analysis
Typically, we will try to visit the given TSG link http://aka.ms/aks/debugdrainfailures
, and it will give you a lot of information on how to fix the PDB problem. But now that I’ve written it, I’m sure you know it’s not the right direction.
TSG is written with the assumption that the Pod is currently running, so the problem can only be caused by incorrect PDB configuration. This does not apply to the current scenario. In this demonstration, if we try to access AKS and get the logs after the error pops up, we will see the following:
2024-11-14T06:46:16.583706508Z stderr F E1114 06:46:16.583463 1 errors.go:77] Post "https://demoaks.hcp.region.azmk8s.io:443/apis/authorization.k8s.io/v1/subjectaccessreviews?timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2024-11-14T06:46:43.819556951Z stderr F E1114 06:46:43.818816 1 scraper.go:140] "Failed to scrape node" err="request failed, status: \"500 Internal Server Error\"" node="aks-agentpool-12345678-vmss000000"
2024-11-14T06:53:48.611847047Z stderr F I1114 06:53:48.611401 1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
2024-11-14T06:53:58.511488254Z stderr F I1114 06:53:58.511359 1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
2024-11-14T06:54:08.425805684Z stderr F I1114 06:54:08.425661 1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
2024-11-14T06:54:18.513209659Z stderr F I1114 06:54:18.513101 1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
2024-11-14T06:54:28.421972759Z stderr F I1114 06:54:28.421722 1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
2024-11-14T06:54:38.423495511Z stderr F I1114 06:54:38.423271 1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
2024-11-14T06:54:43.822708663Z stderr F E1114 06:54:43.822551 1 scraper.go:140] "Failed to scrape node" err="request failed, status: \"401 Unauthorized\"" node="aks-agentpool-12345678-vmss000000"
2024-11-14T06:48:51.784437577Z stderr F E1114 06:48:51.784207 1 nanny_lib.go:153] Get "https://demoaks.hcp.region.azmk8s.io:443/api/v1/nodes?resourceVersion=0": dial tcp 10.209.0.4:443: i/o timeout
2024-11-14T06:52:21.829049299Z stderr F E1114 06:52:21.828887 1 nanny_lib.go:153] Get "https://demoaks.hcp.region.azmk8s.io:443/api/v1/nodes?resourceVersion=0": dial tcp 10.209.0.4:443: i/o timeout
2024-11-14T06:55:51.855258613Z stderr F E1114 06:55:51.855088 1 nanny_lib.go:153] Get "https://demoaks.hcp.region.azmk8s.io:443/api/v1/nodes?resourceVersion=0": dial tcp 10.209.0.4:443: i/o timeout
2024-11-14T06:59:21.895245729Z stderr F E1114 06:59:21.895025 1 nanny_lib.go:153] Get "https://demoaks.hcp.region.azmk8s.io:443/api/v1/nodes?resourceVersion=0": dial tcp 10.209.0.4:443: i/o timeout
2024-11-14T07:02:51.907166383Z stderr F E1114 07:02:51.907022 1 nanny_lib.go:153] Get "https://demoaks.hcp.region.azmk8s.io:443/api/v1/nodes?resourceVersion=0": dial tcp: lookup demoaks.hcp.region.azmk8s.io: i/o timeout
2024-11-14T07:05:51.966886071Z stderr F E1114 07:05:51.966724 1 nanny_lib.go:153] Unauthorized
2024-11-14T07:08:52.0017319Z stderr F E1114 07:08:52.001563 1 nanny_lib.go:153] Unauthorized
2024-11-14T07:11:52.093546879Z stderr F E1114 07:11:52.093399 1 nanny_lib.go:153] Unauthorized
2024-11-14T07:14:52.130508186Z stderr F E1114 07:14:52.130294 1 nanny_lib.go:153] Unauthorized
2024-11-14T07:17:52.20994178Z stderr F E1114 07:17:52.209743 1 nanny_lib.go:153] Unauthorized
2024-11-14T07:20:52.24349834Z stderr F E1114 07:20:52.243306 1 nanny_lib.go:153] Unauthorized
2024-11-14T07:23:52.334233745Z stderr F E1114 07:23:52.334048 1 nanny_lib.go:153] Unauthorized
2024-11-14T07:26:52.366289969Z stderr F E1114 07:26:52.366152 1 nanny_lib.go:153] Unauthorized
Check out the event, and we can see readiness probe is failed.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning NodeNotReady 13m node-controller Node is not ready
Warning Unhealthy 66s (x71 over 11m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 500
Connect to node, get the syslog and find below:
Nov 14 06:45:53 aks-agentpool-12345678-vmss000000 kubelet[2789]: E1114 06:45:53.935841 2789 controller.go:195] "Failed to update lease" err="Put \"https://demoaks.hcp.region.azmk8s.io:443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/aks-agentpool-12345678-vmss000000?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Nov 14 06:46:02 aks-agentpool-12345678-vmss000000 kubelet[2789]: E1114 06:46:02.430836 2789 kubelet_node_status.go:544] "Error updating node status, will retry" err="error getting node \"aks-agentpool-12345678-vmss000000\": Get \"https://demoaks.hcp.region.azmk8s.io:443/api/v1/nodes/aks-agentpool-12345678-vmss000000?resourceVersion=0&timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Nov 14 07:01:08 aks-agentpool-12345678-vmss000000 kubelet[2789]: I1114 07:01:08.450964 2789 status_manager.go:853] "Failed to get status for pod" podUID="e73f86b4-8c24-4af6-8b9d-dfe3286c5ed6" pod="kube-system/metrics-server-0c40b8d296-ej9un" err="Get \"https://demoaks.hcp.region.azmk8s.io:443/api/v1/namespaces/kube-system/pods/metrics-server-0c40b8d296-ej9un\": dial tcp 10.209.0.4:443: i/o timeout"
Nov 14 07:01:08 aks-agentpool-12345678-vmss000000 kubelet[2789]: I1114 07:01:08.784975 2789 prober.go:107] "Probe failed" probeType="Readiness" pod="kube-system/metrics-server-0c40b8d296-kanfw" podUID="0e26fe43-b4bb-45b2-a9d7-c51a57bc6737" containerName="metrics-server" probeResult="failure" output="HTTP probe failed with statuscode: 500"
Nov 14 07:01:09 aks-agentpool-12345678-vmss000000 kubelet[2789]: E1114 07:01:09.517425 2789 server.go:310] "Unable to authenticate the request due to an error" err="invalid bearer token"
Nov 14 07:01:09 aks-agentpool-12345678-vmss000000 kubelet[2789]: I1114 07:01:09.522043 2789 kubelet_node_status.go:679] "Recording event message for node" node="aks-agentpool-12345678-vmss000000" event="NodeHasSufficientMemory"
Nov 14 07:01:09 aks-agentpool-12345678-vmss000000 kubelet[2789]: I1114 07:01:09.522357 2789 kubelet_node_status.go:679] "Recording event message for node" node="aks-agentpool-12345678-vmss000000" event="NodeHasNoDiskPressure"
Nov 14 07:01:09 aks-agentpool-12345678-vmss000000 kubelet[2789]: I1114 07:01:09.522568 2789 kubelet_node_status.go:679] "Recording event message for node" node="aks-agentpool-12345678-vmss000000" event="NodeHasSufficientPID"
Nov 14 07:01:09 aks-agentpool-12345678-vmss000000 kubelet[2789]: I1114 07:01:09.522597 2789 kubelet_node_status.go:679] "Recording event message for node" node="aks-agentpool-12345678-vmss000000" event="NodeReady"
NOTEIn this demonstration, the operation time is between
2024-11-14T06:43:11Z
and2024-11-14T07:11:17Z
.
Based on the above information, we can conclude the following:
- Enabling API Server VNet Integration will cause API server address change.
- Somehow when enabling feature with private cluster mode at the same time, API server will reject current node registeration for a long time.
- Since all Pods are down, Kubernets
- Until the operation times out, metrics-server still cannot authenticate with the API server.
The mitigation methods are:
- Wait for a while after the operation fails and try to reconcile AKS.
- Reimage/shut down/delete the old nodes from the VMSS panel, the Pods will be forced to move to the new nodes and then reconcile AKS.
AKS will come back to Running status after then.
Conclusion
This is a good example of how a PDB error can mislead an investigation, as both incidents involved failures to register with the API server, and core components were unable to operate.
Often, users are misled by the TSG due to the error code, and try to fix the issue with the incorrect direction (e.g. trying to delete PDB resources). Especially when it comes to the built-in component, users don’t even have the option to edit it. A PDB error does not necessarily mean that the problem is with a wrong PDB configuration, it is also important to focus on the reason why it is not working.
It is dangerous to jump to the conclusion that “PDB error causing node draining failures is by design” or “Allowed Disruptions does not meet the requirement”. In the simulation, at least the new node is available for use. But in the incident of “unable to upgrade to AKS 1.31.1”, the new node was not even up - which means what TSG said will never work. Simply saying “the root cause was PDB” is not helping and may cause the operator to driving away from resolving the issue.
Epilogue
This is a side story when I was writing the article Convert public AKS cluster to private cluster. At the time, the “unable to upgrade to AKS 1.31.1” incident became an ongoing issue. During debugging, I found that this way of thinking can mislead the problem, and coincidentally, I found another way to present the problem while writing the article.
This is how this journal coming out.