Last month, there was an incident that caused AKS node pools to broken, if user attempted to upgrade AKS from an older version to 1.31.1. The root cause was simple: AKS did not unset the deprecation flag during the upgrade process, thus causing new nodes to fail to register with API server.
However, as the symptom of this incident, some people may be misled by this error when debugging the issue, because the following error will pop:
(UpgradeFailed) Drain node aks-agentpool-12345678-vmss000000 failed when evicting pod metrics-server-0c40b8d296-ej9un failed with Too Many Requests error. This is often caused by a restrictive Pod Disruption Budget (PDB) policy. See http://aka.ms/aks/debugdrainfailures. Original error: Cannot evict pod as it would violate the pod's disruption budget.. PDB debug info: kube-system/metrics-server-0c40b8d296-ej9un blocked by pdb metrics-server-pdb with 1 unready pods: [kube-system/metrics-server-0c40b8d296-ej9un].
Code: UpgradeFailed
Message: Drain node aks-agentpool-12345678-vmss000000 failed when evicting pod metrics-server-0c40b8d296-ej9un failed with Too Many Requests error. This is often caused by a restrictive Pod Disruption Budget (PDB) policy. See http://aka.ms/aks/debugdrainfailures. Original error: Cannot evict pod as it would violate the pod's disruption budget.. PDB debug info: kube-system/metrics-server-0c40b8d296-ej9un blocked by pdb metrics-server-pdb with 1 unready pods: [kube-system/metrics-server-0c40b8d296-ej9un].
This further misleads them into drawing wrong conclusions. Let’s have a review on this.
Simulation
In the article Convert public AKS cluster to private cluster, I wrote a note as below:
You cannot combine steps 4 and 5. The metrics-server will fail due to readiness probe failures caused by failure to refresh the API server address, and the Pod will not start automatically. This further triggers PDB failure.
At the time, I realized that this behavior could be used to simulate the process of this incident, as they both had same output with similar causes when trying to update/upgrade the AKS cluster.
Therefore, based on this behavior, we can have a reproduction:
- Initializing resources
ranNum=$(echo $RANDOM)
rG=aks-${ranNum}
aks=aks-${ranNum}
vnet=aks-${ranNum}-vnet
uami=${aks}-identity
location=southeastasia
az group create -n ${rG} -l ${location} -o none
az network vnet create -g ${rG} -n ${vnet} \
--address-prefixes 10.208.0.0/12 -o none
az network vnet subnet create -n akssubnet --no-wait \
--vnet-name ${vnet} -g ${rG} --address-prefixes 10.208.0.0/23
az network vnet subnet create -n apisubnet -o none \
--vnet-name ${vnet} -g ${rG} --address-prefixes 10.209.0.0/28
vnetId=$(az resource list -n ${vnet} -g ${rG} \
--resource-type Microsoft.Network/virtualNetworks \
--query [0].id -o tsv)
az identity create -n ${uami} -g ${rG} -o none
uamiId=$(az resource list -n ${uami} -g ${rG} \
--resource-type Microsoft.ManagedIdentity/userAssignedIdentities \
--query [0].id -o tsv)
uamiIdentityId=$(az identity show --ids $uamiId \
--query principalId -o tsv)
az role assignment create --role "Network Contributor" \
--assignee-object-id ${uamiIdentityId} -o none \
--scope ${vnetId}/subnets/akssubnet --assignee-principal-type ServicePrincipal
az role assignment create --role "Network Contributor" \
--assignee-object-id ${uamiIdentityId} -o none \
--scope ${vnetId}/subnets/apisubnet --assignee-principal-type ServicePrincipal
az aks create -n ${aks} -g ${rG} \
--nodepool-name agentpool \
--node-count 1 --no-ssh-key \
--node-vm-size Standard_A4_v2 \
--network-plugin azure -o none \
--assign-identity ${uamiId} \
--vnet-subnet-id ${vnetId}/subnets/akssubnet
- Set node drain timeout value to speed up the process of triggering error
az aks nodepool update -n agentpool -o none \
--cluster-name ${aks} -g ${rG} --drain-timeout 5
- Trigger error
az aks update -n ${aks} -g ${rG} -o none \
--enable-apiserver-vnet-integration --enable-private-cluster \
--apiserver-subnet-id ${vnetId}/subnets/apisubnet
NOTEIt takes about 30 minutes for the error to be triggered.
The output is as below:
(UpgradeFailed) Drain node aks-agentpool-12345678-vmss000000 failed when evicting pod metrics-server-0c40b8d296-kanfw failed with Too Many Requests error. This is often caused by a restrictive Pod Disruption Budget (PDB) policy. See http://aka.ms/aks/debugdrainfailures. Original error: Cannot evict pod as it would violate the pod's disruption budget.. PDB debug info: kube-system/metrics-server-0c40b8d296-kanfw blocked by pdb metrics-server-pdb with 2 unready pods: [kube-system/metrics-server-0c40b8d296-ej9un,kube-system/metrics-server-0c40b8d296-kanfw].
Code: UpgradeFailed
Message: Drain node aks-agentpool-12345678-vmss000000 failed when evicting pod metrics-server-0c40b8d296-kanfw failed with Too Many Requests error. This is often caused by a restrictive Pod Disruption Budget (PDB) policy. See http://aka.ms/aks/debugdrainfailures. Original error: Cannot evict pod as it would violate the pod's disruption budget.. PDB debug info: kube-system/metrics-server-0c40b8d296-kanfw blocked by pdb metrics-server-pdb with 2 unready pods: [kube-system/metrics-server-0c40b8d296-ej9un,kube-system/metrics-server-0c40b8d296-kanfw].
Analysis
Typically, we will try to visit the given TSG link http://aka.ms/aks/debugdrainfailures
, and it will give you a lot of information on how to fix the PDB problem. But now that I’ve written it, I’m sure you know it’s not the right direction.
TSG is written with the assumption that the Pod is currently running, so the problem can only be caused by incorrect PDB configuration. This does not apply to the current scenario.
In this demonstration, if we try to access AKS and get the logs after the error pops up, we will see the following:
2024-11-14T06:46:16.583706508Z stderr F E1114 06:46:16.583463 1 errors.go:77] Post "https://demoaks.hcp.region.azmk8s.io:443/apis/authorization.k8s.io/v1/subjectaccessreviews?timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2024-11-14T06:46:43.819556951Z stderr F E1114 06:46:43.818816 1 scraper.go:140] "Failed to scrape node" err="request failed, status: \"500 Internal Server Error\"" node="aks-agentpool-12345678-vmss000000"
2024-11-14T06:53:48.611847047Z stderr F I1114 06:53:48.611401 1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
2024-11-14T06:53:58.511488254Z stderr F I1114 06:53:58.511359 1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
2024-11-14T06:54:08.425805684Z stderr F I1114 06:54:08.425661 1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
2024-11-14T06:54:18.513209659Z stderr F I1114 06:54:18.513101 1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
2024-11-14T06:54:28.421972759Z stderr F I1114 06:54:28.421722 1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
2024-11-14T06:54:38.423495511Z stderr F I1114 06:54:38.423271 1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
2024-11-14T06:54:43.822708663Z stderr F E1114 06:54:43.822551 1 scraper.go:140] "Failed to scrape node" err="request failed, status: \"401 Unauthorized\"" node="aks-agentpool-12345678-vmss000000"
2024-11-14T06:48:51.784437577Z stderr F E1114 06:48:51.784207 1 nanny_lib.go:153] Get "https://demoaks.hcp.region.azmk8s.io:443/api/v1/nodes?resourceVersion=0": dial tcp 10.209.0.4:443: i/o timeout
2024-11-14T06:52:21.829049299Z stderr F E1114 06:52:21.828887 1 nanny_lib.go:153] Get "https://demoaks.hcp.region.azmk8s.io:443/api/v1/nodes?resourceVersion=0": dial tcp 10.209.0.4:443: i/o timeout
2024-11-14T06:55:51.855258613Z stderr F E1114 06:55:51.855088 1 nanny_lib.go:153] Get "https://demoaks.hcp.region.azmk8s.io:443/api/v1/nodes?resourceVersion=0": dial tcp 10.209.0.4:443: i/o timeout
2024-11-14T06:59:21.895245729Z stderr F E1114 06:59:21.895025 1 nanny_lib.go:153] Get "https://demoaks.hcp.region.azmk8s.io:443/api/v1/nodes?resourceVersion=0": dial tcp 10.209.0.4:443: i/o timeout
2024-11-14T07:02:51.907166383Z stderr F E1114 07:02:51.907022 1 nanny_lib.go:153] Get "https://demoaks.hcp.region.azmk8s.io:443/api/v1/nodes?resourceVersion=0": dial tcp: lookup demoaks.hcp.region.azmk8s.io: i/o timeout
2024-11-14T07:05:51.966886071Z stderr F E1114 07:05:51.966724 1 nanny_lib.go:153] Unauthorized
2024-11-14T07:08:52.0017319Z stderr F E1114 07:08:52.001563 1 nanny_lib.go:153] Unauthorized
2024-11-14T07:11:52.093546879Z stderr F E1114 07:11:52.093399 1 nanny_lib.go:153] Unauthorized
2024-11-14T07:14:52.130508186Z stderr F E1114 07:14:52.130294 1 nanny_lib.go:153] Unauthorized
2024-11-14T07:17:52.20994178Z stderr F E1114 07:17:52.209743 1 nanny_lib.go:153] Unauthorized
2024-11-14T07:20:52.24349834Z stderr F E1114 07:20:52.243306 1 nanny_lib.go:153] Unauthorized
2024-11-14T07:23:52.334233745Z stderr F E1114 07:23:52.334048 1 nanny_lib.go:153] Unauthorized
2024-11-14T07:26:52.366289969Z stderr F E1114 07:26:52.366152 1 nanny_lib.go:153] Unauthorized
Check out the event, and we can see readiness probe is failed.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning NodeNotReady 13m node-controller Node is not ready
Warning Unhealthy 66s (x71 over 11m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 500
Connect to node, get the syslog and find below:
Nov 14 06:45:53 aks-agentpool-12345678-vmss000000 kubelet[2789]: E1114 06:45:53.935841 2789 controller.go:195] "Failed to update lease" err="Put \"https://demoaks.hcp.region.azmk8s.io:443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/aks-agentpool-12345678-vmss000000?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Nov 14 06:46:02 aks-agentpool-12345678-vmss000000 kubelet[2789]: E1114 06:46:02.430836 2789 kubelet_node_status.go:544] "Error updating node status, will retry" err="error getting node \"aks-agentpool-12345678-vmss000000\": Get \"https://demoaks.hcp.region.azmk8s.io:443/api/v1/nodes/aks-agentpool-12345678-vmss000000?resourceVersion=0&timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Nov 14 07:01:08 aks-agentpool-12345678-vmss000000 kubelet[2789]: I1114 07:01:08.450964 2789 status_manager.go:853] "Failed to get status for pod" podUID="e73f86b4-8c24-4af6-8b9d-dfe3286c5ed6" pod="kube-system/metrics-server-0c40b8d296-ej9un" err="Get \"https://demoaks.hcp.region.azmk8s.io:443/api/v1/namespaces/kube-system/pods/metrics-server-0c40b8d296-ej9un\": dial tcp 10.209.0.4:443: i/o timeout"
Nov 14 07:01:08 aks-agentpool-12345678-vmss000000 kubelet[2789]: I1114 07:01:08.784975 2789 prober.go:107] "Probe failed" probeType="Readiness" pod="kube-system/metrics-server-0c40b8d296-kanfw" podUID="0e26fe43-b4bb-45b2-a9d7-c51a57bc6737" containerName="metrics-server" probeResult="failure" output="HTTP probe failed with statuscode: 500"
Nov 14 07:01:09 aks-agentpool-12345678-vmss000000 kubelet[2789]: E1114 07:01:09.517425 2789 server.go:310] "Unable to authenticate the request due to an error" err="invalid bearer token"
Nov 14 07:01:09 aks-agentpool-12345678-vmss000000 kubelet[2789]: I1114 07:01:09.522043 2789 kubelet_node_status.go:679] "Recording event message for node" node="aks-agentpool-12345678-vmss000000" event="NodeHasSufficientMemory"
Nov 14 07:01:09 aks-agentpool-12345678-vmss000000 kubelet[2789]: I1114 07:01:09.522357 2789 kubelet_node_status.go:679] "Recording event message for node" node="aks-agentpool-12345678-vmss000000" event="NodeHasNoDiskPressure"
Nov 14 07:01:09 aks-agentpool-12345678-vmss000000 kubelet[2789]: I1114 07:01:09.522568 2789 kubelet_node_status.go:679] "Recording event message for node" node="aks-agentpool-12345678-vmss000000" event="NodeHasSufficientPID"
Nov 14 07:01:09 aks-agentpool-12345678-vmss000000 kubelet[2789]: I1114 07:01:09.522597 2789 kubelet_node_status.go:679] "Recording event message for node" node="aks-agentpool-12345678-vmss000000" event="NodeReady"
NOTEIn this demonstration, the operation time is between
2024-11-14T06:43:11Z
and2024-11-14T07:11:17Z
.
Based on the above information, we can conclude the following:
- Enabling API Server VNet Integration will cause API server address change.
- Somehow when enabling feature with private cluster mode at the same time, API server will reject current node registeration for a while. (Refer to the error “Unable to authenticate the request due to an error” err=“invalid bearer token”)
- Until the operation times out, metrics-server still cannot authenticate with the API server. (Refer to the error “[xxx nanny_lib.go:153] Unauthorized”)
- Since all metrics-server Pods are down, Kubernetes will not be able to drain nodes due to Pod Disruption Budget (PDB) configuration. This hence causes AKS to fall into a failed status.
NOTEAn additional analysis was performed after publishing this journal, and further identified the possible root cause of this issue, as below. This issue should be fixed soon and no longer reproducible.
The node name, API serer FQDN, timestampe will be relatively changed.
Get the metrics-server service account token and use it to access the API server:
curl -X GET https://demoaks.hcp.region.azmk8s.io/api --header "Authorization: Bearer $(cat /proc/6885/root/var/run/secrets/kubernetes.io/serviceaccount/token)" --insecure'
Output:
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "Unauthorized",
"reason": "Unauthorized",
"code": 401
}
Let’s copy this Pod to another Pod, and we should be able to get a new token using the same service account frpm the new Pod:
kubectl debug metrics-server-xxxx-xxxx -it --image=alpine --copy-to=metrics-server-debugger
Install kubectl, curl and jq:
apk add curl kubectl jq
Check the user the Pod is currently using:
kubectl auth whoami
ATTRIBUTE VALUE
Username system:serviceaccount:kube-system:metrics-server
UID f741d53b-de2c-4c4b-b8f8-78e2544361d9
Groups [system:serviceaccounts system:serviceaccounts:kube-system system:authenticated]
Extra: authentication.kubernetes.io/credential-id [JTI=040128e8-6648-4e89-988c-39d1493d9069]
Extra: authentication.kubernetes.io/node-name [aks-agentpool-12345678-vmss000001]
Extra: authentication.kubernetes.io/node-uid [6c8a75ef-e6c0-480a-a598-35b8468300b9]
Extra: authentication.kubernetes.io/pod-name [my-debugger2]
Extra: authentication.kubernetes.io/pod-uid [57e11a19-3a47-43da-810f-1758682d1e90]
Check if the new token works:
curl -X GET https://demoaks.hcp.region.azmk8s.io/api --header "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" --insecure
Output:
{
"kind": "APIVersions",
"versions": [
"v1"
],
"serverAddressByClientCIDRs": [
{
"clientCIDR": "0.0.0.0/0",
"serverAddress": "demoaks.hcp.region.azmk8s.io:443"
}
]
}
So it looks like all old tokens are not working. Decode the token by using the following command:
cat /var/run/secrets/kubernetes.io/serviceaccount/token | jq -R 'split(".") | .[0],.[1] | @base64d | fromjson'
Compared the content from two tokens. The old token is below:
{
"alg": "RS256",
"kid": "TXkUmbcfCg5E99ebvXNWfiLBX1tXd42v1cvU0AH-nMQ"
}
{
"aud": [
"https://demoaks.hcp.region.azmk8s.io",
"\"demoaks.hcp.region.azmk8s.io\""
],
"exp": 1767857112,
"iat": 1736321112,
"iss": "https://demoaks.hcp.region.azmk8s.io",
"jti": "87d12d0e-e6d4-4809-8446-63ec035d843d",
"kubernetes.io": {
"namespace": "kube-system",
"node": {
"name": "aks-agentpool-12345678-vmss000000",
"uid": "49986760-d5b7-4777-9334-86e8f52898d5"
},
"pod": {
"name": "metrics-server-xxxx-xxxx",
"uid": "a4e9ee55-afb4-419a-967e-9afcac23f860"
},
"serviceaccount": {
"name": "metrics-server",
"uid": "f741d53b-de2c-4c4b-b8f8-78e2544361d9"
},
"warnafter": 1736324719
},
"nbf": 1736321112,
"sub": "system:serviceaccount:kube-system:metrics-server"
}
New token:
{
"alg": "RS256",
"kid": "TXkUmbcfCg5E99ebvXNWfiLBX1tXd42v1cvU0AH-nMQ"
}
{
"aud": [
"https://demoaks.bfbc4fbf-2b58-4310-91f2-2bb1c9445f29.private.region.azmk8s.io",
"\"demoaks.bfbc4fbf-2b58-4310-91f2-2bb1c9445f29.private.region.azmk8s.io\""
],
"exp": 1767858957,
"iat": 1736322957,
"iss": "https://demoaks.bfbc4fbf-2b58-4310-91f2-2bb1c9445f29.private.region.azmk8s.io",
"jti": "d12cb715-563d-447b-bc83-b3350f22168a",
"kubernetes.io": {
"namespace": "kube-system",
"node": {
"name": "aks-agentpool-12345678-vmss000001",
"uid": "6c8a75ef-e6c0-480a-a598-35b8468300b9"
},
"pod": {
"name": "metrics-server-debugger",
"uid": "57e11a19-3a47-43da-810f-1758682d1e90"
},
"serviceaccount": {
"name": "metrics-server",
"uid": "f741d53b-de2c-4c4b-b8f8-78e2544361d9"
},
"warnafter": 1736326564
},
"nbf": 1736322957,
"sub": "system:serviceaccount:kube-system:metrics-server"
}
We see that the issuers are different.
In my scenario, the token will expire after 10 minutes when we found this. So, we wait for it to expire and try to access the server again when the old Pod gets a new token.
curl -X GET https://demoaks.hcp.region.azmk8s.io/api --header "Authorization: Bearer $(cat /proc/6885/root/var/run/secrets/kubernetes.io/serviceaccount/token)" --insecure
Output:
{
"kind": "APIVersions",
"versions": [
"v1"
],
"serverAddressByClientCIDRs": [
{
"clientCIDR": "0.0.0.0/0",
"serverAddress": "demoaks.hcp.region.azmk8s.io:443"
}
]
}
cat /proc/6885/root/var/run/secrets/kubernetes.io/serviceaccount/token | jq -R 'split(".") | .[0],.[1] | @base64d | fromjson'
Output:
{
"alg": "RS256",
"kid": "TXkUmbcfCg5E99ebvXNWfiLBX1tXd42v1cvU0AH-nMQ"
}
{
"aud": [
"https://demoaks.bfbc4fbf-2b58-4310-91f2-2bb1c9445f29.private.region.azmk8s.io",
"\"demoaks.bfbc4fbf-2b58-4310-91f2-2bb1c9445f29.private.region.azmk8s.io\""
],
"exp": 1767862949,
"iat": 1736326949,
"iss": "https://demoaks.bfbc4fbf-2b58-4310-91f2-2bb1c9445f29.private.region.azmk8s.io",
"jti": "64d5947b-afc9-41b8-93cf-f490e997428a",
"kubernetes.io": {
"namespace": "kube-system",
"node": {
"name": "aks-agentpool-12345678-vmss000000",
"uid": "49986760-d5b7-4777-9334-86e8f52898d5"
},
"pod": {
"name": "metrics-server-xxxx-xxxx",
"uid": "a4e9ee55-afb4-419a-967e-9afcac23f860"
},
"serviceaccount": {
"name": "metrics-server",
"uid": "f741d53b-de2c-4c4b-b8f8-78e2544361d9"
},
"warnafter": 1736330556
},
"nbf": 1736326949,
"sub": "system:serviceaccount:kube-system:metrics-server"
The new token is now having new issuer.
Let’s check if all Pods are working properly:
kubectl get po -o wide -n kube-system | grep metrics-server
Output:
metrics-server-5dfc656944-4rmtg 2/2 Running 0 128m 10.208.0.18 aks-agentpool-12345678-vmss000000 <none> <none>
metrics-server-5dfc656944-nkvrh 2/2 Running 0 128m 10.208.0.14 aks-agentpool-12345678-vmss000000 <none> <none>
metrics-server-debugger 3/3 Running 0 98m 10.208.0.55 aks-agentpool-12345678-vmss000001 <none> <none>
I think we found the root cause and can draw the following conclusions:
- Because we tried to convert a public AKS to a private cluster, the issuer (also API server) FQDN in the control plane was updated, but it was not compatible with the old FQDN.
- When we separate the parameters
--enable-apiserver-vnet-integration
and--enable-private-cluster
into two commands, this error is not triggered because the FQDN of the API server is changed to the IP used to connect to metrics-server in the first step, and there is no FQDN change unless the second step is being triggered.
The mitigation methods are:
- Wait for a while after the operation fails and try to reconcile AKS, as the old Pod will automatically get a new token issued by new issuer.
- Reimage/shut down/delete the old nodes from the VMSS panel, the Pods will be forced to move to the new nodes and then reconcile AKS.
AKS will come back to Running status after then.
Conclusion
This is a good example of how a PDB error can mislead an investigation, as both incidents involved failures to register with the API server, and core components were unable to operate.
Often, users are misled by the TSG due to the error code, and try to fix the issue with the incorrect direction (e.g. trying to delete PDB resources). Especially when it comes to the built-in component, users don’t even have the option to edit it. A PDB error does not necessarily mean that the problem is with a wrong PDB configuration, it is also important to focus on the reason why it is not working.
It is dangerous to jump to the conclusion that “PDB error causing node draining failures is by design” or “Allowed Disruptions does not meet the requirement”. In the simulation, at least the new node is available for use. But in the incident of “unable to upgrade to AKS 1.31.1”, the new node was not even up - which means what TSG said will never work. Simply saying “the root cause was PDB” is not helping and may cause the operator to driving away from resolving the issue.
Epilogue
This is a side story when I was writing the article Convert public AKS cluster to private cluster. At the time, the “unable to upgrade to AKS 1.31.1” incident became an ongoing issue. During debugging, I found that this way of thinking can mislead the problem, and coincidentally, I found another way to present the problem while writing the article.
This is how this journal coming out.
Special Thanks to
Ping He: For providing further insights on root cause analysis.