2276 words
11 minutes
How Pod Disruption Budget (PDB) error mislead the investigation

Last month, there was an incident that caused AKS node pools to broken, if user attempted to upgrade AKS from an older version to 1.31.1. The root cause was simple: AKS did not unset the deprecation flag during the upgrade process, thus causing new nodes to fail to register with API server.
However, as the symptom of this incident, some people may be misled by this error when debugging the issue, because the following error will pop:

(UpgradeFailed) Drain node aks-agentpool-12345678-vmss000000 failed when evicting pod metrics-server-0c40b8d296-ej9un failed with Too Many Requests error. This is often caused by a restrictive Pod Disruption Budget (PDB) policy. See http://aka.ms/aks/debugdrainfailures. Original error: Cannot evict pod as it would violate the pod's disruption budget.. PDB debug info: kube-system/metrics-server-0c40b8d296-ej9un blocked by pdb metrics-server-pdb with 1 unready pods: [kube-system/metrics-server-0c40b8d296-ej9un].
 
Code: UpgradeFailed
Message: Drain node aks-agentpool-12345678-vmss000000 failed when evicting pod metrics-server-0c40b8d296-ej9un failed with Too Many Requests error. This is often caused by a restrictive Pod Disruption Budget (PDB) policy. See http://aka.ms/aks/debugdrainfailures. Original error: Cannot evict pod as it would violate the pod's disruption budget.. PDB debug info: kube-system/metrics-server-0c40b8d296-ej9un blocked by pdb metrics-server-pdb with 1 unready pods: [kube-system/metrics-server-0c40b8d296-ej9un].

This further misleads them into drawing wrong conclusions. Let’s have a review on this.

Simulation#

In the article Convert public AKS cluster to private cluster, I wrote a note as below:

You cannot combine steps 4 and 5. The metrics-server will fail due to readiness probe failures caused by failure to refresh the API server address, and the Pod will not start automatically. This further triggers PDB failure.

At the time, I realized that this behavior could be used to simulate the process of this incident, as they both had same output with similar causes when trying to update/upgrade the AKS cluster.

Therefore, based on this behavior, we can have a reproduction:

  1. Initializing resources
ranNum=$(echo $RANDOM)
rG=aks-${ranNum}
aks=aks-${ranNum}
vnet=aks-${ranNum}-vnet
uami=${aks}-identity
location=southeastasia

az group create -n ${rG} -l ${location} -o none

az network vnet create -g ${rG} -n ${vnet} \
--address-prefixes 10.208.0.0/12 -o none
az network vnet subnet create -n akssubnet --no-wait \
--vnet-name ${vnet} -g ${rG} --address-prefixes 10.208.0.0/23
az network vnet subnet create -n apisubnet -o none \
--vnet-name ${vnet} -g ${rG} --address-prefixes 10.209.0.0/28
vnetId=$(az resource list -n ${vnet} -g ${rG} \
--resource-type Microsoft.Network/virtualNetworks \
--query [0].id -o tsv)

az identity create -n ${uami} -g ${rG} -o none
uamiId=$(az resource list -n ${uami} -g ${rG} \
--resource-type Microsoft.ManagedIdentity/userAssignedIdentities \
--query [0].id -o tsv)
uamiIdentityId=$(az identity show --ids $uamiId \
--query principalId -o tsv)

az role assignment create --role "Network Contributor" \
--assignee-object-id ${uamiIdentityId} -o none \
--scope ${vnetId}/subnets/akssubnet --assignee-principal-type ServicePrincipal
az role assignment create --role "Network Contributor" \
--assignee-object-id ${uamiIdentityId} -o none \
--scope ${vnetId}/subnets/apisubnet --assignee-principal-type ServicePrincipal

az aks create -n ${aks} -g ${rG} \
--nodepool-name agentpool \
--node-count 1 --no-ssh-key \
--node-vm-size Standard_A4_v2 \
--network-plugin azure -o none \
--assign-identity ${uamiId} \
--vnet-subnet-id ${vnetId}/subnets/akssubnet
  1. Set node drain timeout value to speed up the process of triggering error
az aks nodepool update -n agentpool -o none \
--cluster-name ${aks} -g ${rG} --drain-timeout 5
  1. Trigger error
az aks update -n ${aks} -g ${rG} -o none \
--enable-apiserver-vnet-integration --enable-private-cluster \
--apiserver-subnet-id ${vnetId}/subnets/apisubnet
NOTE

It takes about 30 minutes for the error to be triggered.

The output is as below:

(UpgradeFailed) Drain node aks-agentpool-12345678-vmss000000 failed when evicting pod metrics-server-0c40b8d296-kanfw failed with Too Many Requests error. This is often caused by a restrictive Pod Disruption Budget (PDB) policy. See http://aka.ms/aks/debugdrainfailures. Original error: Cannot evict pod as it would violate the pod's disruption budget.. PDB debug info: kube-system/metrics-server-0c40b8d296-kanfw blocked by pdb metrics-server-pdb with 2 unready pods: [kube-system/metrics-server-0c40b8d296-ej9un,kube-system/metrics-server-0c40b8d296-kanfw].
 
Code: UpgradeFailed
Message: Drain node aks-agentpool-12345678-vmss000000 failed when evicting pod metrics-server-0c40b8d296-kanfw failed with Too Many Requests error. This is often caused by a restrictive Pod Disruption Budget (PDB) policy. See http://aka.ms/aks/debugdrainfailures. Original error: Cannot evict pod as it would violate the pod's disruption budget.. PDB debug info: kube-system/metrics-server-0c40b8d296-kanfw blocked by pdb metrics-server-pdb with 2 unready pods: [kube-system/metrics-server-0c40b8d296-ej9un,kube-system/metrics-server-0c40b8d296-kanfw].

Analysis#

Typically, we will try to visit the given TSG link http://aka.ms/aks/debugdrainfailures, and it will give you a lot of information on how to fix the PDB problem. But now that I’ve written it, I’m sure you know it’s not the right direction.
TSG is written with the assumption that the Pod is currently running, so the problem can only be caused by incorrect PDB configuration. This does not apply to the current scenario.

In this demonstration, if we try to access AKS and get the logs after the error pops up, we will see the following:

2024-11-14T06:46:16.583706508Z stderr F E1114 06:46:16.583463       1 errors.go:77] Post "https://demoaks.hcp.region.azmk8s.io:443/apis/authorization.k8s.io/v1/subjectaccessreviews?timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2024-11-14T06:46:43.819556951Z stderr F E1114 06:46:43.818816       1 scraper.go:140] "Failed to scrape node" err="request failed, status: \"500 Internal Server Error\"" node="aks-agentpool-12345678-vmss000000"
2024-11-14T06:53:48.611847047Z stderr F I1114 06:53:48.611401       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
2024-11-14T06:53:58.511488254Z stderr F I1114 06:53:58.511359       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
2024-11-14T06:54:08.425805684Z stderr F I1114 06:54:08.425661       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
2024-11-14T06:54:18.513209659Z stderr F I1114 06:54:18.513101       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
2024-11-14T06:54:28.421972759Z stderr F I1114 06:54:28.421722       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
2024-11-14T06:54:38.423495511Z stderr F I1114 06:54:38.423271       1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
2024-11-14T06:54:43.822708663Z stderr F E1114 06:54:43.822551       1 scraper.go:140] "Failed to scrape node" err="request failed, status: \"401 Unauthorized\"" node="aks-agentpool-12345678-vmss000000"
2024-11-14T06:48:51.784437577Z stderr F E1114 06:48:51.784207       1 nanny_lib.go:153] Get "https://demoaks.hcp.region.azmk8s.io:443/api/v1/nodes?resourceVersion=0": dial tcp 10.209.0.4:443: i/o timeout
2024-11-14T06:52:21.829049299Z stderr F E1114 06:52:21.828887       1 nanny_lib.go:153] Get "https://demoaks.hcp.region.azmk8s.io:443/api/v1/nodes?resourceVersion=0": dial tcp 10.209.0.4:443: i/o timeout
2024-11-14T06:55:51.855258613Z stderr F E1114 06:55:51.855088       1 nanny_lib.go:153] Get "https://demoaks.hcp.region.azmk8s.io:443/api/v1/nodes?resourceVersion=0": dial tcp 10.209.0.4:443: i/o timeout
2024-11-14T06:59:21.895245729Z stderr F E1114 06:59:21.895025       1 nanny_lib.go:153] Get "https://demoaks.hcp.region.azmk8s.io:443/api/v1/nodes?resourceVersion=0": dial tcp 10.209.0.4:443: i/o timeout
2024-11-14T07:02:51.907166383Z stderr F E1114 07:02:51.907022       1 nanny_lib.go:153] Get "https://demoaks.hcp.region.azmk8s.io:443/api/v1/nodes?resourceVersion=0": dial tcp: lookup demoaks.hcp.region.azmk8s.io: i/o timeout
2024-11-14T07:05:51.966886071Z stderr F E1114 07:05:51.966724       1 nanny_lib.go:153] Unauthorized
2024-11-14T07:08:52.0017319Z stderr F E1114 07:08:52.001563       1 nanny_lib.go:153] Unauthorized
2024-11-14T07:11:52.093546879Z stderr F E1114 07:11:52.093399       1 nanny_lib.go:153] Unauthorized
2024-11-14T07:14:52.130508186Z stderr F E1114 07:14:52.130294       1 nanny_lib.go:153] Unauthorized
2024-11-14T07:17:52.20994178Z stderr F E1114 07:17:52.209743       1 nanny_lib.go:153] Unauthorized
2024-11-14T07:20:52.24349834Z stderr F E1114 07:20:52.243306       1 nanny_lib.go:153] Unauthorized
2024-11-14T07:23:52.334233745Z stderr F E1114 07:23:52.334048       1 nanny_lib.go:153] Unauthorized
2024-11-14T07:26:52.366289969Z stderr F E1114 07:26:52.366152       1 nanny_lib.go:153] Unauthorized

Check out the event, and we can see readiness probe is failed.

Events:  
Type     Reason        Age                 From             Message  
----     ------        ----                ----             -------  
Warning  NodeNotReady  13m                 node-controller  Node is not ready  
Warning  Unhealthy     66s (x71 over 11m)  kubelet          Readiness probe failed: HTTP probe failed with statuscode: 500

Connect to node, get the syslog and find below:

Nov 14 06:45:53 aks-agentpool-12345678-vmss000000 kubelet[2789]: E1114 06:45:53.935841    2789 controller.go:195] "Failed to update lease" err="Put \"https://demoaks.hcp.region.azmk8s.io:443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/aks-agentpool-12345678-vmss000000?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Nov 14 06:46:02 aks-agentpool-12345678-vmss000000 kubelet[2789]: E1114 06:46:02.430836    2789 kubelet_node_status.go:544] "Error updating node status, will retry" err="error getting node \"aks-agentpool-12345678-vmss000000\": Get \"https://demoaks.hcp.region.azmk8s.io:443/api/v1/nodes/aks-agentpool-12345678-vmss000000?resourceVersion=0&timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Nov 14 07:01:08 aks-agentpool-12345678-vmss000000 kubelet[2789]: I1114 07:01:08.450964    2789 status_manager.go:853] "Failed to get status for pod" podUID="e73f86b4-8c24-4af6-8b9d-dfe3286c5ed6" pod="kube-system/metrics-server-0c40b8d296-ej9un" err="Get \"https://demoaks.hcp.region.azmk8s.io:443/api/v1/namespaces/kube-system/pods/metrics-server-0c40b8d296-ej9un\": dial tcp 10.209.0.4:443: i/o timeout"
Nov 14 07:01:08 aks-agentpool-12345678-vmss000000 kubelet[2789]: I1114 07:01:08.784975    2789 prober.go:107] "Probe failed" probeType="Readiness" pod="kube-system/metrics-server-0c40b8d296-kanfw" podUID="0e26fe43-b4bb-45b2-a9d7-c51a57bc6737" containerName="metrics-server" probeResult="failure" output="HTTP probe failed with statuscode: 500"
Nov 14 07:01:09 aks-agentpool-12345678-vmss000000 kubelet[2789]: E1114 07:01:09.517425    2789 server.go:310] "Unable to authenticate the request due to an error" err="invalid bearer token"
Nov 14 07:01:09 aks-agentpool-12345678-vmss000000 kubelet[2789]: I1114 07:01:09.522043    2789 kubelet_node_status.go:679] "Recording event message for node" node="aks-agentpool-12345678-vmss000000" event="NodeHasSufficientMemory"
Nov 14 07:01:09 aks-agentpool-12345678-vmss000000 kubelet[2789]: I1114 07:01:09.522357    2789 kubelet_node_status.go:679] "Recording event message for node" node="aks-agentpool-12345678-vmss000000" event="NodeHasNoDiskPressure"
Nov 14 07:01:09 aks-agentpool-12345678-vmss000000 kubelet[2789]: I1114 07:01:09.522568    2789 kubelet_node_status.go:679] "Recording event message for node" node="aks-agentpool-12345678-vmss000000" event="NodeHasSufficientPID"
Nov 14 07:01:09 aks-agentpool-12345678-vmss000000 kubelet[2789]: I1114 07:01:09.522597    2789 kubelet_node_status.go:679] "Recording event message for node" node="aks-agentpool-12345678-vmss000000" event="NodeReady"
NOTE

In this demonstration, the operation time is between 2024-11-14T06:43:11Z and 2024-11-14T07:11:17Z.

Based on the above information, we can conclude the following:

  • Enabling API Server VNet Integration will cause API server address change.
  • Somehow when enabling feature with private cluster mode at the same time, API server will reject current node registeration for a while. (Refer to the error “Unable to authenticate the request due to an error” err=“invalid bearer token”)
  • Until the operation times out, metrics-server still cannot authenticate with the API server. (Refer to the error “[xxx nanny_lib.go:153] Unauthorized”)
  • Since all metrics-server Pods are down, Kubernetes will not be able to drain nodes due to Pod Disruption Budget (PDB) configuration. This hence causes AKS to fall into a failed status.
NOTE

An additional analysis was performed after publishing this journal, and further identified the possible root cause of this issue, as below. This issue should be fixed soon and no longer reproducible.
The node name, API serer FQDN, timestampe will be relatively changed.

Get the metrics-server service account token and use it to access the API server:

curl -X GET https://demoaks.hcp.region.azmk8s.io/api --header "Authorization: Bearer $(cat /proc/6885/root/var/run/secrets/kubernetes.io/serviceaccount/token)" --insecure'

Output:

{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "Unauthorized",
  "reason": "Unauthorized",
  "code": 401
}

Let’s copy this Pod to another Pod, and we should be able to get a new token using the same service account frpm the new Pod:

kubectl debug metrics-server-xxxx-xxxx -it --image=alpine --copy-to=metrics-server-debugger

Install kubectl, curl and jq:

apk add curl kubectl jq

Check the user the Pod is currently using:

kubectl auth whoami
ATTRIBUTE                                           VALUE
Username                                            system:serviceaccount:kube-system:metrics-server
UID                                                 f741d53b-de2c-4c4b-b8f8-78e2544361d9
Groups                                              [system:serviceaccounts system:serviceaccounts:kube-system system:authenticated]
Extra: authentication.kubernetes.io/credential-id   [JTI=040128e8-6648-4e89-988c-39d1493d9069]
Extra: authentication.kubernetes.io/node-name       [aks-agentpool-12345678-vmss000001]
Extra: authentication.kubernetes.io/node-uid        [6c8a75ef-e6c0-480a-a598-35b8468300b9]
Extra: authentication.kubernetes.io/pod-name        [my-debugger2]
Extra: authentication.kubernetes.io/pod-uid         [57e11a19-3a47-43da-810f-1758682d1e90]

Check if the new token works:

curl -X GET https://demoaks.hcp.region.azmk8s.io/api --header "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" --insecure

Output:

{
  "kind": "APIVersions",
  "versions": [
    "v1"
  ],
  "serverAddressByClientCIDRs": [
    {
      "clientCIDR": "0.0.0.0/0",
      "serverAddress": "demoaks.hcp.region.azmk8s.io:443"
    }
  ]
}

So it looks like all old tokens are not working. Decode the token by using the following command:

cat /var/run/secrets/kubernetes.io/serviceaccount/token | jq -R 'split(".") | .[0],.[1] | @base64d | fromjson'

Compared the content from two tokens. The old token is below:

{
  "alg": "RS256",
  "kid": "TXkUmbcfCg5E99ebvXNWfiLBX1tXd42v1cvU0AH-nMQ"
}
{
  "aud": [
    "https://demoaks.hcp.region.azmk8s.io",
    "\"demoaks.hcp.region.azmk8s.io\""
  ],
  "exp": 1767857112,
  "iat": 1736321112,
  "iss": "https://demoaks.hcp.region.azmk8s.io",
  "jti": "87d12d0e-e6d4-4809-8446-63ec035d843d",
  "kubernetes.io": {
    "namespace": "kube-system",
    "node": {
      "name": "aks-agentpool-12345678-vmss000000",
      "uid": "49986760-d5b7-4777-9334-86e8f52898d5"
    },
    "pod": {
      "name": "metrics-server-xxxx-xxxx",
      "uid": "a4e9ee55-afb4-419a-967e-9afcac23f860"
    },
    "serviceaccount": {
      "name": "metrics-server",
      "uid": "f741d53b-de2c-4c4b-b8f8-78e2544361d9"
    },
    "warnafter": 1736324719
  },
  "nbf": 1736321112,
  "sub": "system:serviceaccount:kube-system:metrics-server"
}

New token:

{
  "alg": "RS256",
  "kid": "TXkUmbcfCg5E99ebvXNWfiLBX1tXd42v1cvU0AH-nMQ"
}
{
  "aud": [
    "https://demoaks.bfbc4fbf-2b58-4310-91f2-2bb1c9445f29.private.region.azmk8s.io",
    "\"demoaks.bfbc4fbf-2b58-4310-91f2-2bb1c9445f29.private.region.azmk8s.io\""
  ],
  "exp": 1767858957,
  "iat": 1736322957,
  "iss": "https://demoaks.bfbc4fbf-2b58-4310-91f2-2bb1c9445f29.private.region.azmk8s.io",
  "jti": "d12cb715-563d-447b-bc83-b3350f22168a",
  "kubernetes.io": {
    "namespace": "kube-system",
    "node": {
      "name": "aks-agentpool-12345678-vmss000001",
      "uid": "6c8a75ef-e6c0-480a-a598-35b8468300b9"
    },
    "pod": {
      "name": "metrics-server-debugger",
      "uid": "57e11a19-3a47-43da-810f-1758682d1e90"
    },
    "serviceaccount": {
      "name": "metrics-server",
      "uid": "f741d53b-de2c-4c4b-b8f8-78e2544361d9"
    },
    "warnafter": 1736326564
  },
  "nbf": 1736322957,
  "sub": "system:serviceaccount:kube-system:metrics-server"
}

We see that the issuers are different.

In my scenario, the token will expire after 10 minutes when we found this. So, we wait for it to expire and try to access the server again when the old Pod gets a new token.

curl -X GET https://demoaks.hcp.region.azmk8s.io/api --header "Authorization: Bearer $(cat /proc/6885/root/var/run/secrets/kubernetes.io/serviceaccount/token)" --insecure   

Output:

{
  "kind": "APIVersions",
  "versions": [
    "v1"
  ],
  "serverAddressByClientCIDRs": [
    {
      "clientCIDR": "0.0.0.0/0",
      "serverAddress": "demoaks.hcp.region.azmk8s.io:443"
    }
  ]
}
cat /proc/6885/root/var/run/secrets/kubernetes.io/serviceaccount/token | jq -R 'split(".") | .[0],.[1] | @base64d | fromjson'

Output:

{
  "alg": "RS256",
  "kid": "TXkUmbcfCg5E99ebvXNWfiLBX1tXd42v1cvU0AH-nMQ"
}
{
  "aud": [
    "https://demoaks.bfbc4fbf-2b58-4310-91f2-2bb1c9445f29.private.region.azmk8s.io",
    "\"demoaks.bfbc4fbf-2b58-4310-91f2-2bb1c9445f29.private.region.azmk8s.io\""
  ],
  "exp": 1767862949,
  "iat": 1736326949,
  "iss": "https://demoaks.bfbc4fbf-2b58-4310-91f2-2bb1c9445f29.private.region.azmk8s.io",
  "jti": "64d5947b-afc9-41b8-93cf-f490e997428a",
  "kubernetes.io": {
    "namespace": "kube-system",
    "node": {
      "name": "aks-agentpool-12345678-vmss000000",
      "uid": "49986760-d5b7-4777-9334-86e8f52898d5"
    },
    "pod": {
      "name": "metrics-server-xxxx-xxxx",
      "uid": "a4e9ee55-afb4-419a-967e-9afcac23f860"
    },
    "serviceaccount": {
      "name": "metrics-server",
      "uid": "f741d53b-de2c-4c4b-b8f8-78e2544361d9"
    },
    "warnafter": 1736330556
  },
  "nbf": 1736326949,
  "sub": "system:serviceaccount:kube-system:metrics-server"

The new token is now having new issuer.

Let’s check if all Pods are working properly:

kubectl get po -o wide -n kube-system | grep metrics-server

Output:

metrics-server-5dfc656944-4rmtg      2/2     Running   0             128m   10.208.0.18   aks-agentpool-12345678-vmss000000   <none>           <none>
metrics-server-5dfc656944-nkvrh      2/2     Running   0             128m   10.208.0.14   aks-agentpool-12345678-vmss000000   <none>           <none>
metrics-server-debugger              3/3     Running   0             98m    10.208.0.55   aks-agentpool-12345678-vmss000001   <none>           <none>

I think we found the root cause and can draw the following conclusions:

  • Because we tried to convert a public AKS to a private cluster, the issuer (also API server) FQDN in the control plane was updated, but it was not compatible with the old FQDN.
  • When we separate the parameters --enable-apiserver-vnet-integration and --enable-private-cluster into two commands, this error is not triggered because the FQDN of the API server is changed to the IP used to connect to metrics-server in the first step, and there is no FQDN change unless the second step is being triggered.

The mitigation methods are:

  • Wait for a while after the operation fails and try to reconcile AKS, as the old Pod will automatically get a new token issued by new issuer.
  • Reimage/shut down/delete the old nodes from the VMSS panel, the Pods will be forced to move to the new nodes and then reconcile AKS.

AKS will come back to Running status after then.

Conclusion#

This is a good example of how a PDB error can mislead an investigation, as both incidents involved failures to register with the API server, and core components were unable to operate.

Often, users are misled by the TSG due to the error code, and try to fix the issue with the incorrect direction (e.g. trying to delete PDB resources). Especially when it comes to the built-in component, users don’t even have the option to edit it. A PDB error does not necessarily mean that the problem is with a wrong PDB configuration, it is also important to focus on the reason why it is not working.

It is dangerous to jump to the conclusion that “PDB error causing node draining failures is by design” or “Allowed Disruptions does not meet the requirement”. In the simulation, at least the new node is available for use. But in the incident of “unable to upgrade to AKS 1.31.1”, the new node was not even up - which means what TSG said will never work. Simply saying “the root cause was PDB” is not helping and may cause the operator to driving away from resolving the issue.

Epilogue#

This is a side story when I was writing the article Convert public AKS cluster to private cluster. At the time, the “unable to upgrade to AKS 1.31.1” incident became an ongoing issue. During debugging, I found that this way of thinking can mislead the problem, and coincidentally, I found another way to present the problem while writing the article.

This is how this journal coming out.

Special Thanks to#

Ping He: For providing further insights on root cause analysis.

How Pod Disruption Budget (PDB) error mislead the investigation
https://blog.joeyc.dev/posts/misleading-pdb-error/
Author
Joey Chen
Published at
2024-11-15