Skip to content
Advertisement

Kubernetes (K3S) POD gets “ENOTFOUND” after 5-20 hours of airing time

I’m running my Backend on Kubernetes on around 250 pods under 15 deployments, backend in written in NODEJS.

Sometimes after X number of hours (5<X<30) I’m getting ENOTFOUND in one of the PODS, as follows:

{
  "name": "main",
  "hostname": "entrypoint-sdk-54c8788caa-aa3cj",
  "pid": 19,
  "level": 50,
  "error": {
    "errno": -3008,
    "code": "ENOTFOUND",
    "syscall": "getaddrinfo",
    "hostname": "employees-service"
  },
  "msg": "Failed calling getEmployee",
  "time": "2022-01-28T13:44:36.549Z",
  "v": 0
}

I’m running a stress test on the Backend of YY number of users per second, but I’m keeping this stress level steady and not changing it, and then it happens out of nowhere with no specific reason.

Kubernetes is K3S Server Version: v1.21.5+k3s2

Any idea what might cause this weird ENOTFOUND?

Advertisement

Answer

Already saw your same question on github and reference to getaddrinfo ENOTFOUND with newest versions.

As per comments this issue does not appear in k3s 1.21, that is 1 version below yours. I know it almost impossible, but any chance to try similar setup on this ver?

And it seems error comes from node/lib/dns.js.

function errnoException(err, syscall, hostname) {
  // FIXME(bnoordhuis) Remove this backwards compatibility nonsense and pass
  // the true error to the user. ENOTFOUND is not even a proper POSIX error!
  if (err === uv.UV_EAI_MEMORY ||
      err === uv.UV_EAI_NODATA ||
      err === uv.UV_EAI_NONAME) {
    err = 'ENOTFOUND';
  }

What I wanted to suggest you is to check Solving DNS lookup failures in Kubernetes. Article describes long hard way of catching the same error you have that also bothered from time to time.

As a solution aftet investigating all the metrics, logs, etc – was installing K8s cluster add-on called Node Local DNS cache, that

improves Cluster DNS performance by running a dns caching agent on cluster nodes as a DaemonSet. In today’s architecture, Pods in ClusterFirst DNS mode reach out to a kube-dns serviceIP for DNS queries. This is translated to a kube-dns/CoreDNS endpoint via iptables rules added by kube-proxy. With this new architecture, Pods will reach out to the dns caching agent running on the same node, thereby avoiding iptables DNAT rules and connection tracking. The local caching agent will query kube-dns service for cache misses of cluster hostnames(cluster.local suffix by default).

Motivation

  • With the current DNS architecture, it is possible that Pods with the highest DNS QPS have to reach out to a different node, if there is no local kube-dns/CoreDNS instance. Having a local cache will help improve the latency in such scenarios.
  • Skipping iptables DNAT and connection tracking will help reduce conntrack races and avoid UDP DNS entries filling up conntrack table.
  • Connections from local caching agent to kube-dns service can be upgraded to TCP. TCP conntrack entries will be removed on connection
    close in contrast with UDP entries that have to timeout (default
    nf_conntrack_udp_timeout is 30 seconds)
  • Upgrading DNS queries from UDP to TCP would reduce tail latency attributed to dropped UDP packets and DNS timeouts usually up to 30s
    (3 retries + 10s timeout). Since the nodelocal cache listens for UDP
    DNS queries, applications don’t need to be changed.
  • Metrics & visibility into dns requests at a node level.
  • Negative caching can be re-enabled, thereby reducing number of queries to kube-dns service.
Advertisement