Kubernetes (K3S) POD gets “ENOTFOUND” after 5-20 hours of airing time

Question

I&#8217;m running my Backend on Kubernetes on around 250 pods under 15 deployments, backend in written in NODEJS. Sometimes after X number of hours (5<X<30) I&#8217;m getting ENOTFOUND in one of the PODS, as follows: I&#8217;m running a stress test on the Backend of YY number of users per second, but I&…

Accepted Answer

Already saw your same question on github and reference to getaddrinfo ENOTFOUND with newest versions.As per comments this issue does not appear in k3s 1.21, that is 1 version below yours. I know it almost impossible, but any chance to try similar setup on this ver?And it seems error comes from node/lib/dns.js.function errnoException(err, syscall, hostname) {  // FIXME(bnoordhuis) Remove this backwards compatibility nonsense and pass  // the true error to the user. ENOTFOUND is not even a proper POSIX error!  if (err === uv.UV_EAI_MEMORY ||      err === uv.UV_EAI_NODATA ||      err === uv.UV_EAI_NONAME) {    err = 'ENOTFOUND';  }What I wanted to suggest you is to check Solving DNS lookup failures in Kubernetes. Article describes long hard way of catching the same error you have that also  bothered from time to time.As a solution aftet investigating all the metrics, logs, etc &#8211; was installing   K8s cluster add-on called Node Local DNS cache, thatimproves Cluster DNS performance by running a dnscaching agent on cluster nodes as a DaemonSet. In today&#8217;sarchitecture, Pods in ClusterFirst DNS mode reach out to a kube-dnsserviceIP for DNS queries. This is translated to a kube-dns/CoreDNSendpoint via iptables rules added by kube-proxy. With this newarchitecture, Pods will reach out to the dns caching agent running onthe same node, thereby avoiding iptables DNAT rules and connectiontracking. The local caching agent will query kube-dns service forcache misses of cluster hostnames(cluster.local suffix by default).MotivationWith the current DNS architecture, it is possible that Pods with the    highest DNS QPS have to reach out to a different node, if thereis no    local kube-dns/CoreDNS instance. Having a local cache willhelp    improve the latency in such scenarios.Skipping iptables DNAT and connection tracking will help reduce    conntrack races and avoid UDP DNS entries filling up conntrack table.Connections from local caching agent to kube-dns service can be    upgraded to TCP. TCP conntrack entries will be removed on connectionclose in contrast with UDP entries that have to timeout (defaultnf_conntrack_udp_timeout is 30 seconds)Upgrading DNS queries from UDP to TCP would reduce tail latency    attributed to dropped UDP packets and DNS timeouts usually up to 30s(3 retries + 10s timeout). Since the nodelocal cache listens for UDPDNS queries, applications don&#8217;t need to be changed.Metrics & visibility into dns requests at a node level.Negative caching can be re-enabled, thereby reducing number of    queries to kube-dns service.

Advertisement

Answer