Currently we receive repeatedly some messages from specific worker nodes which says “Temporary failure in name resolution”, There seems to be a problem with converting the dns address to ip, and the problem is being repeated on a specific node. When we try to stop the docker container, the following error message appears “An Http request took too long to complete”
And this is a more detailed error
“worker MaxRetryError - Max retires exceeded with url: /api/1/store/ (caused by newConnectionError) failed to establish a new connection [Errno-3] Temporary failure in name resolution”
When this happens, the worker stops working like a zombie, cpu usage drops, and memory usage rises rapidly.
And it is not known exactly whether it is related to this, but the redis key increases rapidly, increasing the memory usage by close to 100%. (By the way, these two phenomena do not coincide in time)
It makes sense as it seems like the worker keeps retrying. Did you check your DNS server as the error message suggested?
You may also run multiple workers to avoid blockages like this. Check this topic out to see how you can run multiple, dedicated workers for different queues: How to clear backlog and monitor it - #15 by amit1
we finally solve this problem. it’s because we deploy so many container in one node
so there was too many traffic, network io, cpu usage. so we guess it can cause dns resolve problem. when we divorce the worker from the sunba & sentry consumers to another node, we can solve the problem. and this problem was related to kafka offset issue and excessive redis keys, we could solve these problem too.