We are trying to understand why sometimes messages appear in the system only an hour after they appear in the source.
The first thing I found was that at such moments there is a large queue of counters-0 and events.preprocess_event, I tried to increase the number of workers from 6 to 36 (6 processes with 6 concurrencies), but this did not help, every day I observe surges of up to 20,000 counters -0 and up to several thousand events.preprocess_event.
In order to understand the problem, I turned on the internal metrics and redirected them to Prometheus via StatsD and StatsD-exporter, but they did not help me because I do not understand what this or that metric is responsible for.
The first question is, is there somewhere a description of internal metrics?
The second question - can someone else advise how to understand what the problem is?