OnPremise: events not showing up but getting 200 to store

I’m running On Premise 20.8.0 since yesterday.
For the first hour everything was fine, my events got processed and showed up in my project.

However, that is not the case anymore.

Some logs:

nginx:

nginx_1                        | 10.10.2.65 - - [10/Sep/2020:07:45:11 +0000] "POST /api/2/store/ HTTP/1.0" 200 41 "-" "sentry-php/1.10.0"
nginx_1                        | 172.20.0.1 - - [10/Sep/2020:07:45:12 +0000] "POST /api/2/store/ HTTP/1.0" 200 41 "-" "sentry-php/1.10.0"
nginx_1                        | 10.10.2.65 - - [10/Sep/2020:07:45:13 +0000] "POST /api/2/store/ HTTP/1.0" 200 41 "-" "sentry-php/1.10.0"
nginx_1                        | 172.20.0.1 - - [10/Sep/2020:07:45:14 +0000] "POST /api/2/store/ HTTP/1.0" 200 41 "-" "sentry-php/1.10.0"
nginx_1                        | 10.10.2.65 - - [10/Sep/2020:07:45:14 +0000] "POST /api/2/store/ HTTP/1.0" 200 41 "-" "sentry-php/1.10.0"
nginx_1                        | 172.20.0.1 - - [10/Sep/2020:07:45:14 +0000] "POST /api/2/store/ HTTP/1.0" 200 41 "-" "sentry-php/1.10.0"
nginx_1                        | 10.10.2.65 - - [10/Sep/2020:07:45:14 +0000] "POST /api/2/store/ HTTP/1.0" 200 41 "-" "sentry-php/1.10.0"
nginx_1                        | 172.20.0.1 - - [10/Sep/2020:07:45:14 +0000] "POST /api/2/store/ HTTP/1.0" 200 41 "-" "sentry-php/1.10.0"
nginx_1                        | 10.10.2.65 - - [10/Sep/2020:07:45:14 +0000] "POST /api/2/store/ HTTP/1.0" 200 41 "-" "sentry-php/1.10.0"

etc. This seems to indicate events are being added?

ingest-consumer:

ingest-consumer_1              | 07:45:43 [INFO] batching-kafka-consumer: Worker flush took 171ms
ingest-consumer_1              | 07:45:44 [INFO] batching-kafka-consumer: Flushing 6 items (from {(u'ingest-events', 0): [1377L, 1382L]}): forced:False size:False time:True
ingest-consumer_1              | 07:45:44 [INFO] batching-kafka-consumer: Worker flush took 87ms
ingest-consumer_1              | 07:45:45 [INFO] batching-kafka-consumer: Flushing 14 items (from {(u'ingest-events', 0): [1383L, 1396L]}): forced:False size:False time:True
ingest-consumer_1              | 07:45:46 [INFO] batching-kafka-consumer: Worker flush took 642ms

This seems it’s doing some ingesting?

Now if I docker-compose logs -f I see no other output whatsoever (except for sometimes redis writing it’s data to disk), which makes me think one of the other parts is not doing it’s job?

The last part of the relay log:

relay_1                        |   caused by: failed to fetch project state from upstream
relay_1                        | 2020-09-10T07:40:42Z [relay_server::actors::events] ERROR: error processing event: failed to resolve project information
relay_1                        |   caused by: failed to fetch project state from upstream
relay_1                        | 2020-09-10T07:40:42Z [relay_server::actors::events] ERROR: error processing event: failed to resolve project information
relay_1                        |   caused by: failed to fetch project state from upstream
relay_1                        | 2020-09-10T07:40:42Z [relay_server::actors::events] ERROR: error processing event: failed to resolve project information
relay_1                        |   caused by: failed to fetch project state from upstream
relay_1                        | 2020-09-10T07:40:42Z [relay_server::actors::events] ERROR: error processing event: failed to resolve project information
relay_1                        |   caused by: failed to fetch project state from upstream
relay_1                        | 2020-09-10T07:40:42Z [relay_server::actors::events] ERROR: error processing event: failed to resolve project information
relay_1                        |   caused by: failed to fetch project state from upstream
relay_1                        | 2020-09-10T07:40:42Z [relay_server::actors::events] ERROR: error processing event: failed to resolve project information
relay_1                        |   caused by: failed to fetch project state from upstream
relay_1                        | 2020-09-10T07:40:42Z [relay_server::actors::events] ERROR: error processing event: failed to resolve project information
relay_1                        |   caused by: failed to fetch project state from upstream
relay_1                        | 2020-09-10T07:40:42Z [relay_server::actors::events] ERROR: error processing event: failed to resolve project information
relay_1                        |   caused by: failed to fetch project state from upstream
relay_1                        | 2020-09-10T07:40:42Z [relay_server::actors::events] ERROR: error processing event: failed to resolve project information
relay_1                        |   caused by: failed to fetch project state from upstream
relay_1                        | 2020-09-10T07:40:42Z [relay_server::actors::events] ERROR: error processing event: failed to resolve project information
relay_1                        |   caused by: failed to fetch project state from upstream
relay_1                        | 2020-09-10T07:40:42Z [relay_server::actors::events] ERROR: error processing event: failed to resolve project information
relay_1                        |   caused by: failed to fetch project state from upstream

This is when I restarted the stack. After that nothing.

The last part of the kafka log:

kafka_1                        | [main] INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=zookeeper:2181 sessionTimeout=40000 watcher=io.confluent.admin.utils.ZookeeperConnectionWatcher@cc34f4d
kafka_1                        | [main] INFO org.apache.zookeeper.common.X509Util - Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation
kafka_1                        | [main] INFO org.apache.zookeeper.ClientCnxnSocket - jute.maxbuffer value is 4194304 Bytes
kafka_1                        | [main] INFO org.apache.zookeeper.ClientCnxn - zookeeper.request.timeout value is 0. feature enabled=
kafka_1                        | [main-SendThread(zookeeper:2181)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server zookeeper/172.20.0.19:2181. Will not attempt to authenticate using SASL (unknown error)
kafka_1                        | [main-SendThread(zookeeper:2181)] INFO org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /172.20.0.11:38414, server: zookeeper/172.20.0.19:2181
kafka_1                        | [main-SendThread(zookeeper:2181)] INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server zookeeper/172.20.0.19:2181, sessionid = 0x1000000f2560000, negotiated timeout = 40000
kafka_1                        | [main-EventThread] INFO org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x1000000f2560000
kafka_1                        | [main] INFO org.apache.zookeeper.ZooKeeper - Session: 0x1000000f2560000 closed
kafka_1                        | ===> Launching ...
kafka_1                        | ===> Launching kafka ...
kafka_1                        | [2020-09-10 07:40:50,319] INFO Registered kafka:type=kafka.Log4jController MBean (kafka.utils.Log4jControllerRegistration$)
kafka_1                        | [2020-09-10 07:40:52,089] WARN The package io.confluent.support.metrics.collectors.FullCollector for collecting the full set of support metrics could not be loaded, so we are reverting to anonymous, basic metric collection. If you are a Confluent customer, please refer to the Confluent Platform documentation, section Proactive Support, on how to activate full metrics collection. (io.confluent.support.metrics.KafkaSupportConfig)
kafka_1                        | [2020-09-10 07:40:52,090] WARN The support metrics collection feature ("Metrics") of Proactive Support is disabled. (io.confluent.support.metrics.SupportedServerStartable)
kafka_1                        | [2020-09-10 07:40:55,802] INFO Starting the log cleaner (kafka.log.LogCleaner)
kafka_1                        | [2020-09-10 07:40:56,215] INFO [kafka-log-cleaner-thread-0]: Starting (kafka.log.LogCleaner)
kafka_1                        | [2020-09-10 07:40:57,514] INFO Awaiting socket connections on 0.0.0.0:9092. (kafka.network.Acceptor)
kafka_1                        | [2020-09-10 07:40:57,598] INFO [SocketServer brokerId=1001] Created data-plane acceptor and processors for endpoint : EndPoint(0.0.0.0,9092,ListenerName(PLAINTEXT),PLAINTEXT) (kafka.network.SocketServer)
kafka_1                        | [2020-09-10 07:40:57,601] INFO [SocketServer brokerId=1001] Started 1 acceptor threads for data-plane (kafka.network.SocketServer)
kafka_1                        | [2020-09-10 07:40:57,905] INFO Creating /brokers/ids/1001 (is it secure? false) (kafka.zk.KafkaZkClient)
kafka_1                        | [2020-09-10 07:40:57,953] INFO Stat of the created znode at /brokers/ids/1001 is: 288,288,1599723657931,1599723657931,1,0,0,72057598103650305,180,0,288
kafka_1                        |  (kafka.zk.KafkaZkClient)
kafka_1                        | [2020-09-10 07:40:57,954] INFO Registered broker 1001 at path /brokers/ids/1001 with addresses: ArrayBuffer(EndPoint(kafka,9092,ListenerName(PLAINTEXT),PLAINTEXT)), czxid (broker epoch): 288 (kafka.zk.KafkaZkClient)
kafka_1                        | [2020-09-10 07:41:00,235] INFO [/config/changes-event-process-thread]: Starting (kafka.common.ZkNodeChangeNotificationListener$ChangeEventProcessThread)
kafka_1                        | [2020-09-10 07:41:00,509] INFO [SocketServer brokerId=1001] Started data-plane processors for 1 acceptors (kafka.network.SocketServer)

Maybe this zookeeper log is interesting:

zookeeper_1                    | ===> Launching ...
zookeeper_1                    | ===> Launching zookeeper ...
zookeeper_1                    | [2020-09-10 07:39:55,044] WARN Either no config or no quorum defined in config, running  in standalone mode (org.apache.zookeeper.server.quorum.QuorumPeerMain)
zookeeper_1                    | [2020-09-10 07:39:56,423] WARN o.e.j.s.ServletContextHandler@4d95d2a2{/,null,UNAVAILABLE} contextPath ends with /* (org.eclipse.jetty.server.handler.ContextHandler)
zookeeper_1                    | [2020-09-10 07:39:56,423] WARN Empty contextPath (org.eclipse.jetty.server.handler.ContextHandler)
zookeeper_1                    | [2020-09-10 07:39:57,217] WARN Exception causing close of session 0x0: ZooKeeperServer not running (org.apache.zookeeper.server.NIOServerCnxn)
zookeeper_1                    | [2020-09-10 07:39:57,219] WARN Unexpected exception (org.apache.zookeeper.server.WorkerService)
zookeeper_1                    | java.lang.NullPointerException
zookeeper_1                    | 	at org.apache.zookeeper.server.ZooKeeperServer.removeCnxn(ZooKeeperServer.java:135)
zookeeper_1                    | 	at org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:602)
zookeeper_1                    | 	at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:375)
zookeeper_1                    | 	at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:530)
zookeeper_1                    | 	at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:155)
zookeeper_1                    | 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
zookeeper_1                    | 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
zookeeper_1                    | 	at java.lang.Thread.run(Thread.java:748)

I first thought my system had not enough RAM (4GB) since it was almost full all the time so I increased it to 8 GB. Now it’s not full but doing the same thing.

I can give logs of other containers if needed.

Anyone has any clue what is going on?

This seems like Relay somehow lost the project information (or it is being passed an incorrect DSN or something). I’m assuming you haven’t changed anything on the SDK side, right?

@jauer - any thoughts?

Meanwhile I did some more digging.

I also increased my CPU count from 1 to 4, which made it that it worked for a whole day now (last time it only worked an hour), but it did stop working again 2 hours ago.

I restarted each docker container individually to see if I could find anything. I went through almost the whole list, to end up at kafka. As soon as I restarted that one the CPU usage of celeryd, which I believe is the worker container, spiked to 50% on all 4 cores and I got back my events that were missing in the last two hours in a constant influx. I assume it will also start processing events again after that as soon as they arrive, but I alos expect it to stop doing that again at some point.

This is the CPU graph btw, I think you can guess when the worker started processing again…

What is the best I can do next time? I can try to only restart kafka and see if that helps.

This is the worker log at the moment it seems to have stopped:

worker_1                       | 10:21:37 [INFO] sentry.tasks.update_user_reports: update_user_reports.records_updated (reports_with_event=0 updated_reports=0 reports_to_update=0)
worker_1                       | Traceback (most recent call last):
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py", line 316, in start
worker_1                       |     blueprint.start(self)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/bootsteps.py", line 119, in start
worker_1                       |     step.start(parent)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/worker/consumer/consumer.py", line 592, in start
worker_1                       |     c.loop(*c.loop_args())
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/celery/worker/loops.py", line 91, in asynloop
worker_1                       |     next(loop)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/asynchronous/hub.py", line 276, in create_loop
worker_1                       |     tick_callback()
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 1040, in on_poll_start
worker_1                       |     cycle_poll_start()
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 322, in on_poll_start
worker_1                       |     self._register_BRPOP(channel)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 308, in _register_BRPOP
worker_1                       |     channel._brpop_start()
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 714, in _brpop_start
worker_1                       |     self.client.connection.send_command('BRPOP', *keys)
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/redis/connection.py", line 610, in send_command
worker_1                       |     self.send_packed_command(self.pack_command(*args))
worker_1                       |   File "/usr/local/lib/python2.7/site-packages/redis/connection.py", line 603, in send_packed_command
worker_1                       |     (errno, errmsg))
worker_1                       | ConnectionError: Error 32 while writing to socket. Broken pipe.
worker_1                       | 10:21:55 [WARNING] celery.worker.consumer.consumer: consumer: Connection to broker lost. Trying to re-establish the connection...
worker_1                       | Restoring 1 unacknowledged message(s)
worker_1                       |
worker_1                       | worker: Warm shutdown (MainProcess)

Can you also show the redis logs from around the same time?

Sure:

ESC[36mredis_1                        |ESC[0m 1327:C 11 Sep 2020 10:19:26.070 * DB saved on disk
ESC[36mredis_1                        |ESC[0m 1327:C 11 Sep 2020 10:19:26.071 * RDB: 0 MB of memory used by copy-on-write
ESC[36mredis_1                        |ESC[0m 1:M 11 Sep 2020 10:19:26.130 * Background saving terminated with success
ESC[36mredis_1                        |ESC[0m 1:M 11 Sep 2020 10:20:27.031 * 10000 changes in 60 seconds. Saving...
ESC[36mredis_1                        |ESC[0m 1:M 11 Sep 2020 10:20:27.033 * Background saving started by pid 1328
ESC[36mredis_1                        |ESC[0m 1328:C 11 Sep 2020 10:20:27.063 * DB saved on disk
ESC[36mredis_1                        |ESC[0m 1328:C 11 Sep 2020 10:20:27.064 * RDB: 0 MB of memory used by copy-on-write
ESC[36mredis_1                        |ESC[0m 1:M 11 Sep 2020 10:20:27.133 * Background saving terminated with success
ESC[36mredis_1                        |ESC[0m 1:M 11 Sep 2020 10:21:28.023 * 10000 changes in 60 seconds. Saving...
ESC[36mredis_1                        |ESC[0m 1:M 11 Sep 2020 10:21:28.026 * Background saving started by pid 1329
ESC[36mredis_1                        |ESC[0m 1329:C 11 Sep 2020 10:21:28.064 * DB saved on disk
ESC[36mredis_1                        |ESC[0m 1329:C 11 Sep 2020 10:21:28.065 * RDB: 0 MB of memory used by copy-on-write
ESC[36mredis_1                        |ESC[0m 1:M 11 Sep 2020 10:21:28.126 * Background saving terminated with success
ESC[36mredis_1                        |ESC[0m 1:M 11 Sep 2020 10:22:29.006 * 10000 changes in 60 seconds. Saving...
ESC[36mredis_1                        |ESC[0m 1:M 11 Sep 2020 10:22:29.008 * Background saving started by pid 1330
ESC[36mredis_1                        |ESC[0m 1330:C 11 Sep 2020 10:22:29.049 * DB saved on disk
ESC[36mredis_1                        |ESC[0m 1330:C 11 Sep 2020 10:22:29.050 * RDB: 1 MB of memory used by copy-on-write
ESC[36mredis_1                        |ESC[0m 1:M 11 Sep 2020 10:22:29.108 * Background saving terminated with success

Doesn’t seem strange?

Yup, just a bit bursty which might be the cause of all these troubles.