I have the onpremise installation working on a T2.medium EC2 instance, but every week I have to restart it because the nginx server becomes unresponsive. I have a load balancer set up and the health checks start failing every Sunday night / Monday morning. Restarting the EC2 instance and re-setting the target group for the load balancer seems to get things back to normal. Is anyone else running onpremise with a similar AWS setup seen this issue? One of the challenges is I can’t SSH on to the instance to view the logs. I have to restart it to even be able to get in via SSH.
The only regular job that runs weekly is this:
And I don’t think it would cause such a hang. @matt, any ideas?
Is this the task that sends the weekly report via email? I can confirm that job did not run this morning. If there is a queue that needs to get flushed, that didn’t happen after restarting the server either. Is there something I need to do to make sure everything is up and running after restarting the instance? I did SSH and make sure the services were up so I had assumed docker-compose took care of restarting, but if there is a way I can verify everything is running as expected that would be great.
I got an alert that the sever had gone down. When I checked it in the AWS console, I saw there had been a CPU spike and when I logged into the instance some of the services were restarting:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
216441487b43 nginx:1.16 "nginx -g 'daemon of…" 5 days ago Up 31 hours 0.0.0.0:9000->80/tcp sentry_onpremise_nginx_1
1bedf79128cd sentry-onpremise-local "/bin/sh -c 'exec /d…" 5 days ago Up 31 hours 9000/tcp sentry_onpremise_web_1
62746ff247a8 sentry-onpremise-local "/bin/sh -c 'exec /d…" 5 days ago Restarting (1) 27 seconds ago sentry_onpremise_post-process-forwarder_1
c05b12a37239 sentry-onpremise-local "/bin/sh -c 'exec /d…" 5 days ago Up 31 hours 9000/tcp sentry_onpremise_cron_1
a628c36faf3e sentry-onpremise-local "/bin/sh -c 'exec /d…" 5 days ago Up 31 hours 9000/tcp sentry_onpremise_worker_1
bab95e4a7a91 sentry-cleanup-onpremise-local "/entrypoint.sh '0 0…" 5 days ago Up 31 hours 9000/tcp sentry_onpremise_sentry-cleanup_1
2e3609305b47 sentry-onpremise-local "/bin/sh -c 'exec /d…" 5 days ago Restarting (1) 27 seconds ago sentry_onpremise_subscription-consumer-events_1
d0815694291e sentry-onpremise-local "/bin/sh -c 'exec /d…" 5 days ago Up 31 hours 9000/tcp sentry_onpremise_ingest-consumer_1
a6fec2782b00 sentry-onpremise-local "/bin/sh -c 'exec /d…" 5 days ago Up 31 hours 9000/tcp sentry_onpremise_subscription-consumer-transactions_1
50285aaae227 snuba-cleanup-onpremise-local "/entrypoint.sh '*/5…" 5 days ago Up 31 hours 1218/tcp sentry_onpremise_snuba-cleanup_1
0c75027c98ba getsentry/relay:20.11.1 "/bin/bash /docker-e…" 5 days ago Up 31 hours 3000/tcp sentry_onpremise_relay_1
d8e128d95fc9 getsentry/snuba:20.11.1 "./docker_entrypoint…" 5 days ago Restarting (1) 29 seconds ago sentry_onpremise_snuba-subscription-consumer-transactions_1
5caf801d3c28 getsentry/snuba:20.11.1 "./docker_entrypoint…" 5 days ago Restarting (1) 27 seconds ago sentry_onpremise_snuba-subscription-consumer-events_1
d657ee33e6ec symbolicator-cleanup-onpremise-local "/entrypoint.sh '55 …" 5 days ago Up 31 hours 3021/tcp sentry_onpremise_symbolicator-cleanup_1
afdc0966954f getsentry/snuba:20.11.1 "./docker_entrypoint…" 5 days ago Restarting (1) 28 seconds ago sentry_onpremise_snuba-transactions-consumer_1
0bc7e6e76e3f getsentry/snuba:20.11.1 "./docker_entrypoint…" 5 days ago Up 31 hours 1218/tcp sentry_onpremise_snuba-consumer_1
79c87b015545 getsentry/snuba:20.11.1 "./docker_entrypoint…" 5 days ago Restarting (1) 29 seconds ago sentry_onpremise_snuba-replacer_1
5e3424dcbb06 getsentry/snuba:20.11.1 "./docker_entrypoint…" 5 days ago Up 31 hours 1218/tcp sentry_onpremise_snuba-outcomes-consumer_1
9a842f0bdef8 getsentry/snuba:20.11.1 "./docker_entrypoint…" 5 days ago Up 31 hours 1218/tcp sentry_onpremise_snuba-api_1
a3f98b91574d getsentry/snuba:20.11.1 "./docker_entrypoint…" 5 days ago Up 31 hours 1218/tcp sentry_onpremise_snuba-sessions-consumer_1
5b912a6e80a2 memcached:1.5-alpine "docker-entrypoint.s…" 5 days ago Up 31 hours 11211/tcp sentry_onpremise_memcached_1
307ed037b6fa postgres:9.6 "docker-entrypoint.s…" 5 days ago Up 31 hours 5432/tcp sentry_onpremise_postgres_1
794dbc10e0c7 getsentry/symbolicator:0.3.0 "/bin/bash /docker-e…" 5 days ago Up 31 hours 3021/tcp sentry_onpremise_symbolicator_1
6c6aed46ae37 tianon/exim4 "docker-entrypoint.s…" 5 days ago Up 31 hours 25/tcp sentry_onpremise_smtp_1
891965a15529 confluentinc/cp-kafka:5.5.0 "/etc/confluent/dock…" 5 days ago Up 31 hours 9092/tcp sentry_onpremise_kafka_1
a6c1560e9a4c redis:5.0-alpine "docker-entrypoint.s…" 5 days ago Up 31 hours 6379/tcp sentry_onpremise_redis_1
a3826b6a2e99 confluentinc/cp-zookeeper:5.5.0 "/etc/confluent/dock…" 5 days ago Up 31 hours 2181/tcp, 2888/tcp, 3888/tcp sentry_onpremise_zookeeper_1
a70cf55b1a15 yandex/clickhouse-server:20.3.9.70 "/entrypoint.sh" 5 days ago Up 31 hours 8123/tcp, 9000/tcp, 9009/tcp sentry_onpremise_clickhouse_1
Currently, my site isn’t working with just “Internal error”
I’m not sure what happened or where to start looking. Any help would be appreciated!
Blech, no fun. Sorry.
I got an alert that the sever had gone down.
This happened overnight Sunday into Monday, as before?
What version of Sentry are you running?
Here’s the code underneath prepare_reports
(in the latest stable release).
How many organizations do you have in your installation?
How many members max per organization?
How many max projects per organization?
Anything weird going on with Redis? Connection issues? Looks like reports pull from there.
Looks like heavy lifting gets us down into tsdb/base.py. My hunch is that you’ve got some extreme data scenario that is choking the report infra.
Any leads anywhere in there?
I’m running the latest version 20.11.1 (4468076).
I run it for only one organization, and we only have three members. There are five projects total but only four of them are active.
I’m not seeing anything weird with Redis at first glance, but something does appear to be going on with Kafka.
At present, the sentry_onpremise_snuba-subscription-consumer-events_1 is exited with code 1:
+ exec gosu snuba snuba subscriptions --auto-offset-reset=latest --consumer-group=snuba-events-subscriptions-consumers --topic=events --result-topic=events-subscription-results --dataset=events --commit-log-topic=snuba-commit-log --commit-log-group=snuba-consumers --delay-seconds=60 --schedule-ttl=60
2020-11-30 06:35:25,743 New partitions assigned: {Partition(topic=Topic(name='events'), index=0): 11383}
2020-11-30 06:35:25,744 Caught OffsetOutOfRange('KafkaError{code=OFFSET_OUT_OF_RANGE,val=1,str="Broker: Offset out of range"}'), shutting down...
Traceback (most recent call last):
File "/usr/local/bin/snuba", line 33, in <module>
sys.exit(load_entry_point('snuba', 'console_scripts', 'snuba')())
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/usr/src/snuba/snuba/cli/subscriptions.py", line 224, in subscriptions
batching_consumer.run()
File "/usr/src/snuba/snuba/utils/streams/processing/processor.py", line 109, in run
self._run_once()
File "/usr/src/snuba/snuba/utils/streams/processing/processor.py", line 139, in _run_once
self.__message = self.__consumer.poll(timeout=1.0)
File "/usr/src/snuba/snuba/subscriptions/consumer.py", line 120, in poll
message = self.__consumer.poll(timeout)
File "/usr/src/snuba/snuba/utils/streams/synchronized.py", line 217, in poll
message = self.__consumer.poll(timeout)
File "/usr/src/snuba/snuba/utils/streams/backends/kafka.py", line 400, in poll
raise OffsetOutOfRange(str(error))
snuba.utils.streams.backends.abstract.OffsetOutOfRange: KafkaError{code=OFFSET_OUT_OF_RANGE,val=1,str="Broker: Offset out of range"}
In the Kafka logs, I see this:
[2020-11-30 12:08:36,254] WARN Unable to reconnect to ZooKeeper service, session 0x10006b24970009b has expired (org.apache.zookeeper.ClientCnxn)
[2020-11-30 12:08:36,263] INFO Creating /brokers/ids/1001 (is it secure? false) (kafka.zk.KafkaZkClient)
[2020-11-30 12:08:36,271] INFO Stat of the created znode at /brokers/ids/1001 is: 3505,3505,1606738116270,1606738116270,1,0,0,72064956843950236,180,0,3505
(kafka.zk.KafkaZkClient)
[2020-11-30 12:08:36,271] INFO Registered broker 1001 at path /brokers/ids/1001 with addresses: ArrayBuffer(EndPoint(kafka,9092,ListenerName(PLAINTEXT),PLAINTEXT)), czxid (broker epoch): 3505 (kafka.zk.KafkaZkClient)
Zookeeper logs:
[2020-11-28 16:02:25,980] WARN CancelledKeyException causing close of session 0x10006b24970008b (org.apache.zookeeper.server.NIOServerCnxn)
[2020-11-28 16:02:33,763] WARN fsync-ing the write ahead log in SyncThread:0 took 3807ms which will adversely effect operation latency. File size is 67108880 bytes. See the ZooKeeper troubleshooting guide (org.apache.zookeeper.server.persistence.FileTxnLog)
[2020-11-28 16:05:09,681] WARN fsync-ing the write ahead log in SyncThread:0 took 1039ms which will adversely effect operation latency. File size is 67108880 bytes. See the ZooKeeper troubleshooting guide (org.apache.zookeeper.server.persistence.FileTxnLog)
[2020-11-28 16:07:53,556] WARN Unable to read additional data from client sessionid 0x10006b24970008d, likely client has closed socket (org.apache.zookeeper.server.NIOServerCnxn)
[2020-11-28 16:21:43,015] WARN fsync-ing the write ahead log in SyncThread:0 took 3774ms which will adversely effect operation latency. File size is 67108880 bytes. See the ZooKeeper troubleshooting guide (org.apache.zookeeper.server.persistence.FileTxnLog)
[2020-11-28 16:50:23,266] WARN fsync-ing the write ahead log in SyncThread:0 took 4089ms which will adversely effect operation latency. File size is 67108880 bytes. See the ZooKeeper troubleshooting guide (org.apache.zookeeper.server.persistence.FileTxnLog)
[2020-11-28 17:05:03,746] WARN fsync-ing the write ahead log in SyncThread:0 took 1158ms which will adversely effect operation latency. File size is 67108880 bytes. See the ZooKeeper troubleshooting guide (org.apache.zookeeper.server.persistence.FileTxnLog)
[2020-11-30 06:23:01,143] WARN Unable to read additional data from client sessionid 0x10006b249700096, likely client has closed socket (org.apache.zookeeper.server.NIOServerCnxn)
[2020-11-30 06:51:09,687] WARN Unable to read additional data from client sessionid 0x10006b249700098, likely client has closed socket (org.apache.zookeeper.server.NIOServerCnxn)
[2020-11-30 06:51:09,756] WARN Unable to read additional data from client sessionid 0x10006b249700098, likely client has closed socket (org.apache.zookeeper.server.NIOServerCnxn)
[2020-11-30 06:51:09,756] WARN Unable to read additional data from client sessionid 0x10006b249700098, likely client has closed socket (org.apache.zookeeper.server.NIOServerCnxn)
[2020-11-30 06:51:09,756] WARN Unable to read additional data from client sessionid 0x10006b249700098, likely client has closed socket (org.apache.zookeeper.server.NIOServerCnxn)
[2020-11-30 06:51:09,757] WARN Unable to read additional data from client sessionid 0x10006b249700098, likely client has closed socket (org.apache.zookeeper.server.NIOServerCnxn)
[2020-11-30 06:51:09,757] WARN Unable to read additional data from client sessionid 0x10006b249700098, likely client has closed socket (org.apache.zookeeper.server.NIOServerCnxn)
[2020-11-30 07:10:31,989] WARN Unable to read additional data from client sessionid 0x10006b24970009a, likely client has closed socket (org.apache.zookeeper.server.NIOServerCnxn)
So if anything, it looks like that log file is the problem point.
Not sure how I can resolve this? I can upload my latest install script log if that helps at all.
Googling turned up this:
Wanna see if that’s your issue and give the workaround a try?
P.S. What I googled was “kafka broker offset out of range.”
Cool, thanks for the link. I ran the command to reset the offsets for snuba-events-subscriptions-consumers
and restarted the service that was exited. I’ll give it a few days and see how it goes.
Sorry for the trouble @avio_taylor. Good luck and let us know how it goes!
The server has stayed up since I made the change
But it looks like it is no longer recording events:
I suppose it is possible there was nothing triggered last week, but since i set this up I don’t think I’ve ever had a week of zero errors. Did I screw something up?
Good news on the server staying up! Bad news on the lack of events. I mean, good news if it’s accurate, but I’m with you … seems highly unlikely. :-/
Do you have a known error or test endpoint in your app, or other means of triggering an error explicitly to see if it shows up in your Sentry?
I’ve tried triggering an error manually from my app running locally, but I’m not seeing it show up in our sentry dashboard. Is there a way to check the docker services logs to see if the event was being sent from the client app?
Appreciate all your help with these issues!
Unfortunately, the same issue seems to have cropped up again last night. It seems to have started at about midnight eastern time and has continued through to this morning. I’m unable to connect to the instance via SSH.
When this happened before, I had to reboot the instance from the EC2 dashboard in order to get SSH working again. If I can connect to the instance, will the logs persist from the previous session? If I can’t get at the previous logs, I’m not sure what I can look at to try to track down what’s going on with this setup.
It appears that resetting the offsets didn’t stick and I’m back to the instance going down. It’s not reachable via SSH at the moment (it just hangs after establishing a connection), but it appears that the CPU is spiking, which is probably why I can’t tunnel in
I won’t be able to SSH until I restart the instance. What should I try when I can get in to take a look at what’s going on? What could be causing the CPU to spike like that?
I think you are having event spikes and your hardware resources (probably memory) cannot keep up with this volume.
Thanks for the reply. Are there any detailed requirements for the hardware environment specified for onpremise anywhere? The only mention in the repo is 2400MB RAM but I think part of my problem might be slow writes to disk. I’ll look around on the forums for recommendations from other people running onpremise with AWS. Thanks again.
This is the correct solution; I must have either passed the wrong group name or didn’t perform this step for all of the groups that were erroring.
I did run docker-compose down
and docker-compose up -d kafka
before running the commands in the kafka shell to make sure everything stuck.
Thanks again for your help with tracking this down.
This topic was automatically closed 15 days after the last reply. New replies are no longer allowed.