Have to restart EC2 instance every week

I have the onpremise installation working on a T2.medium EC2 instance, but every week I have to restart it because the nginx server becomes unresponsive. I have a load balancer set up and the health checks start failing every Sunday night / Monday morning. Restarting the EC2 instance and re-setting the target group for the load balancer seems to get things back to normal. Is anyone else running onpremise with a similar AWS setup seen this issue? One of the challenges is I can’t SSH on to the instance to view the logs. I have to restart it to even be able to get in via SSH.

The only regular job that runs weekly is this:

And I don’t think it would cause such a hang. @matt, any ideas?

Is this the task that sends the weekly report via email? I can confirm that job did not run this morning. If there is a queue that needs to get flushed, that didn’t happen after restarting the server either. Is there something I need to do to make sure everything is up and running after restarting the instance? I did SSH and make sure the services were up so I had assumed docker-compose took care of restarting, but if there is a way I can verify everything is running as expected that would be great.

I got an alert that the sever had gone down. When I checked it in the AWS console, I saw there had been a CPU spike and when I logged into the instance some of the services were restarting:

CONTAINER ID        IMAGE                                  COMMAND                  CREATED             STATUS                          PORTS                          NAMES
216441487b43        nginx:1.16                             "nginx -g 'daemon of…"   5 days ago          Up 31 hours                     0.0.0.0:9000->80/tcp           sentry_onpremise_nginx_1
1bedf79128cd        sentry-onpremise-local                 "/bin/sh -c 'exec /d…"   5 days ago          Up 31 hours                     9000/tcp                       sentry_onpremise_web_1
62746ff247a8        sentry-onpremise-local                 "/bin/sh -c 'exec /d…"   5 days ago          Restarting (1) 27 seconds ago                                  sentry_onpremise_post-process-forwarder_1
c05b12a37239        sentry-onpremise-local                 "/bin/sh -c 'exec /d…"   5 days ago          Up 31 hours                     9000/tcp                       sentry_onpremise_cron_1
a628c36faf3e        sentry-onpremise-local                 "/bin/sh -c 'exec /d…"   5 days ago          Up 31 hours                     9000/tcp                       sentry_onpremise_worker_1
bab95e4a7a91        sentry-cleanup-onpremise-local         "/entrypoint.sh '0 0…"   5 days ago          Up 31 hours                     9000/tcp                       sentry_onpremise_sentry-cleanup_1
2e3609305b47        sentry-onpremise-local                 "/bin/sh -c 'exec /d…"   5 days ago          Restarting (1) 27 seconds ago                                  sentry_onpremise_subscription-consumer-events_1
d0815694291e        sentry-onpremise-local                 "/bin/sh -c 'exec /d…"   5 days ago          Up 31 hours                     9000/tcp                       sentry_onpremise_ingest-consumer_1
a6fec2782b00        sentry-onpremise-local                 "/bin/sh -c 'exec /d…"   5 days ago          Up 31 hours                     9000/tcp                       sentry_onpremise_subscription-consumer-transactions_1
50285aaae227        snuba-cleanup-onpremise-local          "/entrypoint.sh '*/5…"   5 days ago          Up 31 hours                     1218/tcp                       sentry_onpremise_snuba-cleanup_1
0c75027c98ba        getsentry/relay:20.11.1                "/bin/bash /docker-e…"   5 days ago          Up 31 hours                     3000/tcp                       sentry_onpremise_relay_1
d8e128d95fc9        getsentry/snuba:20.11.1                "./docker_entrypoint…"   5 days ago          Restarting (1) 29 seconds ago                                  sentry_onpremise_snuba-subscription-consumer-transactions_1
5caf801d3c28        getsentry/snuba:20.11.1                "./docker_entrypoint…"   5 days ago          Restarting (1) 27 seconds ago                                  sentry_onpremise_snuba-subscription-consumer-events_1
d657ee33e6ec        symbolicator-cleanup-onpremise-local   "/entrypoint.sh '55 …"   5 days ago          Up 31 hours                     3021/tcp                       sentry_onpremise_symbolicator-cleanup_1
afdc0966954f        getsentry/snuba:20.11.1                "./docker_entrypoint…"   5 days ago          Restarting (1) 28 seconds ago                                  sentry_onpremise_snuba-transactions-consumer_1
0bc7e6e76e3f        getsentry/snuba:20.11.1                "./docker_entrypoint…"   5 days ago          Up 31 hours                     1218/tcp                       sentry_onpremise_snuba-consumer_1
79c87b015545        getsentry/snuba:20.11.1                "./docker_entrypoint…"   5 days ago          Restarting (1) 29 seconds ago                                  sentry_onpremise_snuba-replacer_1
5e3424dcbb06        getsentry/snuba:20.11.1                "./docker_entrypoint…"   5 days ago          Up 31 hours                     1218/tcp                       sentry_onpremise_snuba-outcomes-consumer_1
9a842f0bdef8        getsentry/snuba:20.11.1                "./docker_entrypoint…"   5 days ago          Up 31 hours                     1218/tcp                       sentry_onpremise_snuba-api_1
a3f98b91574d        getsentry/snuba:20.11.1                "./docker_entrypoint…"   5 days ago          Up 31 hours                     1218/tcp                       sentry_onpremise_snuba-sessions-consumer_1
5b912a6e80a2        memcached:1.5-alpine                   "docker-entrypoint.s…"   5 days ago          Up 31 hours                     11211/tcp                      sentry_onpremise_memcached_1
307ed037b6fa        postgres:9.6                           "docker-entrypoint.s…"   5 days ago          Up 31 hours                     5432/tcp                       sentry_onpremise_postgres_1
794dbc10e0c7        getsentry/symbolicator:0.3.0           "/bin/bash /docker-e…"   5 days ago          Up 31 hours                     3021/tcp                       sentry_onpremise_symbolicator_1
6c6aed46ae37        tianon/exim4                           "docker-entrypoint.s…"   5 days ago          Up 31 hours                     25/tcp                         sentry_onpremise_smtp_1
891965a15529        confluentinc/cp-kafka:5.5.0            "/etc/confluent/dock…"   5 days ago          Up 31 hours                     9092/tcp                       sentry_onpremise_kafka_1
a6c1560e9a4c        redis:5.0-alpine                       "docker-entrypoint.s…"   5 days ago          Up 31 hours                     6379/tcp                       sentry_onpremise_redis_1
a3826b6a2e99        confluentinc/cp-zookeeper:5.5.0        "/etc/confluent/dock…"   5 days ago          Up 31 hours                     2181/tcp, 2888/tcp, 3888/tcp   sentry_onpremise_zookeeper_1
a70cf55b1a15        yandex/clickhouse-server:20.3.9.70     "/entrypoint.sh"         5 days ago          Up 31 hours                     8123/tcp, 9000/tcp, 9009/tcp   sentry_onpremise_clickhouse_1

Currently, my site isn’t working with just “Internal error”

I’m not sure what happened or where to start looking. Any help would be appreciated!

Blech, no fun. Sorry. :frowning:

I got an alert that the sever had gone down.

This happened overnight Sunday into Monday, as before?

What version of Sentry are you running?

Here’s the code underneath prepare_reports (in the latest stable release).

How many organizations do you have in your installation?
How many members max per organization?
How many max projects per organization?
Anything weird going on with Redis? Connection issues? Looks like reports pull from there.

Looks like heavy lifting gets us down into tsdb/base.py. My hunch is that you’ve got some extreme data scenario that is choking the report infra.

Any leads anywhere in there?

1 Like

I’m running the latest version 20.11.1 (4468076).

I run it for only one organization, and we only have three members. There are five projects total but only four of them are active.

I’m not seeing anything weird with Redis at first glance, but something does appear to be going on with Kafka.

At present, the sentry_onpremise_snuba-subscription-consumer-events_1 is exited with code 1:

+ exec gosu snuba snuba subscriptions --auto-offset-reset=latest --consumer-group=snuba-events-subscriptions-consumers --topic=events --result-topic=events-subscription-results --dataset=events --commit-log-topic=snuba-commit-log --commit-log-group=snuba-consumers --delay-seconds=60 --schedule-ttl=60
2020-11-30 06:35:25,743 New partitions assigned: {Partition(topic=Topic(name='events'), index=0): 11383}
2020-11-30 06:35:25,744 Caught OffsetOutOfRange('KafkaError{code=OFFSET_OUT_OF_RANGE,val=1,str="Broker: Offset out of range"}'), shutting down...
Traceback (most recent call last):
  File "/usr/local/bin/snuba", line 33, in <module>
    sys.exit(load_entry_point('snuba', 'console_scripts', 'snuba')())
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/src/snuba/snuba/cli/subscriptions.py", line 224, in subscriptions
    batching_consumer.run()
  File "/usr/src/snuba/snuba/utils/streams/processing/processor.py", line 109, in run
    self._run_once()
  File "/usr/src/snuba/snuba/utils/streams/processing/processor.py", line 139, in _run_once
    self.__message = self.__consumer.poll(timeout=1.0)
  File "/usr/src/snuba/snuba/subscriptions/consumer.py", line 120, in poll
    message = self.__consumer.poll(timeout)
  File "/usr/src/snuba/snuba/utils/streams/synchronized.py", line 217, in poll
    message = self.__consumer.poll(timeout)
  File "/usr/src/snuba/snuba/utils/streams/backends/kafka.py", line 400, in poll
    raise OffsetOutOfRange(str(error))
snuba.utils.streams.backends.abstract.OffsetOutOfRange: KafkaError{code=OFFSET_OUT_OF_RANGE,val=1,str="Broker: Offset out of range"}

In the Kafka logs, I see this:

[2020-11-30 12:08:36,254] WARN Unable to reconnect to ZooKeeper service, session 0x10006b24970009b has expired (org.apache.zookeeper.ClientCnxn)
[2020-11-30 12:08:36,263] INFO Creating /brokers/ids/1001 (is it secure? false) (kafka.zk.KafkaZkClient)
[2020-11-30 12:08:36,271] INFO Stat of the created znode at /brokers/ids/1001 is: 3505,3505,1606738116270,1606738116270,1,0,0,72064956843950236,180,0,3505
 (kafka.zk.KafkaZkClient)
[2020-11-30 12:08:36,271] INFO Registered broker 1001 at path /brokers/ids/1001 with addresses: ArrayBuffer(EndPoint(kafka,9092,ListenerName(PLAINTEXT),PLAINTEXT)), czxid (broker epoch): 3505 (kafka.zk.KafkaZkClient)

Zookeeper logs:

[2020-11-28 16:02:25,980] WARN CancelledKeyException causing close of session 0x10006b24970008b (org.apache.zookeeper.server.NIOServerCnxn)
[2020-11-28 16:02:33,763] WARN fsync-ing the write ahead log in SyncThread:0 took 3807ms which will adversely effect operation latency. File size is 67108880 bytes. See the ZooKeeper troubleshooting guide (org.apache.zookeeper.server.persistence.FileTxnLog)
[2020-11-28 16:05:09,681] WARN fsync-ing the write ahead log in SyncThread:0 took 1039ms which will adversely effect operation latency. File size is 67108880 bytes. See the ZooKeeper troubleshooting guide (org.apache.zookeeper.server.persistence.FileTxnLog)
[2020-11-28 16:07:53,556] WARN Unable to read additional data from client sessionid 0x10006b24970008d, likely client has closed socket (org.apache.zookeeper.server.NIOServerCnxn)
[2020-11-28 16:21:43,015] WARN fsync-ing the write ahead log in SyncThread:0 took 3774ms which will adversely effect operation latency. File size is 67108880 bytes. See the ZooKeeper troubleshooting guide (org.apache.zookeeper.server.persistence.FileTxnLog)
[2020-11-28 16:50:23,266] WARN fsync-ing the write ahead log in SyncThread:0 took 4089ms which will adversely effect operation latency. File size is 67108880 bytes. See the ZooKeeper troubleshooting guide (org.apache.zookeeper.server.persistence.FileTxnLog)
[2020-11-28 17:05:03,746] WARN fsync-ing the write ahead log in SyncThread:0 took 1158ms which will adversely effect operation latency. File size is 67108880 bytes. See the ZooKeeper troubleshooting guide (org.apache.zookeeper.server.persistence.FileTxnLog)
[2020-11-30 06:23:01,143] WARN Unable to read additional data from client sessionid 0x10006b249700096, likely client has closed socket (org.apache.zookeeper.server.NIOServerCnxn)
[2020-11-30 06:51:09,687] WARN Unable to read additional data from client sessionid 0x10006b249700098, likely client has closed socket (org.apache.zookeeper.server.NIOServerCnxn)
[2020-11-30 06:51:09,756] WARN Unable to read additional data from client sessionid 0x10006b249700098, likely client has closed socket (org.apache.zookeeper.server.NIOServerCnxn)
[2020-11-30 06:51:09,756] WARN Unable to read additional data from client sessionid 0x10006b249700098, likely client has closed socket (org.apache.zookeeper.server.NIOServerCnxn)
[2020-11-30 06:51:09,756] WARN Unable to read additional data from client sessionid 0x10006b249700098, likely client has closed socket (org.apache.zookeeper.server.NIOServerCnxn)
[2020-11-30 06:51:09,757] WARN Unable to read additional data from client sessionid 0x10006b249700098, likely client has closed socket (org.apache.zookeeper.server.NIOServerCnxn)
[2020-11-30 06:51:09,757] WARN Unable to read additional data from client sessionid 0x10006b249700098, likely client has closed socket (org.apache.zookeeper.server.NIOServerCnxn)
[2020-11-30 07:10:31,989] WARN Unable to read additional data from client sessionid 0x10006b24970009a, likely client has closed socket (org.apache.zookeeper.server.NIOServerCnxn)

So if anything, it looks like that log file is the problem point.

Not sure how I can resolve this? I can upload my latest install script log if that helps at all.

Googling turned up this:

Wanna see if that’s your issue and give the workaround a try?

P.S. What I googled was “kafka broker offset out of range.”

Cool, thanks for the link. I ran the command to reset the offsets for snuba-events-subscriptions-consumers and restarted the service that was exited. I’ll give it a few days and see how it goes.

Sorry for the trouble @avio_taylor. Good luck and let us know how it goes!

The server has stayed up since I made the change :partying_face:

But it looks like it is no longer recording events:

I suppose it is possible there was nothing triggered last week, but since i set this up I don’t think I’ve ever had a week of zero errors. Did I screw something up?

1 Like

Good news on the server staying up! Bad news on the lack of events. I mean, good news if it’s accurate, but I’m with you … seems highly unlikely. :-/

Do you have a known error or test endpoint in your app, or other means of triggering an error explicitly to see if it shows up in your Sentry?

I’ve tried triggering an error manually from my app running locally, but I’m not seeing it show up in our sentry dashboard. Is there a way to check the docker services logs to see if the event was being sent from the client app?

Appreciate all your help with these issues!

Unfortunately, the same issue seems to have cropped up again last night. It seems to have started at about midnight eastern time and has continued through to this morning. I’m unable to connect to the instance via SSH.

When this happened before, I had to reboot the instance from the EC2 dashboard in order to get SSH working again. If I can connect to the instance, will the logs persist from the previous session? If I can’t get at the previous logs, I’m not sure what I can look at to try to track down what’s going on with this setup.

It appears that resetting the offsets didn’t stick and I’m back to the instance going down. It’s not reachable via SSH at the moment (it just hangs after establishing a connection), but it appears that the CPU is spiking, which is probably why I can’t tunnel in

I won’t be able to SSH until I restart the instance. What should I try when I can get in to take a look at what’s going on? What could be causing the CPU to spike like that?

I think you are having event spikes and your hardware resources (probably memory) cannot keep up with this volume.

Thanks for the reply. Are there any detailed requirements for the hardware environment specified for onpremise anywhere? The only mention in the repo is 2400MB RAM but I think part of my problem might be slow writes to disk. I’ll look around on the forums for recommendations from other people running onpremise with AWS. Thanks again.

This is the correct solution; I must have either passed the wrong group name or didn’t perform this step for all of the groups that were erroring.

I did run docker-compose down and docker-compose up -d kafka before running the commands in the kafka shell to make sure everything stuck.

Thanks again for your help with tracking this down.

1 Like

This topic was automatically closed 15 days after the last reply. New replies are no longer allowed.