Snuba stuck restarting, "no such option --dataset"

CaptainHypertext · May 11, 2020, 7:16pm

I have Sentry 10.1.0.dev0 installed on a Debian 9 box via the install script. It stopped receiving events today. This is separate but related to my earlier thread, since I had everything running fine. I did upgrade my Sentry server using install.sh, if you need the install log let me know, it’s too big to post.

Ran docker-compose ps :

               Name                                 Command                 State                 Ports
-----------------------------------------------------------------------------------------------------------------------
sentry_onpremise_clickhouse_1                /entrypoint.sh                   Up           8123/tcp, 9000/tcp, 9009/tcp
sentry_onpremise_cron_1                      /bin/sh -c exec /docker-en ...   Up           9000/tcp
sentry_onpremise_kafka_1                     /etc/confluent/docker/run        Up           9092/tcp
sentry_onpremise_memcached_1                 docker-entrypoint.sh memcached   Up           11211/tcp
sentry_onpremise_post-process-forwarder_1    /bin/sh -c exec /docker-en ...   Up           9000/tcp
sentry_onpremise_postgres_1                  docker-entrypoint.sh postgres    Up           5432/tcp
sentry_onpremise_redis_1                     docker-entrypoint.sh redis ...   Up           6379/tcp
sentry_onpremise_sentry-cleanup_1            /entrypoint.sh 0 0 * * * g ...   Up           9000/tcp
sentry_onpremise_smtp_1                      docker-entrypoint.sh exim  ...   Up           25/tcp
sentry_onpremise_snuba-api_1                 ./docker_entrypoint.sh api       Up           1218/tcp
sentry_onpremise_snuba-cleanup_1             /entrypoint.sh */5 * * * * ...   Up           1218/tcp
sentry_onpremise_snuba-consumer_1            ./docker_entrypoint.sh con ...   Restarting
sentry_onpremise_snuba-outcomes-consumer_1   ./docker_entrypoint.sh con ...   Restarting
sentry_onpremise_snuba-replacer_1            ./docker_entrypoint.sh rep ...   Up           1218/tcp
sentry_onpremise_symbolicator-cleanup_1      /entrypoint.sh 55 23 * * * ...   Up           3021/tcp
sentry_onpremise_symbolicator_1              /bin/bash /docker-entrypoi ...   Up           3021/tcp
sentry_onpremise_web_1                       /bin/sh -c exec /docker-en ...   Up           127.0.0.1:9000->9000/tcp
sentry_onpremise_worker_1                    /bin/sh -c exec /docker-en ...   Up           9000/tcp
sentry_onpremise_zookeeper_1                 /etc/confluent/docker/run        Up           2181/tcp, 2888/tcp, 3888/tcp

Ran docker-compose logs snuba-consumer :

snuba-consumer_1           | + '[' c = - ']'
snuba-consumer_1           | + snuba consumer --help
snuba-consumer_1           | + set -- snuba consumer --dataset events --auto-offset-reset=latest --max-batch-time-ms 750
snuba-consumer_1           | + set gosu snuba snuba consumer --dataset events --auto-offset-reset=latest --max-batch-time-ms 750
snuba-consumer_1           | + exec gosu snuba snuba consumer --dataset events --auto-offset-reset=latest --max-batch-time-ms 750
snuba-consumer_1           | Error: no such option: --dataset
snuba-consumer_1           | + '[' c = - ']'
snuba-consumer_1           | + snuba consumer --help
snuba-consumer_1           | + set -- snuba consumer --dataset events --auto-offset-reset=latest --max-batch-time-ms 750
snuba-consumer_1           | + set gosu snuba snuba consumer --dataset events --auto-offset-reset=latest --max-batch-time-ms 750
snuba-consumer_1           | + exec gosu snuba snuba consumer --dataset events --auto-offset-reset=latest --max-batch-time-ms 750
snuba-consumer_1           | Error: no such option: --dataset
snuba-consumer_1           | + '[' c = - ']'
snuba-consumer_1           | + snuba consumer --help
snuba-consumer_1           | + set -- snuba consumer --dataset events --auto-offset-reset=latest --max-batch-time-ms 750
snuba-consumer_1           | + set gosu snuba snuba consumer --dataset events --auto-offset-reset=latest --max-batch-time-ms 750
snuba-consumer_1           | + exec gosu snuba snuba consumer --dataset events --auto-offset-reset=latest --max-batch-time-ms 750
snuba-consumer_1           | Error: no such option: --dataset
It repeats for a while...

Ran docker-compose logs snuba-outcomes-consumer

snuba-outcomes-consumer_1  | + snuba consumer --help
snuba-outcomes-consumer_1  | + set -- snuba consumer --dataset outcomes --auto-offset-reset=earliest --max-batch-time-ms 750
snuba-outcomes-consumer_1  | + set gosu snuba snuba consumer --dataset outcomes --auto-offset-reset=earliest --max-batch-time-ms 750
snuba-outcomes-consumer_1  | + exec gosu snuba snuba consumer --dataset outcomes --auto-offset-reset=earliest --max-batch-time-ms 750
snuba-outcomes-consumer_1  | Error: no such option: --dataset
snuba-outcomes-consumer_1  | + '[' c = - ']'
snuba-outcomes-consumer_1  | + snuba consumer --help
snuba-outcomes-consumer_1  | + set -- snuba consumer --dataset outcomes --auto-offset-reset=earliest --max-batch-time-ms 750
snuba-outcomes-consumer_1  | + set gosu snuba snuba consumer --dataset outcomes --auto-offset-reset=earliest --max-batch-time-ms 750
snuba-outcomes-consumer_1  | + exec gosu snuba snuba consumer --dataset outcomes --auto-offset-reset=earliest --max-batch-time-ms 750
snuba-outcomes-consumer_1  | Error: no such option: --dataset
snuba-outcomes-consumer_1  | + '[' c = - ']'
snuba-outcomes-consumer_1  | + snuba consumer --help
snuba-outcomes-consumer_1  | + set -- snuba consumer --dataset outcomes --auto-offset-reset=earliest --max-batch-time-ms 750
snuba-outcomes-consumer_1  | + set gosu snuba snuba consumer --dataset outcomes --auto-offset-reset=earliest --max-batch-time-ms 750
snuba-outcomes-consumer_1  | + exec gosu snuba snuba consumer --dataset outcomes --auto-offset-reset=earliest --max-batch-time-ms 750
snuba-outcomes-consumer_1  | Error: no such option: --dataset
Also repeats for a while...

Any help would be appreciated.

BYK · May 11, 2020, 7:56pm

You need to pull from the on-premise repo. Specifically, you need this PR merged in:

CaptainHypertext · May 11, 2020, 8:39pm

Ok, I’ll try that out. Also, I do get a lot of these messages during install, is this normal?

cimpl.KafkaException: KafkaError{code=_TRANSPORT,val=-195,str="Failed to get metadata: Local: Broker transport failure"}
2020-05-11 20:37:19,357 Connection to Kafka failed (attempt 15)
Traceback (most recent call last):
  File "/usr/src/snuba/snuba/cli/bootstrap.py", line 56, in bootstrap
    client.list_topics(timeout=1)

EDIT: I did pull the latest version with git, ran install again, it went through 59 of the above error and ended with this:

cimpl.KafkaException: KafkaError{code=_TRANSPORT,val=-195,str="Failed to get metadata: Local: Broker transport failure"}
Traceback (most recent call last):
  File "/usr/local/bin/snuba", line 11, in <module>
    load_entry_point('snuba', 'console_scripts', 'snuba')()
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/src/snuba/snuba/cli/bootstrap.py", line 56, in bootstrap
    client.list_topics(timeout=1)
cimpl.KafkaException: KafkaError{code=_TRANSPORT,val=-195,str="Failed to get metadata: Local: Broker transport failure"}
Cleaning up...

CaptainHypertext · May 12, 2020, 3:24pm

I gathered some more logs from various containers

docker-compose logs kafka

kafka_1                    | [main-SendThread(zookeeper:2181)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server zookeeper/172.19.0.3:2181. Will not attempt to authenticate using SASL (unknown error)
kafka_1                    | [main-SendThread(zookeeper:2181)] INFO org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /172.19.0.10:37118, server: zookeeper/172.19.0.3:2181
kafka_1                    | [main-SendThread(zookeeper:2181)] WARN org.apache.zookeeper.ClientCnxn - Session 0x0 for server zookeeper/172.19.0.3:2181, unexpected error, closing socket connection and attempting reconnect
kafka_1                    | java.io.IOException: Connection reset by peer
kafka_1                    |    at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
kafka_1                    |    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
kafka_1                    |    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
kafka_1                    |    at sun.nio.ch.IOUtil.read(IOUtil.java:192)
kafka_1                    |    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
kafka_1                    |    at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:75)
kafka_1                    |    at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:363)
kafka_1                    |    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1223)

docker-compose logs zookeeper

zookeeper_1                | ===> Launching ...
zookeeper_1                | ===> Launching zookeeper ...
zookeeper_1                | [2020-05-12 15:21:46,729] WARN Either no config or no quorum defined in config, running  in standalone mode (org.apache.zookeeper.server.quorum.QuorumPeerMain)
zookeeper_1                | [2020-05-12 15:21:46,842] WARN o.e.j.s.ServletContextHandler@4d95d2a2{/,null,UNAVAILABLE} contextPath ends with /* (org.eclipse.jetty.server.handler.ContextHandler)
zookeeper_1                | [2020-05-12 15:21:46,842] WARN Empty contextPath (org.eclipse.jetty.server.handler.ContextHandler)
zookeeper_1                | [2020-05-12 15:21:46,938] ERROR Unexpected exception, exiting abnormally (org.apache.zookeeper.server.ZooKeeperServerMain)
zookeeper_1                | java.io.IOException: No snapshot found, but there are log entries. Something is broken!
zookeeper_1                |    at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:240)
zookeeper_1                |    at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240)
zookeeper_1                |    at org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:290)
zookeeper_1                |    at org.apache.zookeeper.server.ZooKeeperServer.startdata(ZooKeeperServer.java:450)
zookeeper_1                |    at org.apache.zookeeper.server.NIOServerCnxnFactory.startup(NIOServerCnxnFactory.java:764)
zookeeper_1                |    at org.apache.zookeeper.server.ServerCnxnFactory.startup(ServerCnxnFactory.java:98)
zookeeper_1                |    at org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:144)
zookeeper_1                |    at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:106)
zookeeper_1                |    at org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:64)
zookeeper_1                |    at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:128)
zookeeper_1                |    at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:82)

Literally all I’ve done is git pull and run install.sh. At first I did run the install script before pulling anything down, if that could have messed something up. At this point it seems like zookeeper is broken.

BYK · May 12, 2020, 7:59pm

Yup, this is a known issue with zookeeper unfortunately. We’ll be adding an automated fix soon but until then you can run docker volume rm sentry-zookeper && docker volume create --name sentry-zookeper to fix it.

CaptainHypertext · May 12, 2020, 8:19pm

Thank you good sir, it’s working again.

BalaFPR · May 14, 2020, 4:13pm

Hey,

I just updated the on-premise Sentry instance on our Debian box to Sentry 10.1.0.dev0 by running the ./install.sh. I am getting the exact same errors that you posed above.

So I did a git pull to get the latest changes and ran ./install.sh again, which also printed the KafkaException: KafkaError{code=_TRANSPORT [...] multiple times for me.

Unfortunately docker volume rm sentry-zookeper && docker volume create --name sentry-zookeeper does not do the trick for me.

docker volume ls gives me the following output:

DRIVER              VOLUME NAME
local               0a2cb31dab08a438a508b77b51fd144d7ee34226444db254af7296228b69ed61
local               0fbe9c58a7ab7617851c588404a6d71456f5edf977aa52d10d2c86d15910614f
local               3d08e4b57d68dab696e1019bf4e726f093fa49dababe09496e89f4004e35c224
local               4e457d766ecc8d6a78475e08fe658058023d7979defd8da3292382f16ea7a977
local               6a6a487e144fb9c72e325e09d75cb06595692e43fbdf1e73a0bbf8563ab83c4c
local               9f75bf1599442ac7fae2f99e1e2ef0a805bee49c49dcc7a0f212ae3d6cc324a2
local               4478e387ba9b4c32dceca5b600b1d0e5e27f03af6076079fafe19d4a2495307e
local               77209715264f3b3a0cc19030be5d37cdd8c34e1c2c1a0608e658ff4ce807079b
local               ad8109ba3de00d22efdaf0b31864e9903b15b61049ff9e82e568ddf0945142e1
local               b856f5fca4f2338a9dc10a3f84a086ddbca509df216fdd6197996b051c729020
local               d6f19963dddb57d10abaec575802a5e160b08f146a9709bafa276d4e6512a9a7
local               dfa20d838e74a27e1bf23163df455312097951e59cc11b2473248f5a86bfa6a5
local               dff80ad769afea9ce08e6ac4b406844418107d4973bd0228cb2b833a124fda03
local               e8b2ce88ad2613ee77053a4531d17704d0be45aea939bb104951a116ce4da9cb
local               nextcloud_db
local               nextcloud_nextcloud
local               sentry-clickhouse
local               sentry-data
local               sentry-kafka
local               sentry-postgres
local               sentry-redis
local               sentry-symbolicator
local               sentry-zookeeper
local               sentry_onpremise_sentry-clickhouse-log
local               sentry_onpremise_sentry-kafka-log
local               sentry_onpremise_sentry-secrets
local               sentry_onpremise_sentry-smtp
local               sentry_onpremise_sentry-smtp-log
local               sentry_onpremise_sentry-zookeeper-log

After running docker volume rm sentry-zookeper && docker volume create --name sentry-zookeeper I start all containers using docker-compose up but still get the java.io.IOException: No snapshot found, but there are log entries. Something is broken! message for zookeeper_1.

What am I doing wrong?

BYK · May 15, 2020, 11:17am

Can you also try docker volume rm sentry_onpremise_sentry-zookeeper-log? This should be done while the system is down so run the following:

docker-compose stop
docker volume rm sentry_onpremise_sentry-zookeeper-log
docker volume rm sentry-zookeper && docker volume create --name sentry-zookeeper
docker-compose up -d
docker-compose logs -f zookeeper

BalaFPR · May 15, 2020, 11:29am

Hi BYK,

Thank you, that works! I had to:

docker-compose stop
docker volume rm sentry_onpremise_sentry-zookeeper-log
docker volume rm sentry-zookeeper
docker volume rm sentry-kafka
docker volume rm sentry_onpremise_sentry-kafka-log

And then recreate them with docker volume create --name <volume_name>

Now Kafka and Zookeeper appear to be working again (they both were in a restart loop before). Though, relay is giving me issues now:

relay_1 | 2020-05-15T11:20:23Z [relay::cli] ERROR: relay has no credentials, which are required in managed mode. Generate some with "relay credentials generate" first.

I realized there is a relay folder now within the onpremise folder, which contains a config.yml. I’m not quite sure how to generate credentials, could you help me out with that one?

Edit:
I tried to generate the relay credentials doing

docker exec -it sentry_onpremise_relay_1 /bin/bash
root@b367d6f99467:/work# relay credentials generate
 ERROR relay::cli > could not write config file (file /work/.relay/credentials.json)
 caused by: Read-only file system (os error 30)

I assume I somehow have to generate a credentials.json within the onpremise/relay/ directory?

Edit 2:
I managed to create the credentials.json by setting the mode in relay/config.yml to proxy, then executing the credentials generate on the container, then setting the mode back to managed. After starting the container I got

relay_1                    |   caused by: Permission denied (os error 13)
relay_1                    | error: could not open config file (file /work/.relay/credentials.json)

So I chmodded the onpremise/relay/credentials.json to 777.
Now I get the following error relay_1 | 2020-05-15T11:57:30Z [relay_server::actors::upstream] ERROR: authentication encountered error: upstream request returned error 401 Unauthorized. Where do I have to put the credentials in order to get it working?

BYK · May 15, 2020, 6:31pm

Running ./install.sh should take care of generating the credentials and putting them in the right place for you. Did that not work for some reason? Relevant lines are:

github.com

getsentry/onpremise/blob/adda25ee23b65314cc691d3a11548dbbbe64eef5/install.sh#L220-L244


if [ ! -f "$RELAY_CREDENTIALS_JSON" ]; then
  echo ""
  echo "Generating Relay credentials..."

  # We need the ugly hack below as `relay generate credentials` tries to read the config and the credentials
  # even with the `--stdout` and `--overwrite` flags and then errors out when the credentials file exists but
  # not valid JSON. We hit this case as we redirect output to the same config folder, creating an empty
  # credentials file before relay runs.
  $dcr --no-deps -v $(pwd)/$RELAY_CONFIG_YML:/tmp/config.yml relay --config /tmp credentials generate --stdout > "$RELAY_CREDENTIALS_JSON"
  echo "Relay credentials written to $RELAY_CREDENTIALS_JSON"
fi

RELAY_CREDENTIALS=$(sed -n 's/^.*"public_key"[[:space:]]*:[[:space:]]*"\([a-zA-Z0-9_-]\{1,\}\)".*$/\1/p' "$RELAY_CREDENTIALS_JSON")
if [ -z "$RELAY_CREDENTIALS" ]; then
  >&2 echo "FAIL: Cannot read credentials back from $RELAY_CREDENTIALS_JSON."
  >&2 echo "      Please ensure this file is readable and contains valid credentials."
  >&2 echo ""
  exit 1
fi

This file has been truncated. show original

I’d recommend deleting the credentials file, undoing your permission changes (444 should suffice) and running ./install.sh to see if it helps or not. Make sure you have the latest version of on-premise repo before running though.

BalaFPR · May 16, 2020, 4:19pm

I have tried running ./install.sh with the latest version of on-premise yesterday and it somehow did not work correctly.

The odd part is, for some reason which I still haven’t figured out, the install script had not generated the credentials file. I removed the self-generated credentials.json and re-ran ./install.sh yesterday - and it generated the credentials.json and everything fired up correctly (to my knowledge).

At first at least one project was not able to receive events, I have no idea why, but I just re-ran the ./install.sh after rebooting the machine and now everything seems to be working again. I’ve just fired a test-exception and it is being logged to Sentry.

3/10 projects have received events so far, I am going to test the other projects on monday if they do not show any log entries by then.

Thank you for your help

BYK · May 18, 2020, 6:10am

The script does not generate a new credentials file if it finds one in-place and tries to use that one instead. Maybe this is the reason?

BalaFPR · May 25, 2020, 6:35am

I can’t tell as I don’t remember exactly which steps we took.

At least all projects seem to be working again. Thank you for the help.

Topic		Replies	Views
Unable to capture event in Sentry 20.6.0 On-Premise	6	2603	December 8, 2020
Celery.worker.consumer.consumer: consumer: Connection to broker lost. Trying to re-establish the connection... worker_1Restoring 13 unacknowledged message(s) On-Premise	17	3982	December 19, 2020
Connection to Kafka failed When installing On-Premise	18	8071	May 23, 2020
Sentry no more catch errors On-Premise	12	10437	June 5, 2021
No new events after update On-Premise	10	3518	March 5, 2021

Snuba stuck restarting, "no such option --dataset"

Related topics