Snuba stuck restarting, "no such option --dataset"

I have Sentry 10.1.0.dev0 installed on a Debian 9 box via the install script. It stopped receiving events today. This is separate but related to my earlier thread, since I had everything running fine. I did upgrade my Sentry server using install.sh, if you need the install log let me know, it’s too big to post.

Ran docker-compose ps :

               Name                                 Command                 State                 Ports
-----------------------------------------------------------------------------------------------------------------------
sentry_onpremise_clickhouse_1                /entrypoint.sh                   Up           8123/tcp, 9000/tcp, 9009/tcp
sentry_onpremise_cron_1                      /bin/sh -c exec /docker-en ...   Up           9000/tcp
sentry_onpremise_kafka_1                     /etc/confluent/docker/run        Up           9092/tcp
sentry_onpremise_memcached_1                 docker-entrypoint.sh memcached   Up           11211/tcp
sentry_onpremise_post-process-forwarder_1    /bin/sh -c exec /docker-en ...   Up           9000/tcp
sentry_onpremise_postgres_1                  docker-entrypoint.sh postgres    Up           5432/tcp
sentry_onpremise_redis_1                     docker-entrypoint.sh redis ...   Up           6379/tcp
sentry_onpremise_sentry-cleanup_1            /entrypoint.sh 0 0 * * * g ...   Up           9000/tcp
sentry_onpremise_smtp_1                      docker-entrypoint.sh exim  ...   Up           25/tcp
sentry_onpremise_snuba-api_1                 ./docker_entrypoint.sh api       Up           1218/tcp
sentry_onpremise_snuba-cleanup_1             /entrypoint.sh */5 * * * * ...   Up           1218/tcp
sentry_onpremise_snuba-consumer_1            ./docker_entrypoint.sh con ...   Restarting
sentry_onpremise_snuba-outcomes-consumer_1   ./docker_entrypoint.sh con ...   Restarting
sentry_onpremise_snuba-replacer_1            ./docker_entrypoint.sh rep ...   Up           1218/tcp
sentry_onpremise_symbolicator-cleanup_1      /entrypoint.sh 55 23 * * * ...   Up           3021/tcp
sentry_onpremise_symbolicator_1              /bin/bash /docker-entrypoi ...   Up           3021/tcp
sentry_onpremise_web_1                       /bin/sh -c exec /docker-en ...   Up           127.0.0.1:9000->9000/tcp
sentry_onpremise_worker_1                    /bin/sh -c exec /docker-en ...   Up           9000/tcp
sentry_onpremise_zookeeper_1                 /etc/confluent/docker/run        Up           2181/tcp, 2888/tcp, 3888/tcp

Ran docker-compose logs snuba-consumer :

snuba-consumer_1           | + '[' c = - ']'
snuba-consumer_1           | + snuba consumer --help
snuba-consumer_1           | + set -- snuba consumer --dataset events --auto-offset-reset=latest --max-batch-time-ms 750
snuba-consumer_1           | + set gosu snuba snuba consumer --dataset events --auto-offset-reset=latest --max-batch-time-ms 750
snuba-consumer_1           | + exec gosu snuba snuba consumer --dataset events --auto-offset-reset=latest --max-batch-time-ms 750
snuba-consumer_1           | Error: no such option: --dataset
snuba-consumer_1           | + '[' c = - ']'
snuba-consumer_1           | + snuba consumer --help
snuba-consumer_1           | + set -- snuba consumer --dataset events --auto-offset-reset=latest --max-batch-time-ms 750
snuba-consumer_1           | + set gosu snuba snuba consumer --dataset events --auto-offset-reset=latest --max-batch-time-ms 750
snuba-consumer_1           | + exec gosu snuba snuba consumer --dataset events --auto-offset-reset=latest --max-batch-time-ms 750
snuba-consumer_1           | Error: no such option: --dataset
snuba-consumer_1           | + '[' c = - ']'
snuba-consumer_1           | + snuba consumer --help
snuba-consumer_1           | + set -- snuba consumer --dataset events --auto-offset-reset=latest --max-batch-time-ms 750
snuba-consumer_1           | + set gosu snuba snuba consumer --dataset events --auto-offset-reset=latest --max-batch-time-ms 750
snuba-consumer_1           | + exec gosu snuba snuba consumer --dataset events --auto-offset-reset=latest --max-batch-time-ms 750
snuba-consumer_1           | Error: no such option: --dataset
It repeats for a while...

Ran docker-compose logs snuba-outcomes-consumer

snuba-outcomes-consumer_1  | + snuba consumer --help
snuba-outcomes-consumer_1  | + set -- snuba consumer --dataset outcomes --auto-offset-reset=earliest --max-batch-time-ms 750
snuba-outcomes-consumer_1  | + set gosu snuba snuba consumer --dataset outcomes --auto-offset-reset=earliest --max-batch-time-ms 750
snuba-outcomes-consumer_1  | + exec gosu snuba snuba consumer --dataset outcomes --auto-offset-reset=earliest --max-batch-time-ms 750
snuba-outcomes-consumer_1  | Error: no such option: --dataset
snuba-outcomes-consumer_1  | + '[' c = - ']'
snuba-outcomes-consumer_1  | + snuba consumer --help
snuba-outcomes-consumer_1  | + set -- snuba consumer --dataset outcomes --auto-offset-reset=earliest --max-batch-time-ms 750
snuba-outcomes-consumer_1  | + set gosu snuba snuba consumer --dataset outcomes --auto-offset-reset=earliest --max-batch-time-ms 750
snuba-outcomes-consumer_1  | + exec gosu snuba snuba consumer --dataset outcomes --auto-offset-reset=earliest --max-batch-time-ms 750
snuba-outcomes-consumer_1  | Error: no such option: --dataset
snuba-outcomes-consumer_1  | + '[' c = - ']'
snuba-outcomes-consumer_1  | + snuba consumer --help
snuba-outcomes-consumer_1  | + set -- snuba consumer --dataset outcomes --auto-offset-reset=earliest --max-batch-time-ms 750
snuba-outcomes-consumer_1  | + set gosu snuba snuba consumer --dataset outcomes --auto-offset-reset=earliest --max-batch-time-ms 750
snuba-outcomes-consumer_1  | + exec gosu snuba snuba consumer --dataset outcomes --auto-offset-reset=earliest --max-batch-time-ms 750
snuba-outcomes-consumer_1  | Error: no such option: --dataset
Also repeats for a while...

Any help would be appreciated.

You need to pull from the on-premise repo. Specifically, you need this PR merged in:

Ok, I’ll try that out. Also, I do get a lot of these messages during install, is this normal?

cimpl.KafkaException: KafkaError{code=_TRANSPORT,val=-195,str="Failed to get metadata: Local: Broker transport failure"}
2020-05-11 20:37:19,357 Connection to Kafka failed (attempt 15)
Traceback (most recent call last):
  File "/usr/src/snuba/snuba/cli/bootstrap.py", line 56, in bootstrap
    client.list_topics(timeout=1)

EDIT: I did pull the latest version with git, ran install again, it went through 59 of the above error and ended with this:

cimpl.KafkaException: KafkaError{code=_TRANSPORT,val=-195,str="Failed to get metadata: Local: Broker transport failure"}
Traceback (most recent call last):
  File "/usr/local/bin/snuba", line 11, in <module>
    load_entry_point('snuba', 'console_scripts', 'snuba')()
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/usr/src/snuba/snuba/cli/bootstrap.py", line 56, in bootstrap
    client.list_topics(timeout=1)
cimpl.KafkaException: KafkaError{code=_TRANSPORT,val=-195,str="Failed to get metadata: Local: Broker transport failure"}
Cleaning up...

I gathered some more logs from various containers

docker-compose logs kafka

kafka_1                    | [main-SendThread(zookeeper:2181)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server zookeeper/172.19.0.3:2181. Will not attempt to authenticate using SASL (unknown error)
kafka_1                    | [main-SendThread(zookeeper:2181)] INFO org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /172.19.0.10:37118, server: zookeeper/172.19.0.3:2181
kafka_1                    | [main-SendThread(zookeeper:2181)] WARN org.apache.zookeeper.ClientCnxn - Session 0x0 for server zookeeper/172.19.0.3:2181, unexpected error, closing socket connection and attempting reconnect
kafka_1                    | java.io.IOException: Connection reset by peer
kafka_1                    |    at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
kafka_1                    |    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
kafka_1                    |    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
kafka_1                    |    at sun.nio.ch.IOUtil.read(IOUtil.java:192)
kafka_1                    |    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
kafka_1                    |    at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:75)
kafka_1                    |    at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:363)
kafka_1                    |    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1223)

docker-compose logs zookeeper

zookeeper_1                | ===> Launching ...
zookeeper_1                | ===> Launching zookeeper ...
zookeeper_1                | [2020-05-12 15:21:46,729] WARN Either no config or no quorum defined in config, running  in standalone mode (org.apache.zookeeper.server.quorum.QuorumPeerMain)
zookeeper_1                | [2020-05-12 15:21:46,842] WARN o.e.j.s.ServletContextHandler@4d95d2a2{/,null,UNAVAILABLE} contextPath ends with /* (org.eclipse.jetty.server.handler.ContextHandler)
zookeeper_1                | [2020-05-12 15:21:46,842] WARN Empty contextPath (org.eclipse.jetty.server.handler.ContextHandler)
zookeeper_1                | [2020-05-12 15:21:46,938] ERROR Unexpected exception, exiting abnormally (org.apache.zookeeper.server.ZooKeeperServerMain)
zookeeper_1                | java.io.IOException: No snapshot found, but there are log entries. Something is broken!
zookeeper_1                |    at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:240)
zookeeper_1                |    at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:240)
zookeeper_1                |    at org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:290)
zookeeper_1                |    at org.apache.zookeeper.server.ZooKeeperServer.startdata(ZooKeeperServer.java:450)
zookeeper_1                |    at org.apache.zookeeper.server.NIOServerCnxnFactory.startup(NIOServerCnxnFactory.java:764)
zookeeper_1                |    at org.apache.zookeeper.server.ServerCnxnFactory.startup(ServerCnxnFactory.java:98)
zookeeper_1                |    at org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:144)
zookeeper_1                |    at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:106)
zookeeper_1                |    at org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:64)
zookeeper_1                |    at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:128)
zookeeper_1                |    at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:82)

Literally all I’ve done is git pull and run install.sh. At first I did run the install script before pulling anything down, if that could have messed something up. At this point it seems like zookeeper is broken.

Yup, this is a known issue with zookeeper unfortunately. We’ll be adding an automated fix soon but until then you can run docker volume rm sentry-zookeper && docker volume create --name sentry-zookeper to fix it.

1 Like

Thank you good sir, it’s working again.

1 Like

Hey,

I just updated the on-premise Sentry instance on our Debian box to Sentry 10.1.0.dev0 by running the ./install.sh. I am getting the exact same errors that you posed above.

So I did a git pull to get the latest changes and ran ./install.sh again, which also printed the KafkaException: KafkaError{code=_TRANSPORT [...] multiple times for me.

Unfortunately docker volume rm sentry-zookeper && docker volume create --name sentry-zookeeper does not do the trick for me.

docker volume ls gives me the following output:

DRIVER              VOLUME NAME
local               0a2cb31dab08a438a508b77b51fd144d7ee34226444db254af7296228b69ed61
local               0fbe9c58a7ab7617851c588404a6d71456f5edf977aa52d10d2c86d15910614f
local               3d08e4b57d68dab696e1019bf4e726f093fa49dababe09496e89f4004e35c224
local               4e457d766ecc8d6a78475e08fe658058023d7979defd8da3292382f16ea7a977
local               6a6a487e144fb9c72e325e09d75cb06595692e43fbdf1e73a0bbf8563ab83c4c
local               9f75bf1599442ac7fae2f99e1e2ef0a805bee49c49dcc7a0f212ae3d6cc324a2
local               4478e387ba9b4c32dceca5b600b1d0e5e27f03af6076079fafe19d4a2495307e
local               77209715264f3b3a0cc19030be5d37cdd8c34e1c2c1a0608e658ff4ce807079b
local               ad8109ba3de00d22efdaf0b31864e9903b15b61049ff9e82e568ddf0945142e1
local               b856f5fca4f2338a9dc10a3f84a086ddbca509df216fdd6197996b051c729020
local               d6f19963dddb57d10abaec575802a5e160b08f146a9709bafa276d4e6512a9a7
local               dfa20d838e74a27e1bf23163df455312097951e59cc11b2473248f5a86bfa6a5
local               dff80ad769afea9ce08e6ac4b406844418107d4973bd0228cb2b833a124fda03
local               e8b2ce88ad2613ee77053a4531d17704d0be45aea939bb104951a116ce4da9cb
local               nextcloud_db
local               nextcloud_nextcloud
local               sentry-clickhouse
local               sentry-data
local               sentry-kafka
local               sentry-postgres
local               sentry-redis
local               sentry-symbolicator
local               sentry-zookeeper
local               sentry_onpremise_sentry-clickhouse-log
local               sentry_onpremise_sentry-kafka-log
local               sentry_onpremise_sentry-secrets
local               sentry_onpremise_sentry-smtp
local               sentry_onpremise_sentry-smtp-log
local               sentry_onpremise_sentry-zookeeper-log

After running docker volume rm sentry-zookeper && docker volume create --name sentry-zookeeper I start all containers using docker-compose up but still get the java.io.IOException: No snapshot found, but there are log entries. Something is broken! message for zookeeper_1.

What am I doing wrong?

Can you also try docker volume rm sentry_onpremise_sentry-zookeeper-log? This should be done while the system is down so run the following:

docker-compose stop
docker volume rm sentry_onpremise_sentry-zookeeper-log
docker volume rm sentry-zookeper && docker volume create --name sentry-zookeeper
docker-compose up -d
docker-compose logs -f zookeeper
1 Like

Hi BYK,

Thank you, that works! I had to:

docker-compose stop
docker volume rm sentry_onpremise_sentry-zookeeper-log
docker volume rm sentry-zookeeper
docker volume rm sentry-kafka
docker volume rm sentry_onpremise_sentry-kafka-log

And then recreate them with docker volume create --name <volume_name>

Now Kafka and Zookeeper appear to be working again (they both were in a restart loop before). Though, relay is giving me issues now:

relay_1 | 2020-05-15T11:20:23Z [relay::cli] ERROR: relay has no credentials, which are required in managed mode. Generate some with "relay credentials generate" first.

I realized there is a relay folder now within the onpremise folder, which contains a config.yml. I’m not quite sure how to generate credentials, could you help me out with that one?

Edit:
I tried to generate the relay credentials doing

docker exec -it sentry_onpremise_relay_1 /bin/bash
root@b367d6f99467:/work# relay credentials generate
 ERROR relay::cli > could not write config file (file /work/.relay/credentials.json)
 caused by: Read-only file system (os error 30)

I assume I somehow have to generate a credentials.json within the onpremise/relay/ directory?

Edit 2:
I managed to create the credentials.json by setting the mode in relay/config.yml to proxy, then executing the credentials generate on the container, then setting the mode back to managed. After starting the container I got

relay_1                    |   caused by: Permission denied (os error 13)
relay_1                    | error: could not open config file (file /work/.relay/credentials.json)

So I chmodded the onpremise/relay/credentials.json to 777.
Now I get the following error relay_1 | 2020-05-15T11:57:30Z [relay_server::actors::upstream] ERROR: authentication encountered error: upstream request returned error 401 Unauthorized. Where do I have to put the credentials in order to get it working?

Running ./install.sh should take care of generating the credentials and putting them in the right place for you. Did that not work for some reason? Relevant lines are:

I’d recommend deleting the credentials file, undoing your permission changes (444 should suffice) and running ./install.sh to see if it helps or not. Make sure you have the latest version of on-premise repo before running though.

I have tried running ./install.sh with the latest version of on-premise yesterday and it somehow did not work correctly.

The odd part is, for some reason which I still haven’t figured out, the install script had not generated the credentials file. I removed the self-generated credentials.json and re-ran ./install.sh yesterday - and it generated the credentials.json and everything fired up correctly (to my knowledge).

At first at least one project was not able to receive events, I have no idea why, but I just re-ran the ./install.sh after rebooting the machine and now everything seems to be working again. I’ve just fired a test-exception and it is being logged to Sentry.

3/10 projects have received events so far, I am going to test the other projects on monday if they do not show any log entries by then.

Thank you for your help

The script does not generate a new credentials file if it finds one in-place and tries to use that one instead. Maybe this is the reason?

I can’t tell as I don’t remember exactly which steps we took.

At least all projects seem to be working again. Thank you for the help.

1 Like