Create sample event: Failed to load sample event

Hello,

When I click on the button “Create sample event” I got a pop-up “Failed to load sample event” after some seconds. And in the Sentry logs I got:

May 22 01:12:56 sentry sentry[97661]: 192.168.0.1 - - [21/May/2020:20:12:54 +0000] "POST /api/0/projects/fmk/emz-php-test/create-sample/ HTTP/1.0" 200 6457 "https://sentry.findmykids.org/organizations/fmk/issues/?project=3" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0"
May 22 01:12:56 sentry sentry[97661]: 192.168.0.1 - - [21/May/2020:20:12:56 +0000] "POST /api/2/store/ HTTP/1.0" 200 680 "-" "Go-http-client/2.0"
May 22 01:12:56 sentry syslogd: last message repeated 6 times
May 22 01:12:57 sentry sentry[97661]: 192.168.0.1 - - [21/May/2020:20:12:57 +0000] "POST /api/2/store/ HTTP/1.0" 200 680 "-" "Go-http-client/2.0"
May 22 01:12:57 sentry syslogd: last message repeated 15 times
May 22 01:12:57 sentry sentry[97661]: 20:12:57 [WARNING] django.request: Not Found: /api/0/issues/64/events/latest/ (status_code=404 request=<WSGIRequest: GET u'/api/0/issues/64/events/latest/'>)
May 22 01:12:57 sentry sentry[97661]: 192.168.0.1 - - [21/May/2020:20:12:57 +0000] "POST /api/2/store/ HTTP/1.0" 200 680 "-" "Go-http-client/2.0"
May 22 01:12:58 sentry syslogd: last message repeated 3 times
May 22 01:12:58 sentry sentry[97661]: 192.168.0.1 - - [21/May/2020:20:12:57 +0000] "GET /api/0/issues/64/events/latest/ HTTP/1.0" 404 1195 "https://sentry.findmykids.org/organizations/fmk/issues/?project=3" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0"
May 22 01:12:58 sentry sentry[97661]: 192.168.0.1 - - [21/May/2020:20:12:58 +0000] "POST /api/2/store/ HTTP/1.0" 200 680 "-" "Go-http-client/2.0"
May 22 01:12:58 sentry syslogd: last message repeated 17 times
May 22 01:12:58 sentry sentry[97661]: 20:12:58 [WARNING] django.request: Not Found: /api/0/issues/64/events/latest/ (status_code=404 request=<WSGIRequest: GET u'/api/0/issues/64/events/latest/'>)
May 22 01:12:58 sentry sentry[97661]: 192.168.0.1 - - [21/May/2020:20:12:58 +0000] "GET /api/0/issues/64/events/latest/ HTTP/1.0" 404 1195 "https://sentry.findmykids.org/organizations/fmk/issues/?project=3" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0"
May 22 01:12:58 sentry sentry[97661]: 192.168.0.1 - - [21/May/2020:20:12:58 +0000] "POST /api/2/store/ HTTP/1.0" 200 680 "-" "Go-http-client/2.0"
May 22 01:12:59 sentry syslogd: last message repeated 1 times
May 22 01:12:59 sentry sentry[97661]: 192.168.0.1 - - [21/May/2020:20:12:59 +0000] "POST /api/2/store/ HTTP/1.0" 200 680 "-" "Go-http-client/2.0"
May 22 01:13:00 sentry syslogd: last message repeated 19 times
May 22 01:13:00 sentry sentry[97661]: 192.168.0.1 - - [21/May/2020:20:13:00 +0000] "POST /api/2/store/ HTTP/1.0" 200 680 "-" "Go-http-client/2.0"
May 22 01:13:01 sentry syslogd: last message repeated 21 times
May 22 01:13:01 sentry sentry[97661]: 192.168.0.1 - - [21/May/2020:20:13:01 +0000] "POST /api/2/store/ HTTP/1.0" 200 680 "-" "Go-http-client/2.0"
May 22 01:13:01 sentry syslogd: last message repeated 16 times
May 22 01:13:01 sentry sentry[97661]: 20:13:01 [WARNING] django.request: Not Found: /api/0/issues/64/events/latest/ (status_code=404 request=<WSGIRequest: GET u'/api/0/issues/64/events/latest/'>)

I’m also getting the 404 JSON reply in browser with payload “No events found for group”.

Clearly something is not going that well, but at this time I have no idea where do I investigate this further. I’ve checked Relay logs (seems to be running, no messages after registering, like at all) and Snuba - looks like there’s no errors in both.

Could you please point me to the right direction ?

Thanks.

Can you check your post-process-forwarder service logs along with Kafka and Zookeeper logs?

Sure, and it seems to quit right after start (ran as sentry run post-process-forwarder):

May 22 11:46:13 sentry sentry[83900]: The configured event stream backend does not need a forwarder process to enqueue post-process tasks. Exiting...

I have this in my config right as it it mentioned in Snuba docs:

SENTRY_SEARCH = ‘sentry.search.snuba.EventsDatasetSnubaSearchBackend’
SENTRY_TSDB = ‘sentry.tsdb.redissnuba.RedisSnubaTSDB’
SENTRY_EVENTSTREAM = 'sentry.eventstream.snuba.SnubaEventStream

I’ve tried to run Sentry with the autogenerated config, but then it keeps telling me in the logs that I have my Sentry/Snuba stack misconfigured.

Kafka or Zookeper logs don’t seem to have any disturbing records (although I’m not that familiar with kafka). Zookeeper does have some WARN records about socket being closed for the remote side, but other than that - really nothing. If you would kindly tell me what to search for, - I will examine those more thoroughly.

I also do have lots of these in the Sentry log:

May 22 17:52:51 sentry sentry[42708]: 12:52:51 [ERROR] sentry.errors.events: preprocess.failed.empty (cache_key=u'e:e44fc7c52ab04396b565d3d62f6ca97b:2')
May 22 17:52:51 sentry sentry[42708]: 12:52:51 [ERROR] sentry.errors.events: preprocess.failed.empty (cache_key=u'e:575c9f863a4b40a682bb171a14d1b52c:2')
May 22 17:52:51 sentry sentry[42708]: 12:52:51 [ERROR] sentry.errors.events: preprocess.failed.empty (cache_key=u'e:1b4d0c6905484b2590460f3ab63d40c8:2')

(don’t know if it’s related, just in case).

What is more disturbing - is that relay log is slient. It launches, registers succcessfully and then merely does nothing. I enabled the DEBUG level, but it simply rescans projects configs (I have one indeed) and that’s all - no recation to the events. Is there a way to ensure Sentry is communicating with it, from the Sentry point of view ? Am I right assuming that relay misconfiguration leads to this event processing inability ?

This needs to be SENTRY_EVENTSTREAM = "sentry.eventstream.kafka.KafkaEventStream" and there is one more line you need to add. See the updated configuration:

Can you detail your installation method for this as this should not happen on the on-premise repo?

Relay should be contacted directly for event ingestion, that’s why we have Nginx now in front of everything to direct event requests to Relay and the rest to Sentry. If you don’t have this, it is normal to see Relay doing nothing.

Thanks a lot, seems like at some point I switched to the generated sentry config instead of the onpremises-repo-provided one. I’ve put these lines to the actual configuration, along with the ones you quoted (and made a decent review comparing my config to the repo one), but seems like when it comes to the events, nothing really changed - I still got these (only this time POST /api/2/store/ go to the replay, where they really should)

May 22 19:36:56 sentry sentry[38357]: 14:36:56 [WARNING] django.request: Not Found: /api/0/issues/79/events/latest/ (status_code=404 request=<WSGIRequest: GET u'/api/0/issues/79/events/latest/'>)
May 22 19:36:57 sentry sentry[38357]: 14:36:57 [WARNING] django.request: Not Found: /api/0/issues/79/events/latest/ (status_code=404 request=<WSGIRequest: GET u'/api/0/issues/79/events/latest/'>)
May 22 19:36:58 sentry sentry[38357]: 14:36:58 [WARNING] django.request: Not Found: /api/0/issues/79/events/latest/ (status_code=404 request=<WSGIRequest: GET u'/api/0/issues/79/events/latest/'>)

Sure, the thing is that I run some of the Sentry stack services without docker, so… things got a bit complicated.

Cool, now that you mentioned this I realized that I should take a look at nginx.conf and really put these locations straight into our main reverse-proxy vhost config. So, now this enormous amount of POST /api/2/store/ request goes into the relay (getting 200s, but relay log still doesn’t mention it, and I’m quite sure about this because I’m observing this traffic in extended nginx logs and in tcpdump). But browser still gets 404 when XHR’ing the /api/0/issues/79/events/latest/, and this goes back into sentry, because 0 is excluded from relay’s nginx regexp location. So, if browser sends POST to a sentry instance and then tries to GET the event from another location in sentry, what can cause 404 there ?

Thanks.

This still doesn’t provide much detail :smiley:

I’m not sure how chatty Relay is in its default INFO logging level but you may want to increase that first to see if it is doing something and succeeding or doing nothing at all: https://getsentry.github.io/relay/options/#logging

All these make me think that there’s something wrong with your Kafka setup or other services’ communication with Kafka. What does your post-process-forwarder service say after your config changes?

Well, all of the services are on the external host: clickhouse, kafka, redis, pgsql, memcached. Yup, I am aware about how the name resolution works in docker networking, so I not only changed the corresponding configuration directives in various configs, but also provided a hosts-based aliases for services. So far there’s no point in assuming that some service totally lacks connectivity. Although, there may be cases…

Sentry, snuba and relay are source-compiled, all of them are latest masters.

As about the relay. It was running on the DEBUG loglevel, I switched it to the TRACE one, and now - yup, it seems to be as chatty as the amout of POST /api//store/* rps from sentry. Relay seems to be communicating with kafka as I observe loads of these going in cycles:

May 23 00:15:17 sentry relay[53726]: 2020-05-22T19:15:17Z [relay_server::actors::events] TRACE: queued event
May 23 00:15:17 sentry relay[53726]: 2020-05-22T19:15:17Z [relay_server::actors::events] TRACE: processing json event
May 23 00:15:17 sentry relay[53726]: 2020-05-22T19:15:17Z [relay_server::actors::events] TRACE: sending envelope to kafka
May 23 00:15:17 sentry relay[53726]: 2020-05-22T19:15:17Z [relay_server::actors::store] TRACE: Sending event item of envelope to kafka

These go without any error.

As about post-process-forwarder service - yup, you’re absolutely right, there’s something wrong with it in my setup: although it’s not exiting now, it seems to be just staying idle: does absolutely nothing. At this time I’m trying to get to the caused of this behavior. You already helped me a lot, and I really apreciate your help, but if you have any directions of what it may be please let me know.

This is the full log of the sentry process-forwarder launched in foreground:

[sentry@sentry:~]> /usr/local/bin/sentry run post-process-forwarder --commit-batch-size 1
19:10:18 [WARNING] sentry.utils.geo: settings.GEOIP_PATH_MMDB not configured.
19:10:38 [INFO] sentry.plugins.github: apps-not-configured
%4|1590174641.395|CONFWARN|rdkafka#consumer-1| [thrd:app]: Configuration property produce.offset.report is deprecated: No longer used.
%4|1590174641.395|CONFWARN|rdkafka#consumer-2| [thrd:app]: Configuration property produce.offset.report is deprecated: No longer used.

Thanks.

1 Like

Oooh, my bad. Seems like I found it. I run pure Kafka instead of the Kafka/Confluent.

1 Like

I redid the kafka stack part, now I’m running Confluent, but, unluckily for me, nothing has changed. I see Confluent topics in the Web-interface, but it seems like only one of the three has consumers:

Am I right ? Seems like outcomes and events queues aren’t handled. Is Snuba supposed to do this ? In the same time in the Snuba logs I only see various 200’d POSTs to /query and nothing else (except dogstatsd error which seems to be harmless).

Solved that too, seems like the kafka/confluent hostname MUST be resolvable too, because confluent gives it inside a connect descriptor. By the way, the most frustrating fact is that Snuba doesn’t report this error, I found it using the strace. :confused: It really could report that in its logs…

But events still aren’t streaming or at least creating… :confused:

I’m also getting some amount of these:

19:33:55 [WARNING] sentry.eventstore.snuba.backend: eventstore.missing-snuba-event (event_id=u'410270b14ce046f2a0821c6ef4eb2d4d' project_id=2L)

Jeez, guys…

Solution is right there:

So, basically there’s two kinds of issues:

  • snuba doesn’t report Kafka/Confluent connectivity issues (nope, it does not).
  • snuba has the UTC timezone hardcoded, so the ClickHouse must run in UTC too. would be nice to at least have a config knob for that !

This is quite weird as we got reports mentioning connectivity issues from Snuba. Could it be your logging level or something else?

I was running Snuba with SNUBA_LOG_LEVEL=“WARNING”, so I doubt that (but still it may be, if Snuba reports this on NOTICE/INFO/DEBUG).

There’s actually a thing: I believe Snuba will report primary connectivity issue to the DEFAULT_BROKERS, but in my case this connectivity was okay. What was not okay, - is that Kafka/Confluent were running on a host, say, named foobar. After an initial contact the Kafka/Confluent were returning their hostname as some part of the connect descriptor for Snuba, and then the connection got stall. As soon as I fixed the foobar direct name resolving, I got Snuba processing the event flow.

I don’t quite understand this part. Can you elaborate a bit more please so we may add warnings/errors for this?

The thing is, all of these issues are masked in the provided getsentry/onpremise bundle:

  • the name resolving is provided by the docker internal resolver, so the confluent hostname is always resolvable by default.
  • the clickhouse is runningg in UTC because you don’t provide it with the TZ env variable inside the stock docker-compose.yml, and theoretically you may not be even fully aware about the issue of UTC being hardcoded inside snuba (actually I found this inside the code, but now I’m too lazy to quote - it’s greppable, so I hope you will forgive me).

As about the error I mentioned in my previous post - the l7 handshake, as I see it from the strace part, looks like:

  1. first, usual TCP handshake is happening with usual name resolving prior to it. Snuba is taking the name of the host from its config. SYN-SYN+ACK-ACK- you know, the usual stuff.
  2. then the protocol handshake is happening. Kafka says something like (not familiar with Kafka, so I’m describing my findings from strace/tcpdump): “ookay, man, my name is hostname Kafka figured out from syscalls, meet me there.”
  3. this is the part where the loss of connectivity happens: if snuba is unable to resolve hostname mentioned, it’s now stuck. Sad thing is, on this stage it never logs that ! But in the docker environment this is [almost] never happening. Or, if it still happens (sometime, as we both know, 127.0.0.11 just silently dies), this explains the cries for help in the similar cases.

As about why I don’t use the dockerized environment (I am fully aware that you will probably disaree):

  • for production usage, the database layer should be decontainerized (as this will make it simplier to manage). this substracts clickhouse, redis, memcached, postgres.
  • for production use, confluent should be decontainerized (same reason; furthermore it’s installable from the YUM/APT repos, which also makes it simplier to update).
  • so at this point all that’s left is snuba, relay and sentry itself (yeah, I’m aware of minidumps but we don’t develop this kind of software).

So this is why I decontainerized all of the stuff.

2 Likes