Relay errors in fresh new on premise install

The plot thickens!

relay_1 | 2021-01-07T12:06:47Z [actix::actors::resolver] WARN: Can not create system dns resolver: io error

This log line indicates that the underlying library will not use the system resolver (docker’s one!), but will fall back to a default which will be Google’s DNS which obviously can’t resolve Docker compose internal web.

Now the question is why the system resolver is not used. The error logging comes from here and the reading/parsing happens here.

The contents of /etc/resolv.conf in the relay container:

nameserver 127.0.0.11
options edns0 trust-ad ndots:0

So either reading /etc/resolv.conf or parsing throws an error. I’ll try to debug this further.

Like pea soup!

Good luck @stefanw, let us know what you find. And good lead, @JanMalte. :slight_smile:

I read lots of Rust code, even built some Rust to test things out, but could not figure out how the io error from the logs is triggered.

But I found that relay supports multiple http client libraries as it is moving from actix-web-client to reqwest. The DNS problems come from the resolver that actix-web uses so I configured relay with the other library like this in relay/config.yml:

http:
  _client: "reqwest"

Docker compose down/up, then relay uses the other http library and (after a couple of seconds) it works!

Changing the http client library actually resolves the DNS AAAA request issue, but the DNS resover issue still appears:

[actix::actors::resolver] WARN: Can not create system dns resolver: io error

And the relay service still does not operate well. The error message changed to:

[relay_server::actors::upstream] ERROR: authentication encountered error: upstream request returned error 400 Bad Request
    caused by: No valid json body

This error did not happened before, so this might be a different issue or gets caused by changing the http client library, I do not know.

Full log with logging set to DEBUG. The following lines appear over and over again.

[relay_server::actors::upstream] INFO: registering with upstream (http://web:9000/)
[relay_server::actors::upstream] DEBUG: Recovering from network outage.
[relay_server::actors::upstream] ERROR: authentication encountered error: upstream request returned error 400 Bad Request
    caused by: No valid json body

btw: do I need to add the credentials from relay/credentials.json anywhere?

Today I stumbled across this issue while migrating our on-prem installation onto another system and upgrading from 9.0 to 20.12.1.

I got the exact same error as shown in JanMalte’s post.

The key here is this line:

relay_1 | 2021-01-07T12:06:47Z [actix::actors::resolver] WARN: Can not create system dns resolver: io error

After having spent most of the day debugging this issue, to me it looks like this:

relay version 20.12.1 uses actix version 0.7.9 which in turn uses trust-dns-resolver(1) 0.10.0. This version doesn’t recognize newer resolver options in resolv.conf - in my case it was the trust-ad option that led to an exception InvalidOption(17). That exception is unfortunately masked by actix and only leads to the generic Can not create system dns resolver: io error message. I compiled and ran an example resolver(2) provided by trust-dns to debug this issue to confirm that removing the problematic option makes the error go away.

To workaround this issue, I have added this section to my docker-compose.yml:

  relay:
    entrypoint:
      - sh
      - -c
      - |
        # NOTE: Using temp file since `sed -i` fails with `Device or resource busy`.
        tmp_resolv_conf="$$(mktemp)"
        sed '/^options /s/\<trust-ad\>//g' /etc/resolv.conf >"$$tmp_resolv_conf"
        cp "$$tmp_resolv_conf" /etc/resolv.conf
        rm -f "$$tmp_resolv_conf"
        exec bash /docker-entrypoint.sh run

This simply removes the trust-ad option from resolv.conf and this way makes the resolver errors go away and event ingestion working again.

A real fix would involve updating trust-dns-resolver to a newer version. I tested the latest version, 0.20.0, using the CLI client(3) provided with this version and confirmed it parses resolv.conf correctly even with trust-ad option enabled.

But 0.20.0 is only provided with actix starting from 0.11.0-beta.1 (the latest at the time of writing). I don’t know if there’s an earlier version of trust-dns-resolver that already has a fix and is shipped with earlier versions of actix…

Links:

  1. Github: /actix/actix/blob/0.7/Cargo.toml#L67
  2. Github: /bluejekyll/trust-dns/blob/r0.10.0/crates/resolver/examples/global_resolver.rs
  3. Github: /bluejekyll/trust-dns/tree/v0.20.0#using-the-trust-dns-resolver-cli

(Sorry for the obfuscated links, but I was greeted with a friendly “new users are not allowed to post more than 2 links” message…)

3 Likes

I can confirm that @thomasw 's solution worked for me. Swapping over to the reqwest http client as @stefanw suggested did not however. I continually got the No valid json body error as JanMalte reported.

I’m not sure what the best upstream fix is, but this works for me for now.

Thank you so much for all this debugging @thomasw! I relayed this information to our Relay team (pun inteded) and hoping to get a fix out soon.

In the meantime, maybe we can patch the Docker image or the compose file with your fixes?

1 Like

@BYK Adding the workaround upstream would be very welcome so I can get rid of the modification in my docker-compose.yml. I’d think adding it to the docker image’s entrypoint script should be most appropriate.

Also, I dug a little bit deeper and figured that trust-dns-resolver uses another crate, resolv-conf, to parse resolv.conf. This added support for trust-ad in version 0.7.0, which is the latest at the time of writing. In addition it also added support for the no-reload option which I haven’t seen in the wild yet, but might good to know about nevertheless. Version 0.7.0 is used by trust-dns-resolver starting from 0.20.0 (which is also the latest) which in turn is used as already said by actix starting from 0.11.0-beta.1 (also the latest). So, given it might take a while until a version of actix containing a fix is ready for prime time it might indeed be worth adding the workaround.

1 Like

@thomasw Do you know you can override most docker-compose settings in a separate docker-compose.override.yml file?
This i the was I did for applying the workaround without messing with the upstream docker-compose.yml.

docker-compose.override.yml

version: '3.4'
services:
  relay:
    sysctls:
      net.ipv6.conf.all.disable_ipv6: 1
      net.ipv6.conf.default.disable_ipv6: 1
      net.ipv6.conf.lo.disable_ipv6: 1
    entrypoint:
        - sh
        - -c
        - |
          # NOTE: Using temp file since `sed -i` fails with `Device or resource busy`.
          tmp_resolv_conf="$$(mktemp)"
          sed '/^options /s/\<trust-ad\>//g' /etc/resolv.conf >"$$tmp_resolv_conf"
          cp "$$tmp_resolv_conf" /etc/resolv.conf
          rm -f "$$tmp_resolv_conf"
          exec bash /docker-entrypoint.sh run
1 Like

Gotcha! Didn’t know that. Thanks a lot for the hint!
But still, would prefer not having to maintain local modifications. :slight_smile:

1 Like

@thomasw @JanMalte I opted to force Relay to use reqwest as we are already planning to switch over to that. This should resolve the issue as @stefanw mentioned above:

Just modify your relay/config.yml file accordingly and let us know how it goes please. If this causes any headaches, I’ll revert back to resolv.conf modification path.

Thank you all for helping us figure this out! :heart:

@BYK I did as you suggested, reverted my modifications to docker-compose.yml and modified relay/config.yml so it uses reqwest.

Unfortunately I get largely the same results as @JanMalte reported previously.

Full log of relay server:

2021-01-14T16:40:19Z [rdkafka::client] ERROR: librdkafka: FAIL [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.12:9092 failed: Connection refused (after 2ms in state CONNECT)
2021-01-14T16:40:19Z [rdkafka::client] ERROR: librdkafka: Global error: BrokerTransportFailure (Local: Broker transport failure): kafka:9092/bootstrap: Connect to ipv4#172.18.0.12:9092 failed: Connection refused (after 2ms in state CONNECT)
2021-01-14T16:40:19Z [rdkafka::client] ERROR: librdkafka: Global error: AllBrokersDown (Local: All broker connections are down): 1/1 brokers are down
2021-01-14T16:40:19Z [actix::actors::resolver] WARN: Can not create system dns resolver: io error
2021-01-14T16:40:19Z [rdkafka::client] ERROR: librdkafka: FAIL [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.12:9092 failed: Connection refused (after 57ms in state CONNECT)
2021-01-14T16:40:19Z [rdkafka::client] ERROR: librdkafka: Global error: BrokerTransportFailure (Local: Broker transport failure): kafka:9092/bootstrap: Connect to ipv4#172.18.0.12:9092 failed: Connection refused (after 57ms in state CONNECT)
2021-01-14T16:40:19Z [rdkafka::client] ERROR: librdkafka: Global error: AllBrokersDown (Local: All broker connections are down): 1/1 brokers are down
2021-01-14T16:40:19Z [relay_server::actors::upstream] ERROR: authentication encountered error: could not send request to upstream
  caused by: could not send request using reqwest
  caused by: error sending request for url (http://web:9000/api/0/relays/register/challenge/): error trying to connect: dns error: failed to lookup address information: Temporary failure in name resolution
2021-01-14T16:40:19Z [relay_server::actors::upstream] ERROR: authentication encountered error: could not send request to upstream
  caused by: could not send request using reqwest
  caused by: error sending request for url (http://web:9000/api/0/relays/register/challenge/): error trying to connect: dns error: failed to lookup address information: Temporary failure in name resolution
2021-01-14T16:40:20Z [rdkafka::client] ERROR: librdkafka: FAIL [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.12:9092 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed)
2021-01-14T16:40:20Z [rdkafka::client] ERROR: librdkafka: Global error: BrokerTransportFailure (Local: Broker transport failure): kafka:9092/bootstrap: Connect to ipv4#172.18.0.12:9092 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed)
2021-01-14T16:40:20Z [rdkafka::client] ERROR: librdkafka: FAIL [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.12:9092 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed)
2021-01-14T16:40:20Z [rdkafka::client] ERROR: librdkafka: Global error: BrokerTransportFailure (Local: Broker transport failure): kafka:9092/bootstrap: Connect to ipv4#172.18.0.12:9092 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed)
2021-01-14T16:40:23Z [relay_server::actors::upstream] ERROR: authentication encountered error: could not send request to upstream
  caused by: could not send request using reqwest
  caused by: error sending request for url (http://web:9000/api/0/relays/register/challenge/): error trying to connect: operation timed out
2021-01-14T16:40:25Z [relay_server::actors::upstream] ERROR: authentication encountered error: could not send request to upstream
  caused by: could not send request using reqwest
  caused by: error sending request for url (http://web:9000/api/0/relays/register/challenge/): error trying to connect: tcp connect error: Connection refused (os error 111)
2021-01-14T16:40:27Z [relay_server::actors::upstream] ERROR: authentication encountered error: could not send request to upstream
  caused by: could not send request using reqwest
  caused by: error sending request for url (http://web:9000/api/0/relays/register/challenge/): error trying to connect: tcp connect error: Connection refused (os error 111)
2021-01-14T16:40:30Z [relay_server::actors::upstream] WARN: Network outage, scheduling another check in 0ns
2021-01-14T16:40:30Z [relay_server::actors::upstream] ERROR: authentication encountered error: could not send request to upstream
  caused by: could not send request using reqwest
  caused by: error sending request for url (http://web:9000/api/0/relays/register/challenge/): error trying to connect: tcp connect error: Connection refused (os error 111)
2021-01-14T16:40:30Z [relay_server::actors::upstream] WARN: Network outage, scheduling another check in 1s
2021-01-14T16:40:31Z [relay_server::actors::upstream] WARN: Network outage, scheduling another check in 1.5s
2021-01-14T16:40:33Z [relay_server::actors::upstream] WARN: Network outage, scheduling another check in 2.25s
2021-01-14T16:40:35Z [relay_server::actors::upstream] WARN: Network outage, scheduling another check in 3.375s
2021-01-14T16:40:36Z [relay_server::actors::upstream] ERROR: authentication encountered error: could not send request to upstream
  caused by: could not send request using reqwest
  caused by: error sending request for url (http://web:9000/api/0/relays/register/challenge/): error trying to connect: tcp connect error: Connection refused (os error 111)
2021-01-14T16:40:39Z [relay_server::actors::upstream] WARN: Network outage, scheduling another check in 5.0625s
2021-01-14T16:40:43Z [relay_server::actors::upstream] ERROR: authentication encountered error: could not send request to upstream
  caused by: could not send request using reqwest
  caused by: error sending request for url (http://web:9000/api/0/relays/register/challenge/): error trying to connect: tcp connect error: Connection refused (os error 111)
2021-01-14T16:40:44Z [relay_server::actors::upstream] WARN: Network outage, scheduling another check in 7.59375s
2021-01-14T16:40:50Z [rdkafka::client] ERROR: librdkafka: FAIL [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.12:9092 failed: Connection refused (after 0ms in state CONNECT, 30 identical error(s) suppressed)
2021-01-14T16:40:50Z [rdkafka::client] ERROR: librdkafka: Global error: BrokerTransportFailure (Local: Broker transport failure): kafka:9092/bootstrap: Connect to ipv4#172.18.0.12:9092 failed: Connection refused (after 0ms in state CONNECT, 30 identical error(s) suppressed)
2021-01-14T16:40:50Z [rdkafka::client] ERROR: librdkafka: FAIL [thrd:kafka:9092/bootstrap]: kafka:9092/bootstrap: Connect to ipv4#172.18.0.12:9092 failed: Connection refused (after 0ms in state CONNECT, 30 identical error(s) suppressed)
2021-01-14T16:40:50Z [rdkafka::client] ERROR: librdkafka: Global error: BrokerTransportFailure (Local: Broker transport failure): kafka:9092/bootstrap: Connect to ipv4#172.18.0.12:9092 failed: Connection refused (after 0ms in state CONNECT, 30 identical error(s) suppressed)
2021-01-14T16:40:51Z [relay_server::actors::upstream] WARN: Network outage, scheduling another check in 11.390625s
2021-01-14T16:40:55Z [relay_server::actors::upstream] ERROR: authentication encountered error: could not send request to upstream
  caused by: could not send request using reqwest
  caused by: error sending request for url (http://web:9000/api/0/relays/register/challenge/): error trying to connect: tcp connect error: Connection refused (os error 111)
2021-01-14T16:41:03Z [relay_server::actors::upstream] WARN: Network outage, scheduling another check in 17.0859375s
2021-01-14T16:41:17Z [relay_server::actors::upstream] ERROR: authentication encountered error: could not send request to upstream
  caused by: could not send request using reqwest
  caused by: error sending request for url (http://web:9000/api/0/relays/register/challenge/): operation timed out
2021-01-14T16:41:25Z [relay_server::actors::upstream] WARN: Network outage, scheduling another check in 25.62890625s
2021-01-14T16:41:47Z [relay_server::actors::upstream] ERROR: authentication encountered error: could not send request to upstream
  caused by: could not send request using reqwest
  caused by: error sending request for url (http://web:9000/api/0/relays/register/challenge/): operation timed out
2021-01-14T16:41:55Z [relay_server::actors::upstream] WARN: Network outage, scheduling another check in 38.443359375s
2021-01-14T16:42:26Z [relay_server::actors::upstream] ERROR: authentication encountered error: upstream request returned error 400 Bad Request
  caused by: No valid json body
2021-01-14T16:43:23Z [relay_server::actors::upstream] ERROR: authentication encountered error: upstream request returned error 400 Bad Request
  caused by: No valid json body
2021-01-14T16:44:23Z [relay_server::actors::upstream] ERROR: authentication encountered error: upstream request returned error 400 Bad Request
  caused by: No valid json body
2021-01-14T16:45:23Z [relay_server::actors::upstream] ERROR: authentication encountered error: upstream request returned error 400 Bad Request
  caused by: No valid json body
2021-01-14T16:46:23Z [relay_server::actors::upstream] ERROR: authentication encountered error: upstream request returned error 400 Bad Request
  caused by: No valid json body
2021-01-14T16:47:23Z [relay_server::actors::upstream] ERROR: authentication encountered error: upstream request returned error 400 Bad Request
  caused by: No valid json body
2021-01-14T16:48:23Z [relay_server::actors::upstream] ERROR: authentication encountered error: upstream request returned error 400 Bad Request
  caused by: No valid json body
2021-01-14T16:49:23Z [relay_server::actors::upstream] ERROR: authentication encountered error: upstream request returned error 400 Bad Request
  caused by: No valid json body
2021-01-14T16:50:23Z [relay_server::actors::upstream] ERROR: authentication encountered error: upstream request returned error 400 Bad Request
  caused by: No valid json body

As can be seen in the logs it uses reqwest as intended but upstream requests fail with No valid json body resulting in event ingestion to fail…

Which version of the relay docker image is this? There are some bugfixes in master that will go out in the scheduled release tomorrow.

It is getsentry/relay:20.12.1. I’ll happily test against a newer version when available.

Can you just try with getsentry/relay:nightly as I’ll pull this back before the release tomorrow if it is still failing with this.

@BYK nightly works for me with reqwest – without any problems! Thumbs up!

1 Like