Sentry brought PHP app down…help with mitigation

timkelty · April 9, 2020, 2:18pm

We recently started using Sentry (self-hosted) for logging of a Yii2 (Craft CMS) PHP app.
Perhaps naïvely, I set up errors with a minimum level of “warning” to log to Sentry.

This was deployed and seemingly working fine, though we did notice the data sent to Sentry seemed to be a total firehose. Specifically, there was a particular warning getting logged, sometimes multiple times per page request. I knew I should probably either fix the warning, or dial-in the logging a bit.

Before I had a chance to do that, there was a traffic spike which seemed to overwhelm resources all over our stack and bring the site down. The traffic was nothing the site can’t normally handle, and the only thing that had been deployed recently was Sentry. We frantically removed the Sentry logging and things came back.

Based on what we saw in logs, our guess as to what happened is:

Traffic spike caused TONS of Sentry requests
In turn, this overwhelmed our Sentry server, causing the requests to just hang and/or timeout
The pending requests caused a high IO wait condition that locked everything up brought our servers down

So it seems we may have been naïve in our implementation. We probably rushed things, thinking a dev/logging tool would have more protection against something like this happening.

Our Sentry server is hosted on AWS and well above the recommended specs.

So my questions are:

Does our diagnosis sound plausible?
Is logging warnings in production like we were doing just a bad idea in general? I’m leaning towards a min level of error in prod, and warning for other envs (staging).
What is the right way to mitigate such things? Clearly the app shouldn’t have been spewing warnings like that to begin with, but worst-case scenario, if that happened again – what is the right way to not interfere with the app?
I’m using the default HttpTransport with the php sdk. It looks like SpoolTransport might have helped in this situation?
Can/should the default HttpTransport be configured with a lower timeout?

I was also surprised this happened right off the bat, and that there wasn’t much warning me against the potential of this happening. I haven’t yet found other horror stories like ours, so I’m wondering if this is happening to others, and if not, what is so unique about what we’re doing. I have to imagine an app inadvertently spewing warnings like this isn’t unheard of.

Here is my Yii2 log component:

        'log' => [
            'targets' => [
                function () {
                    $minLevel = \Monolog\Logger::WARNING;
                    $logger = new \Monolog\Logger('craftcms');
                    $logger->pushHandler(new \Monolog\Handler\StreamHandler('php://stderr', $minLevel));

                    if (!\Craft::$app->getRequest()->isConsoleRequest) {
                        $logger->pushHandler(new \Monolog\Handler\StreamHandler('php://stdout', \Monolog\Logger::DEBUG));
                    }

                    if (CRAFT_ENVIRONMENT !== 'local') {
                        $sentryClient = \Sentry\ClientBuilder::create([
                            'dsn' => getenv('SENTRY_DSN'),
                            'environment' => CRAFT_ENVIRONMENT,
                        ])->getClient();
                        $logger->pushHandler(new \Sentry\Monolog\Handler(new \Sentry\State\Hub($sentryClient), $minLevel));
                    }

                    return \Craft::createObject([
                        'class' => \samdark\log\PsrTarget::class,
                        'except' => ['yii\web\HttpException:40*'],
                        'logVars' => [],
                        'logger' => $logger,
                    ]);
                },
            ]
        ]

stayallive · April 11, 2020, 11:27am

Almost, the Sentry server crashing and not accepting events in a timely manor is certainly the cause of your app going down.

But it’s more likely because all available requests handlers (php-fpm or available php-cgi threads) were saturated by all the hanging requests waiting for Sentry to either timeout or accept the event which caused the server to no longer accept any requests (or take a long time accepting them). Doubt much I/O is happening when sending Sentry events since it’s not using the disk to write and/or send them.

No. But is is bad warning occur in a “normal” request. Warnings should not happen in production is my advice (same goes for notices and any other error level). Your app should not generate any Sentry events in normal circumstances I would say.

Correct, it shouldn’t, but if it’s the “normal” for you app you could think about either not logging warnings until they’re resolved or apply sampling so not all events are sent back to Sentry.

I doubt it since the HttpTransport already waits until the end of the request to sent events, and it would not sent less or would batch events, so I don’t think it would have helped.

It can. But requires you to create you initialise transport, there are no options you can set.

We did recently merge a PR to set more sensible timeouts by default but it looks like that has not been released yet so keep an eye out for that or replace the transport so you can set your own timeouts.

Many events sent by the PHP client are usually not good for performance, we can’t really get around it since PHP is a single threaded language so we can only do so much to keep requests fast but still transmit all events without using something like an external queue (which we don’t want to). Sentry should let you know when something is wrong, so those warnings are probably a good thing to receive but is also a good thing to fix, possibly if that firehose of warnings is solved Sentry will work much better for you.

I hope my answers help a bit, let me know if you have follow up questions.

timkelty · April 13, 2020, 1:04pm

I totally agree. The ecosystem we’re in involves a fair amount of 3rd party plugins, which can make that hard to enforce. It’s a bit of catch-22, as I want to know about these warnings so I can fix/report them, but also don’t want the server to suffer.

Sampling looks interesting…perhaps I could set up 2 loggers – errors only with 1.0, and the other, logging warnings with a lower sample rate?

Ultimately I think if we would have let this marinate on the staging environment longer, we could have seen and fixed the warning deluge. It also looks like that Enforce a timeout for connecting to the server and for the requests instead of waiting indefinitely by ste93cry · Pull Request #979 · getsentry/sentry-php · GitHub will ship soon, which may have helped keep the server afloat.

Thanks so much fo the help!

Topic		Replies	Views
Potential negative impact on capturing during HTTP requests	2	1291	June 28, 2019
Problem with High Requests On-Premise	3	1203	June 30, 2020
Troubleshooting Sentry PHP logging SDKs	5	3196	March 26, 2020
Possible IO issues related to Sentry server being offline SDKs	2	1348	February 1, 2020
Sentry seems to be causing errors itself? Feedback	0	1469	March 21, 2019

Sentry brought PHP app down…help with mitigation

Related topics