We recently started using Sentry (self-hosted) for logging of a Yii2 (Craft CMS) PHP app.
Perhaps naïvely, I set up errors with a minimum level of “warning” to log to Sentry.
This was deployed and seemingly working fine, though we did notice the data sent to Sentry seemed to be a total firehose. Specifically, there was a particular warning getting logged, sometimes multiple times per page request. I knew I should probably either fix the warning, or dial-in the logging a bit.
Before I had a chance to do that, there was a traffic spike which seemed to overwhelm resources all over our stack and bring the site down. The traffic was nothing the site can’t normally handle, and the only thing that had been deployed recently was Sentry. We frantically removed the Sentry logging and things came back.
Based on what we saw in logs, our guess as to what happened is:
- Traffic spike caused TONS of Sentry requests
- In turn, this overwhelmed our Sentry server, causing the requests to just hang and/or timeout
- The pending requests caused a high IO wait condition that locked everything up brought our servers down
So it seems we may have been naïve in our implementation. We probably rushed things, thinking a dev/logging tool would have more protection against something like this happening.
Our Sentry server is hosted on AWS and well above the recommended specs.
So my questions are:
- Does our diagnosis sound plausible?
- Is logging warnings in production like we were doing just a bad idea in general? I’m leaning towards a min level of error in prod, and warning for other envs (staging).
- What is the right way to mitigate such things? Clearly the app shouldn’t have been spewing warnings like that to begin with, but worst-case scenario, if that happened again – what is the right way to not interfere with the app?
- I’m using the default
HttpTransport
with the php sdk. It looks likeSpoolTransport
might have helped in this situation? - Can/should the default
HttpTransport
be configured with a lower timeout?
I was also surprised this happened right off the bat, and that there wasn’t much warning me against the potential of this happening. I haven’t yet found other horror stories like ours, so I’m wondering if this is happening to others, and if not, what is so unique about what we’re doing. I have to imagine an app inadvertently spewing warnings like this isn’t unheard of.
Here is my Yii2 log component:
'log' => [
'targets' => [
function () {
$minLevel = \Monolog\Logger::WARNING;
$logger = new \Monolog\Logger('craftcms');
$logger->pushHandler(new \Monolog\Handler\StreamHandler('php://stderr', $minLevel));
if (!\Craft::$app->getRequest()->isConsoleRequest) {
$logger->pushHandler(new \Monolog\Handler\StreamHandler('php://stdout', \Monolog\Logger::DEBUG));
}
if (CRAFT_ENVIRONMENT !== 'local') {
$sentryClient = \Sentry\ClientBuilder::create([
'dsn' => getenv('SENTRY_DSN'),
'environment' => CRAFT_ENVIRONMENT,
])->getClient();
$logger->pushHandler(new \Sentry\Monolog\Handler(new \Sentry\State\Hub($sentryClient), $minLevel));
}
return \Craft::createObject([
'class' => \samdark\log\PsrTarget::class,
'except' => ['yii\web\HttpException:40*'],
'logVars' => [],
'logger' => $logger,
]);
},
]
]