Extremely large initial download. Login takes 1+ minutes. What is this?

dginther · July 6, 2021, 11:06pm

When login works, it takes sometimes 2 minutes. Why is it downloading so much, just to log in?

Here is a failed attempt.
It times out on the gateway.
Screen Shot 2021-07-06 at 5.05.43 PM

Why in the world am I downloading over 500MB just to log into the server? Anytime I click a link from outside, it tries to do this as well. Help!

priscilawebdev · July 9, 2021, 9:00am

Thank you! We will take a look into that

chadwhitacre · July 13, 2021, 1:29pm

What version are you running, @dginther?

Does this look like a known issue to you @priscilawebdev? What do you think might be going on?

dginther · July 13, 2021, 2:24pm

@chadwhitacre 21.6.1, currently

priscilawebdev · July 15, 2021, 2:41pm

I hope to work on this soon. We have been updating a bunch of config and libraries so it could be a bug on our side.

dginther · July 15, 2021, 6:19pm

We have been doing some more work on troubleshooting this. We enabled some APM on this and this is the result we came up with:

We have increased all the timeouts we can find and now we are more able to get consistently logged in without a failure, but the sentry-web pods are now consistently having what I would call ridiculous memory usage:

which is causing evictions of the pods due to resource limitations.

I understand that on-premise support is ‘best effort’ and that the solution Sentry offers is “Use our cloud offering,” but we are unable to use your cloud offering because of ATO/Compliance issues, as well as cost issues. This is a pretty frustrating place to be in, to be honest.

dginther · July 16, 2021, 4:47pm

Digging in further, we have found the following:
We found this was happening due to the requests hitting timeouts. We started by raising the sentry-web timeout itself from the default 30 seconds to 300. This seemed to help tremendously once actually getting logged into the interface if you could get there. Unfortunately we were still seeing a lot of 504 errors.

This was due to the nginx proxy that sits in front of the sentry-web instances having its own timeout of 60 seconds. After raising this one as well, it most certainly helped the ability to login but not the speed and there are still 502 errors occurring. I believe this is due to the instability of the sentry-web instances and the resources they are demanding. Ultimately they are running away with memory usage until eventually the request dies out or k8s evicts the pod because it using too many resources.

Memory usage for sentry-web skyrocketing - note that this is only one instance at the moment but when multiple users are trying to access the gui this will multiply.

We then brought up another Sentry instance, with no data, to compare. This is the production login:

and this is the non-production login:

We enabled APM tracing to see what was happening and we see the query that seems to be taking a long while:

We then ran that query against the postgres (in RDS) database:

and you can see it is 25 million rows and just over 1GB of data.

Admittedly, we have a lot of events, I’m just trying to understand how we might make this a better experience for our users.

ElijahLynn · July 20, 2021, 11:09pm

I work with Demian and am helping out on this issue. In Chrome network panel, I was able to “copy as curl” the network request to our instance at http://sentry.vfs.va.gov/api/0/organizations/vsp/?detailed=0. I saved it to a file locally which was 256MB of json, then ran cat sentry.output | python -m json.tool and most of results are of this type, for the user: null.

{
    "task": "setup_user_context",
    "status": "complete",
    "user": null,
    "completionSeen": null,
    "dateCompleted": "2021-07-19T18:43:55.286403Z",
    "data": {}
},
{
    "task": "setup_user_context",
    "status": "complete",
    "user": null,
    "completionSeen": null,
    "dateCompleted": "2021-07-19T18:43:55.337710Z",
    "data": {}
},
{
    "task": "setup_user_context",
    "status": "complete",
    "user": null,
    "completionSeen": null,
    "dateCompleted": "2021-07-19T18:43:55.417010Z",
    "data": {}
},

The source code appears to be at sentry/organization.tsx at ff0b7418318b3cecb30f6460c2028204e35fb019 · getsentry/sentry · GitHub.

async function fetchOrg(
  api: Client,
  slug: string,
  detailed: boolean,
  isInitialFetch?: boolean
): Promise<Organization> {
  const detailedQueryParam = detailed ? 1 : 0;
  const org = await getPreloadedDataPromise(
    `organization?detailed=${detailedQueryParam}`,
    slug,
    () =>
      // This data should get preloaded in static/sentry/index.ejs
      // If this url changes make sure to update the preload
      api.requestPromise(`/organizations/${slug}/`, {
        query: {detailed: detailedQueryParam},
      }),
    isInitialFetch
  );

  if (!org) {
    throw new Error('retrieved organization is falsey');
  }

  OrganizationActions.update(org, {replace: true});
  setActiveOrganization(org);

  return org;
}

That’s all I got for right now to keep the momentum going on this. As Demian mentioned, us engineers working on the VA.gov project don’t have a choice for your SaaS offering right now, as we are restricted by US government constraints. I see you don’t have an option to offer a paid consult, however, if you do have a way to do that, please let us know if that could be an option.

Thanks

ElijahLynn · July 20, 2021, 11:22pm

I see one more type. Apologies, this is reversed, as I used tac on the output to get the beginning:

 },
     "data": {}
     "dateCompleted": "2021-06-23T16:12:35.600144Z",
     "completionSeen": null,
     "user": null,
     "status": "complete",
     "task": "setup_release_tracking",
 {

ElijahLynn · July 21, 2021, 7:28pm

I haven’t dug into this too much more yet, but worth noting our GitHub organization is very, very large. We have:

2,700+ people
2,300+ teams
1,900+ repositories

UPDATE (can’t post more than 3 consecutive replies):
I created another issue in GitHub Issues here for better visibility > Large response for fetchOrg() and timeouts (org is 2,700+ people and 2,300+ teams) · Issue #27677 · getsentry/sentry · GitHub

wedamija · July 22, 2021, 7:04pm

Hi @ElijahLynn, could you run
select count(*) from sentry_organizationonboardingtask where task in (5, 6) and organization_id = <your_org_id>?

I’m expecting there to be two rows here, but want to confirm.

dginther · July 22, 2021, 7:53pm

looks like 1 row, but that also seems to coordinate with the amount of records we’re requesting on login?

wedamija · July 22, 2021, 8:16pm

That’s… very strange. There should be a unique index on organization_id, task. Could you dump your table schema for that table?

I’m assuming the unique isn’t there. Not totally sure how your install got into this state, what has your upgrade path in general looked like?

One last thing you can do is:

select user_id, status, project_id, count(*) 
from sentry_organizationonboardingtask 
where task in (5, 6) and organization_id = 1
group by user_id, status, project_id

I suspect there won’t be much variation here, but want to be sure. It’s likely you can just delete all but 1 of these rows, but let’s see what’s in the table first. This probably won’t entirely fix your problems, since the environmentproject queries are still a problem, but hopefully we can at least solve one of your issues.

dginther · July 22, 2021, 8:53pm

the schema looks like this:

dginther · July 22, 2021, 8:54pm

and the answer to your last part:
image (4)

wedamija · July 22, 2021, 9:53pm

Ok, so looking at our code:

github.com

getsentry/sentry/blob/ee22e171f2beffa93cf9928ee7ee2fe5e24f00f9/src/sentry/models/organizationonboardingtask.py#L60

    
      
          #   USER_CONTEXT:    User has added user context to sdk
          #   ISSUE_TRACKER:   Tracker added, issue not yet created
          
          

          
class OrganizationOnboardingTaskManager(BaseManager):
              def record(self, organization_id, task, **kwargs):
                  cache_key = f"organizationonboardingtask:{organization_id}:{task}"
                  if cache.get(cache_key) is None:
                      try:
                          with transaction.atomic():
                              self.create(organization_id=organization_id, task=task, **kwargs)
                              return True
                      except IntegrityError:
                          pass
          
          
            # Store marker to prevent running all the time
                      cache.set(cache_key, 1, 3600)
          
          
        return False

This is where we write these rows. We rely entirely on an integrity error being thrown here to prevent duplicates, and since you don’t have the unique on organization_id, task, then you’re getting duplicate rows here. I’m confused about how this could be missing, main options are:

Is there a chance someone removed this on your install?
If not, then what version of Sentry did you start with, what was your upgrade path to your current version, etc? I want to make sure that this isn’t affecting other on-premise users. It’d be helpful to know your upgrade path so that we could try and reproduce this ourselves.

What you want to do here is correct your data so that there’s only one row per task per org, then you can add the unique constraint back in place.

You can identify duplicate rows per task with this. I’ll leave it up to you to write a delete here:

select *
from sentry_organizationonboardingtask
where organization_id = 1 and task = 5
and id != (select min(id) from sentry_organizationonboardingtask where organization_id = 1 and task = 5)

You’ll need to repeat this for each task in your system that has duplicates.

Then this recreates the unique index

CREATE UNIQUE INDEX sentry_organizationonboar_organization_id_47e98e05cae29cf3_uniq ON public.sentry_organizationonboardingtask USING btree (organization_id, task)

This will fail unless you remove all duplicate rows, but it’s important to get this back in place, otherwise you’ll hit this issue again.

There might be some timing issues here with more duplicate rows getting added. One option to avoid that is to stop ingestion for your Sentry instance while you are fixing data and adding this index in place.

dginther · July 22, 2021, 9:56pm

If it was removed on our install, then it was entirely unintentional. We started with Sentry 9, we used Sentry 10 for a while and then we moved the data from that install to this install of Sentry 20, but I didn’t do the migration so I don’t know all the details. I will see if the engineer who did that work can give me a better idea of that.

ElijahLynn · July 29, 2021, 8:53pm

I want to follow up and let you all know that there was some progress made with the solution posted by @wedamija, so thank you for that. I’ll try to get the engineer to post the results here, but until then, please know that we have fast login now!!

mleclerc001 · August 10, 2021, 8:13pm

Sorry for the late reply @wedamija

I don’t believe this is affecting other on-premise users and is likely just due to something that was broken with our install.

Unfortunately when it comes to details about our upgrade path I don’t have any real valuable information here. There have been a handful of people who have interacted with it and I have only been around long enough to inherit managing the instance from version 10.0.1

From there I upgraded to the latest version at the time which was 21.4.2 I believe. With that being said, I have messed around with other test instances which work fine. I also did a fresh install of 10.0.1 to confirm that all of the database schemas lined up with the information given (which it does). Meaning that there was something that got messed up prior even attempting this upgrade. However what is curious to me is the fact that I don’t believe these tables should have ever had the compound keys missing even going back to earlier versions of on-premise.

wedamija · August 17, 2021, 6:14pm

Thanks for the update @mleclerc001 , good to confirm that it shouldn’t be affecting other on-prem users. I’ve replied to your post in the other thread, we can probably leave this one for now, since this specific issue is resolved.

Topic		Replies	Views
Performance Sentry API - on Premise, Kubernetes On-Premise	2	1173	November 2, 2021
Sentry v10 On-Premise Issue Count & Github SSO On-Premise	8	3766	January 3, 2020
Sentry stopped logging after spike? On-Premise	4	5150	October 24, 2021
Huge API responses On-Premise	10	1840	August 17, 2017
Test log in on-premise worked but when called from my application logs are not written into hosted on-premise	4	1886	August 3, 2020

Extremely large initial download. Login takes 1+ minutes. What is this?

Related topics