Extremely large initial download. Login takes 1+ minutes. What is this?

Thank you! We will take a look into that :wink:

What version are you running, @dginther?

Does this look like a known issue to you @priscilawebdev? What do you think might be going on?

@chadwhitacre 21.6.1, currently

I hope to work on this soon. We have been updating a bunch of config and libraries so it could be a bug on our side.

1 Like

We have been doing some more work on troubleshooting this. We enabled some APM on this and this is the result we came up with:

We have increased all the timeouts we can find and now we are more able to get consistently logged in without a failure, but the sentry-web pods are now consistently having what I would call ridiculous memory usage:


which is causing evictions of the pods due to resource limitations.

I understand that on-premise support is ‘best effort’ and that the solution Sentry offers is “Use our cloud offering,” but we are unable to use your cloud offering because of ATO/Compliance issues, as well as cost issues. This is a pretty frustrating place to be in, to be honest.

Digging in further, we have found the following:
We found this was happening due to the requests hitting timeouts. We started by raising the sentry-web timeout itself from the default 30 seconds to 300. This seemed to help tremendously once actually getting logged into the interface if you could get there. Unfortunately we were still seeing a lot of 504 errors.

This was due to the nginx proxy that sits in front of the sentry-web instances having its own timeout of 60 seconds. After raising this one as well, it most certainly helped the ability to login but not the speed and there are still 502 errors occurring. I believe this is due to the instability of the sentry-web instances and the resources they are demanding. Ultimately they are running away with memory usage until eventually the request dies out or k8s evicts the pod because it using too many resources.

Memory usage for sentry-web skyrocketing - note that this is only one instance at the moment but when multiple users are trying to access the gui this will multiply.

We then brought up another Sentry instance, with no data, to compare. This is the production login:

and this is the non-production login:

We enabled APM tracing to see what was happening and we see the query that seems to be taking a long while:

We then ran that query against the postgres (in RDS) database:


and you can see it is 25 million rows and just over 1GB of data.

Admittedly, we have a lot of events, I’m just trying to understand how we might make this a better experience for our users.

1 Like

I work with Demian and am helping out on this issue. In Chrome network panel, I was able to “copy as curl” the network request to our instance at http://sentry.vfs.va.gov/api/0/organizations/vsp/?detailed=0. I saved it to a file locally which was 256MB of json, then ran cat sentry.output | python -m json.tool and most of results are of this type, for the user: null.

{
    "task": "setup_user_context",
    "status": "complete",
    "user": null,
    "completionSeen": null,
    "dateCompleted": "2021-07-19T18:43:55.286403Z",
    "data": {}
},
{
    "task": "setup_user_context",
    "status": "complete",
    "user": null,
    "completionSeen": null,
    "dateCompleted": "2021-07-19T18:43:55.337710Z",
    "data": {}
},
{
    "task": "setup_user_context",
    "status": "complete",
    "user": null,
    "completionSeen": null,
    "dateCompleted": "2021-07-19T18:43:55.417010Z",
    "data": {}
},

The source code appears to be at sentry/organization.tsx at ff0b7418318b3cecb30f6460c2028204e35fb019 · getsentry/sentry · GitHub.

async function fetchOrg(
  api: Client,
  slug: string,
  detailed: boolean,
  isInitialFetch?: boolean
): Promise<Organization> {
  const detailedQueryParam = detailed ? 1 : 0;
  const org = await getPreloadedDataPromise(
    `organization?detailed=${detailedQueryParam}`,
    slug,
    () =>
      // This data should get preloaded in static/sentry/index.ejs
      // If this url changes make sure to update the preload
      api.requestPromise(`/organizations/${slug}/`, {
        query: {detailed: detailedQueryParam},
      }),
    isInitialFetch
  );

  if (!org) {
    throw new Error('retrieved organization is falsey');
  }

  OrganizationActions.update(org, {replace: true});
  setActiveOrganization(org);

  return org;
}

That’s all I got for right now to keep the momentum going on this. As Demian mentioned, us engineers working on the VA.gov project don’t have a choice for your SaaS offering right now, as we are restricted by US government constraints. I see you don’t have an option to offer a paid consult, however, if you do have a way to do that, please let us know if that could be an option.

Thanks

I see one more type. Apologies, this is reversed, as I used tac on the output to get the beginning:

 },
     "data": {}
     "dateCompleted": "2021-06-23T16:12:35.600144Z",
     "completionSeen": null,
     "user": null,
     "status": "complete",
     "task": "setup_release_tracking",
 {

I haven’t dug into this too much more yet, but worth noting our GitHub organization is very, very large. We have:

  • 2,700+ people
  • 2,300+ teams
  • 1,900+ repositories

UPDATE (can’t post more than 3 consecutive replies):
I created another issue in GitHub Issues here for better visibility > Large response for fetchOrg() and timeouts (org is 2,700+ people and 2,300+ teams) · Issue #27677 · getsentry/sentry · GitHub

1 Like

Hi @ElijahLynn, could you run
select count(*) from sentry_organizationonboardingtask where task in (5, 6) and organization_id = <your_org_id>?

I’m expecting there to be two rows here, but want to confirm.


looks like 1 row, but that also seems to coordinate with the amount of records we’re requesting on login?

That’s… very strange. There should be a unique index on organization_id, task. Could you dump your table schema for that table?

I’m assuming the unique isn’t there. Not totally sure how your install got into this state, what has your upgrade path in general looked like?

One last thing you can do is:

select user_id, status, project_id, count(*) 
from sentry_organizationonboardingtask 
where task in (5, 6) and organization_id = 1
group by user_id, status, project_id

I suspect there won’t be much variation here, but want to be sure. It’s likely you can just delete all but 1 of these rows, but let’s see what’s in the table first. This probably won’t entirely fix your problems, since the environmentproject queries are still a problem, but hopefully we can at least solve one of your issues.

the schema looks like this:

and the answer to your last part:
image (4)

Ok, so looking at our code:

This is where we write these rows. We rely entirely on an integrity error being thrown here to prevent duplicates, and since you don’t have the unique on organization_id, task, then you’re getting duplicate rows here. I’m confused about how this could be missing, main options are:

  • Is there a chance someone removed this on your install?
  • If not, then what version of Sentry did you start with, what was your upgrade path to your current version, etc? I want to make sure that this isn’t affecting other on-premise users. It’d be helpful to know your upgrade path so that we could try and reproduce this ourselves.

What you want to do here is correct your data so that there’s only one row per task per org, then you can add the unique constraint back in place.

You can identify duplicate rows per task with this. I’ll leave it up to you to write a delete here:

select *
from sentry_organizationonboardingtask
where organization_id = 1 and task = 5
and id != (select min(id) from sentry_organizationonboardingtask where organization_id = 1 and task = 5)

You’ll need to repeat this for each task in your system that has duplicates.

Then this recreates the unique index

CREATE UNIQUE INDEX sentry_organizationonboar_organization_id_47e98e05cae29cf3_uniq ON public.sentry_organizationonboardingtask USING btree (organization_id, task)

This will fail unless you remove all duplicate rows, but it’s important to get this back in place, otherwise you’ll hit this issue again.

There might be some timing issues here with more duplicate rows getting added. One option to avoid that is to stop ingestion for your Sentry instance while you are fixing data and adding this index in place.

4 Likes

If it was removed on our install, then it was entirely unintentional. We started with Sentry 9, we used Sentry 10 for a while and then we moved the data from that install to this install of Sentry 20, but I didn’t do the migration so I don’t know all the details. I will see if the engineer who did that work can give me a better idea of that.

3 Likes

I want to follow up and let you all know that there was some progress made with the solution posted by @wedamija, so thank you for that. I’ll try to get the engineer to post the results here, but until then, please know that we have fast login now!!

1 Like

Sorry for the late reply @wedamija

I don’t believe this is affecting other on-premise users and is likely just due to something that was broken with our install.

Unfortunately when it comes to details about our upgrade path I don’t have any real valuable information here. There have been a handful of people who have interacted with it and I have only been around long enough to inherit managing the instance from version 10.0.1

From there I upgraded to the latest version at the time which was 21.4.2 I believe. With that being said, I have messed around with other test instances which work fine. I also did a fresh install of 10.0.1 to confirm that all of the database schemas lined up with the information given (which it does). Meaning that there was something that got messed up prior even attempting this upgrade. However what is curious to me is the fact that I don’t believe these tables should have ever had the compound keys missing even going back to earlier versions of on-premise.

1 Like

Thanks for the update @mleclerc001 , good to confirm that it shouldn’t be affecting other on-prem users. I’ve replied to your post in the other thread, we can probably leave this one for now, since this specific issue is resolved.

This topic was automatically closed 15 days after the last reply. New replies are no longer allowed.