Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to tune project listing query again #11620

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Conversation

agjohnson
Copy link
Contributor

  • Don't subquery for builds in project listing prefetch

This seems like it's unneccessary, but the prefetch is not accurate
using Build.objects first.
.values_list("id", flat=True)[:1]
# Get most recent and recent successful builds
builds_latest = (
Build.internal.filter(project__in=self)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear why Build.internal is needed in both the inner query and the prefetch query. It seems like one of them could be Build.objects at very least? The performance is a little better with Build.objects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand is to avoid getting builds from PRs (external versions) and I'd say that .internal should perform better than .objects since it removes a lot of builds to consider and they should be removed using an index Build.type. That's the theory, tho 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh same, that was my thought initially. There is some complexity added by this method though, which I think ultimately annoys the query planner.

.annotate(latest=Max("pk"))
.values_list("latest", flat=True)
)
builds_success = (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it could be combined into the query above, saving ~400ms.

@agjohnson
Copy link
Contributor Author

This reduced the time needed for prefetch from 12s to 3s, but this is still not usable. It's really not clear why, but the planner was previously falling apart on this and triggered a sequence scan on builds_version for some reason.

This is most easily testable against one of our accounts:

In [1]: %time list(Project.objects.dashboard(User.objects.filter(is_staff=True).first()))
CPU times: user 28.8 ms, sys: 45 μs, total: 28.9 ms
Wall time: 2.44 s

The current explain looks different than it did (specifically it doesn't sequence scan builds_version), but it is overly complex for :

Sort  (cost=32709.47..32709.91 rows=179 width=339) (actual time=1304.421..1304.425 rows=15 loops=1)
  Sort Key: builds_build.date DESC
  Sort Method: quicksort  Memory: 31kB
  ->  Nested Loop Left Join  (cost=30818.25..32702.77 rows=179 width=339) (actual time=1304.218..1304.392 rows=15 loops=1)
        Filter: (((builds_version.type)::text <> 'external'::text) OR (builds_version.type IS NULL))
        ->  Nested Loop  (cost=30817.82..32514.43 rows=185 width=339) (actual time=1304.207..1304.307 rows=15 loops=1)
              ->  HashAggregate  (cost=30817.39..30819.39 rows=200 width=4) (actual time=1304.175..1304.183 rows=19 loops=1)
                    Group Key: max(v0.id)
                    ->  GroupAggregate  (cost=78.79..30715.36 rows=8162 width=8) (actual time=2.336..1304.159 rows=19 loops=1)
                          Group Key: v0.project_id
                          ->  Nested Loop Left Join  (cost=78.79..30592.93 rows=8162 width=8) (actual time=0.152..1261.318 rows=450264 loops=1)
                                Filter: (((v1.type)::text <> 'external'::text) OR (v1.type IS NULL))
                                Rows Removed by Filter: 1898
                                ->  Nested Loop  (cost=78.36..26593.19 rows=8462 width=12) (actual time=0.136..442.442 rows=452162 loops=1)
                                      ->  Unique  (cost=77.80..77.86 rows=11 width=4) (actual time=0.121..0.140 rows=19 loops=1)
                                            ->  Sort  (cost=77.80..77.83 rows=11 width=4) (actual time=0.121..0.129 rows=19 loops=1)
                                                  Sort Key: u0.id
                                                  Sort Method: quicksort  Memory: 25kB
                                                  ->  Nested Loop  (cost=0.85..77.61 rows=11 width=4) (actual time=0.019..0.108 rows=19 loops=1)
                                                        ->  Index Scan using projects_project_users_user_id on projects_project_users u1  (cost=0.42..24.74 rows=11 width=4) (actual time=0.007..0.027 rows=19 loops=1)
                                                              Index Cond: (user_id = 14481)
                                                        ->  Index Only Scan using projects_project_pkey on projects_project u0  (cost=0.42..4.81 rows=1 width=4) (actual time=0.004..0.004 rows=1 loops=19)
                                                              Index Cond: (id = u1.project_id)
                                                              Heap Fetches: 3
                                      ->  Index Scan using builds_build_project_id on builds_build v0  (cost=0.56..2402.78 rows=769 width=12) (actual time=0.010..20.851 rows=23798 loops=19)
                                            Index Cond: (project_id = u0.id)
                                ->  Index Only Scan using idx_builds_version_id_type on builds_version v1  (cost=0.43..0.46 rows=1 width=9) (actual time=0.001..0.001 rows=1 loops=452162)
                                      Index Cond: (id = v0.version_id)
                                      Heap Fetches: 410238
              ->  Index Scan using builds_build_pkey on builds_build  (cost=0.44..8.47 rows=1 width=339) (actual time=0.006..0.006 rows=1 loops=19)
                    Index Cond: (id = (max(v0.id)))
                    Filter: (project_id = ANY ('{487639,74581,689368,24458,170010,613422,714226,521174,256207,815321,527062,451683,17662,233368,489923}'::integer[]))
                    Rows Removed by Filter: 0
        ->  Index Only Scan using idx_builds_version_id_type on builds_version  (cost=0.43..1.01 rows=1 width=9) (actual time=0.005..0.005 rows=1 loops=15)
              Index Cond: (id = builds_build.version_id)
              Heap Fetches: 12

Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note here is that we are using .prefetch_related here, which does the joining on the Python side. That could explain why some of there queries are fast when testing them in the DB, but slow when using Django to access these views:

prefetch_related, on the other hand, does a separate lookup for each relationship, and does the ‘joining’ in Python

(from https://docs.djangoproject.com/en/5.0/ref/models/querysets/#prefetch-related)

Since we are using Max() to get the latest build and the latest successful build for each project, we could probably use select_related instead here which will make everything at the DB.

.values_list("id", flat=True)[:1]
# Get most recent and recent successful builds
builds_latest = (
Build.internal.filter(project__in=self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand is to avoid getting builds from PRs (external versions) and I'd say that .internal should perform better than .objects since it removes a lot of builds to consider and they should be removed using an index Build.type. That's the theory, tho 😄

.values_list("id", flat=True)[:1]
# Get most recent and recent successful builds
builds_latest = (
Build.internal.filter(project__in=self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of __in can't we just use project__pk=self.pk here as I did in another PR? That worked pretty good there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self is a Project.objects queryset, not an individual model instance.

# Get most recent and recent successful builds
builds_latest = (
Build.internal.filter(project__in=self)
.values("project")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this line here? I understand the project value is not used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might not, but this is for grouping by project. The latest build day per project is what is needed here. If there is a different way to group this, we don't need the second query at all.

@humitos
Copy link
Member

humitos commented Sep 26, 2024

Since we are using Max() to get the latest build and the latest successful build for each project, we could probably use select_related instead here which will make everything at the DB.

I quickly tested this and it's slower 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants