Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spawn scanner processes concurrently #54

Merged
merged 9 commits into from
Nov 29, 2023
Merged

Spawn scanner processes concurrently #54

merged 9 commits into from
Nov 29, 2023

Conversation

DMRobertson
Copy link
Contributor

@DMRobertson DMRobertson commented Nov 28, 2023

The content scanner uses subprocess.run to invoke the scanner script. This is synchronous and does not yield to the event loop. Replace it with the equivalent asyncio call so we can spawn multiple processes and concurrently await their results.

To demonstrate the improvement, I have wired up a rudimentary test script, which downloads 182 avatars from #synapse-dev through the content scanner. Before:

$ git rev-parse HEAD
bbbff949b5ca122a7dbbbdf7799c0dc14fae5f23

$ python perf/scanner_perf.py
number of URLs: 182
......................................................................................................................................................................................
Status codes from scanner server: Counter({200: 174, 403: 4, 502: 4})
Succeeded in 175.86s
......................................................................................................................................................................................
Status codes from scanner server: Counter({200: 178, 403: 4})
Succeeded in 4.26s
Server return code: None

After:

2023-11-28 15:49:57 ✔  $ git rev-parse HEAD
0ba77df7e2056f39fe90d06638e3f39de9a3bfa7

$ python perf/scanner_perf.py
number of URLs: 182
......................................................................................................................................................................................
Status codes from scanner server: Counter({200: 178, 403: 4})
Succeeded in 10.93s
......................................................................................................................................................................................
Status codes from scanner server: Counter({200: 178, 403: 4})
Succeeded in 0.15s
Server return code: None

@DMRobertson DMRobertson marked this pull request as ready for review November 28, 2023 16:10
@DMRobertson DMRobertson requested a review from a team as a code owner November 28, 2023 16:10
@DMRobertson
Copy link
Contributor Author

After:

I was surprised that the second run (with caches warmed up) still benefited from this change. But maybe that's because the Before run hit 502s and ended up retrying them on the second run.

In any case, I think the main result here is the drop from ~3minutes to ~10 seconds.

perf/pickle_path Outdated Show resolved Hide resolved
src/matrix_content_scanner/scanner/scanner.py Show resolved Hide resolved
perf/.gitignore Show resolved Hide resolved
@DMRobertson DMRobertson merged commit e1f82a0 into main Nov 29, 2023
4 checks passed
@DMRobertson DMRobertson deleted the dmr/perf branch November 29, 2023 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants