Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

[Presence] Huge spike in CPU usage/Federation traffic approximately every 25 minutes #15878

Open
realtyem opened this issue Jul 5, 2023 · 4 comments
Labels
A-Presence O-Occasional Affects or can be seen by some users regularly or most users rarely S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@realtyem
Copy link
Contributor

realtyem commented Jul 5, 2023

There is a timeout for when to send a ping over federation every 25 minutes that keeps a user from being marked 'offline' before the 30 minute timeout hits.

Federation spike 1

This appears to be the replication notifier system ramping up and queueing a bunch of federation sending requests over approximately 1 minute worth of time(give or take a few seconds)

Images

Federation spike 2

Federation spike 3

Federation spike 4

There is a database hit during this to get_current_hosts_in_room(), I'm not personally convinced it's contributing to the seriousness of this situation(but included here for completeness).

Images:

Federation spike 5

UPDATE: Additional information from the other side of the slash in the title

The large spike in traffic caused by queueing and then sending all those requests looks like this:
Federation spike 7

@H-Shay
Copy link
Contributor

H-Shay commented Jul 5, 2023

I think this is essentially a dupe of #9478 - I am going to close this in favor of that.

@H-Shay H-Shay closed this as completed Jul 5, 2023
@realtyem
Copy link
Contributor Author

realtyem commented Jul 6, 2023

I postulate that this is different and more specific(ignoring that #9478 is actually a meta issue similar to where this is housed). I believe that this specific behavior is actually caused by the prefill of presence that is setup on startup of Synapse.

Look here at part of the __init__() function where the wheel timers that decided when things that need to be timed out are initially set. In the case of federation pings, this is set to 25 minutes. Specifially, line 683 is where the relevant timer is calculated(we will ignore the other federation timer that is set, as it is for remote users and we are not responsible for sending presence for them).

now = self.clock.time_msec()
if self._presence_enabled:
for state in self.user_to_current_state.values():
self.wheel_timer.insert(
now=now, obj=state.user_id, then=state.last_active_ts + IDLE_TIMER
)
self.wheel_timer.insert(
now=now,
obj=state.user_id,
then=state.last_user_sync_ts + SYNC_ONLINE_TIMEOUT,
)
if self.is_mine_id(state.user_id):
self.wheel_timer.insert(
now=now,
obj=state.user_id,
then=state.last_federation_update_ts + FEDERATION_PING_INTERVAL,
)
else:
self.wheel_timer.insert(
now=now,
obj=state.user_id,
then=state.last_federation_update_ts + FEDERATION_TIMEOUT,
)

Slightly further down(same function) is where a looping call is setup to run every 5 seconds after waiting 30 seconds(so it doesn't influence the rest of Synapse starting up).:

if self._presence_enabled:
# Start a LoopingCall in 30s that fires every 5s.
# The initial delay is to allow disconnected clients a chance to
# reconnect before we treat them as offline.
def run_timeout_handler() -> Awaitable[None]:
return run_as_background_process(
"handle_presence_timeouts", self._handle_timeouts
)
self.clock.call_later(
30, self.clock.looping_call, run_timeout_handler, 5000
)

After all this initially runs, all the timers are based on this timer, then are reset when timeouts are handled(in this instance and for now) in about 25 minutes, leading to a perpetual repeat of this situation.

I speculate that adjusting the initial set of this timer to spread it over the 25 minute interval(evenly or randomly) would break this load up and prevent the spikes. Subsequent timers will then be also spread out.

@H-Shay
Copy link
Contributor

H-Shay commented Jul 6, 2023

Fair enough! Thanks for adding an explanation, that makes what you are getting at much clearer - I will re-open.

@H-Shay H-Shay reopened this Jul 6, 2023
@H-Shay H-Shay added A-Presence T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. O-Occasional Affects or can be seen by some users regularly or most users rarely S-Minor Blocks non-critical functionality, workarounds exist. labels Jul 6, 2023
@realtyem
Copy link
Contributor Author

realtyem commented Jul 6, 2023

These images/screenshots of metrics are for a single user(for context).

So, what happens when the scale of local users climbs to 100's of thousands? In theory, an exponential increase in traffic.

I'm envisioning something similar to a deduplicating bucket like system with an arbitrary timer(say 5 or 10 seconds) that can accumulate presence data for multiple users(or update it if some change came along in that 5 seconds) to 'append' to outgoing federation traffic. Then, if another need to send data over federation comes in, it can just check that bucket and send it. If no other request has come in within that arbitrary timer, then go ahead and send the presence data by itself. Or some such.

I'm open to clever ideas, if anyone would like to suggest something.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Presence O-Occasional Affects or can be seen by some users regularly or most users rarely S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests

2 participants