[Presence] Huge spike in CPU usage/Federation traffic approximately every 25 minutes #15878

realtyem · 2023-07-05T01:28:38Z

There is a timeout for when to send a ping over federation every 25 minutes that keeps a user from being marked 'offline' before the 30 minute timeout hits.

This appears to be the replication notifier system ramping up and queueing a bunch of federation sending requests over approximately 1 minute worth of time(give or take a few seconds)

Images

There is a database hit during this to get_current_hosts_in_room(), I'm not personally convinced it's contributing to the seriousness of this situation(but included here for completeness).

Images:

UPDATE: Additional information from the other side of the slash in the title

The large spike in traffic caused by queueing and then sending all those requests looks like this:

The text was updated successfully, but these errors were encountered:

H-Shay · 2023-07-05T21:45:31Z

I think this is essentially a dupe of #9478 - I am going to close this in favor of that.

realtyem · 2023-07-06T00:20:21Z

I postulate that this is different and more specific(ignoring that #9478 is actually a meta issue similar to where this is housed). I believe that this specific behavior is actually caused by the prefill of presence that is setup on startup of Synapse.

Look here at part of the __init__() function where the wheel timers that decided when things that need to be timed out are initially set. In the case of federation pings, this is set to 25 minutes. Specifially, line 683 is where the relevant timer is calculated(we will ignore the other federation timer that is set, as it is for remote users and we are not responsible for sending presence for them).

synapse/synapse/handlers/presence.py

Lines 668 to 691 in b07b14b

    
           now = self.clock.time_msec() 
        
           if self._presence_enabled: 
        
               for state in self.user_to_current_state.values(): 
        
                   self.wheel_timer.insert( 
        
                       now=now, obj=state.user_id, then=state.last_active_ts + IDLE_TIMER 
        
                   ) 
        
                   self.wheel_timer.insert( 
        
                       now=now, 
        
                       obj=state.user_id, 
        
                       then=state.last_user_sync_ts + SYNC_ONLINE_TIMEOUT, 
        
                   ) 
        
                   if self.is_mine_id(state.user_id): 
        
                       self.wheel_timer.insert( 
        
                           now=now, 
        
                           obj=state.user_id, 
        
                           then=state.last_federation_update_ts + FEDERATION_PING_INTERVAL, 
        
                       ) 
        
                   else: 
        
                       self.wheel_timer.insert( 
        
                           now=now, 
        
                           obj=state.user_id, 
        
                           then=state.last_federation_update_ts + FEDERATION_TIMEOUT, 
        
                       )

Slightly further down(same function) is where a looping call is setup to run every 5 seconds after waiting 30 seconds(so it doesn't influence the rest of Synapse starting up).:

synapse/synapse/handlers/presence.py

Lines 723 to 735 in b07b14b

    
           if self._presence_enabled: 
        
               # Start a LoopingCall in 30s that fires every 5s. 
        
               # The initial delay is to allow disconnected clients a chance to 
        
               # reconnect before we treat them as offline. 
        
               def run_timeout_handler() -> Awaitable[None]: 
        
                   return run_as_background_process( 
        
                       "handle_presence_timeouts", self._handle_timeouts 
        
                   ) 
        
               self.clock.call_later( 
        
                   30, self.clock.looping_call, run_timeout_handler, 5000 
        
               )

After all this initially runs, all the timers are based on this timer, then are reset when timeouts are handled(in this instance and for now) in about 25 minutes, leading to a perpetual repeat of this situation.

I speculate that adjusting the initial set of this timer to spread it over the 25 minute interval(evenly or randomly) would break this load up and prevent the spikes. Subsequent timers will then be also spread out.

H-Shay · 2023-07-06T02:24:14Z

Fair enough! Thanks for adding an explanation, that makes what you are getting at much clearer - I will re-open.

realtyem · 2023-07-06T10:30:32Z

These images/screenshots of metrics are for a single user(for context).

So, what happens when the scale of local users climbs to 100's of thousands? In theory, an exponential increase in traffic.

I'm envisioning something similar to a deduplicating bucket like system with an arbitrary timer(say 5 or 10 seconds) that can accumulate presence data for multiple users(or update it if some change came along in that 5 seconds) to 'append' to outgoing federation traffic. Then, if another need to send data over federation comes in, it can just check that bucket and send it. If no other request has come in within that arbitrary timer, then go ahead and send the presence data by itself. Or some such.

I'm open to clever ideas, if anyone would like to suggest something.

realtyem mentioned this issue Jul 5, 2023

Current state of Presence #15877

Open

H-Shay closed this as completed Jul 5, 2023

H-Shay reopened this Jul 6, 2023

H-Shay added A-Presence T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. O-Occasional Affects or can be seen by some users regularly or most users rarely S-Minor Blocks non-critical functionality, workarounds exist. labels Jul 6, 2023

matrixbot mentioned this issue Dec 22, 2023

[Presence] Huge spike in CPU usage/Federation traffic approximately every 25 minutes element-hq/synapse#15878

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Presence] Huge spike in CPU usage/Federation traffic approximately every 25 minutes #15878

[Presence] Huge spike in CPU usage/Federation traffic approximately every 25 minutes #15878

realtyem commented Jul 5, 2023 •

edited

Loading

H-Shay commented Jul 5, 2023

realtyem commented Jul 6, 2023

H-Shay commented Jul 6, 2023

realtyem commented Jul 6, 2023

[Presence] Huge spike in CPU usage/Federation traffic approximately every 25 minutes #15878

[Presence] Huge spike in CPU usage/Federation traffic approximately every 25 minutes #15878

Comments

realtyem commented Jul 5, 2023 • edited Loading

H-Shay commented Jul 5, 2023

realtyem commented Jul 6, 2023

H-Shay commented Jul 6, 2023

realtyem commented Jul 6, 2023

realtyem commented Jul 5, 2023 •

edited

Loading