pidgin.im

Fri Jan 18 10:51:54 EST 2008

On Fri, Jan 18, 2008 at 10:34:45AM -0500, Luke Schierer wrote:
> Thursday afternoon, our load average jumped from between 0 and 15, with
> the usual being between 0 and 10, to something more like 20.  The server
> appears to have mostly handled this.  I suspect this was caused by the
> massive growth of "committed" ram, which grew to about 8.5G between
> Thursday 10:00 and Thursday 20:00.  At apprximately 23:00, also on
> Thursday, either someone noticed this, and restarted something (probably
> lighttpd), or the problem fixed itself.
> 
> However, today, started about 01:00, the load average started going MUCH
> MUCH higher.  This caused the server to start to fall over, as various
> processes, notably the munin monitoring, and logcheck emails, to starve
> for resources.  At about 05:15, the load fell from about 55 to 20-15,
> and munin was again able to run for the next 3 hours.  The the load
> climbed back up.  
> 
> When I noticed this on reaching a computer here at my office, the load
> average was up over 100, and trac.fcgi was using a steady 99% of the
> CPU.  Memory did not seem to be an issue, though munin tells me that the
> amount committed was growing again during that 3 hour block this morning
> while it could run.
> 
> I halted both lighttpd and usher, and killed all mtn jobs to let the
> system recover.  Once the load was again below 10, which happened
> relatively rapidly once I was able to kill trac.fcgi and the mtn jobs, I
> restarted both usher and ligttpd. 
> 
> This is the second time since the 14th that the committed memory has
> spiked, something that used to happen regularly but other than one spike
> about a week ago, had not happened since this past summer.  Does anyone
> have any idea what has changed recently that might be leaking?
> 
> luke

Within half an hour of restarting lighttpd, trac.fcgi was again using
99% of the CPU steadily.  The load average started climing.  I started
by nicing the process to 10, then 15, then 20, the load continued to
climb, and I started seeing evidence of processes starving for CPU.

Notably:

* the email I am not replying to did not make it to my own inbox.
* munin's monitoring update failed.

email and munin have in the past been key indicators of homing's overall
health.  Something is slamming trac, and I have no clue what.

I've killed it right now, hopefully Daniel can look at it soon.

luke