pidgin.im

Fri Jan 18 10:34:45 EST 2008

Thursday afternoon, our load average jumped from between 0 and 15, with
the usual being between 0 and 10, to something more like 20.  The server
appears to have mostly handled this.  I suspect this was caused by the
massive growth of "committed" ram, which grew to about 8.5G between
Thursday 10:00 and Thursday 20:00.  At apprximately 23:00, also on
Thursday, either someone noticed this, and restarted something (probably
lighttpd), or the problem fixed itself.

However, today, started about 01:00, the load average started going MUCH
MUCH higher.  This caused the server to start to fall over, as various
processes, notably the munin monitoring, and logcheck emails, to starve
for resources.  At about 05:15, the load fell from about 55 to 20-15,
and munin was again able to run for the next 3 hours.  The the load
climbed back up.  

When I noticed this on reaching a computer here at my office, the load
average was up over 100, and trac.fcgi was using a steady 99% of the
CPU.  Memory did not seem to be an issue, though munin tells me that the
amount committed was growing again during that 3 hour block this morning
while it could run.

I halted both lighttpd and usher, and killed all mtn jobs to let the
system recover.  Once the load was again below 10, which happened
relatively rapidly once I was able to kill trac.fcgi and the mtn jobs, I
restarted both usher and ligttpd. 

This is the second time since the 14th that the committed memory has
spiked, something that used to happen regularly but other than one spike
about a week ago, had not happened since this past summer.  Does anyone
have any idea what has changed recently that might be leaking?

luke