pidgin.im
Luke Schierer
lschiere at pidgin.im
Fri Jan 18 10:51:54 EST 2008
On Fri, Jan 18, 2008 at 10:34:45AM -0500, Luke Schierer wrote:
> Thursday afternoon, our load average jumped from between 0 and 15, with
> the usual being between 0 and 10, to something more like 20. The server
> appears to have mostly handled this. I suspect this was caused by the
> massive growth of "committed" ram, which grew to about 8.5G between
> Thursday 10:00 and Thursday 20:00. At apprximately 23:00, also on
> Thursday, either someone noticed this, and restarted something (probably
> lighttpd), or the problem fixed itself.
>
> However, today, started about 01:00, the load average started going MUCH
> MUCH higher. This caused the server to start to fall over, as various
> processes, notably the munin monitoring, and logcheck emails, to starve
> for resources. At about 05:15, the load fell from about 55 to 20-15,
> and munin was again able to run for the next 3 hours. The the load
> climbed back up.
>
> When I noticed this on reaching a computer here at my office, the load
> average was up over 100, and trac.fcgi was using a steady 99% of the
> CPU. Memory did not seem to be an issue, though munin tells me that the
> amount committed was growing again during that 3 hour block this morning
> while it could run.
>
> I halted both lighttpd and usher, and killed all mtn jobs to let the
> system recover. Once the load was again below 10, which happened
> relatively rapidly once I was able to kill trac.fcgi and the mtn jobs, I
> restarted both usher and ligttpd.
>
> This is the second time since the 14th that the committed memory has
> spiked, something that used to happen regularly but other than one spike
> about a week ago, had not happened since this past summer. Does anyone
> have any idea what has changed recently that might be leaking?
>
> luke
Within half an hour of restarting lighttpd, trac.fcgi was again using
99% of the CPU steadily. The load average started climing. I started
by nicing the process to 10, then 15, then 20, the load continued to
climb, and I started seeing evidence of processes starving for CPU.
Notably:
* the email I am not replying to did not make it to my own inbox.
* munin's monitoring update failed.
email and munin have in the past been key indicators of homing's overall
health. Something is slamming trac, and I have no clue what.
I've killed it right now, hopefully Daniel can look at it soon.
luke
More information about the Devel
mailing list