I wrote this text, as it is quite hard to articulate a fault finding process. From an initial unknown problem, there are many different vectors to investigate. The list of failure conditions are things that I have had happen to me, I haven't googled anything; just written with better structure.

Please note, I can use step-through debuggers; but generally feel that they are a slow solution. They are useful for looking at an unknown segfault that you can't spot, but as most errors report under Unix systems; they offer little additional information. I use these when stuck, and not before.

Situation:

Some of your companies code is failing a percentage of the time. Instead of a page, a bold white screen is returned.

My Process:

if you have a well setup system, this list of items is fast.

  • The first thing is a binary search on the platform to partition what is failing.
    • Does the fault occur in all browsers?
    • If I look at the client diagnostic tools (e.g. Firebug or Venkmann etc) am I getting valid output?
    • Are there any faults written in the webservers error log (or general syslog log if relevant)?
    • Look at the log book, what has changed recently ?
    • On a test server, run the code without resource constraints, and see if the issue still occurs. If this stops it breaking, maintain service availability by making this change, whilst actually looking for the fault.
    • Assuming the system has user profiles, does the fault occur for all types of user?
    • How many features have this fault (what area of the code)?

Possible causes:

  • Sometimes high usage will lead to databases “gone away” (literal text). This can be resolved by altering how your interpreter is using DB connections, or by increasing the number of DB workers.
  • Sometimes congestion will lead to a saturated DB, and some queries hang not execute. This can be resolved by improving your SQL. Frozen queries should be aborted (via tools) to improve the service for everything else.
  • It is potentially possible that the number of client has exceeded the number of Apache workers. To resolve, add more workers; or look at why your asset processing is taking so long.
  • Long running SPA may exhaust RAM on the client. This is mostly a MSIE problem. Use the MSIEleaks tool, fix your memory leaks; or build your client side scripts so that they pass id string, not the DOM objects.
  • If you are starting and stopping XML parsers alot, it is possible to exhaust handles on Expat. They are still exhausted is you kill all PHP processes; to resolve you have to reboot. Ensure your code deallocates all the Expat resources that it uses, specifically in the error states.
  • With massive concurrency, it is possible to exhaust IO handles (as allocated by the kernel). To fix, don't open handles that you don't need, and look at your architecture. There is probably something that should be changed.
  • Don't forkbomb.
  • Be careful about creating Perl processes; the maximum limit for Perl is lower than other processes types.
  • On some filesystems/ via some tools; writing more than 2GB per file leads to problems. Analyse if you need this (yes for DB, which are written specifically to deal with this...).
  • See if a complex computation could be performed by a different library, written in a more high performance language.

Actions:

  • For the above list of items; was any specific data returned? If so resolution should be easy.
  • If this is practical; run all the testcases, specifically the performance ones on complete subsystems.
  • Turn on crash reporting features (to log to a log). This will tell you about the failures better. I have crash handlers etc
  • If you have no useful analysis; fail over to xdebug. I repeat everyone elses notes about not running that on a production machine. This will provide better data about segfaults, if other mechanisms failed. Alter what the xdebug trace tells you.
  • Run xdebug in interactive mode, can't be done on a production server. I don't need to use this approach very often; as I can read code quite well, and earlier mechanisms generically tell me what I need to know. I note that Zend likes people using the debugger, as it encourages sales on their IDE.

Fault Analysis

RSS. Share: Share this resource on your twitter account. Share this resource on your linked-in account. G+

Fault Analysis

RSS. Share: Share this resource on your linked-in account. Share this resource on your twitter account. G+ ­ Follow edited