SE Radio 282: Donny Nadolny on Debugging Distributed Systems

Venue: Internet
Donny Nadolny of PagerDuty joins Robert Blumen to tell the story of debugging an issue that PagerDuty encountered when they set up a Zookeeper cluster that spanned across two geographically separated datacenters in different regions.  The debugging process took them through multiple levels of the stack starting with their application, the implementation of the Zookeeper cluster, the Linux kernel, and the TCP stack.   Donny explains how they identified problems at each layer, and how finally gained a complete understanding of the issue as the interaction between multiple bugs, incorrect assumptions, and less well-known behaviors of TCP.  Robert and Donny spend the final part of the show reflecting on lessons learned from this bug including the need to question what your tools tell you, the importance of persistence in debugging, and how to implement more useful monitoring.

Show Notes

Related Links

Join the discussion
1 comment
  • Good discussion on an excellent topic! Real world debugging is a topic that is not covered very often because the cases are usually very specific and hard to simplify enough to allow non-experts to follow along.

More from this show