Venue: Internet
Donny Nadolny of PagerDuty joins Robert Blumen to tell the story of debugging an issue that PagerDuty encountered when they set up a Zookeeper cluster that spanned across two geographically separated datacenters in different regions. The debugging process took them through multiple levels of the stack starting with their application, the implementation of the Zookeeper cluster, the Linux kernel, and the TCP stack. Donny explains how they identified problems at each layer, and how finally gained a complete understanding of the issue as the interaction between multiple bugs, incorrect assumptions, and less well-known behaviors of TCP. Robert and Donny spend the final part of the show reflecting on lessons learned from this bug including the need to question what your tools tell you, the importance of persistence in debugging, and how to implement more useful monitoring.
Show Notes
Related Links
- SE-Radio Episode 229: Flavio Junqueira on Distributed Coordination with Apache ZooKeeper (https://www.se-radio.net/2015/06/episode-229-flavio-junqueira-on-distributed-coordination-with-apache-zookeeper/)
- Blog post: The Discovery of Apache ZooKeeper’s Poison Packet ( https://www.pagerduty.com/blog/the-discovery-of-apache-zookeepers-poison-packet/ )
- Donny Nadolny slideshare from talk at Velocity 2016 on Debugging Distributed Systems (http://www.slideshare.net/DonnyNadolny/debugging-distributed-systems-velocity-santa-clara-2016)
- TCP Puzzlers https://www.joyent.com/blog/tcp-puzzlers
- Wikipedia entry on TCP (https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Protocol_operation)
- Detection of Half-Open Half-Dropped (http://blog.stephencleary.com/2009/05/detection-of-half-open-dropped.html
- Each necessary, but only jointly sufficient (http://www.kitchensoap.com/2012/02/10/each-necessary-but-only-jointly-sufficient/)
- Myth of the Root Cause (http://blog.scalyr.com/2016/10/the-myth-of-the-root-cause/)
Good discussion on an excellent topic! Real world debugging is a topic that is not covered very often because the cases are usually very specific and hard to simplify enough to allow non-experts to follow along.