Venue: Internet
John Allspaw CTO of Etsy speaks with Robert Blumen about systemic failures and outages; how are systems defended against outages?; why do they fail anyway?; why are failures not entirely preventable?; why do outages involve multiple failures?; the time that Etsy identified it’s own office as a potential source of fraud; the human as part of the system; is human error an important component of failure?; understanding human action during failures; what can we learn from outages?; effective post-mortems; testing as a way of preventing failure; the limitations of testing; testing in production.
Show Notes
Related Links
- Richard Cook MD: How complex systems fail http://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf
- Each necessary, but only jointly sufficient http://www.kitchensoap.com/2012/02/10/each-necessary-but-only-jointly-sufficient/
- The Myth of the Root Cause: How Complex Web Systems Fail http://blog.scalyr.com/2016/10/the-myth-of-the-root-cause/
- Irreversible Failures: Lessons from the DynamoDB Outage http://blog.scalyr.com/2015/09/irreversible-failures-lessons-from-the-dynamodb-outage/
- Fault Injection in Production http://queue.acm.org/detail.cfm?id=2353017
- Hindsight bias https://en.wikipedia.org/wiki/Hindsight_bias
- Chaos engineering http://principlesofchaos.org/
- Noah Sussman on twitter https://twitter.com/NoahSussman
- John Allspaw’s blog http://www.kitchensoap.com/
- Book: Web Operations Keeping the Data on Time by John Allspaw https://www.amazon.com/Web-Operations-Keeping-Data-Time-ebook/dp/B0043M4Z34/
- Book: The Art of Capacity Planning Scaling Web Resources by John Allspaw https://www.amazon.com/Art-Capacity-Planning-Scaling-Resources/dp/0596518579
[…] out a great episode from time to time. Some recent highlights are an episode with John Allspaw on system failures and another with Michael Feathers on legacy […]