Trade-offs under pressure: heuristics and observations of teams resolving internet service outages (Part II)
Trade-offs under pressure: heuristics and observations of teams resolving internet service outages, Allspaw, Masters thesis, Lund University 2015
This is part 2 of our look at Allspaw’s 2015 master thesis (here’s part 1). Today we’ll be digging into the analysis of an incident that took place at Etsy on December 4th, 2014.
- 1:00pm Eastern Standard Time the Personalisation / Homepage Team for Etsy are in a conference room kicking off a lunch-and-learn session on the personalised feed feature on the Etsy.com homepage
- 1:06pm reports of the personalised homepage having issues start appearing from multiple sources. Instead of the personalised feed, the site has fallen back to serving a generic ‘trending items’ feed. This is a big deal during the important holiday shopping season. Members of the team begin diagnosing the issue using the #sysops and #warroom internal IRC channels.
- 1:18pm a key observation was made that an API call to populate the homepage sidebar saw a huge jump in latency
- 1:28pm an engineer reported that the profile of errors for a specific API method matched the pattern of sidebar errors
- 1:32pm the API errors were narrowed down to requests for data on a specific single shop. The Continue reading