Registry was mostly down for 60 minutes and partially down for another 17 minutes starting at 5.03AM Pacific (1203UTC). The root cause was a failure in our caching provider (Fastly), which meant that instead of about 5-10% of normal requests hitting our servers, 100% of requests did. We have a lot of servers, but this 5x-10x spike in load was too much for them, and they simultaneously overloaded and could not serve enough requests to keep up, which meant that about 90% of requests failed.
This cache failure at the CDN was a human-caused accident, and Fastly have already given us a detailed explanation of what happened and changes they are putting in place to avoid it being possible in future. We are working with them on further changes we can make to our architecture to withstand this kind of event in future, including cache configuration changes and additional hardware on our side to be able to withstand a sudden burst of traffic like this.
www was also down for the length of the outage; www is a client of the registry just like everybody else, so when the registry is down it cannot serve package information.
Posted almost 5 years ago. Jun 04, 2014 - 14:25 UTC