


While the above two factors made up the root cause, several other factors contributed to the length of the outage. The BoltDB database, used by Consul, experienced a performance issue.Roblox enabled a new streaming feature on Consul at a time when database reads and writes were unusually high.The root cause of the outageįrom a technical perspective, two issues together formed the root cause of the Roblox outage: Fortunately, HashiCorp engineers worked with Roblox to troubleshoot and triage the issue, demonstrating their commitment to customer success even when the going gets tough. Therefore, the Roblox outage impacted the reputation of HashiCorp as well. Roblox was one of HashiCorp’s hallmark customers. Consul: An identity-based networking solution that provides service discovery, health checks, and session locking.Vault: A secrets management solution for securing sensitive data like credentials.Nomad: Schedules containers on specific hardware nodes and checks container health.Roblox uses HashiCorp’s HashiStack to manage its global infrastructure. While we don’t have a specific dollar cost for the outage, it was a significant incident for Roblox and HashiCorp. The outage affected 50 million Roblox users. The outage began on October 28, 2021, and was resolved 73 hours later on October 31. In this post, we’ll summarize the scope and root cause of the outage, explain what other ITOps teams can learn from it, and consider how SolarWinds ® Pingdom ® can help you reduce the risk of extended downtime in your environment. Thanks to transparency from Roblox, we have an outage case study to learn from. The post-mortem was a pleasant, late Christmas gift to users and SysAdmins everywhere.

While the incident itself was an IT nightmare, Roblox’s detailed technical post-mortem several months later was an excellent way to bounce back. After three full days of downtime, service was finally restored on Halloween day. Users were frustrated, and the clock was ticking. It seemed like the issue was a hardware problem, but it wasn’t. Just before Halloween 2021, Roblox engineers experienced a horror story: a service outage that also took down critical monitoring systems.
