On Tuesday, we sent a secure notification message to all of our customers to announce the release of the new Goals feature. Many of our customers attempted to sign in to read the message in a short period of time; from 10:30 to 12:30 PDT, our user service was unable to handle the load. We’d like to describe this incident in more detail, explain the factors that contributed to the failure, and outline our plans to prevent similar issues in the future.
Our secure messaging service relies on the user service to provide details like the customer’s email address and name when it sends a secure message. When the messaging service requested this information for all of our customers, the user service began to fall behind. Since we messaged all of our customers at once, many of them attempted to sign in at the same time. The sign in process depends on our user service, too, so it fell even further behind and sign in attempts began to fail.
The Simple web application is designed to provide as little information as possible when a sign in attempt fails. Though this is a valuable security feature, it obscured the real cause of the failure, leading many customers to believe that they had entered an incorrect passphrase. Many customers then attempted to reset their passphrases, but our already overloaded user service could not handle all of their requests.
Our monitoring alerted us to the high load on the user service shortly after sign in attempts began to fail. Our backend engineers started to debug the problem immediately, looking for slow queries in the user service code. Simultaneously, our operations engineers began deploying more capacity to handle the increased volume. Within 10 minutes, we increased limits in our database to support the increased demand; within the hour, we deployed upgrades to the user service and began to add more capacity. By 12:30 PDT, we no longer observed errors.
We’re taking several steps to prevent similar incidents in the future. First, we’ve already deployed performance fixes and additional capacity for our user service. We’re also adjusting our monitoring thresholds so that we can add capacity before customers are impacted. Finally, we’re examining all of our services to make sure that they can serve our new customers as Simple continues to grow.
- Will Maier, operations engineer
Would you like to help us design, build and monitor Simple? We’re hiring.