On Thursday, November 15th 2012, the services behind the Simple web and mobile applications failed to handle higher than usual load caused by a bug in our transaction ingest pipeline. Between 5:30AM PST and 11:00 AM PST, customers experienced intermittent failures to load or update their Activity feeds and we temporarily disabled our Send Money feature. Ingestion is separate from the systems that approve or reject swipes, deposits and other transactions; these were unaffected and card swipes, bill payments and deposits were all handled quickly and correctly throughout the morning.
At approximately 6:30 AM PST, we noticed that some Activity feeds were loading slowly and, in many cases, failed to load completely. We immediately paged our engineers and they began to investigate the cause.
Simple is built on a system of separate services. We partition these services as much as possible and avoid synchronous inter-service communication so that we can control the possibility of failure with a system of rate limiters and retries. Our web and mobile applications present information from many individual services, so they must rely on synchronous requests to render Activity, Goals, and support messages.
We traced the problem we observed on Thursday to the bill payment service behind our Send Money feature, which Activity contacts to load scheduled and upcoming payments. Our engineers discovered unoptimized database queries which caused the service to respond slowly. We temporarily disabled the Send Money feature while we worked on fixes to the underlying service. These fixes were deployed to production at 8:19 AM PST; while we saw response time improve significantly, customers continued to experience problems.
Our traffic patterns follow a common daily arc, with the heaviest load happening around 6:00 AM PST and peaking between 12:00 PM and 1:00 PM PST. On the 1st and 15th of each month, many customers sign in to check their direct deposits and pay their bills. However, we discovered that the load on the bill payment service was nearly five times greater than we expected based on past traffic. We also noticed that while Activity was now loading correctly, its results were out of date.
When we receive a new transaction from our partners, we transform the incoming data to make it more useful to our customers. Raw bill payments contain a contact ID rather than the information about the contact. We correct this by looking up the contact ID and inserting that information into the transaction before presenting it in a customer’s Activity feed. Our transaction service needs to talk to the payment contacts service to perform this transformation. On Thursday, several bugs in these services came together to amplify the already high traffic.
First, the transaction service’s request rate limiter broke when the payments service failed to respond to its requests in time. Second, the transaction service was configured to make many attempts after a single request failed, increasing the load on an already beleaguered service. Lastly, the transaction service was configured with a short timeout, so many slow-yet-successful responses from the payments service were considered failures by the transaction service.
Together, these bugs turned a performance problem in the payments database into a persistent problem that blocked our transaction processors. While the transaction processors were spinning in retry loops, nearly 2,600 transactions entered a backlog in our queueing system.
At 9:49 AM PST, our engineers deployed a fix which implemented backoffs in the case of some connection errors. This gave the transaction service enough space to process all but 700 backlogged transactions by 10:00AM PST. Soon, though, other errors in the same code began to cause the retry problem to recur, and processing of the transactions queue stalled again. Additional fixes were deployed at 10:09 AM PST and the queue was completely processed soon after. With the load removed from the payments service, we re-enabled bill pay.
We are taking immediate steps to address the issues we’ve discovered during this incident. We have already deployed fixes for the misbehaving code and will standardize our approach to similar areas of the system to better control performance under future traffic peaks. We are also extending our monitoring systems to alert us more quickly based on anomalous traffic patterns, errors logged, and queue sizes.
Since this incident occurred on the 15th of the month–a common payday–several of our customers experienced delays in seeing their paychecks appear in their Activity, although the funds had been deposited as expected. We know this particular situation was especially stressful for many, and we apologize for the stress that many of our customers felt. As always, we strive to be as transparent as possible when we have problems, and we are thankful for our customers’ trust.
– Ian Eure, backend engineer