Monthly Archives: August 2012

Delayed and failed payments

We’re investigating delayed and failed payments scheduled by our customers. (^Will, 11:30 PDT 2012-08-27)

12:50 PDT 2012-08-28: Our test payments now succeed and this issue is resolved. Please send a support message if you have questions about your own payments that may have failed. Thanks for your patience! (^Will)

Activity, password reset and messaging postmortem

On Tuesday, we sent a secure notification message to all of our customers to announce the release of the new Goals feature. Many of our customers attempted to sign in to read the message in a short period of time; from 10:30 to 12:30 PDT, our user service was unable to handle the load. We’d like to describe this incident in more detail, explain the factors that contributed to the failure, and outline our plans to prevent similar issues in the future.

Our secure messaging service relies on the user service to provide details like the customer’s email address and name when it sends a secure message. When the messaging service requested this information for all of our customers, the user service began to fall behind. Since we messaged all of our customers at once, many of them attempted to sign in at the same time. The sign in process depends on our user service, too, so it fell even further behind and sign in attempts began to fail.

The Simple web application is designed to provide as little information as possible when a sign in attempt fails. Though this is a valuable security feature, it obscured the real cause of the failure, leading many customers to believe that they had entered an incorrect passphrase. Many customers then attempted to reset their passphrases, but our already overloaded user service could not handle all of their requests.

Our monitoring alerted us to the high load on the user service shortly after sign in attempts began to fail. Our backend engineers started to debug the problem immediately, looking for slow queries in the user service code. Simultaneously, our operations engineers began deploying more capacity to handle the increased volume. Within 10 minutes, we increased limits in our database to support the increased demand; within the hour, we deployed upgrades to the user service and began to add more capacity. By 12:30 PDT, we no longer observed errors.

We’re taking several steps to prevent similar incidents in the future. First, we’ve already deployed performance fixes and additional capacity for our user service. We’re also adjusting our monitoring thresholds so that we can add capacity before customers are impacted. Finally, we’re examining all of our services to make sure that they can serve our new customers as Simple continues to grow.

- Will Maier, operations engineer

Would you like to help us design, build and monitor Simple? We’re hiring.

Web app offline

We’re debugging issues with customer logins and simple.com/activity. We’ll update here shortly. Thanks for your patience! (^Will, 10:57 PDT)

 

Update 12:16 PDT: We’ve resolved the load issues that disrupted site access earlier today. We’re continuing to investigate the underlying cause. Thanks again for your patience! (^Will)

Disappearing transactions

We’ve been tracking a bug that causes transaction holds to appear briefly and then disappear in customers’ Activity views. Transactions affected by this bug will appear once they are finally settled by the merchant, and we’re currently testing fixes for the underlying issue. Please contact support if you have any questions. Thanks! (Will, 12:00 PDT)

 

2:30 PDT: This issue is resolved. (Will)

Transaction, Onboarding, Activity Post-Mortem

On Monday and continuing through much of Tuesday, the team here at Simple noticed several interruptions to our services. Simple customers experienced long delays before their card swipes and other transactions appeared in their Activity view. Additionally, for several hours on Tuesday, people were unable to redeem their invites to join Simple. And for 15 minutes on Tuesday, customers could not view their accounts in the simple.com web application. We’d like to explain the factors that contributed to these incidents and describe the actions we’re taking to provide even more reliable service to our customers.

Simple runs several services that communicate with our banking and transaction processing partners. These services order cards for new customers, clean data from card swipes, and connect our web and mobile applications to our customers’ data.

On Monday, our registration service began to receive unexpected data from our partner when it tried to create accounts and order cards for new Simple customers. This data then spread without detection from the registration service to the service responsible for fetching data about customer transactions. While the registration service was able to continue running without interruption, the transaction service began to fail whenever it encountered this data.

Our engineers noticed the failures in the transaction service and restarted it on Monday evening, observing that transaction ingestion resumed after the restart. Our monitoring did not alert, however, when transaction ingestion failed again soon after the restart. The service did not successfully process any new transactions that night. 

The next day, customers continued to report that transactions had not shown up as expected in their Activity. Our engineers began debugging the service, looking for errors that may have been introduced in a recent deploy.

Around this time, our engineers also discovered that the unexpected data returned to the registration service had caused the creation of incorrect account records for a small number of new customers. In order to limit the damage, engineers decided to shut down the registration service until the source of the problem was understood.

When the registration service was turned off, a new and unexpected dependency between the customer web application and the registration service caused attempts to reach simple.com/activity to fail. Our engineering team discovered this 15 minutes later and turned the registration service back on to restore customer access. We then blocked further registrations but allowed the underlying service to continue to run.

An unrelated investigation into delays in the initial funding of some customer accounts revealed log messages generated by the registration service that included the unexpected data from our partner. Engineers then traced the spread of the data to the transaction ingestion service, and deployed a fix at 12:15 PM PDT.

Though all transaction data had been retrieved from our partner transaction processor at this time, the volume of transactions ingested caused the workers responsible for processing this data to fall behind. Another fix was deployed to speed up processing of the transaction backlog. All transactions had been processed by 12:35 PM PDT, and registration was allowed to resume at 4:10 PM PDT.

Our response to these failures was delayed because our automated monitoring did not catch the abnormally low transaction ingestion rate, the incorrect records for some new customers or the increased error rates on simple.com/activity. Though we already collect hundreds of thousands of data points about our service, we had not identified these important metrics or ensured that changes in their behavior would alert our engineering team. We’re adding more metrics and live tests to our services to help catch these and similar issues more quickly in the future. 

We’ve also updated our services to explicitly check for the problematic data we received, and are examining all of our services to ensure that they properly validate all data. We had already begun to modify our web application to gracefully handle failures in our own services when these incidents happened, but we will continue to improve the application’s resilience. Finally, our customer relations team is working with our partner and our customers to correct any errors that resulted from the unexpected data. 

It takes many moving parts to make banking easier for our customers. Our own services, our partners’ services and the infrastructure that ties them all together are complex, and we work hard to engineer systems that can continue to operate even under adverse or unpredictable conditions. We’re using what we’ve learned to improve the reliability and performance of the systems that make Simple possible.

- Will Maier, operations engineer

Would you like to help us design, build and monitor Simple? We’re hiring.

Onboarding Delays

We’re investigating errors during the customer onboarding process. Until we resolve this issue, customers will not be able to redeem their invites. Please message our customer support team if you have any questions. Thanks for your patience! (Will, 11:43 PDT)

17:19 PDT: We’ve fixed the issue preventing users from claiming their invites and have re-enabled onboarding. New monitoring checks are in place to help identify this issue more quickly. Thanks again for hanging with us! (Will)

Delays displaying transactions

Since yesterday evening, purchases made on Simple cards have taken longer than usual to appear in our customers’ transaction histories. We’re working on a fix for this now, but please message our customer relations team if you have questions. Thanks for your patience! (Will, 10:49 PDT)

13:12 PDT: We’ve identified the underlying issue and are processing the backlog of transactions. (Will)

13:41 PDT: The transaction backlog has been processed. We’re adding additional monitoring to catch this problem in the future. Thanks again for hanging with us! (Will)