On 12/31/2012, shortly after 4:00 PM PST (00:00 UTC), we started receiving alerts from our monitoring system that we were unable to reach our transaction processor (hereafter referred to as “processor”). The first alerts came in at 4:02 PM PST, indicating that servers on their end were unreachable. Customer Relations received a call at 4:05 PM PST about a declined transaction, and immediately notified engineering. We also observed that our ingest service, which consumes data from this partner, had stopped. At 4:09 PM PST we contacted our processor to escalate the issue. They confirmed that they were aware of the problem and taking action. We also notified our partner bank, The Bancorp Bank, though they were not affected.
The root cause of this outage was discovered to be a “leap second” added by an upstream NTP server (time.apple.com). It has since been removed from the pool of NTP servers that our processor uses. A similar leap second issue was reported widely in the news last summer. Leap seconds were introduced into timekeeping in 1972 and some technology companies have techniques for dealing with these changes — Google uses what they call the “leap smear” to spread-out the added second over a longer period of time. If leap seconds are added to a year, it’s usually only done in either June or December.
Around 4:15 PM PST these events were followed by reports that card swipes, ATM withdrawals, and our card activation line were also affected. Customer Relations staff were able to confirm these reports by testing transactions with Square and attempting withdrawals at nearby ATMs. We updated our status blog at 4:22 PM PST and continued to handle a large amount of customer contact via phone, email, support chat, and Twitter.
At 4:22 PM PST our processor began the process of a full shutdown and restart of the systems that handle Visa (credit) and Star (ATM/debit) transactions for Simple. At 4:29 PM PST we decided to pause signups for any new customers, since our ability to order new cards was affected.
Our processor began bringing their systems back online at 5:10 PM PST, followed by a restoration of all customer data. Restoration was completed at 6:15 PM PST, all services were re-enabled, and the first successful Visa transaction came in one minute after. Staff continued testing card swipes and ATM withdrawals and by 6:50 PM PST it appeared that most things were back to normal. At 6:20 PM PST we re-enabled signups.
Staff still observed some intermittent swipe failures with PIN transactions for a small number of customers. At 7:52 PM PST, our processor resolved all transaction issues by performing another restore of customer data, and we confirmed that no more customers were affected. We posted the all-clear on our status blog at 8:09 PM PST.
To all our customers who were affected by the outage, please accept our sincere apologies. We are making changes to our processing architecture that we expect will mitigate this in the future. If you enjoy solving hard problems and want to help build the future of banking, check out our job openings — we’d love to talk.
- Chris Brentano, operations engineer