Josh Fruhlinger
Contributing writer

8 biggest IT disasters this year

Feature
28 Dec 202112 mins
Technology Industry

From security flaws to software engineering fails, these high-profile IT disasters wreaked real-world havoc this year. Let them serve as cautionary tales.

bomb explosion operation desert storm 71261176
Credit: Thinkstock

IT is synonymous with business operations for just about any company of any size. So when tech goes down, the company can go down with it.

IT failure, whether it’s a complex system or project, is increasingly shooting to the top of the business news section, where its impact can become even more detrimental — and embarrassing.

We’ve gathered eight of the biggest tech crises of 2021 to spotlight the kinds of near catastrophic IT issues that can not only arise but have an outsize impact on your business. Beyond schadenfreude, we hope these tales of IT disaster have lessons for you, even if your organization is nowhere near as big or the stakes aren’t as high as some of the protagonists from these tales.

Why you should design better UIs (and not make your creditors mad)

Many companies tend to take an “if it ain’t broke, don’t fix it” attitude toward their IT tools, and if you’ve ever been part of a botched upgrade or rollout, you know why. But that can result in some truly outdated systems in production use with UIs dating from the earliest days of the software industry — which in turn can mean usability problems with real-world consequences.

One of Citibank’s back-end systems is a good example of this trend, and is one of the main causes of a half-billion dollar screwup. The story goes like this: Citibank was attempting to send a $7.8 million interest payment on behalf of Revlon, one of its customers, to several of Revlon’s creditors. Doing that in Flexcube, an ancient piece of in-house Citibank software, was a particularly clunky process: Citibank’s employees had to set up a transaction as if they were paying off the whole loan so that the interest could be calculated correctly, then check multiple boxes in order to send the bulk of the payment to an internal Citibank account while only the interest portion went out to creditors. Despite the fact that three different people signed off on this transaction for Revlon, it went through without all the proper boxes checked, and $900 million, most of which wasn’t due to creditors until 2023, was sent out.

You may find it surprising that this sort of mistake isn’t unheard of — and that the benefitting party usually returns the money sent in error back to the company that made the goof. But this time around things went differently: More than half the money sent out went to various hedge funds still bitter that the terms of the loan had been previously renegotiated to Revlon’s advantage. They said they regarded the money as an early payment of the debt they were owed, and this year a judge ruled that they didn’t have to give it back.

The big lesson here is to at least modernize your UIs to ensure employees can perform their duties in a streamlined, coherent fashion — and that it can be less painful to make mistakes if people aren’t mad enough at you to take advantage of it.

Sacre bleu! French bank customers see each other’s accounts

Customers of the French bank LCL logged in to their banking app on Feb. 23 only to find that they were looking at someone else’s information. The word quickly spread on Twitter and many speculated that this could have been the result of a cyberattack. But according to the bank itself, it was actually the upshot of a software error that was corrected within a day.

Of course, these sorts of development mistakes are a sign of internal failures at the companies where they occur, and they especially shouldn’t occur in the banking industry. The fallout illustrated the typical dance that follows on from these kinds of mistakes, with the company at fault minimizing matters: LCL said that no personal information was revealed, that customers could only see other customers’ accounts but not transfer money, and perhaps only a few hundred customers were affected. Others pointed out that transaction information could’ve been used to suss out customer identities, and potentially tens of thousands of users were logging in while the bug was running on live code. In the end, LCL had to scramble to avoid a massive fine from European privacy regulators.

When software keeps the cell door locked

In 2019, the Arizona Legislature passed a law to allow certain prison inmates convicted of nonviolent offenses to complete programming in state prisons that would accelerate their release. But whistleblowers in February revealed that, more than a year later, the software that keeps track of prisoner release eligibility still hasn’t been updated to accommodate the new law. While the state insists eligible prisoners can and do have their sentences recalculated manually, the truth is that many may not know they’re eligible for release, or don’t have advocates on the outside to press their case, and so are languishing in prison when by law they have the right to go free.

There are several lessons for IT here. One is the importance of building flexibility and extensibility into any system. Another is that software isn’t just software: It has real and profound impacts on human lives. Finally, there’s the question of how law can be implemented in the form of code — and whether the algorithms for enforcing the law should be developed during the legislative process rather than being left to be written after it’s already on the books.

Maine’s ancient HR system limps on

The state of Maine’s HR and payroll is, as the Portland Press Herald describes it, run by a “40-year-old system programmed in an obsolete language only one state employee knows how to use.” The system had already outlasted a 2016 attempt to replace it that flopped; another attempt, which was supposed to wrap up in 2020, imploded in mutual acrimony this past March, as Workday, the company hired to roll out a new cloud-based system for Maine, walked away from the project.

Rollouts of ERP systems and similar platforms are notoriously disaster prone, and Maine’s payroll needs were devilishly complex (state police were paid differently hourly rates if they carried a weapon, worked with a K9, or wore scuba gear, for instance). At the core of the dispute is a story that should sound familiar to anyone who’s been involved in a big project like this: Maine says that the system came online with a 50% error rate, and Workday said Maine’s data as imported into the system was hopelessly riddled with errors. More fundamentally, it seems that Maine was hiring staffers to work on the project who didn’t have the needed skills, and the state wasn’t willing to pay enough to find workers who could make the grade. Throw in some accusations of nepotism and sexual harassment and you have a real IT management mess. Maine is still using its 40-year-old HR system.

Amazon’s leave problems

If your takeaway from those previous two items is that government is incapable of competent project management, we regret to inform you that a not dissimilar crisis came to light this year in a private sector company — and not just any private sector company, but Amazon, the archetype of the hyperefficient new economy that IT and the web made possible.

A New York Times investigation revealed that Amazon’s internal processes for offering various types of leave to its employees are extremely broken. This has resulted in a litany of horror stories affecting white and blue collar workers alike, such as employees being fired for not showing up to work even though they’re on approved leave, new mothers on maternity leave seeing mysterious cuts in their paycheck, and an injured worker on disability forced to sell his wedding ring for cash because his checks simply stopped showing up.

It turns out Amazon manages its leave system using multiple software products from a variety of vendors, a legacy of its rapid initial growth, so perhaps the lesson here is that the choices you make early in a company’s history may reverberate years or decades later. Like the Arizona prison system, Amazon tries to make up for IT dysfunction with human labor: 67 full-time employees are dedicated to inputting data on employee leave, a job so stressful that many end up needing to take leaves of absence themselves.

Eating too much of your own dog food

On Oct. 4, people all over the world were unable to access Facebook, Instagram, or WhatsApp, as all the services run by the company now known at Meta were disconnected from the internet. We won’t get too deep into the actual cause of the crisis, which involved an error in the Border Gateway Protocol essentially severing Facebook services from the rest of the internet’s DNS system. Instead, we want to focus on one detail that might be relevant to any IT shop, even those that aren’t part of one of the largest tech companies in the world.

Early in the outage, New York Times tech reporter Sheera Frenkel reported that Facebook employees couldn’t enter company HQ because their ID badges no longer opened the doors. This in turn prevented techs from getting physical access to the servers they needed to fix the overall problem. Improbably, Facebook’s electronic door locks were powered by … Facebook. It seems that Facebook is rather obsessed with running all its internal systems on Facebook’s own infrastructure, which meant its in-house communications system was also down and unable to deal with the crisis. The industry term for a company that does this is “eating its own dogfood,” and it’s generally seen as a vote of confidence in your own products, but Facebook’s disaster goes to show that you need a backup food supply handy.

A lurking bug takes down Fastly

On June 8, millions of Internet users trying to access sites ranging from Reddit to important UK government departments found themselves confronted by 503 error codes, indicating that the server hosting the website wasn’t able to handle the request. (Twitter was still working but, tragically, it could no longer display emojis.) How could so many different sites go offline at once? The answer, it turns out, is related to the rise of content delivery networks, which deploy proxy servers at strategic points across the internet for their clients to ensure superfast load times. Nearly every big content site uses CDNs these days, and there aren’t that many players in this space, and so when one goes down, it can lead to a big chunk of the internet going with it.

In this case, the single point of failure was Fastly, an edge computing provider with a booming CDN business. Fastly rolled out a software update on May 12 that included a bug that could be triggered by a specific customer configuration under just the right conditions. On June 8, a customer unwittingly updated their configuration and caused a crisis that lay at the intersection of software development and industry consolidation.

Shooting the messenger

In October, a reporter from the St. Louis Post-Dispatch, working with security expert Shaji Khan, discovered that a website that allowed the public to search teacher’s certification and credentials also inadvertently revealed those teachers’ Social Security numbers. While the numbers weren’t actually displayed on the search results page itself, they were in clear text in the HTML for the page, making them trivially easily to find. The Post-Dispatch informed the state education department about the flaw before the story was published, giving them time to correct it, and if matters had stood there we probably wouldn’t be talking about this story now.

But two days after an Education Department spokesperson started crafting a (never sent) statement thanking the media for bringing the matter to their attention, the governor publicly accused the paper of hiring “hackers” to embarrass him and the state government and promised to launch a criminal investigation. After doubling down, he faced backlash and ridicule, including blowback from members of his own political party, and we definitely are talking about the story now. So maybe the lesson here is that how you deal with the fallout from an IT disaster is almost as important as the disaster itself.