The Internet Avoided a Minor Disaster Last Week

A tiny backend bug at Let’s Encrypt almost broke millions of websites. A five-day scramble ensured it didn’t.

This is a story about something that could have gone wrong on the internet this week but instead turned out mostly OK. How often can you say that?

Around 9 o’clock on the East Coast on Friday, February 28, bad news arrived on the doorstep of Let’s Encrypt. An arm of the nonprofit Internet Security Research Group, Let’s Encrypt is a so-called certificate authority that lets websites implement encrypted connections at no cost. A CA parcels out digital certificates that essentially vouch that a website isn’t an imposter. That cryptographic guarantee is the backbone of HTTPS, the encrypted connections that keep anyone from intercepting or spying on your interactions with websites.

Those certificates expire after a set amount of time; Let’s Encrypt caps its certificates at 90 days, at which point a site operator has to renew. It’s a largely automated process, but if a site doesn’t have an active certificate, your browser will notice and may not load the page you’re trying to visit at all.

Think of it sort of like updating the registration on your car every year. If your tags expire, you’ll get pulled over.

Let’s Encrypt’s work is technical and happens in the background. But in a few short years it has helped make the internet much more secure on a fundamental level. Plenty of companies offer security certificates; Let’s Encrypt just took the audacious step of making them free. A week ago, it issued its billionth certificate.

But that ubiquity also means that when a pebble drops in the middle of Let’s Encrypt’s pond, the ripples can travel a long way. On February 28, the pebble was a bug that threatened to effectively render 3 million sites nonfunctional in a matter of days.

The flaw itself? Relatively minor in the grand scheme of the internet. Let’s Encrypt uses software called Boulder to make sure that it’s allowed to issue a certificate to a site. (Some high-value targets, like banks, specify that they’ll only accept certificates from a particular CA. Let’s Encrypt has solid security, but some paid certificate authorities offer warranties in the event anything goes wrong, as well as other upgrades. It’s the difference between, say, having a strong deadbolt and adding renter’s insurance.) Boulder confirms that Let’s Encrypt is honoring those preferences when it first issues a certificate and again 30 days later. Or at least, it’s supposed to; the bug meant it was skipping the second check. And that’s a big no-no.

The actual security implications of that backend hiccup were minimal, says ISRG executive director Josh Aas. At the same time, Let’s Encrypt couldn’t let a bug that affected 2.6 percent of its active certificates—3,048,289 in all, when it confirmed the issue—linger indefinitely. “The severity of the bug here is not very high,” says Aas. “But these 3 million certificates were issued in a noncompliant way. We have an obligation to revoke them.”

That obligation stems from the Certification Authority Browser Forum, or CA/B, an industry group that sets strict standards about the use of certificates. In this case, those standards gave Let’s Encrypt a five-day window to come back into compliance, which would entail revoking every certificate that was affected by the bug. The alternative for Let’s Encrypt was ignoring the CA/B and letting it slide, but that was really no option at all.

“They did the right thing. The CA/B sets these rules and has fairly strict requirements, which you want. When a person or computer talks to another computer, you want to make sure they’ve met some identity criterion,” says Kenneth White, security principal at MongoDB, a massive database provider that uses Let’s Encrypt. “You can’t be mostly correct. You’ve got to follow the guidelines for how to enforce these things.”

The impact of pulling those certificates would be swift and severe. Once browsers like Chrome and Firefox found them missing, they would flash warnings to any visitors that the sites weren’t safe. Some browsers would block access altogether. A not insignificant chunk of the internet would effectively be taken out of commission. All because of this one small flaw in one niche corner of the Let’s Encrypt operation.

Within two minutes of confirming the bug, the Let’s Encrypt team stopped issuing any new certificates in a bid to stanch the bleeding. A little over two hours after that, they fixed the bug itself. And then they let everyone know what was coming.

“We can’t contact everybody, so we started contacting the largest subscribers, telling them about the situation, getting them as informed as possible,” says Aas. “And then we worked with them to get them to replace their certificates as quickly as possible.”

Once a site operator renewed a certificate, Let’s Encrypt could safely revoke the old one. No harm would befall the site. Which sounds like a simple enough solution—but nothing’s simple at this kind of scale.

Bigger organizations had an easier time fixing the problem, because they generally have the resources to monitor any signs of trouble that surface and the tools to automate the renewal process. “If you’ve got a dozen or two dozen servers or something, that’s some poor sleepy-eyed soul in the middle of the night at a keyboard,” says MongoDB’s White. “We reissued a little over 15,000 certificates [for clients], and we did it in a few hours. There was some work involved, but it wasn’t catastrophic. We had measures in place to be able to rotate quickly.”

Smaller sites got a big assist from the Electronic Frontier Foundation, which operates Certbot, a free software tool that automatically adds Let’s Encrypt certificates to sites and renews them every 60 days. In the last two months alone, Certbot has generated certificates for 19.2 million unique sites. “Fortunately we had anticipated the need to check revoked certificates for renewal in 2015,” says EFF engineering director Max Hunter. “Because Let’s Encrypt communicated the issue early, and the code path for the query was already in place, our work was relatively straightforward.” By Tuesday a team from EFF, along with volunteers in Paris and Finland, had updated Certbot to renew any revoked certificates.

Meanwhile, Let’s Encrypt sent an email to every address it had on file. It created a searchable database of every affected domain so that hosting companies could see if they needed to act. “We marked those certificates as expired in our internal system, and then our normal automated processes kicked in to generate and deploy new certificates,” says Justin Samuel, CEO of Less Bits, a startup that operates hosting company ServerPilot.

On Tuesday night, 30 minutes before the deadline, Let’s Encrypt made another announcement. Of the 3 million potentially impacted sites, 1.7 million had managed to renew their certificates, an astonishing number given the short window of time. “No other CA comes close to making large-scale cert reissuing not only feasible but also fast,” says Samuel.

That success also emboldened Aas to make a difficult call. Let’s Encrypt would let the remaining certificates slide. “We made the decision that instead of breaking more than a million websites, potentially, we just aren’t going to revoke them by the deadline,” says Aas. “We think it’s the right decision for the health of the internet.”

It was the internet equivalent of a call from the governor minutes before midnight. Let’s Encrypt will continue to revoke certificates if it can confirm that the sites have renewed them, but otherwise it is content to leave them be in their slightly broken form. The security risk is small, Aas says, and since Let’s Encrypt certificates are only viable for 90 days to begin with, any stragglers will have washed out of the ecosystem by summertime at the latest.

“If anything, this just reinforces that they are one of the most transparent, modern certificate authorities in the world,” says MongoDB’s White, who points to previous certificate snafus that for-profit companies like Symantec have badly mishandled. “It’s easy to armchair quarterback. But I think if people are overly critical that’s misplaced.”

The intricacies of internet infrastructure are generally ignored until something goes terrible wrong. This time, though, it’s useful to reflect on what went right. For once, the story is that nothing broke.