By: Sven Delmas
Read Time: 16 min
We had a partial production outage the weekend of 30 May 2020, and we missed a few things, outlined in the next few sections. Since others may encounter a similar incident in the future, we thought it would be worthwhile to share our experience and help others learn as much as we have. We’re addressing where we failed through better customer communication, improved planning for future workarounds, accelerated completion of our CI/CD improvements, and stronger endpoint alerting.
Note: All times listed are in UTC.
On 30 May 2020 at 13:04, a customer alerted us in our public Slack that their LogDNA agents suddenly stopped shipping data. Our on-call SRE began an investigation immediately.
The certificate chain for our systems relied on the AddTrust External CA Root. That root certificate expired on 30 May 2020 at 10:48, which caused a certificate expiry error that broke the TLS handshake between some of our customers’ agents and libraries and our systems. Any system that attempted a new TLS handshake with our various endpoints with one of the clients using an older certificate authority certificate store or an older TLS implementation failed to connect.
Along with our libraries and direct endpoints, we have two separate agents we’re currently supporting: v1 agent, which is written in NodeJS, and v2 agent, which is written in Rust. The v1 agent has been kept on an older version of NodeJS to provide compatibility with older operating systems. That older version of NodeJS uses a default list of trusted certificates for certificate authorities that did not include the new certificate, and NodeJS overall does not read from the local system’s trusted certificate authority list.
NodeJS, similarly to Java, ships with a bundled list of trusted certificates to ensure the security of TLS calls. Browsers do the same thing except they manage their own lists; NodeJS uses Mozilla’s list as the core developers deferred to Mozilla’s “well-defined policy” to ensure the list stays current. In the past, there were a number of calls for NodeJS to enable teams to add new certificates or otherwise enable better management of the certificate store. As of version 7.3.0 (coupled with LTS 6.10.0 and LTS 4.8.0), the core developers of NodeJS added the ability to include new certificates in that trusted list. Before that release, end-user developers and ops teams would have to recompile NodeJS with their own certificate additions or patches. Coincidentally, the NodeJS community raised a request to remove the AddTrust certificate and put the proper certificate in, and that fix landed on 01 June 2020. We discovered this change in the source code itself during the postmortem phase of this incident.
NodeJS also ran into an issue with how it vendored OpenSSL. Different versions of NodeJS used different versions of OpenSSL, and some older versions of OpenSSL gave up when finding invalid certificates in a given path versus trying alternatives.
How does that work, exactly? To ensure we’re all on the same page, let’s talk about certificate chains. The basic chain involves three pieces: a root certificate from a certificate authority, an intermediate certificate that is verified by the root certificate, and a leaf certificate that is verified by the intermediate certificate. The root certificates, as noted, are generally coded into operating systems, browsers, and other local certificate stores to ensure no one can (easily) impersonate a certificate authority. Leaf certificates are pretty familiar to most technical teams as they are the certificates that a team receives as a result of their request to a certificate authority. Intermediate certificates, on the other hand, serve a few functions. The most important function is they add a layer of security between the all-powerful root certificate with its private keys and the external world. One of the other functions, however, that’s the most relevant here is the need to bridge between old root certificates and new ones&emdash;a function known as cross-signing. Certificate authorities release two intermediate certificates, one for each root certificate, that then both validate a leaf certificate.
Older versions of OpenSSL, specifically 1.0.x and older, have issues with this system where they follow the certificate path up the chain once and then fail if that path leads to an expired root certificate. In newer versions, OpenSSL attempts to follow an alternate chain when one is available, such as in this case where there was an additional cross-signed intermediate available that pointed to the new root. This last issue caused problems when we built our systems on different versions of NodeJS that ship with older versions of OpenSSL, and it also caused problems with other systems because of the next point.
When we were updating certificates in the past, we only updated the leaf certificate (the bottommost part of the certificate chain) rather than including the intermediate certificate. Sectigo has offered intermediate certificates that cross-signed the AddTrust certificate with the new USERTrust RSA certificate to ensure that older systems supported. Since we didn’t add the intermediate certificate during our update process, we missed adding the cross-signed intermediate certificate. As noted, this omittance caused errors with OpenSSL and GnuTLS on other older systems such as older Debian or Fedora builds, not just our NodeJS build.
Finally, and possibly most importantly, we did not have any external endpoint monitoring to alert us when this certificate chain broke. We ended up relying on customers flagging the incident for us.
We weren’t as prompt, nor as expansive, as we should have been in communicating with our customers. That is changing immediately, starting with this public postmortem. We are discussing steps to ensure that we have the right required channel of communication for customers based on incident severity, and we will implement them as quickly as possible. Here’s the most up-to-date version of our plan:
Ultimately, if there is an incident, we owe it to every customer impacted to be notified as early as possible, along with consistent status page updates. Our customers can expect effective communication moving forward.
We know customers rely on us every day to help with troubleshooting, security, compliance, and other critical tasks. We were not fast enough in providing potential workarounds for the full problem. We are starting work on documenting temporary workarounds for future use so we can respond faster during an incident.
We know some customers need a backfill solution for security or compliance. We are aware of the problem and are in the extremely early stages of defining solutions and workarounds for future use. Given the current complexities, we do not recommend attempting to backfill missing logs from this recent incident. Some of the difficulties we need to address for a backfill solution are identifying when the problem actually requires backfill, minimizing impact of any potential option on throughput to livetail, reducing duplicate entries, and ensuring that timestamps are preserved and placed in order. For example, any log management system can see occasional delays when coupled with slower delivery from external systems, and it might appear that a batch of log lines was not ingested when, in fact, they are in the pipeline and would appear on their own without any intervention. As these difficulties could seriously warp or destroy data if not handled properly, we are starting with small, noncritical experiments and ensuring that any process we eventually recommend for any workaround our customers need is thoroughly tested, repeatable, and reliable.
We’re improving our CI/CD systems to speed up releases when we need to. We actually were in the process of doing this very task, which caused some of the delays in releasing an initial patch immediately as we discovered issues that we needed to fix.
Our current monitoring solution does not include external endpoint monitoring. To increase our level of proactive alerting and notifications, we have already started talking to vendors to supplement our current monitoring so we are notified the moment any of our endpoints one of our own systems goes down. We are also working to identify the right automated solution to check certificates that includes the entire certificate chain rather than just the tail end of the certificate chain.
While this certificate expiration incident affected multiple software providers over the weekend, we should have known of the issues proactively, we should have been more actively communicating updates during the incident, we should have been faster at identifying and providing workarounds, and ultimately, we should have resolved the issue much faster. Our commitment to you is that we are going to do better, starting with the fixes outlined above.
We encountered four interwoven bugs that caused a degradation of service in one of our production instances. As the bugs have particularly interesting implications and...
Also published on our Dev.to engineering blog When thinking about serverless applications, one thing that comes to mind immediately is efficiency. Running code that gets...
When thinking about serverless applications, one thing that comes to mind immediately is efficiency. Running code that gets the job done as swiftly and efficiently...