By: Sven Delmas

Read Time: 16 min

We had a partial production outage the weekend of 30 May 2020, and we missed a few things, outlined in the next few sections. Since others may encounter a similar incident in the future, we thought it would be worthwhile to share our experience and help others learn as much as we have. We’re addressing where we failed through better customer communication, improved planning for future workarounds, accelerated completion of our CI/CD improvements, and stronger endpoint alerting.

Note: All times listed are in UTC.

On 30 May 2020 at 13:04, a customer alerted us in our public Slack that their LogDNA agents suddenly stopped shipping data. Our on-call SRE began an investigation immediately.

What Happened

The certificate chain for our systems relied on the AddTrust External CA Root. That root certificate expired on 30 May 2020 at 10:48, which caused a certificate expiry error that broke the TLS handshake between some of our customers’ agents and libraries and our systems. Any system that attempted a new TLS handshake with our various endpoints with one of the clients using an older certificate authority certificate store or an older TLS implementation failed to connect.

Detailed Times (in UTC)

2020-05-30 10:48
The certificate expired as noted.
2020-05-30 12:59
We received our first customer ticket reporting something wasn’t working correctly.
2020-05-30 13:04
We first received word from a customer in our public Slack that their agents stopped shipping data.
2020-05-30 13:13
The SRE on-call finished verifying the customer report and opened an incident.
2020-05-30 13:52
In analyzing the data coming in for ingestion, we misunderstood the impact of the certificate expiration due to a discrepancy in two different datasets. Our data did not demonstrate a complete loss of ingestion traffic, so while a severe problem, we thought the issue was more isolated to one version of our agent. Based on customer reports, we thought that only the Docker-based image of our v1 agent was affected. As such, we focused on releasing a new v1 agent build and identifying a way for customers running older operating systems to update their certificate lists.
2020-05-30 15:22
We identified the patch for Debian-based systems and attempted to apply the same patch to our v1 agent image. Then, we realized that the problem with the v1 agent was due to how NodeJS manages certificates (described under NodeJS Certificate Management). We focused our efforts on rebuilding the v1 agent with the correct certificate store for both Docker and Debian-based systems.
2020-05-30 19:10
We started validating packages internally before shipping to customers. We also thought we identified the solution to the lack of any drop in ingestion traffic as existing agents did not need to attempt a new TLS handshake.
2020-05-30 19:13
We pushed a new v1 agent build that allowed customers to restart ingestion on most platforms and documented hotfixes for older Debian-based systems in our public Slack. We did not realize our build had failures on AWS due to a hiccup in our CI/CD process.
2020-05-31
We continued to think that the patched image solved the problem and pointed customers to the new 1.6.3 image as needed.
2020-06-01 13:07
We received reports from multiple customers with data that the new 1.6.3 image had issues that would cause CrashLoopBackOff. We discovered a bug in our release chain that prevented us from releasing a new image. We were back at square one.
2020-06-01 15:00
We fixed the bug in our release chain and started generating new packages. An intermittent hiccup in service with our CI/CD provider caused a further delay.
2020-06-01 17:14
We released the new packages and started exploring the option of switching to a new certificate authority. We wanted to remove the need for customer intervention and to avoid issues with how NodeJS vendored OpenSSL (described under NodeJS Certificate Management and Certificate Update Process).
2020-06-01 18:13
We began the process of updating our certificates on testing environments, pushing the changes out to higher environments one by one after testing was complete.
2020-06-01 21:46
We completed a switchover to a new certificate authority and certificate chain. This fix meant all systems except those that would require manual certificate chain installation, such as those using Syslog ingestion, would immediately begin ingestion again without any further customer action. Those systems with manual certificate chain installation were provided with further instructions and pointed to the new chain on the CDN.

Key Factors

NodeJS Certificate Management

Along with our libraries and direct endpoints, we have two separate agents we’re currently supporting: v1 agent, which is written in NodeJS, and v2 agent, which is written in Rust. The v1 agent has been kept on an older version of NodeJS to provide compatibility with older operating systems. That older version of NodeJS uses a default list of trusted certificates for certificate authorities that did not include the new certificate, and NodeJS overall does not read from the local system’s trusted certificate authority list.

NodeJS, similarly to Java, ships with a bundled list of trusted certificates to ensure the security of TLS calls. Browsers do the same thing except they manage their own lists; NodeJS uses Mozilla’s list as the core developers deferred to Mozilla’s “well-defined policy” to ensure the list stays current. In the past, there were a number of calls for NodeJS to enable teams to add new certificates or otherwise enable better management of the certificate store. As of version 7.3.0 (coupled with LTS 6.10.0 and LTS 4.8.0), the core developers of NodeJS added the ability to include new certificates in that trusted list. Before that release, end-user developers and ops teams would have to recompile NodeJS with their own certificate additions or patches. Coincidentally, the NodeJS community raised a request to remove the AddTrust certificate and put the proper certificate in, and that fix landed on 01 June 2020. We discovered this change in the source code itself during the postmortem phase of this incident.

NodeJS also ran into an issue with how it vendored OpenSSL. Different versions of NodeJS used different versions of OpenSSL, and some older versions of OpenSSL gave up when finding invalid certificates in a given path versus trying alternatives.

How does that work, exactly? To ensure we’re all on the same page, let’s talk about certificate chains. The basic chain involves three pieces: a root certificate from a certificate authority, an intermediate certificate that is verified by the root certificate, and a leaf certificate that is verified by the intermediate certificate. The root certificates, as noted, are generally coded into operating systems, browsers, and other local certificate stores to ensure no one can (easily) impersonate a certificate authority. Leaf certificates are pretty familiar to most technical teams as they are the certificates that a team receives as a result of their request to a certificate authority. Intermediate certificates, on the other hand, serve a few functions. The most important function is they add a layer of security between the all-powerful root certificate with its private keys and the external world. One of the other functions, however, that’s the most relevant here is the need to bridge between old root certificates and new ones&emdash;a function known as cross-signing. Certificate authorities release two intermediate certificates, one for each root certificate, that then both validate a leaf certificate.

Older versions of OpenSSL, specifically 1.0.x and older, have issues with this system where they follow the certificate path up the chain once and then fail if that path leads to an expired root certificate. In newer versions, OpenSSL attempts to follow an alternate chain when one is available, such as in this case where there was an additional cross-signed intermediate available that pointed to the new root. This last issue caused problems when we built our systems on different versions of NodeJS that ship with older versions of OpenSSL, and it also caused problems with other systems because of the next point.

Certificate Update Process

When we were updating certificates in the past, we only updated the leaf certificate (the bottommost part of the certificate chain) rather than including the intermediate certificate. Sectigo has offered intermediate certificates that cross-signed the AddTrust certificate with the new USERTrust RSA certificate to ensure that older systems supported. Since we didn’t add the intermediate certificate during our update process, we missed adding the cross-signed intermediate certificate. As noted, this omittance caused errors with OpenSSL and GnuTLS on other older systems such as older Debian or Fedora builds, not just our NodeJS build.

Endpoint Monitoring

Finally, and possibly most importantly, we did not have any external endpoint monitoring to alert us when this certificate chain broke. We ended up relying on customers flagging the incident for us.

Next Steps

Outage and Customer Communications

We weren’t as prompt, nor as expansive, as we should have been in communicating with our customers. That is changing immediately, starting with this public postmortem. We are discussing steps to ensure that we have the right required channel of communication for customers based on incident severity, and we will implement them as quickly as possible. Here’s the most up-to-date version of our plan:

Ultimately, if there is an incident, we owe it to every customer impacted to be notified as early as possible, along with consistent status page updates. Our customers can expect effective communication moving forward.

Workarounds and Solutions

We know customers rely on us every day to help with troubleshooting, security, compliance, and other critical tasks. We were not fast enough in providing potential workarounds for the full problem. We are starting work on documenting temporary workarounds for future use so we can respond faster during an incident.

We know some customers need a backfill solution for security or compliance. We are aware of the problem and are in the extremely early stages of defining solutions and workarounds for future use. Given the current complexities, we do not recommend attempting to backfill missing logs from this recent incident. Some of the difficulties we need to address for a backfill solution are identifying when the problem actually requires backfill, minimizing impact of any potential option on throughput to livetail, reducing duplicate entries, and ensuring that timestamps are preserved and placed in order. For example, any log management system can see occasional delays when coupled with slower delivery from external systems, and it might appear that a batch of log lines was not ingested when, in fact, they are in the pipeline and would appear on their own without any intervention. As these difficulties could seriously warp or destroy data if not handled properly, we are starting with small, noncritical experiments and ensuring that any process we eventually recommend for any workaround our customers need is thoroughly tested, repeatable, and reliable.

Technical Initiatives

We’re improving our CI/CD systems to speed up releases when we need to. We actually were in the process of doing this very task, which caused some of the delays in releasing an initial patch immediately as we discovered issues that we needed to fix.

Our current monitoring solution does not include external endpoint monitoring. To increase our level of proactive alerting and notifications, we have already started talking to vendors to supplement our current monitoring so we are notified the moment any of our endpoints one of our own systems goes down.  We are also working to identify the right automated solution to check certificates that includes the entire certificate chain rather than just the tail end of the certificate chain.

Wrap Up

While this certificate expiration incident affected multiple software providers over the weekend, we should have known of the issues proactively, we should have been more actively communicating updates during the incident, we should have been faster at identifying and providing workarounds, and ultimately, we should have resolved the issue much faster. Our commitment to you is that we are going to do better, starting with the fixes outlined above.

Tags:

About Sven Delmas

RELATED POSTS

Postmortem of Incident on 08 June 2020

We encountered four interwoven bugs that caused a degradation of service in one of our production instances. As the bugs have particularly interesting implications and...

Engineering

Serverless Logging Performance, Part 2

Also published on our Dev.to engineering blog When thinking about serverless applications, one thing that comes to mind immediately is efficiency. Running code that gets...

Engineering

Serverless Logging Performance – Part 1

When thinking about serverless applications, one thing that comes to mind immediately is efficiency. Running code that gets the job done as swiftly and efficiently...

Engineering