Ever been stuck with a system that just can’t heal? A system that continuously falls over or fails spectacularly at seemingly random moments? Working with modern systems, especially containerized systems distributed across many clouds, can be difficult and frustrating for anyone on call when something goes wrong. I’ve certainly be there. Let’s dig into where you can gather data from a broken system, how to get data if you’re not lucky enough to have logs, how you can figure out what’s happening using that data, and how best to act on that data. We’ll also explore common trouble spots that might be hidden in that data for you to find. Finally, we’ll take a look specifically at common issues with containers and when they’ll appear so they’re easier to spot.
Logging is deceptively simple. You import a library, pass strings to it, and BAM you have logs. However, getting decent, machine-parsable logs is hard because most people don’t understand the value of good logging. Logging is an underutilized tool in the developer’s toolbox, and it is often misunderstood as just another unnecessary debugging tool. In reality, logging is a boon to the people who will be working on a system later down the line, and making fantastic logs really is a team sport. How do you convince the rest of your team to log better? In this talk, we’ll review these basics of logging to set a baseline: the importance of logs the different log levels and types common misconceptions around logging the reason behind the shift from text-based logs to structured, machine-parsable logs Then, we’ll use those basics to understand pain points in proper log adoption, answer arguments against extensive logging, and review a practical framework for adding the necessary logs to an application that doesn’t have sufficient logs.