One of the greatest threats to a log management solution is load. As log volume increases, the ability for a solution to process each event decreases. Given enough load, this will result in dropped messages and data loss.
This problem becomes apparent when using the Elastic Stack in a high volume logging environment. If an Elasticsearch node is busy processing events, Logstash buffers new events in a queue until Elasticsearch becomes available again. This is a good enough solution for small deployments, but it presents a problem in enterprise deployments.
The main problem is in how Logstash buffers events. Logstash stores incoming events in an in-memory queue with a fixed size. The main reason for this is reliability; if Logstash fails, the contents of its queue are lost. A fixed queue size reduces the amount of data loss in case of a crash, but it slows down the entire stack.
To prevent this, persistent queues were added to Logstash. However, these also have limitations including:
- No guarantee against data loss for protocols that don’t acknowledge the delivery of logs, such as TCP, UDP, and ZeroMQ Push/Pull
- No data replication, which means no protection from system failures
- Once the queue is full, Logstash puts back pressure on log generating components to slow down the production of new logs
What we need is a way to queue messages that won’t result in dropped logs. This is where message brokering systems like Apache Kafka come into play. By handling and redistributing log events quickly and reliably, message brokers are an ideal solution for log management solutions where data integrity is paramount.
We’ll look at three solutions commonly used to broker log messages:
- Apache Kafka
RabbitMQ is a general purpose message broker. It uses a smart broker / dumb consumer model, where the broker tracks which messages have been delivered to which consumers. In other words, RabbitMQ waits for Logstash or Elasticsearch to download messages from its queue, and only deletes the message when the consumer acknowledges a successful download. This ensures that messages are only deleted when they’re guaranteed to have been processed by the consumer.
The downside to this model is that its speed is limited to the size of the message queue and the ability to track the state of each consumer. If a publisher sends too many messages too quickly, RabbitMQ automatically reduces its connection speed. As with Logstash, this can lead to latency problems further up the stack if either RabbitMQ or consumers can’t process messages quickly enough. If you’re not surpassing tens of thousands of log messages per second, this broker will work for you. RabbitMQ will not be able to handle hundreds of thousands of log messages and more.
Redis is an in-memory data store designed for data processing. While it can stream large volumes of data quickly and efficiently, it’s commonly used as a caching service. Redis can recognize and process different data types including strings, hashes, lists, and ordered sets, making it useful as both a queue and a real-time data processing service.
Although the draw to Redis is primarily for its fast speed as an in-memory data store, it is also its greatest weakness when memory is full. Log data will be dropped anytime the redis queues are full. Redis, it offers options for disk-based data persistence which helps reduce the risk of data loss in case of a failure, but at the cost of increased latency. In addition, Redis runs as a mostly single-threaded process. While it can perform simple operations quickly, advanced operations can introduce latency and delays throughout the stack. Using in-memory only queues introduces risk of data loss and we moved away from this implementation quickly. The lack of parallelism is also tough for re-indexing and re-processing log lines into Elastic Search.
Apache Kafka is a distributed messaging queue that uses a publisher/subscriber (pub/sub) model. Publishers (e.g. a log generating component) send messages to Kafka, and subscribers (e.g. Logstash or Elasticsearch) can pull messages as long as they have the capacity to do so. Subscribers are responsible for tracking their state in the queue, while Kafka itself only stores messages for a short period of time. This allows Kafka to store more messages, serve more consumers, and use fewer resources than smart brokers.
Perhaps Kafka’s biggest strength is its speed. Its transmission rate is limited mostly by how quickly consumers can pull messages. It uses a disk-based queuing system, and while this can add latency, it also provides built-in fault tolerance and replication. It also offers failure safety by sending acknowledgments to publishers once messages are fully replicated, ensuring data persistence in case of a failure.
Kafka’s main challenge is that it’s more difficult to set up properly than other brokers. Kafka uses Apache ZooKeeper to track and maintain cluster state, adding complexity and overhead to Kafka deployments. Unlike Redis, Kafka nodes are unaware of the data being streamed, and unlike RabbitMQ, Kafka relies on each consumer to track its own progress in the queue. Despite these limitations, Kafka offers a high level of throughput and has been shown to support over 2.5 million records per second.
One of the most common issue users run into when deploying their own ELK stack into production is dropping log lines before Elastic has a chance to index them. Ultimately, the question of which broker to use comes down to your performance and scalability requirements, including:
- Infrastructure size and the number of log-generating components
- Logging volume in terms of events per second and event size
- Tolerance for dropped events
Brokers like RabbitMQ, Redis, and Kafka allow Logstash, Elasticsearch, and other downstream systems to process high volumes of log data without slowing down the rest of your stack.
If you’re growing faster than your ELK stack and find yourself throwing more engineering resources at your in-house logging solution, contact LogDNA today!
Next Thursday, be sure to check our blog and read about our experience scaling past Kafka and it’s limitations.