With great power comes great responsibility, and with big data comes the necessity to query it effectively. ElasticSearch has been around for over seven years and changed the game in terms of running complex queries on big data (petabyte scale). Tasks like e-commerce product search, real-time log analysis for troubleshooting or generally anything that involves querying big data is considered “data intensive”. ElasticSearch is a distributed, full-text search engine, or database, and the key word here is “distributed”.
A lot of small problems are much easier to deal with than a few big ones, and DevOps is all about spreading out dependencies and responsibility so it’s easier on everyone. ElasticSearch uses the same concept to help query big data; it’s also highly scalable and open-source. Imagine you need to setup an online store, a private “Google search box” that your customers could use to search for anything in your inventory. That’s exactly what ElasticSearch can do for your application monitoring and logging data. It stores all your data, or in the context of our post, all your logging data in nodes that make up several clusters.
ElasticSearch and Data Analysis
Staying with the online store example, modern day queries can get pretty technical and a customer could, for example, be looking for only products in a certain price range, or a certain color, or a certain anything. Things can get more complicated if you’re also running a price alerting system that lets customers set alerts if things on their wish list drop below a certain price. ElasticSearch gives you those full-text search and analytics capabilities by breaking data down into nodes, clusters, indexes, types, documents, shards and replicas. This is how it allows you to store, search, and analyze big data quickly and in “near” real time (NRT).
The architecture and design of ElasticSearch is based on the assumption that all nodes are located on a local network. This is the use case that is extensively tested for and in a lot of cases is the environment that users operate in. However, monitoring data can be stored on different servers and clusters and to query them, ElasticSearch needs to run across clusters. If your clusters are at different remote locations, this is where ElasticSearch’s assumption that all nodes are on the same network starts working against you. When data is stored across multiple ElasticSearch clusters, querying it becomes harder.
Network disruptions are much more common between distributed infrastructure (even with a dedicated link) along with a host of other problems. Workarounds are what adapting to new technology is all about, and there have been a number of them — one of the most recent and effective ones being tribe nodes. ElasticSearch has a number of different use cases within organizations and are spread across departments. It could be used for logging visitor data in one, analyzing financial transactions in another, and deriving insights from social media data across a third.
Since data resides across the cluster on different nodes, some complex queries need to get data from multiple nodes and process them; for that you need to query multiple clusters. If all these clusters are not at the same physical location, tribe node connects them and lets you deal with them like one big cluster. What makes the tribe node unique is that it doesn’t impose any restrictions on core APIs like the cross-cluster search. The tribe node supports almost all APIs, with the exception of meta-level APIs like Create Index, for example, which must be executed on each cluster separately.
The tribe node works by executing search requests across multiple clusters and merging the results from each cluster into a single global cluster result. It does this by actually joining each cluster as a node that keeps updating itself on the state of the cluster. This uses considerable resources, as the node has to acknowledge every single cluster state update from every remote cluster.
Additionally, with tribe nodes, the node that receives the request (the corresponding node) basically does all the work. This means the node that receives the request identifies which indices, shards, and nodes the search has to be executed against. It sends requests to all relevant nodes, decides what the top N-hits that need to be fetched are and then actually fetches them.
The tribe node is also very hard to maintain code-wise over time — especially since it’s the only exception to ElasticSearch’s rule that a node must belong to one and only one cluster.
If DevOps is about spreading the load around, it’s pretty obvious what the problem is with tribe node. One node is being taxed with all the processing work while the nodes not relevant to the query standby idle. With cross-cluster search, you’re actually remotely querying each cluster with its own _search APIs, so no additional nodes that need to be constantly updated would join the cluster and slow it down. When a search request is executed on a node, instead of doing everything itself, the node forwards the indices at a rate of one _search_shard request per cluster.
The _search API allows ElasticSearch to execute searches, queries, aggregations, suggestions, and more against multiple indices which are in turn broken down into shards. The concept is that instead of having a huge Database A, Database B, Database C and so on, it merges everything into one giant solid block of data. The next step is to break it down into bits (shards) and give every node a piece to look after, worry about, care for, maintain, and query when required. This makes it a lot easier to query since the load is spread evenly across all the nodes.
Now, unlike in tribe nodes where the first node would wait for every node to reply, do the math and fetch the documents, with cross-cluster search the initial node has done it’s job already. Once the shards are sent to all clusters for comparison, all further processing is done locally on the relevant clusters. Further time and processing power is saved by sending shards to only 3 nodes per cluster by default; you can also choose how many nodes per cluster you would like discovered.
Now traffic flows only one way in cross-cluster searches and that means the corresponding node just passes the message on to the three default nodes that carry on the process. You can also choose which nodes you would like to act as gateways and which nodes you would actually like to store data on. This gives you a lot more control over the traffic going in and out of your cluster.
Again, unlike tribe nodes that require an actual additional node in each of your clusters, with cross-cluster search, no additional or special nodes are required for cross cluster searches and it isn’t tied to any specific API. Any node can act as a corresponding node and you can control which nodes get to be corresponding nodes and which nodes don’t. Furthermore, when merging clusters, tribe nodes can’t keep two indices with the same name even if they’re from different clusters. Cross-cluster search aims to fix this limitation by being able to dynamically register, update, and remove remote clusters.
The Need For Logging
There are also commercial algorithms built on ElasticSearch to make life and logging even easier. LogDNA is a good example, and we’ve been known to talk about our product in context of it being the “Apple of logging”. Along with predictive intelligence and machine learning, LogDNA allows users to aggregate, search and filter from all hosts and apps. LogDNA also features automatic parsing of fields from common log formats, such as weblogs, Mongo, Postgres and JSON. Additionally, we offer a live-streaming tail using a web interface or command line interface (CLI).
LogDNA provides the power to query your log data end-to-end without having to worry about clusters, nodes, or indices. It does all the heavy lifting behind the scenes so you enjoy an intuitive and intelligent experience when analyzing your log data. As we discussed earlier, maintaining your own ElasticSearch stack and stitching together all the infrastructure and dependencies at every level is a pain. Instead, you’re better off saving all that time and opting for a third-party tool that abstracts away low-lying challenges and lets you deal with your log data directly. That’s what tools like LogDNA do.
From not being able to query across clusters, to querying through a special node, to finally remotely querying across clusters, ElasticSearch is certainly making progress. In an age where data is “big”, nothing is as important as the ability to make it work for you and we can sure expect ElasticSearch to continue to make this feature better. However, if you’d rather save yourself all the effort of managing multi-cluster querying, and instead analyze and derive value from your log data, LogDNA is the way to go.