Blog | Cloud | 3 minutes read

Cloud Monitoring – Best Practices

Moving your applications into the cloud (whether your own private cloud or a public cloud like AWS, Azure or Google Cloud) forces you to change how you approach development and operational support. A critical part of supporting a cloud platform is how you handle your cloud monitoring. 

In this article, we’re going to discuss cloud monitoring best practices for an effective monitoring strategy. We’ll also talk about how and why you should leverage the power of monitoring to provide better support while reducing the amount of support your development teams need to provide.

What is Cloud Monitoring?

Cloud monitoring is the process of managing, reviewing, evaluating, and monitoring cloud-based systems, services, applications, and IT infrastructure for a streamlined, optimal workflow.

How to Get Started with a Monitoring Plan

Engineers like to design and create. Unfortunately, the desire to create a prototype or get going on developing functionality often results in relegating monitoring to the list of things to be done at some point in the future.  

You need to treat your monitoring plan and its implementation as first-class citizens within the development life-cycle. You will save yourself and your team much frustration and rework by postponing development until you have a monitoring plan in place. The upside of this approach is that you’ll be able to monitor your applications and services as soon as they are deployed.

Leverage the Experience of Experts

Consider leveraging a product which is designed and created for the explicit purpose of monitoring cloud resources. Homegrown solutions require development time and maintenance. Investing in a monitoring tool will save you many headaches and allow you and your engineers the time to invest in improving your core functionality.

Select a provider that can interact with your cloud platform and that will simplify the process of monitoring and automating as much of your support needs as possible.

Consistent and Descriptive Logging

System and application logs are your primary sources of information into how your system is performing, and provide information after an incident to let your engineers understand what went wrong and determine how to resolve bugs, and make your system more resilient going forward.

Define and publish a standard logging format for all your engineers to use in their applications. Many log aggregation services support indexing based on key-value pairs, which enhances their usefulness for triaging problems. Log statements should be as descriptive as possible, including the process, thread, and data involved in the process. A log statement which meets these requirements might look similar to the example below.

logger.info(“event=addItemToCart userId={} itemId{} cartId{}”, userId, itemId, cartId);

Fig. 1 Example of a Descriptive and Well-Formatted Log Statement

Distributed Tracing

Directly related to logging standards is the implementation of distributed tracing. The current trend in application development is to compose your application using microservices, containers, or other components. This approach enhances the reusability, maintainability, and scalability of an application. Implementing trace IDs, which are unique to a transaction and passed to all involved services, enables your operations and support personnel to trace problematic transactions through the system and quickly identify the source of the problem.

Log Aggregation in a Central System

The benefits of the cloud—precisely, the concept of auto-scaling and self-healing application groups—enhances the user experience, but can be problematic when the logs you need to view were on an instance that has already been replaced. Continually exporting your log files to a centralized log management service eliminates this problem. Additionally, support engineers can search logs and trace problems across multiple services within a single portal. These systems can also be used for reporting, as well as the identification and automation of response to problems. 

Avoid Support Burnout with Automation

You want your engineers enhancing your product, not getting burned out manually checking on the health of your services and trying to find problems before they affect your customers. A comprehensive monitoring system allows you to identify critical metrics and determine thresholds for optimal performance. You can set up alerts based on these thresholds that can respond automatically, or notify support personnel with actionable information about identified problems.

The best thing you can do for your application and your organization is to automate, automate, automate. Investing in a comprehensive monitoring solution and automation when your application is in its infancy pays off in exponential dividends as your offerings expand.

Written By Mike Mackrory

 

Read Next