Automatic discovery and monitoring

Automatic discovery and monitoring offer efficient and adaptive system management. By eliminating manual configuration, it seamlessly adapts to changes in large-scale applications to ensure comprehensive coverage across physical, logical, and business components. This approach detects anomalies in real-time to minimize the impact of such failures. This advanced monitoring solution excels in providing a thorough and adaptive approach to system management, crucial for maintaining the reliability and performance of modern, dynamic applications.

“And at our scale, humans cannot continuously monitor the status of all of our systems”. - Netflix. ”

This is true for traditional APM tools, which is primarily used by performance-tuning experts to manually analyze and correlate information to identify bottlenecks and errors in production. With higher scale and dynamics, this task is like finding a needle in a haystack. Too many moving parts and metrics need to be correlated.

If machine intelligence approach is applied to system management, the core model and data set must be impeccable. Microservice applications are made of hundreds to thousands of building blocks, and are constantly evolving. Therefore, it is necessary to understand all the blocks and their dependencies, which demand an advanced approach to discovery.

Components of automatic discovery and monitoring

Automatic discovery and monitoring covers physical components, logical components, and business components.

Physical components

  • Datacenter or Availability zones – Zones exist in different continents and regions. They can fail or have different performance characteristics.
  • Hosts/Machines – Either physical, virtual, or delivered as a service. Each host has resources such as CPU, memory, and IO that can be a bottleneck. Each host runs in one zone.
  • Containers – Runs on top of a host and can be managed by a scheduler such as Kubernetes or Apache Mesos.
  • Processes – Runs within a container or on the host. The processes can include runtime environments such as Java or PHP, and middleware such as Tomcat, Oracle, or Elasticsearch.
  • Clusters – Many services can act as a group or cluster so that they appear as a unified distributed process to the outside world. The number of instances within a cluster can change and can have an impact on the cluster performances.

Logical components

  • Services – Logical units of work that can have many instances and different versions that are running on top of the previous mentioned physical building blocks.
  • Endpoints – Public API of a service to expose specific commands to the rest of the system.
    • Application perspectives or Applications – A perspective on a set of services and endpoints that are defined by a common context that are declared by using tags.
  • Traces – Sequence of synchronous and asynchronous communications between services. Services communicate with each other and deliver a result for a user request. The process of transforming data in a data flow might involve many services.
    • Calls – Request between two services. A trace is composed of one or more calls.

Business components

  • Business services – Compositions of services and applications that deliver unique business value and services.
  • Business process – Combination of technical traces that form a process. For example, it might represent the "buying" trace in e-commerce, followed by an order trace in ERP, followed by a trace of FedEx's logistics in delivery to the customers.

It’s common for thousands of service instances in different versions that run on hundreds of hosts in different zones on more than one continent to provide an application to its users. This creates a network of dependencies between the components that must work perfectly together so that the service quality of the application is ensured, and the business value is delivered. A traditional monitoring tool might alert when a single component crosses a threshold. However, the failure of one or many of these components does not mean that the quality of the application is definitely affected. Therefore, a modern monitoring tool must understand the whole network of components and their dependencies to monitor, analyze, and predict the quality of service.

Identifying and cataloging changes

As described, the number of services and their dependencies is ten to hundred times higher than in SOA-based applications, which pose a challenge for monitoring tools. The situation is getting worse due to continuous delivery methodology, automation tools, and container platforms exponentially increase the rate of change of applications. This dynamic environment makes it impractical to manually keep up with the changes or to continuously configure monitoring tools into the newly deployed blocks. For example, a new container that is spun up by an orchestration tool. Therefore, a modern monitoring solution is required to have automatic and immediate discovery of each and every block before you analyze and understand them.

The subsequent changes need to be linked to maintain persistence and a mode can be reconstructed at any point in time to investigate incidents.

The changes can happen in any of the building blocks at any time as shown in the following image:

Changes

Instana's comprehensive discovery process

A key ingredient to the Instana Dynamic APM solution is the Instana agent architecture, and specifically by using sensors. These sensors are mini agents. The mini agents are small programs that are designed to attach and monitor specific entities. They are managed automatically by a single agent (one per host), which is deployed either as a stand-alone process on the host, or as a container through the container scheduler.

The agent automatically detects the physical components such as AWS zones, Docker containers that run on the host or Kubernetes, processes such as HAProxy, Nginx, JVM, Spring Boot, Postgres, Cassandra or Elasticsearch, and clusters of these processes, such as Cassandra cluster. For each detected component, the agent collects its configuration data and starts monitoring for changes. It also sends an important metric for each component every second. The agent automatically detects and uses metrics that are provided by the services such as JMX or Dropwizard.

Agent

Later, the agent starts to inject trace functions into the service code. For example, it intercepts HTTP calls, database calls, and queries to Elasticsearch. It captures the context of each call such as stack traces or payload.

The intelligence that combines this data into traces, discovers dependencies and services, and detects changes and issues is done on the server. Therefore, the agent is lightweight and can be injected into thousands of hosts.

Instana is designed for an automatic, immediate, and continuous discovery for the new generation of monitoring solutions.

Collecting data

Instana uses a single agent with multiple sensors and currently supporting over hundreds of technologies. For more information, see Configuring and monitoring supported technologies. These sensors are not extensions. The sensors are updated, loaded, and unloaded by the agent. The optional command-line interface provides access to the agent state, individual sensors, and agent logs.

A sensor is designed to automatically discover and monitor a specific technology, and pass its data to the agent. The agent manages all communication to the Instana Service Quality Engine. After discovery, the sensor collects the details and metric data to provide an accurate representation of the components state. A specific sensor gathers specific types of data about their respective technologies by tailoring the approach based on the technology at hand.

The sensor collects the following data:

  • Configuration: Catalogs current settings and states to track any changes.
  • Events: Initial discovery, state changes (online and offline), built-in events that trigger issues or incidents based on failing health rules on entities, and custom events that trigger issues or incidents based on the thresholds of an individual metric of any entity.
  • Traces: Captures trace based on the programming language platform.
  • Metrics: Qualitative attributes of the technology that indicate performance.

In addition, discovery is recursive within a sensor. For example, the Java Machine sensor continues up the stack and discovers frameworks such as Tomcat or SpringBoot, and then assists the agent to load the appropriate additional sensors.

The intelligence that combines this data into traces, discovering dependencies and services, and detecting changes and issues is done on the server. Therefore, the agent is lightweight, and can be injected into thousands of hosts.

Pipeline

The Instana backend uses the streaming technology, which is capable of processing millions of events per second that are streamed from the agents. This streaming engine is effectively real-time and takes only 3 seconds to process the situation and display it to the user.