“And at our scale, humans cannot continuously monitor the status of all of our systems”. - Netflix
This is especially true for traditional APM tools, which primarily have been used by performance tuning experts to manually analyze and correlate information to identify bottlenecks and errors in production. With higher scale and dynamics, this task is like finding a needle in a haystack. There are just too many moving parts and metrics to correlate.
If we are to apply a machine intelligence approach to system management, the core model and data set must be impeccable. Microservice applications are made of 100s to 1000s of building blocks, all constantly evolving. It is therefore necessary to understand all the blocks and their dependencies, which demands an advanced approach to discovery.
The building blocks that application monitoring needs to cover are:
- Datacenter/Availability Zones – Zones can be in different continents and regions. They can fail or have different performance characteristics.
- Hosts/Machines – Either physical, virtual, or “as a service”. Each host has resources like CPU, memory, and IO that can be a bottleneck. Each host runs in one zone.
- Containers – Running on top of a host and can be managed by a scheduler like Kubernetes or Mesos.
- Processes – Running in the container (usually one per container) or on the host. Can be runtime environments like Java or PHP, but also middleware like Tomcat, Oracle, or Elasticsearch.
- Clusters – Many services can act as a group or cluster, so that they appear as one distributed process to the outside world. The number of instances within cluster can change and can have an impact on the cluster performances.
- Services – Logical units of work that can have many instances and different versions running on top of the previous mentioned physical building blocks.
- Endpoints – Public API of a service, to expose specific commands to the rest of the system.
- Application Perspectives (also called Applications) – A perspective on a set of services and endpoints defined by a common context (declared using tags).
- Traces – A trace is the sequence of synchronous and asynchronous communications between services. Services talk to each other and deliver a result for a user request. Transforming data in a data flow can involve many services.
- Calls – Describes a request between two services. A trace is composed of one or more calls.
- Business Services – Can be compositions of services and applications that deliver unique business value and services.
- Business Process – A combination of technical traces that form a process. As an example, it could be the “buying” trace in e-commerce, followed by an order trace in ERP, followed by a trace of FedEx’s logistics in delivery to the customers.
It’s not uncommon for thousands of service instances in different versions run on hundreds of hosts in different zones on more than one continent to provide an application to its users. This creates a network of dependencies between the components which must work perfectly together so that the service quality of the application is ensured, and the business value delivered. A traditional monitoring tool would alert when a single component crosses a threshold, however, the failure of one or many of these components does not mean that the quality of the application is definitely affected. A modern monitoring tool therefore must understand the whole network of components and their dependencies to monitor, analyze, and predict the quality of service.
As described, the number of services and their dependencies is 10-100x higher than in SOA-based applications, which poses a challenge for monitoring tools. And the situation is getting worse – continuous delivery methodology, automation tools, and container platforms exponentially increase the rate of changes of applications, making it impossible for humans to keep up with the changes or to continuously configure monitoring tools into the newly deployed blocks (e.g. a new container just spun up by an orchestration tool). A modern monitoring solution is therefore required to have automatic and immediate discovery of each and every block, before analyzing and understanding them.
The changes then need to be linked to the previous snapshot so that persistency is kept and a mode can be reconstructed for any given point in time to investigate incidents.
Changes can happen in any of the building blocks at any time. See this graphic for examples of changes in each component:
A key ingredient to the Instana Dynamic APM solution is our agent architecture, and specifically, our use of sensors. Sensors are mini agents – small programs specifically designed to attach and monitor one thing. They are automatically managed by our single agent (one per host), which is deployed either as a stand alone process on the host, or as a container via the container scheduler (learn more about our agent and sensors in data collection).
The agent first automatically detects the physical components like zones in AWS, Docker containers running on the host or Kubernetes, processes like HAProxy, Nginx, JVM, Spring Boot, Postgres, Cassandra or Elasticsearch, and clusters of these processes, like a Cassandra cluster. For each component it detects, the agent will collect its configuration data and start monitoring it for changes. It also starts sending important metrics for each component every second. The agent automatically detects and utilizes metrics provided by the services like JMX or Dropwizard.
As a next step, the agent starts to inject trace functionality into the service code. For example, it intercepts HTTP calls, database calls, and queries to Elasticsearch. It captures the context of each call like stack traces or payload.
The intelligence combining this data into traces, discovering dependencies and services, and detecting changes and issues is done on the server. The agent is therefore lightweight and can be injected into thousands of hosts.
Automatic, immediate, and continuous discovery is a requirement for the new generation of monitoring solutions. Instana has been fundamentally designed around this requirement.