Dynamic Graph

Concepts

The Dynamic Graph is the core technology powering Instana. It is a model of your application that understands all physical and logical dependencies of components. Components are the individual parts of your application like Host, OS, JVM, Cassandra Node, MySQL, etc. The graph has more than the physical components – it also includes logical components like traces, applications, services, clusters, or tablespaces. Components and their dependencies are discovered automatically by the Instana Agent and Sensors, meaning the graph is kept up to date in real-time.

Every node in the graph is also continuously updated with state information like metrics, configuration data, and a calculated health value based on semantical knowledge and a machine learning approach. This knowledge also analyses the dependencies in the graph to find logical groupings, like services and applications, to understand impact on that level and derive criticality of issues. The whole graph is persistent, meaning the Instana application can go back and forth in time to leverage the entire knowledge base of the graph for many operational use cases.

Based on the Dynamic Graph, we calculate the impact of changes and issues on the application or service, and, if the impact is critical, we combine a set of correlated issues and changes into an Incident. An incident shows how issues and changes evolve over time, enabling Instana to point directly to the root cause of the incident. Any change is then automatically discovered and we calculate its impact on surrounding nodes. A change can be a degradation of health (which we call an “Issue”), a configuration change, a deployment or appearance/disappearance of a process, container or server.

To make this concrete, let’s look at how we would model and understand a simple application that uses an Elasticsearch cluster to search for a product using a web interface. In fact, this could be just one µService but it shows how we understand clusters and dependencies in Instana.

Understanding a Dynamic Application

Let’s develop a model of the Dynamic Graph for an Elasticsearch cluster to understand how this works and why this is useful in distributed and fluid environments.

We start with a single Elasticsearch node, which technically is a Java application, so the graph looks like this:

ES Node Graph

The nodes show the automatically discovered components on the host and their relationships. For an Elasticsearch node, we would discover a JVM, a Process, a Docker container (if the node runs inside of a container), and the host on which it is running. If it is running in a cloud environment like Amazon AWS, we would also discover its availability zone and add that to the graph.

Each node has properties (like JVM_Version=1.7.21) and all the relevant metrics in real-time, e.g. I/O and network statistics of the Host, Garbage Collection statistics of the JVM, and number of documents indexed by the ES node.

The edges between the nodes describe their relationships. In this case, these are “runs on” relationships. For example, the ES node “runs on” the JVM.

For an Elasticsearch Cluster, we would have multiple nodes that are building the cluster.

ES Cluster Graph

In this case, what we added a cluster node to the graph that represents the state and the health of the whole cluster. It has dependencies on all four Elasticsearch nodes that comprise the cluster.

The logical unit of Elasticsearch is the index – the index is used by applications to access documents in Elasticsearch. It is physically structured in shards that are distributed to the ES nodes in the cluster.

We add the index to the graph to understand the statistics and health of the index used by applications.

ES Index Graph

In addition, we assume that we access the Elasticsearch index with a simple Spring Boot application.

Now the graph includes the Spring Boot application.

Spring Boot Graph

As the Instana Java sensor records distributed traces, Instana will know whether the Spring Boot application accesses an Elasticsearch index. We correlate these traces with the logical components in the graph and track statistics and health on the different traces.

Using this graph, we can understand different Elasticsearch issues and show how we analyze the impact on the overall service health.

Let’s assume that we have two different problems:

  1. I/O problem on one host causing read/write on index/shard data to be slow.
  2. Thread pool in one Elasticsearch node is overloaded so that requests are queued as they cannot be handled until a thread is free.

Graph Incident

Incident Description

In this case, the Host (1) starts having I/O problems. Our health intelligence would display the host’s health as yellow and then fire an issue to our issue tracker. A few minutes later, the ES (Elasticsearch) Node (2) would be affected by this, and our health intelligence would see that the throughput on this node is degraded to a level that we mark this node as yellow – firing an issue again. Our engine would than correlate the two issues and add them to one incident, which wouldn’t be marked as problematic as in this case, the cluster health is still good so that the service quality is not affected.

Then on another ES node (3), the thread pool for processing queries is filled up and requests get pooled. As the performance is badly affected by this, our engine marks the status of the node as red. This effects the ES cluster (4), which turns to yellow, as the throughput is decreasing. The two issues generated are aggregated to the initial incident.

As the cluster affects the performance of the index (5), we mark the index as yellow and add the issue to the incident. Now the performance of the product search transactions is effected, and our performance health analytics will mark the transaction as yellow (6) which also affects the health of the application (7).

As both the application and the transaction are effected, our incident will actually fire with a yellow status saying that the product search performance is decreasing and users are affected. The path to the two root causes are highlighted – the I/O problem and the Threadpool problem. As seen in the screenshot, Instana will show the evolution of the incident, and the user can drill into the components at the time the issue was happening – including the exact historic environment and metrics at that point of time.

This shows the unique capabilities of Instana:

  • Combining physical, process, and trace information using the graph to understand their dependencies.
  • Intelligence to understand the health of single components, but also the health of clusters, applications, and traces.
  • Intelligent impact analysis to understand if an issue is critical or not.
  • Show the root cause of a problem and give actionable information and context.
  • Keeps the history of the graph, its properties, metrics, changes and issues, and provide a “timeshift” feature to analyze any given problem with a clear view on the state and dependencies of all components.

Finding root cause in modern environments will only get more challenging in the coming years. The simple example above has shown that finding the root cause is not a trivial task without understanding of the context, dependencies, and impact. Now think of “liquid” systems based on µServices that add and remove services all the time with new releases pushed out frequently – Instana keeps track of the state and health in real time, and understands any impact of these changes or issues. This is all done without any manual configuration and in real time.

Terminology

Zone

Zones can be in different continents and regions. They can fail or have different performance characteristics.

Host/Machine

Either physical, virtual or “as a service”. Each host has resources like CPU, memory and IO that can be a bottleneck. Each hosts runs in one zone.

Container

Running on top of a host and can be managed by a scheduler like Kubernetes or Mesos.

Process

Running in the container (usually one per container) or on the host. Can be runtime environments like Java or PHP but also middleware like Tomcat, Oracle or Elasticsearch.

Cluster

Many services can act as a group or cluster, so that they appear as one distributed process to the outside world. The number of instances within cluster can change and can have an impact on the cluster performances.

Service

Logical units of work that can have many instances and different versions running on top of the previous mentioned physical building blocks.

Endpoint

Public API of a service, to expose specific commands to the rest of the system.

Application Perspective

A perspective on a set of services and endpoints defined by a common context (declared using tags).

Trace

A trace is the sequence of synchronous and asynchronous calls between service endpoints. Services talk to each other and deliver a result for a user request. Transforming data in a data flow can involve many services.

Call

Describes an activity within a monitored process, typically a request between two services. A trace is composed of one or more calls. A call is composed of either one or two spans:

  • entry span: For example an HTTP request from an uninstrumented process to an instrumented process.
  • exit span: For example a database call from an instrumented process to a database (databases are not instrumented).
  • exit + entry span pair: For example an HTTP request from an instrumented process to another instrumented process. There will be an exit span for the client process and an entry span for the process serving the response.
  • intermediate span: For example custom spans added via the SDK, instrumented in-process caching or instrumented view technologies.

Span

The lowest granularity level at which data is collected by the Instana tracers. A span represents an activity within a monitored process, e.g. an incoming or outgoing HTTP request, a received message or a database call.

  • Self Time: Self time is a property of a span which denotes the time spent just in the span itself without calls to the underlying services.
  • Waiting Time: The waiting time denotes the time spent waiting for calls to underlying services to return.
  • Network Time: The network time is always present between an exit span and an entry span, and represents the time between the exit and entry span.

Usage

The Dynamic Graph is created and updated automatically. The definition of some components, like services, can be further specified through service configuration.

Graph traversal and scoping can be accomplished using our powerful dynamic focus ability.