Events and Incidents

Concepts

Instana detects three major types of events to help you manage the Quality of Service of your applications:

Issues
An Issue is an event that gets created if an entity gets unhealthy. They are detected by machine learning algorithms and health signatures built to detect various unhealthy situations, whether from degradations of service quality, to complex infrastructure issues, to disk saturation. 

Incidents
Incidents are the highest level of events. They are created when a discovered service is impacted, and they correlate all relevant events by leveraging the Dynamic Graph to provide context.

Changes A Change is an event representing anything from a server start/stop, a deployment, a configuration change on a system, you name it. Further separated into:

  • Changes - Changed configuration of components.
  • Offline/Online - Tracking presence of components under management. 

Issues

An Issue is an event that is triggered if something out of the ordinary happens.

Critical infrastructure issues, such as disk saturation and Elasticsearch cluster split brain situations, will trigger incidents because their end result is most likely data loss.

An example of an issue:

image2016 11 22 17 13 5

In this example, the CPU steal time on one Linux machine is suspicious and therefore marked as an issue. An issue by itself does not trigger an alert, Instana simply notes that it happened. Should the service to where this system is connected behave badly, this issue will be part of the incident. This methodology is one of the major benefits of Instana because it frees you from manually correlating events and performance problems. Just because something is using too much CPU for a while doesn’t mean there is a problem as such. Only when a service is impacted will this be relevant information. 

Instana records the time when an issue occurred for the first time and also when the condition ceased to exist (start and end time). In this case, you see that the CPU steal exceeded the 5% limit for only two and a half minutes (from 16:25:11 to 16:27:53). By clicking on the issue line, you will see the details on the right hand side of the screen. The drop in CPU steal is evident at around 16:27.

The “View Built-in Event” link brings you directly to the corresponding definition in the Events & Alerts settings for this issue. This helps to understand on which basis a particular issue has been created.

Notes:

  • Applications, services or endpoints which receive infrequent traffic (eg. one call every 15 minutes) are not considered to have a sufficient basis for our issue detection.

Incidents

Incidents help you understand situations impacting your edge services and critical infrastructure by automatically learning their behavior and health, and then sending alerts when they become unhealthy. Edge services are what customers or other systems outside the monitored application actually access; they are the external deliverables of the application.

events and incidents 1

Incidents are created as soon as Instana detects either a key performance indication (KPI) is breached on an edge service, or a critical infrastructure issue. Please read Part II of the Monitoring Microservices blogpost for more context.

Instana tracks KPIs for discovered application services. Those KPIs are:

  • Load (calls/second)
  • Latency (response time in milliseconds)
  • Errors (error rate)
  • Saturation (of the most constrained resources of a service)

Instana automatically measures these KPIs for every service, and applies machine learning on these KPIs to figure out the health of a service. Typical problems that are detected are:

  • The error rate is higher than normal
  • The performance of the service is slow
  • There is a sudden drop or increase of load
  • The saturation of the service is close to reaching a limit

KPIs are determined by capturing and analyzing every trace across the services and the application. Traces automatically capture errors like status codes, exceptions, or error logs to find out if something went wrong. Traces also measure the time spent in each service and underlying components. Based on the Google Dapper architecture, a trace is a tree that consists of spans, where a span is a basic unit of work. In the microservice world, one span normally equals one request to a service or component, like a database. This way Instana automatically not only has an end-to-end tracing of the application, but also the information about the performance of each individual service and component. 

If the health of a service is impacted, Instana will create a new incident and correlate it with all the other incidents by traversing the Dynamic Graph of Issues and Events.

The result is a comprehensive overview of the situation regarding Service and Event Impact.

Changes

A change can be anything from a server start or stop, a deployment, a configuration change on a system, you name it.

Instana recognizes changes by tracking relevant configurations specific to each monitored technology, as well as monitoring if something goes online (monitored by Instana) or goes offline (is not monitored by Instana anymore).

Every change is recorded and typically has a duration of only 1 second (start and end time difference). Like issues, change events are also correlated into an incident should they be relevant, thus sparing you an alert just because a system went offline. Maybe it was turned off because the load decreased at the end of the working day and it was no longer needed. 

image2016 11 22 17 6 2

Terminology

Issue

An event marked as suspicious by Instana using characteristics specific to the technology within which it occurred.

Incident

An issue correlated to performance degradation that should be investigated by a human.

Change

Any change within the environment, including but not limited to: a deployment, a configuration change, a server start or stop.

Usage

How to analyze an incident

Let’s take an example incident: a service is suddenly responding slower than usual - we call this a “sudden increase in average latency.” The incident is marked in yellow as warning. This is done automatically and there is no configuration option for thresholds - which is another major benefit of Instana. The color is presented as long as this incident is still active. Once it is resolved, it will be presented in grey but still available for drill-down. Clicking on the line in the table starts the experience:

events and incidents 4

The incident detail view (right part of the screen) is organized into three parts:

  1. The header contains basic information about the key facts of the incident. 

    • Start time;
    • End time (current if it is still ongoing);
    • The number of the still active events;
    • The number of changes involved;
    • The number of affected entities.

You can see the incident start date, the end date if available, how many events are still active, how many changes belong to this incident, and the number of affected entities:

image2016 11 22 17 33 26

  1. The second section provides a visual representation of the incident development over time. The chart shows the complete time frame, from start to end (current server time, if the incident is still open) and all events, sorted by start time. The view is limited to seven events when collapsed. Press the expand button to see the full view if your incidents contains more than seven events at a time. Clicking on either of the bars will open the detail-view for that issue:

image2016 11 08 15 44 12

  1. The third section contains the details for the graph view in section 2. A list of all events, sorted by start time, allows the user to see all available information for each event. To do this, just click it to expand it:

image2016 11 08 15 56 23

The details help in understanding the event, followed by multiple charts with the corresponding metric plotted for visualization. If an event is still active, the chart will continue rendering new incoming metric values. There are two flags available, emphasizing that this event affects a service and/or that this event has triggered the incident. If available, the flags are placed top of each event in the list (see screenshot above).

When focusing on an event, the detail section will provide the same information described in the incidents event list on point 3.

Search capabilities - finding an incident

Searching through events discovered by Instana relies on the Dynamic Focus feature. When in the incident view, you’ll notice that event.type:incident has been automatically added to the search bar. In addition to this event filter, the current time window is also applied as a filter.

In addition, you can use the search box to find specific items by the data shown in the columns “Title” or “On” (the name of the service on which the incident occurred) in the overview table. In this example, the search term is “sudden” and you don’t need to add a wildcard character to the string. The result is a list of all incidents containing the word “sudden” in the title:

image2016 11 22 18 6 10

Configuration

Instana has a predefined set of health rules, built and maintained by our experts, to detect anomalies in traces and infrastructure issues. Individual predefined rules can be disabled if they do not suit your environment.

Additionally users can configure their own custom rules in Instana, which can then trigger custom issues.

This functionality can be found in the Events & Alerts section of the Team Settings menu.

Events Overview

The Events page gives an overview showing a list of currently available events. It list both built-in event, which are included out-of-the-box with Instana, as well as user defined custom events.

events overview

The symbol in the first column indicates the severity (incident, critical or warning), the state (active or disabled) and the kind of event (issue or incident). This list can be filtered using the dropdowns above or by a fulltext search. Events can be disabled at any time. However, only custom events can be deleted.

Built-in Events

Built-in events are predefined health events based on integrated algorithms which help to understand the health of the monitored system in real-time. These can be disabled individually in case a built-in event is not relevant for the monitored system. The number of built-in events is growing constantly as the experts at Instana are continuously working on new rules to enable better insight into the health of the monitored system.

Custom Events

A custom event enables users to create issues or incidents based on an individual metric of any given entity. You can create a new custom event by clicking on “New Event” on the top of the Events Overview page. The “new event” form has three sections: Event details, conditions, and scope.

Event Details

create event, step 1

Condition

create event, step 2

There are three data sources providing metrics that can be used for triggering custom events:

  • Built-in metrics: These are metrics that are always available when the corresponding entity is instrumented. For example, when a JVM is being monitored (entity = JVM), Instana always provides built-in metrics like the amount of memory used.
  • Custom metrics: These are metrics that are explicitly exposed by a monitored application. For example, if an application exposes Dropwizard metrics, these metrics will be found here.
  • System Rules: Currently, Instana provides only one system rule, which is “offline event detection”. This rule is active when an entity (like a JVM or a host) is down.

Depending on the metric type, you have different options how to define the condition that triggers the custom event. For example, you can configure an event if the number of errors within a time window of 5 minutes is greater than 10.

Scope

create event, step 3

Typically you don’t want an event to trigger on all entities in your application or system landscape, but want to restrict it to a specific set of entities. The scope lets you define for which entities the event will be evaluated:

  • Application perspective: Reference an application perspective.
  • Selected entities: Define a dynamic focus query. Only entities matching that query will be considered when the event is evaluated.
  • All available entities: No restriction, evaluate the event for all entities in your application or system landscape.