Events and Incidents

Concepts

Instana detects three major event groups to help you manage the Quality of Service of your applications:

Issues
An Issue is created if an entity gets unhealthy. They are detected by machine learning algorithms and health signatures built to detect various unhealthy situations, whether from degradations of service quality, to complex infrastructure issues, to disk saturation. 

Incidents
Incidents are the highest level of events. They are created when a discovered service is impacted, and they correlate all relevant events by leveraging the Dynamic Graph to provide context.

Changes A Change is anything from a server start/stop, a deployment, a configuration change on a system, you name it. Further separated into:

  • Changes - Changed configuration of components.
  • Offline/Online - Tracking presence of components under management. 

Issues

An Issue is an event that is triggered if something out of the ordinary happens.

Critical infrastructure issues, such as disk saturation and Elasticsearch cluster split brain situations, will trigger incidents because their end result is most likely data loss.

An example of an issue:

image2016 11 22 17 13 5

In this example, the CPU steal time on one Linux machine is suspicious and therefore marked as an issue. An issue by itself does not trigger an alert, Instana simply notes that it happened. Should the service to where this system is connected behave badly, this issue will be part of the incident. This methodology is one of the major benefits of Instana because it frees you from manually correlating events and performance problems. Just because something is using too much CPU for a while doesn’t mean there is a problem as such. Only when a service is impacted will this be relevant information. 

Instana records the time when an issue occurred for the first time and also when the condition ceased to exist (start and end time). In this case, you see that the CPU steal exceeded the 5% limit for only two and a half minutes (from 16:25:11 to 16:27:53). By clicking on the issue line, you will see the details on the right hand side of the screen. The drop in CPU steal is evident at around 16:27.

Incidents

Incidents help you understand situations impacting your edge services and critical infrastructure by automatically learning their behavior and health, and then sending alerts when they become unhealthy. Edge services are what customers or other systems outside the monitored application actually access; they are the external deliverables of the application.

events and incidents 1

Incidents are created as soon as Instana detects either a key performance indication (KPI) is breached on an edge service, or a critical infrastructure issue. Please read Part II of the Monitoring Microservices blogpost for more context.

Instana tracks KPIs for discovered application services. Those KPIs are:

  • Load (calls/second)
  • Latency (response time in milliseconds)
  • Errors (error rate)
  • Saturation (of the most constrained resources of a service)

Instana automatically measures these KPIs for every service, and applies machine learning on these KPIs to figure out the health of a service. Typical problems that are detected are:

  • The error rate is higher than normal
  • The performance of the service is slow
  • There is a sudden drop or increase of load
  • The saturation of the service is close to reaching a limit

KPIs are determined by capturing and analyzing every trace across the services and the application. Traces automatically capture errors like status codes, exceptions, or error logs to find out if something went wrong. Traces also measure the time spent in each service and underlying components. Based on the Google Dapper architecture, a trace is a tree that consists of spans, where a span is a basic unit of work. In the microservice world, one span normally equals one request to a service or component, like a database. This way Instana automatically not only has an end-to-end tracing of the application, but also the information about the performance of each individual service and component. 

If the health of a service is impacted, Instana will create a new incident and correlate it with all the other incidents by traversing the Dynamic Graph of Issues and Events.

The result is a comprehensive overview of the situation regarding Service and Event Impact.

Changes

A change can be anything from a server start or stop, a deployment, a configuration change on a system, you name it.

Instana recognizes changes by tracking relevant configurations specific to each monitored technology, as well as monitoring if something goes online (monitored by Instana) or goes offline (is not monitored by Instana anymore).

Every change is recorded and typically has a duration of only 1 second (start and end time difference). Like issues, change events are also correlated into an incident should they be relevant, thus sparing you an alert just because a system went offline. Maybe it was turned off because the load decreased at the end of the working day and it was no longer needed. 

image2016 11 22 17 6 2

Terminology

Issue

An event marked as suspicious by Instana using characteristics specific to the technology within which it occurred.

Incident

An issue correlated to performance degradation that should be investigated by a human.

Change

Any change within the environment, including but not limited to: a deployment, a configuration change, a server start or stop.

Usage

How to analyze an incident

Let’s take an example incident: a service is suddenly responding slower than usual - we call this a “sudden increase in average latency.” The incident is marked in yellow as warning. This is done automatically and there is no configuration option for thresholds - which is another major benefit of Instana. The color is presented as long as this incident is still active. Once it is resolved, it will be presented in grey but still available for drill-down. Clicking on the line in the table starts the experience:

events and incidents 4

The incident detail view (right part of the screen) is organized into three parts:

  1. The header contains basic information about the key facts of the incident. 

    • Start time;
    • End time (current if it is still ongoing);
    • The number of the still active events;
    • The number of changes involved;
    • The number of affected entities.

You can see the incident start date, the end date if available, how many events are still active, how many changes belong to this incident, and the number of affected entities:

image2016 11 22 17 33 26

  1. The second section provides a visual representation of the incident development over time. The chart shows the complete time frame, from start to end (current server time, if the incident is still open) and all events, sorted by start time. The view is limited to seven events when collapsed. Press the expand button to see the full view if your incidents contains more than seven events at a time. Clicking on either of the bars will open the detail-view for that issue:

image2016 11 08 15 44 12

  1. The third section contains the details for the graph view in section 2. A list of all events, sorted by start time, allows the user to see all available information for each event. To do this, just click it to expand it:

image2016 11 08 15 56 23

The details help in understanding the event, followed by multiple charts with the corresponding metric plotted for visualization. If an event is still active, the chart will continue rendering new incoming metric values. There are two flags available, emphasizing that this event affects a service and/or that this event has triggered the incident. If available, the flags are placed top of each event in the list (see screenshot above).

When focusing on an event, the detail section will provide the same information described in the incidents event list on point 3.

Search capabilities - finding an incident

Searching through events discovered by Instana relies on the Dynamic Focus feature. When in the incident view, you’ll notice that event.type:incident has been automatically added to the search bar. In addition to this event filter, the current time window is also applied as a filter.

In addition, you can use the search box to find specific items by the data shown in the columns “Title” or “On” (the name of the service on which the incident occurred) in the overview table. In this example, the search term is “sudden” and you don’t need to add a wildcard character to the string. The result is a list of all incidents containing the word “sudden” in the title:

image2016 11 22 18 6 10

Configuration

Instana uses a predefined set of health rules, built by our experts, to determine anomalies in the trace model and what issues the user needs to know about. Users can configure their own custom rules into Instana, which can then trigger custom issues.

This functionality can currently be found under the Knowledge Management under the Settings menu.

Custom Rules

A custom rule enables users to define conditions on individually selectable application metrics. In order to select a metric, the user first needs to choose between the built-in and custom metrics, such as Dropwizard, JMX or StatsD metrics, using the origin dropdown menu and give the rule a name:

custom rule origin

In case of the built-in metrics, the user needs to additionally choose the entity type of interest:

custom rule entity type

After that, the user has to provide following input to configure a custom rule:

  • metric - defines the metric to which the user wants to apply the rule, like CPU steal or a custom metric name (supports filtering by typing part of the metric name);
  • time window - defines the time interval which is used to evaluate the rule; 
  • aggregation - defines whether to take an average or a sum of the metric values in the specified time interval; 
  • operator - indicates the condition type to be applied against the aggregation value defined above;
  • value - the threshold value which, when crossed, leads to Instana triggering an issue (see issue configuration in the next section).

In the snapshot below, the custom rule dictates that an alert will be triggered when the host’s CPU steal time is, on average over 1 min time interval, greater than twenty percent:

custom rule example

Custom Issues & Incidents

Configuring custom issues lets users apply any custom rule they have already built to multiple entities. The user defines what kind of entities by specifying a given filter query. In the snapshot below, the rule of load > 1 is being applied to those entities in zone:us-east-1d. So now, whenever the defined conditions are read to be true, an issue is generated and Instana will alert the user. 

After applying a custom rule to a defined group of entities, there are further customization options. The user can define the severity of the custom issue, from “warning” or “critical.” They can implement an expiration time for the issue, assign a custom string of text to name the issue, and provide a quick description. They can toggle if the issue is important enough to generate an incident, containing all the correlating events.

images2017 03 27 16 05 36

It is common for teams to want to be informed if some entity goes offline for any reason. Instana tracks both online and offline events, and can leverage this information to define custom issues based on those offline events. To enable this, select the ‘offline’ rule in any custom issue configuration dialogue box. Subsequently, if any entity matching the applied filter query goes offline, this custom issue will fire, creating a warning or critical event, or an incident, depending on your configuration.

offline alerting

Once a custom issue is defined, it can be easily enabled or disabled in the custom issues overview:

images2017 03 27 16 12 28