Instana detects three major event groups to help you manage the Quality of Service of your applications:
An Issue is created if an entity gets unhealthy. They are detected by machine learning algorithms and health signatures built to detect various unhealthy situations, whether from degradations of service quality, to complex infrastructure issues, to disk saturation.
Incidents are the highest level of events. They are created when a discovered service is impacted, and they correlate all relevant events by leveraging the Dynamic Graph to provide context.
Changes A Change is anything from a server start/stop, a deployment, a configuration change on a system, you name it. Further separated into:
- Changes - Changed configuration of components.
- Offline/Online - Tracking presence of components under management.
An Issue is an event that is triggered if something out of the ordinary happens.
Critical infrastructure issues, such as disk saturation and Elasticsearch cluster split brain situations, will trigger incidents because their end result is most likely data loss.
An example of an issue:
In this example, the CPU steal time on one Linux machine is suspicious and therefore marked as an issue. An issue by itself does not trigger an alert, Instana simply notes that it happened. Should the service to where this system is connected behave badly, this issue will be part of the incident. This methodology is one of the major benefits of Instana because it frees you from manually correlating events and performance problems. Just because something is using too much CPU for a while doesn’t mean there is a problem as such. Only when a service is impacted will this be relevant information.
Instana records the time when an issue occurred for the first time and also when the condition ceased to exist (start and end time). In this case, you see that the CPU steal exceeded the 5% limit for only two and a half minutes (from 16:25:11 to 16:27:53). By clicking on the issue line, you will see the details on the right hand side of the screen. The drop in CPU steal is evident at around 16:27.
- Applications, services or endpoints which receive infrequent traffic (eg. one call every 15 minutes) are not considered to have a sufficient basis for our issue detection.
Incidents help you understand situations impacting your edge services and critical infrastructure by automatically learning their behavior and health, and then sending alerts when they become unhealthy. Edge services are what customers or other systems outside the monitored application actually access; they are the external deliverables of the application.
Incidents are created as soon as Instana detects either a key performance indication (KPI) is breached on an edge service, or a critical infrastructure issue. Please read Part II of the Monitoring Microservices blogpost for more context.
Instana tracks KPIs for discovered application services. Those KPIs are:
- Load (calls/second)
- Latency (response time in milliseconds)
- Errors (error rate)
- Saturation (of the most constrained resources of a service)
Instana automatically measures these KPIs for every service, and applies machine learning on these KPIs to figure out the health of a service. Typical problems that are detected are:
- The error rate is higher than normal
- The performance of the service is slow
- There is a sudden drop or increase of load
- The saturation of the service is close to reaching a limit
KPIs are determined by capturing and analyzing every trace across the services and the application. Traces automatically capture errors like status codes, exceptions, or error logs to find out if something went wrong. Traces also measure the time spent in each service and underlying components. Based on the Google Dapper architecture, a trace is a tree that consists of spans, where a span is a basic unit of work. In the microservice world, one span normally equals one request to a service or component, like a database. This way Instana automatically not only has an end-to-end tracing of the application, but also the information about the performance of each individual service and component.
If the health of a service is impacted, Instana will create a new incident and correlate it with all the other incidents by traversing the Dynamic Graph of Issues and Events.
The result is a comprehensive overview of the situation regarding Service and Event Impact.
A change can be anything from a server start or stop, a deployment, a configuration change on a system, you name it.
Instana recognizes changes by tracking relevant configurations specific to each monitored technology, as well as monitoring if something goes online (monitored by Instana) or goes offline (is not monitored by Instana anymore).
Every change is recorded and typically has a duration of only 1 second (start and end time difference). Like issues, change events are also correlated into an incident should they be relevant, thus sparing you an alert just because a system went offline. Maybe it was turned off because the load decreased at the end of the working day and it was no longer needed.
An event marked as suspicious by Instana using characteristics specific to the technology within which it occurred.
An issue correlated to performance degradation that should be investigated by a human.
Any change within the environment, including but not limited to: a deployment, a configuration change, a server start or stop.
How to analyze an incident
Let’s take an example incident: a service is suddenly responding slower than usual - we call this a “sudden increase in average latency.” The incident is marked in yellow as warning. This is done automatically and there is no configuration option for thresholds - which is another major benefit of Instana. The color is presented as long as this incident is still active. Once it is resolved, it will be presented in grey but still available for drill-down. Clicking on the line in the table starts the experience:
The incident detail view (right part of the screen) is organized into three parts:
The header contains basic information about the key facts of the incident.
- Start time;
- End time (current if it is still ongoing);
- The number of the still active events;
- The number of changes involved;
- The number of affected entities.
You can see the incident start date, the end date if available, how many events are still active, how many changes belong to this incident, and the number of affected entities:
- The second section provides a visual representation of the incident development over time. The chart shows the complete time frame, from start to end (current server time, if the incident is still open) and all events, sorted by start time. The view is limited to seven events when collapsed. Press the expand button to see the full view if your incidents contains more than seven events at a time. Clicking on either of the bars will open the detail-view for that issue:
- The third section contains the details for the graph view in section 2. A list of all events, sorted by start time, allows the user to see all available information for each event. To do this, just click it to expand it:
The details help in understanding the event, followed by multiple charts with the corresponding metric plotted for visualization. If an event is still active, the chart will continue rendering new incoming metric values. There are two flags available, emphasizing that this event affects a service and/or that this event has triggered the incident. If available, the flags are placed top of each event in the list (see screenshot above).
When focusing on an event, the detail section will provide the same information described in the incidents event list on point 3.
Search capabilities - finding an incident
Searching through events discovered by Instana relies on the Dynamic Focus feature. When in the incident view, you’ll notice that
event.type:incident has been automatically added to the search bar. In addition to this event filter, the current time window is also applied as a filter.
In addition, you can use the search box to find specific items by the data shown in the columns “Title” or “On” (the name of the service on which the incident occurred) in the overview table. In this example, the search term is “sudden” and you don’t need to add a wildcard character to the string. The result is a list of all incidents containing the word “sudden” in the title:
Instana uses a predefined set of health rules, built by our experts, to determine anomalies in the trace model and what issues the user needs to know about. Users can configure their own custom rules into Instana, which can then trigger custom issues.
This functionality can currently be found under the Knowledge Management under the Settings menu.
Built-in rules are predefined health-rules based on integrated algorithms which help to understand the health of the monitored system in real-time. These can be disabled individually in case a built-in rule is not relevant for the monitored system. The number of built-in rules is growing constantly as the experts at Instana are continuously working on new rules to enable better insight into the health of the monitored system. The rules can be sorted according to Name, Entity type or by Enabled rules by just clicking on the headers of each column. The search box above the table allows to filter the built-in rules by rule name and entity type.
The expand button in the left most column can be clicked to get a brief overview of each built-in rule containing the affected entity type, a short description and the used parameters.
A click on any built-in rule navigates to its details page. This page describes the algorithm behind the rule, input for the rule (type of metric used) and a detailed description and values of each parameters.
A custom rule enables users to define conditions on individually selectable application metrics. In order to select a metric, the user first needs to choose between the built-in and custom metrics, such as Dropwizard, JMX or StatsD metrics, using the origin dropdown menu and give the rule a name:
In case of the built-in metrics, the user needs to additionally choose the entity type of interest:
After that, the user has to provide following input to configure a custom rule:
- metric - defines the metric to which the user wants to apply the rule, like CPU steal or a custom metric name (supports filtering by typing part of the metric name);
- time window - defines the time interval which is used to evaluate the rule;
- aggregation - defines whether to take an average or a sum of the metric values in the specified time interval;
- operator - indicates the condition type to be applied against the aggregation value defined above;
- value - the threshold value which, when crossed, leads to Instana triggering an issue (see issue configuration in the next section).
In the snapshot below, the custom rule dictates that an alert will be triggered when the host’s CPU steal time is, on average over 1 min time interval, greater than twenty percent:
Custom Issues & Incidents
Configuring custom issues lets users apply any custom rule they have already built to multiple entities. The user defines what kind of entities by specifying a given filter query. In the snapshot below, the rule of
load > 1 is being applied to those entities in
zone:us-east-1d. So now, whenever the defined conditions are read to be true, an issue is generated and Instana will alert the user.
After applying a custom rule to a defined group of entities, there are further customization options. The user can define the severity of the custom issue, from “warning” or “critical.” They can implement an expiration time for the issue, assign a custom string of text to name the issue, and provide a quick description. They can toggle if the issue is important enough to generate an incident, containing all the correlating events.
It is common for teams to want to be informed if some entity goes offline for any reason. Instana tracks both online and offline events, and can leverage this information to define custom issues based on those offline events. To enable this, select the ‘offline’ rule in any custom issue configuration dialogue box. Subsequently, if any entity matching the applied filter query goes offline, this custom issue will fire, creating a warning or critical event, or an incident, depending on your configuration.
Once a custom issue is defined, it can be easily enabled or disabled in the custom issues overview: