Modern IT environments can consist of many different components, ranging from a few monolithic applications, to thousands of microservices. These components are connected by application code in an ever-changing composition to adapt to constant technical and business change. We use tracing to understand the interaction between different components and the messages that are passed between them. In order to understand how a system of services works together and the impact of subsequent component failure, we enhance traces with information about the underlying service, application, and system infrastructure using our unique Dynamic Graph.
A trace is the representation of a single request and its path through a system of services. This could be the direct result of a request initiated by a customer’s browser, but could also be initiated by a scheduled job or any other internal execution. Each trace is made up of one or more calls.
A call represents a communication between two services. It consists of detailed information of the caller and callee including endpoints, type specific payloads such as parameters, error codes and details.
A call itself is typically made of one or two spans. For most users and use-cases the concept of a call should be enough to get started and solve a lot of problems. To dig deeper see Further Concepts below.
Traces underpin each section of the Application view in Instana and can be examined in detail in the Analyze view. In the Analyze view, you can investigate the traces and calls collected by Instana. In order to understand how the whole application behaves with each call, we monitor each one of those calls as they come in to the system.
Traces in Application & Service Dashboards
From the applications and services dashboards a user can dive into the underpinning traces. On each dashboard a Traces button is located in the upper right corner. Selecting it allows to “Analyze Traces” by Application, Service and Endpoint.
The Analyze View can filter and group traces or calls by arbitrary tags. Filters are connected with AND logic operator so a trace or a call needs to match all the filters. The default grouping is the endpoint name. Grouping can be removed to inspect the individual traces and calls that match the filters. The above example filters by “application=Demo” and “service=shop” and shows the traces grouped by the trace endpoint.
Selecting a group and picking a trace from the Analyze View will open up the Trace View. Selecting a call will open this particular call in context of its trace.
Above is a summary of a specific trace, including a timeline and a service list. The summary contains:
- The tracename (usually an HTTP entry)
- The name of the service it occurred on
- The type or technology.
The core KPIs:
- Calls to other services
- The total latency
- Errors within the trace
The timeline shows when the trace has started and the chronological order on which services have been called throughout the trace. The call chains hang from the root element (or span, as we call it). On simple three tier systems, you will have a typical depth of about four levels, while on systems with a distributed service or microservices architecture, you can expect to see much longer icicles. That graph gives a good impression of the call structure - if you have either very long subcalls of the trace, or certain periodic call patterns, like one HTTP call per database entry. When you then click on a span in the graph you jump to the details of it.
The different services have their own color (in this example shop = blue, productsearch green). Hovering over a particular call shows in detail where the time was spent: Self (within the call), Waiting on another call or on the Network. Clicking on a call shows the call on the trace tree further down.
The service list below summarizes all calls per service and allows a quick overview of number of calls, the aggregated time and the errors occurred.
Trace Tree & Call Detail
The trace tree gives a detailed insight in the structure of upstream and downstream service calls. It shows the type of the call and allows to explore the trace tree by expanding and collapsing individual parts of the tree. It also allows to directly go to the services and endpoints dashboards of a call.
When clicking on a call the call detail sidebar is opened. It shows call details such as the Errors and status code, the stack trace and some more details captured during that call.
Capturing Logs and Errors
Instana automatically captures errors when a service returns a bad response or log with an
WARN level (or similar depending on the framework) was detected.
Automatic Aggregation of Short Exit Calls
Instana always endeavors to give you the best understanding of service interactions, while also minimizing impact on the actual application. Certain scenarios, however, require Instana to drop data in order to achieve that. A very common problem in systems is the so called 1+N query problem, which describes a situation where code performs 1 database call to get a list of items, followed by N individual calls to retrieve the individual items. The problem usually can be fixed by only performing one call and joining the other calls to it.
At the moment, capturing of parameters or return values is not possible due to impact concerns.
Long Running Tasks
Due to timeouts, high load, or any other number of environmental conditions, it is possible that calls need significant time until they respond. Traces can contain tens or even hundreds of such calls. As we don’t want to wait until all calls have responded to deliver tracing information, long running spans are replaced with a placeholder. When the span finally returns, the placeholder is replaced again with the correct call information.
Instana stores all traces and calls for 7 days. Past this period, our retention strategy retains statistically significant traces and calls to prevent unbounded storage growth. On-prem setups can configure this retention period to suit their needs.
When looking at historic data, this indicator will be displayed. Trace and call KPIs are unaffected within Analyze past the retention period, but only retained traces and calls will be available to view in detail.
For most users and use-cases the core concepts should be enough to get started and solve a lot of problems. The following section is for architects and interested developers to dig deeper and understand the underlying concepts.
A trace is the representation of a single request and its path through a system of services. This could directly be the result of a request initiated by a customer’s browser, but could also be initiated by a scheduled job or any other internal execution. Each trace is made up of one or more calls.
A call represents a communication between two services and can be seen as a request and a response where the response can be asynchronous. A call is a set of data and time measurements corresponding to a particular RPC or service call. One could have an HTTP or database call, for example. Instana makes it easy to understand the
type of a call, like HTTP, messaging, database or batch, by highlighting that on the UI.
In order to capture this data, Instana measures both on the caller and the callee side. This is crucial in distributed systems since depending on the scenario the monitoring tool can measure on the caller side or the callee side or on both sides. In distributed tracing, these individual measurements are called spans.
Calls may represent operations that have incurred in errors. For example, an call representing an HTTP operation could result on a 4xx or 5xx status code, or the invocation of an API through Java RMI may result in an exception being thrown. Such calls are said to be erroneous, and Instana marks them accordingly in the UI.
As shown in the image above, the ceoncepts of “erroneous call” and “error log” should not be confused. Instana automatically collect logs with level
ERROR (and equivalent, depending on the logging framework). Each collected log is linked to the call during which execution the log has been filed. In the image above, we see that the root call of the trace is both erroneous and has one error log associated with it. In general, however, a call may be erroneous without having error logs associated with it, and vice-versa.
Spans represents timing of code execution (literally an action with its start and end times). It also carries a set of data consisting of both a timestamp and a duration. Different types of spans can have one or several sets of these, complete with metadata annotations. Every trace model is comprised of a block of spans in a hierarchical set, ordered by 64-bit identifiers used for reference between parent (“caller”) and child (“callee”) spans. The first span in any trace serves as root, and its 64-bit identifier is considered the identifier for the trace as a whole.
The first span of a particular service indicates that a call entered the service, and is called an entry span (in the Dapper paper, this is named “server span”). Spans of calls leaving a service are called exit span (in the Dapper paper, this is named a “client span”). In addition to entry and exit spans, intermediate spans mark significant sections of code so the trace runtime can be clearly attributed to the correct code.
By the way, the name “span” is derived from the famous Google’s Dapper paper, and is short for “timespan”.
Distributed Tracing Theory
A span carries a set of data consisting of both a timestamp and a duration. Different types of spans can have one or several sets of these, complete with metadata annotations. Every trace model is comprised of a block of spans in a hierarchical set, ordered by 64-bit identifiers used for reference between parent (“caller”) and child (“callee”) spans. The first span in any trace serves as root, and its 64-bit identifier is considered the identifier for the trace as a whole.
The first span in a particular service indicates that a call entered the service, and is called an entry span (The Dapper paper refers to it as a “server span”). Spans of calls leaving a service are called exit spans (in the Dapper paper they are “client spans”). Intermediate spans mark significant sections of code so the trace runtime can be clearly attributed to the correct code. In order to follow a sequence of spans across services, Instana sends correlation headers automatically with instrumented exits, and those correlation headers are automatically read by Instana’s entries. See HTTP Tracing Headers.
Each span has an associated type, such as HTTP call or database connection. Depending on the span type, additional contextual data is also associated with the span.
For the tracing implemented in Instana, we took inspiration from Dapper, a paper describing distributed tracing at Google, and the OpenTracing Initiative started by developers of “Zipkin,” a Dapper-like tracing developed at Twitter. Google and Twitter both face highly distributed service architectures and have already addressed the problems many microservice users have today.
A couple of key takeaways from Dapper / OpenTracing worth highlighting are:
- The precise lines of code running are not interesting to either operations or development. For operations, code changes are rarely possible, which turns most of the APM tools into development focused tools. But even for development departments, most of the data on code contained in callstacks is noise. It is either unchangeable runtime code, framework code, or third party code.
- The interaction between Services and Components matters. Most problems can be resolved when the failing component or service has been identified. While there can be internal errors, service availability and interaction is the most important to monitor. Instana also uses tracing to understand the logical dependencies of components that contribute to its the Dynamic Graph.
- Remote calls are especially interesting due to their inherent slowness. While many APM tools focus on in process callstacks, it turns out that most of the time these are really fast. Slow performance most often comes into play when accessing remote services.
- One-off problems are time wasters to troubleshoot, insights can be gained even when looking at a very low percentage of all traces. Every engineer wants to make sure his/her code runs as fast as possible, but it is inefficient to just investigate one call out of a thousand, even when it is 100ms slower than the average.
Motivation on Tracing
Tracing is at the core of every Application Performance Management tool. It is called “User-Defined Transaction Profiling” by Gartner and has two important values for the users of APM tools:
- Understand transaction flows through all involved components to get a better view of the application architecture and distributed call patterns.
- Troubleshoot performance problems by getting method/code level details, including execution times.
The current generation of APM Tools, and their tracing capabilities, focus mainly on the second point, with the “callstack”. At Instana, we have found this approach to be less than ideal for several reasons, and instead, we focus on the first value, especially when dealing with highly distributed microservice environments.
To set the stage for understanding tracing, here is a quick introduction on how callstacks work. A callstack is an ordered list of code executions, such that whenever code invokes other code, the new code is put onto the top of that stack. Callstacks are used by runtimes of all programming languages, and they can usually be seen printed out as “stacktrace” when errors occur as they allow “tracing back” to what calls led to the error.
For example, an error message might say, “Apple is not a number”. Combined with the callstack, it is possible to narrow down where in a complex system the error actually occurred. The message alone is usually insufficient, as the NumberUtil algorithm might be used in many places.
Thread.run() HttpFramework.service() HttpFramework.dispatch() ShoppingCart.update() ShoppingCart.updateCart() ShoppingCart.parseQuantity() ShoppingCart.convertParams() NumberUtil.convert() <-- Error: "Apple is not a number"
Developers can use this callstack to trace back from the error to the relevant business method (
ShoppingCart.parseQuantity()) in order to get a better understanding of why the error occurred.
When service oriented architectures appeared ten years ago, the callstack was broken apart. The “ShoppingCart” logic now might reside on server A, while “NumberUtil” resides on server B.
Suddenly, the error trace on server B was only containing the short callstack of the parse error, and on server A a new error was produced, saying that something went wrong on server B, but not stating the problem. Instead of a single error callstack, which is easy to troubleshoot, a user now ended up with two callstacks with two errors. Even worse, there was no connection between the two, sometimes making it impossible to have access to both at the same time.
Thread.run() HttpFramework.service() HttpFramework.dispatch() ShoppingCart.update() ShoppingCart.updateCart() ShoppingCart.parseQuantity() ShoppingCart.convertParams() RestClient.invokeConversion() <-- Error: Unkown
Thread.run() HttpFramework.service() HttpFramework.dispatch() NumberUtil.convert() <-- Error: "Apple is not a number"
This is where distributed tracing came into the picture. The idea behind distributed tracing was to fix this problem by simply connecting the two error call stacks with each other. While many ideas exist on how to do that, most implementations use a simple mechanism: When server A calls server B, the APM tool adds an identifier to the call which servers as a common reference point between the callstacks in the APM system. Such a mechanism is usually called “correlation,” and it sufficiently joins the two callstacks to produce one error.
Thread.run() HttpFramework.service() HttpFramework.dispatch() ShoppingCart.update() ShoppingCart.updateCart() ShoppingCart.parseQuantity() ShoppingCart.convertParams() RestClient.invokeConversion() Thread.run() HttpFramework.service() HttpFramework.dispatch() NumberUtil.convert() <-- Error: "Apple is not a number"
Much better: With some added decoration, where the remote call actually takes place and on which servers which parts of the callstack were executed, this allows for again finding out that the “ShoppingCart” was the context of the error, and the “NumberUtil” caused the shopping cart activity to fail.
While the examples so far illustrated error tracing, APM tools use the same mechanism also for taking and presenting performance measurements. So they annotate the trace with performance numbers like this:
413 Thread.run() 413 HttpFramework.service() 413 HttpFramework.dispatch() 412 ShoppingCart.update() 411 ShoppingCart.updateCart() 211 ShoppingCart.parseQuantity() 210 ShoppingCart.convertParams() 200 RestClient.invokeConversion() 10 Thread.run() 10 HttpFramework.service() 10 HttpFramework.dispatch() 5 NumberUtil.convert()
The total time for executing that shopping cart update was apparently 413ms. The actual number conversion took 5ms. The time spent in between is distributed among many calls, so one usually looks for bigger cliffs. In this case, updating the cart takes a total total of 411ms, while the parsing itself only requires 211ms, which itself did spend most time doing the remote call.
In most APM tools, the trace has become the core element and central domain of the product. We saw that because of that focus, most customers of existing APM solutions actually do not use the tool for operations and monitoring, but only for troubleshooting performance issues or errors. Callstacks themselves are insufficient for monitoring. They are not easy to read for users of the APM tool and do not provide information to correlate performance and availability of a system to overall health. System and application health needs to take into account not only what happens on a code execution, but also correlate much more information like process activity, resource usage, queuing, access patterns, load, and throughput.
Tracing at Instana
In case of errors or slow performance, we provide a detailed context, so that all the data needed for troubleshooting a particular case is available. We don’t collect this data (including callstack) for every trace because it is an invasive task that can cause overhead and side effects in the code itself.
Going back to the above example, this is roughly how Instana will show the above transaction:
Service A | ShoppingCart.update - 412ms | Service A | RestClient.invokeConversion - 200ms | Service B | NumberService - 5ms|
This is a more visual representation of call nesting and length. It is reduced to the important parts, showing where time was spent and remote calls took place. More importantly, it connects to our Dynamic Graph, which, for example, knows that the CPU on the Service B server is overloaded, and can correlate this to the above transaction for root cause deduction. Should other information, like Service URLs or database queries, be relevant Instana will capture them as well.
HTTP Tracing Headers
The Instana tracing feature utilizes 3 different HTTP headers:
X-INSTANA-L. They are hex encoded and refer to the span and trace ID of the preceding exit.
x-instana-l carries information about the trace level, where 0 means that no trace shall be recorded. If you have any firewalls, proxies or similar infrastructure in place that operates on an HTTP header whitelist, please add them to it.
To futureproof we recommend also to allow
tracestate, as these are the W3C standardized headers for tracing purpose.
Instana is designed to monitor modern applications that are built on microservices and are highly dynamic and largely scaled. When we started to think about how a distributed tracing feature should be designed for such environments, it seemed like an impossible task, as we wanted to be able to:
- Trace without any configuration,
- Trace without any manual instrumentation,
- Trace every service request – no sampling,
- Trace without measurable impact on the application,
- Trace with automatic correlation between service and underlying infrastructure metrics.
Many months and several patents later, we succeeded.
Monitoring systems can usually only work with data the system under management makes available. This is called “Observability.” Many standard system components, like database systems or web servers, provide such endpoints from which to read monitoring data. Unfortunately, applications often do not provide sufficient data on their own, thus a monitoring tool has to interact with the system under management and make data available itself.
Instana performs this interaction for all kinds of systems automatically when needed.
No Measurable Impact
Impact on an application can occur in various ways that are influenced by the instrumentation technology and implementation.
The instrumentation code of the APM tool could:
- Add execution time (latency) to every service request which makes them slower,
- Add additional objects/memory to the application code that may add overhead on garbage collection,
- Change the way the runtime (e.g. Just In Time Compiler) interprets the code, which leads to different or less optimized applications.
Instana’s instrumentation is built such that there is no measurable impact on the application – meaning the instrumented application will behave and perform with Instana the same way as without Instana. The overhead we impart is not in the running application process, but is in our Agent’s process only. That overhead is limited solely to adding a low percentage of CPU on one core of that host if free CPU is available.
There are three different categories of tracing support:
- Automatic Zero-Configuration Tracing;
- Manual Zero-Configuration Tracing;
- Fully Customizable Tracing.
Automatic Zero-Configuration Tracing
Tracing is supported automatically for:
- .NET (running on .NET 4.0+);
- Java (running on a JVM 1.6 +);
- Scala (running on a JVM 1.6 +);
- PHP (Apache+mod_php, PHP-FPM and PHP-CGI running PHP 5.3 or higher, incl. 7).
- Python (Python 2.7 and greater)
Manual Zero-Configuration Tracing
Additionally, tracing is supported after manual inclusion of the Instana library:
Fully Customizable Tracing
Instana supports a programmatic usage of tracing through several Trace SDKs:
- Java Trace SDK - Integrate additional information into the automatic tracing. Enables custom correlation.
- Ruby Trace SDK - Integrate additional information into the automatic tracing. Enables custom correlation, cross host, job, and asynchronous tracing.
- Trace REST Web Service - Supports creation of traces from arbitrary code.
- OpenTracing compatible SDK