Concepts of tracing

Tracing, or according to Gartner User-Defined Transaction Profiling, is at the core of every Application Performance Management tool. Instana provides a comprehensive view of your application architecture and distributed call patterns, by understanding the transaction flows through all the connected components. This approach is especially relevant in highly distributed and microservice environments.

Concepts of tracing describes the general concept of distributed tracing and how it is implemented in Instana AutoTrace™. For more information, see the Tracing in Instana on which technologies and runtimes can be traced with Instana.

Trace

A trace represents a single request and its path through a system of services. A trace can be a direct result of a request that is initiated by a customer’s browser, scheduled job, or any other internal execution. Each trace is made up of one or more calls.

Call

A call represents communication between two services: a request and a response (asynchronous). Each call is a set of data and time measurements corresponding to a particular Remote Procedure Call (RPC) or service call. Within the Instana UI, each type of call is highlighted, such as HTTP, messaging, database, batch, or internal.

To capture the call data, both the caller and the callee side are measured, which is crucial in distributed systems. In distributed tracing, these individual measurements are called spans.

An internal call is a particular type of call that represents work that is done inside a service. It can be created from intermediate spans that are sent through custom tracing. If you prefer to implement custom tracing to write your own custom instrumentation, then Instana supports OpenTelemetry, OpenTracing, OpenCensus, Jaeger, Zipkin the Web Trace SDK or one of the language-based tracing SDKs.

Calls can represent operations that incurred in errors. For example, a call that represents an HTTP operation might result in a 5xx status code, or the invocation of an API through Java Remote Method Invocation (RMI) might result in an exception. Such calls are considered erroneous and are marked accordingly in the Instana UI as shown in the following image.

HTTP calls that result in a status code 4xx are not considered erroneous, as 4xx are defined as client-side errors.

Trace View

As shown in the image, error logs are shown in the call that they are associated with. Instana automatically collect logs with the level WARN and ERROR (and equivalent, depending on the logging framework). In the image, a call is erroneous and has one error log that is associated with it. However, in general a call might be erroneous without having error logs associated with it, and vice-versa.

Span

The name span is derived from Google’s Dapper paper and is short for timespan. Spans represent the timing of code executions, that is, an action with a start and end time. Span also contains a set of data that consists of both a timestamp and a duration. Different types of spans can have one or several sets of data that are complete with metadata annotations. Every trace model consists of a block of spans in a hierarchical set that is ordered by 64-bit identifiers and are used for reference between parent (caller) and child (callee) spans. In each trace, the first span serves as the root, and its 64-bit identifier is the identifier for the whole trace.

The first span of a particular service indicates that a call entered the service, and is called an entry span (in the Dapper paper, the entry span is named "server span"). Spans of calls that leave a service are called exit span (in the Dapper paper, exit span is named “client span”). In addition to entry and exit spans, intermediate spans mark significant sections of code, so the trace runtime can be clearly attributed to the correct code.

Span Model

Each span has an associated type, such as HTTP call or database connection. Depending on the type of span, more contextual data are associated. To follow a sequence of spans across services, Instana sends correlation headers automatically with instrumented exits, and those correlation headers are automatically read by Instana’s entries. For more information, see HTTP Tracing Headers.

Understanding tracing

Callstacks

A callstack is an ordered list of code executions. Whenever code invokes other code, the new code is put onto the topmost of the stack. Callstacks are used by runtimes of all programming languages and are usually printed as a stacktrace. When an error occurs, the stacktrace traces back to the calls that led to the error.

For example, the following error message states Apple is not a number. Combined with the callstack, it's possible to narrow down in a complex system where the error occurred. The message alone is usually insufficient, as the NumberUtil algorithm might be used in many places.

Thread.run()
  HttpFramework.service()
    HttpFramework.dispatch()
      ShoppingCart.update()
        ShoppingCart.updateCart()
          ShoppingCart.parseQuantity()
            ShoppingCart.convertParams()
              NumberUtil.convert()  <-- Error: "Apple is not a number"

To understand why the error occurred, use the callstack to trace back from the error to the relevant business method, which in this case is ShoppingCart.parseQuantity().

Callstacks themselves are insufficient for monitoring. They are not easy to read and do not provide information to correlate the performance and availability of a system to overall health. To see what happens on a code execution and to correlate, consider information like process activity, resource usage, queuing, access patterns, load and throughput, system, and application health.

Distributed tracing

With the introduction of service oriented architectures (SOA), the callstack is broken apart. For example, the ShoppingCart logic might now reside on server A, while NumberUtil resides on server B. An error trace on server B contains only the short callstack of the parse error. While on server A, a new error is produced stating that something went wrong on server B, but not stating the problem itself.

Instead of a single error callstack that is easy to troubleshoot, you end up with two callstacks with two errors. Also, because no connection exists between the two, it makes it impossible to have access to both at the same time.

Server A:

Thread.run()
  HttpFramework.service()
    HttpFramework.dispatch()
      ShoppingCart.update()
        ShoppingCart.updateCart()
          ShoppingCart.parseQuantity()
            ShoppingCart.convertParams()
              RestClient.invokeConversion() <-- Error: Unkown

Server B:

Thread.run()
  HttpFramework.service()
    HttpFramework.dispatch()
      NumberUtil.convert()  <-- Error: "Apple is not a number"

The idea behind distributed tracing is to fix this problem by connecting the two error call stacks with each other. Most implementations use a simple mechanism to do so; when server A calls server B, the application performance monitoring (APM) tool adds an identifier to the call that serves as a common reference point between the callstacks in the APM system. This mechanism is called correlation and to produce one error, it joins the two callstacks.

Thread.run()
  HttpFramework.service()
    HttpFramework.dispatch()
      ShoppingCart.update()
        ShoppingCart.updateCart()
          ShoppingCart.parseQuantity()
            ShoppingCart.convertParams()
              RestClient.invokeConversion()
                Thread.run()
                  HttpFramework.service()
                    HttpFramework.dispatch()
                      NumberUtil.convert()  <-- Error: "Apple is not a number"

By understanding where the remote call takes place and on which server parts of the callstack were executed, you can find out that the ShoppingCart was the context of the error, and the NumberUtil caused the shopping cart activity to fail.

Measuring performance

However, the preceding examples illustrate that the APM tools trace errors by using the same mechanism that is used for taking and presenting performance measurements. The trace is annotated with performance numbers as shown:

413 Thread.run()
413   HttpFramework.service()
413     HttpFramework.dispatch()
412       ShoppingCart.update()
411         ShoppingCart.updateCart()
211           ShoppingCart.parseQuantity()
210             ShoppingCart.convertParams()
200               RestClient.invokeConversion()
 10                 Thread.run()
 10                   HttpFramework.service()
 10                     HttpFramework.dispatch()
  5                       NumberUtil.convert()

The total time for executing the shopping cart update is approximately 413 ms. The number conversion (NumberUtil.convert()) took 5 ms. The time that is spent in between is distributed among many calls, so you are looking for bigger cliffs. In the example, updating the cart (ShoppingCart.updateCart()) took a total of 411 ms, while the parsing (ShoppingCart.parseQuantity()) only required 211 ms, which itself spent most of the time in doing the remote call.

Tracing with Instana

If errors or slow performance occurs, a detailed context is provided so that all the required data for troubleshooting a particular case is available. This data, including the callstack, is not collected for every call because it is an invasive task that can cause processing overhead.

Referring to the preceding example, Instana displays the transaction as shown:

Service A |   ShoppingCart.update - 412ms                       |
Service A        | RestClient.invokeConversion - 200ms |
Service B                    | NumberService - 5ms|

The output that is displayed is a better visual representation of call nesting and length, as it is reduced to the critical parts, showing where time is spent, and where remote calls took place. It also connects to the Dynamic Graph, which knows that the CPU on the Service B server is overloaded, and it can correlate this to the transaction for root cause analysis. Other relevant information, such as service URLs or database queries is also captured.

Trace continuity

Trace continuity means that calls triggered by one external request are collected into one trace. Instana employs protocol-specific means to add metadata, such as HTTP headers, gRPC metadata, Kafka message headers, AMQP headers, JMS headers, and more. Adding metadata ensures trace continuity across all protocols and services.

Communication protocols without support for any metadata do not support trace continuity, which means that when you call another service over such a protocol, the outgoing call is a leaf in the trace tree. The work that happens in the receiver of the call is not part of that trace. Instead, receiving the call starts a new trace and all subsequent calls that are triggered in the receiver belong to this new trace.

Trace continuity is not supported in the following cases:

  • Kafka up to version 0.10 (Kafka introduced headers in version 0.11)
  • Sending or receiving Kafka messages with the Node.js package kafka-node (that is, the package does not support headers. When you work with Kafka in Node.js, use the npm package kafkajs instead of kafka-node because kafkajs supports trace continuity. For more information, see additional remarks for continuing the trace for incoming messages)
  • NATS and NATS streaming messaging
  • Microsoft Message Queue

W3C Trace Context Support

The following Instana tracers support the W3C trace context specification for HTTP or HTTPS communication in addition to the proprietary headers like X-INSTANA-T or X-INSTANA-S:

The following Instana tracers currently do not support the W3C trace context specification. Only the proprietary headers like X-INSTANA-T or X-INSTANA-S are supported:

Tracing headers

To ensure the trace continuity across different services, Instana tracers use different headers or metadata properties, depending on the protocol.

HTTP tracing headers

Instana tracers support two sets of HTTP headers for trace correlation. The first includes (X-INSTANA-*) Instana's vendor-specific headers, and the second set includes the standard headers from the W3C trace context specification. Instana tracers add both sets of headers to downstream requests. If both sets of headers are present in an incoming request, the X-INSTANA-* headers are given priority over the W3C headers. If only one set of headers is present, the trace is continued from that set. This ensures interoperability with other W3C compliant instrumentations (like OpenTelemetry) while also providing backwards compatibility with an earlier version of Instana tracers (without W3C support) that are still deployed in the field.

Instana-specific trace correlation headers:

  • X-INSTANA-T: This header is the trace ID of the trace that is in progress. Instana tracers support trace IDs with a length of 16 or 32 characters from the character range [0-9a-f]. When you start a new trace, the tracers generate a random trace ID with a length of 16 characters. For example, 7fa8b643c98711ef.
  • X-INSTANA-S: This header is the span ID of the HTTP exit span that represents the outgoing HTTP request on the client side. Instana tracers support span IDs that are 16 characters long from the character range [0-9a-f]. This ID becomes the parent span ID for the entry span on the receiving server side. For example,ff1938c2b29a8010.
  • X-INSTANA-L: This header is the trace level. The value 0 means that no spans are created (also known as trace suppression), and the value 1 means that spans are created. If this header is missing, the value is assumed as 1. When you send X-INSTANA-L=0, omit X-INSTANA-T and X-INSTANA-S.

W3C trace context headers:

  • traceparent: This header contains the trace ID, parent span ID, and additional flags. This header is roughly equivalent to a combination of X-INSTANA-T and X-INSTANA-S. For more information, see the W3C trace context specification.
  • tracestate: This header has an optional list of key-value pairs that are collected during the ongoing trace. For more information, see the W3C trace context specification. Instana tracers contribute a key-value pair with the key in to this list, with the following format: "in=trace-id;span-id".

If you have any firewalls, proxies, or similar infrastructure in place that operates on HTTP headers, add all five headers to its allow list, which applies to all the versions of HTTP. In particular, regarding the tracing headers HTTP/1.1 and HTTP/2 have no difference.

Generic messaging headers

For many messaging protocols, the same message headers are used over HTTP, with underscores (_) instead of hyphens (-). That is, the headers are X_INSTANA_T, X_INSTANA_S, and X_INSTANA_L. For more information about the semantics of the individual headers, see HTTP tracing headers. To find out which messaging protocols use this header format, see the information in the following section.

AMQP message headers

For Advanced Message Queuing Protocol (AMQP) messages, the same message headers are used over HTTP, that is, X-INSTANA-T, X-INSTANA-S, and X-INSTANA-L. Currently, W3C trace context headers do not support AMQP messages because AMQP messages have no stable specification for that protocol yet. For more information, see HTTP tracing headers.

AWS SNS message attributes

For Amazon Simple Notification Service (AWS SNS), the generic messaging attributes are used, that is, X_INSTANA_T, X_INSTANA_S, and X_INSTANA_L. Currently, W3C trace context headers do not support AWS SNS because AWS SNS has no specification for that protocol yet. For more information, see generic messaging headers.

AWS SQS

For Amazon Simple Queue Service (AWS SQS), the generic messaging headers are used that is X_INSTANA_T, X_INSTANA_S, and X_INSTANA_L. Currently, W3C trace context headers do not support AWS SQS because AWS SQS has no specification for that protocol yet. For more information, see generic messaging headers.

Google Cloud Pub/Sub

For Google Cloud Pub/Sub, the same message headers are used over HTTP, but all in lowercase, that is x-instana-t, x-instana-s, and x-instana-l. Currently, W3C trace context headers do not support Google Cloud Pub/Sub because no specification is available for that protocol yet. For more information, HTTP tracing headers.

GraphQL

Trace correlation for GraphQL relies on the underlying transport protocol. For more information about GraphQL over HTTP, see HTTP tracing headers. For GraphQL queries and mutations that are transported over a different protocol, such as AMQP and Kafka, see the section for that particular protocol.

gRPC metadata

For gRPC, the same message headers are used over HTTP, that is, X-INSTANA-T, X-INSTANA-S, and X-INSTANA-L. Currently, W3C trace context headers do not support gRPC because no stable specification is available for that protocol yet. For more information, HTTP tracing headers.

IBM MQ

For IBM MQ, the generic messaging headers are used in the Java trace, including X_INSTANA_T, X_INSTANA_S, and X_INSTANA_L. For more information, see generic messaging headers. In addition to the generic messaging headers, the IBM MQ Tracing user exit and the IBM App Connect Enterprise (ACE) Tracing user exit supports the W3C trace context headers.

Currently, no W3C trace context specifications exist for the messaging protocol. W3C has specifications for only HTTP and HTTPS communications. When IBM MQ and ACE use the messaging protocol, and the user exists inside IBM MQ or ACE need to propagate trace context headers, the headers traceparent and tracestate are propagated in the same format as used in HTTP communications.

JMS tracing headers

For Java Message Service (JMS), the generic messaging headers are used that is X_INSTANA_T, X_INSTANA_S, and X_INSTANA_L. W3C trace context headers are currently not supported for JMS because JMs has no specification for that protocol yet. For more information, see generic messaging headers.

Kafka tracing headers

Kafka tracing headers are currently undergoing a migration. Historically, the header X_INSTANA_C is used with a binary representation of the trace ID and the parent span ID. Unfortunately, some incomplete or noncompliant Kafka drivers and applications cannot handle nonstring headers correctly. For this reason, Instana tracers are moving toward a set of headers with string content (X_INSTANA_T, X_INSTANA_S). All Instana tracers still support the legacy header X_INSTANA_C, but they already support the new header format X_INSTANA_T or X_INSTANA_S. For more information about this migration, see migration.

Modern Kafka tracing headers X_INSTANA_T and X_INSTANA_S

The following string headers are used for Kafka trace correlation:

  • X_INSTANA_T: The trace ID is a string, which is always 32 characters long, and left-pad with 0 as necessary. Example: "00000000000000007fa8b643c98711ef".
  • X_INSTANA_S: The parent span ID is 16 characters long. Example: ff1938c2b29a8010.
  • X_INSTANA_L_S: The trace level (optional, type string). The value 0 means that no spans are created (also known as trace suppression), and the value 1 means that spans are created. If X_INSTANA_L_S header is missing, the value 1 is assumed. Omit X_INSTANA_T and X_INSTANA_S when you send X_INSTANA_L_S=0.
Legacy Kafka tracing header X_INSTANA_C

The following binary headers are used for Kafka trace correlation before the header format migration:

  • X_INSTANA_C
  • X_INSTANA_L

The X_INSTANA_C (trace context) header combines the trace and the span ID. Its value is a 24-bytes binary header. The first 16 bytes are the trace ID, and the last 8 bytes are the span ID. When you use 64-bit trace IDs, the first 8 bytes are 0. When a process receives a Kafka message with X_INSTANA_C header, and needs to transform this header to a string representation of the trace ID and parent span ID, the following rules must be applied:

  • If the first 8 bytes of X_INSTANA_C header are all 0, then bytes from 9-16 of X_INSTANA_C are converted into a string of 16 characters from the alphabet [0-9a-f].
  • If bytes 1-8 of X_INSTANA_C header contain at least one nonzero byte, bytes 1-16 are converted to a string of 32 characters from the same alphabet. In either case, bytes 17-24 are converted into a string of 16 characters from the alphabet [0-9a-f].

The following examples are of conversions between trace ID, span ID strings, and the binary X_INSTANA_C header. All that is necessary for that conversion is to convert the characters from the string directly to octets and vice versa:

With 64-bit trace ID:

Trace ID Span ID X_INSTANA_C
"8000000000000000" "ffffffffffffffff" 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x80, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
"0000000000000001" "0000000000000002" 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x02
"7fffffffffffffff" "0f0f0f0f0f0f0f0f" 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x7f, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f

With 128-bit trace ID:

Instana does not use 128-bit trace IDs, and the mentioned migration from binary X_INSTANA_C header to string headers happen before the migration to 128-bit trace IDs. So, this table has merely theoretical value. It is not applicable in practice.

Trace ID Span ID X_INSTANA_C
"f0f0f0f0f0f0f0f08000000000000000" "ffffffffffffffff" 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0x80, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
"00000000000000010000000000000002" "0000000000000003" 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x03
"f0f0f0f0f0f0f0f07fffffffffffffff" "0f0f0f0f0f0f0f0f" 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0x7f, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f

The X_INSTANA_L (type integer) header denotes the trace level. The value 0 means that no spans are created (also known as trace suppression), and the value 1 means that spans are created. Do not send X_INSTANA_C header when you send X_INSTANA_L=0.