Distributed Tracing

Created:2023-10-04 Last Modified:2024-06-24

This document was translated by ChatGPT

#1. Distributed Tracing

DeepFlow presents application Spans, system Spans, and network Spans involved in a single call on a flame graph through distributed tracing, enabling collaboration across multiple departments such as business development teams, framework development teams, service mesh operations teams, container operations teams, DBA teams, and cloud operations teams on a single platform.

#1.1 Overview

Initiate a tracing operation on a call in the distributed tracing feature page, and then display it in the form of a right slide-out panel. This diagram shows the link call tracing, as shown below.

Note: The flame graph and topology graph of distributed tracing do not currently support adding to the Dashboard.
1

00-Overview

00-Overview

The call tracing right slide-out panel is divided into three parts: header information, data visualization, and call information data list.

  • ① Header Information: Displays basic information of the link, such as client, server, request start time, duration, request type, request resource, etc.
  • ① Data Visualization: Displays call tracing Span data in the form of a flame graph or displays the services of call tracing in the form of a topology graph.
  • ② Call Information Data List: Displays associated information of the call.

#1.1.1 Flame Graph

01-Flame Graph

01-Flame Graph

The flame graph consists of multiple bar segments, each representing a Span. The x-axis represents time, and the y-axis represents the depth of the call stack, displayed from top to bottom in the order of Span calls. Below is a detailed introduction:

  • Length: Combined with the x-axis, it represents the execution time of a Span, with both ends corresponding to the start and end times.
  • Service List: Displays the proportion of time delay consumed by each service. Clicking on a service can link with the flame graph to highlight the corresponding Span of the service.
    • Color: Application Spans and system Spans represent each service with a different color; all network Spans are gray (as network Spans do not belong to any service).
  • Display Information: The display information of the bar segment consists of icon + call information + execution time.
    • Icon: Different types of Spans are distinguished by icons.
      • A: Application Span, collected through the Opentelemetry protocol, covering business code and framework code.
      • S: System Span, collected through eBPF with zero intrusion, covering system calls, application functions (such as HTTPS), API Gateway, and service mesh Sidecar.
      • N: Network Span, collected from network traffic through BPF, covering container network components such as iptables, ipvs, OvS, and LinuxBridge.
    • Call Information: The call information displayed by different Spans varies slightly.
      • Application Span and System Span: Application Protocol, Request Type, Request Resource.
      • Network Span: Observation Point.
    • Execution Time: The total time consumed from the start to the end of the Span.
  • Operation: Supports hover and click.
    • Hover: Hover over a Span to display call information + instance information + execution time in the form of a TIP.
      • Instance Information: Application Span displays service + resource instance; System Span displays process + resource instance; Network Span displays network card + resource instance.
      • Execution Time: Displays the entire execution time of the Span, i.e., the proportion of its own execution time.
    • Click: Click on a Span to highlight itself and its parent Span, and view detailed information of the clicked Span.
  • Collapse Sidebar: Click to collapse the service list.

#1.1.2 Call Topology Graph

02-Call Topology Graph

02-Call Topology Graph

The call topology graph displays data in an orderly and structured manner, with data aggregated by service as nodes. The parent-child relationships between Spans are displayed using horizontal and vertical lines, showing their request call relationships. Below is a detailed introduction:

  • Node: Corresponds to the service in the service list of the flame graph, aggregating one or more Spans under the same service into a node and displaying the time consumed by the service in the call chain.
    • Display Information: The square node display information consists of icon + call information + self time.
      • Icon: Different types of Spans are distinguished by icons. For details, please refer to the [Flame Graph] section.
    • Self Time: The total time consumed by one or more Spans corresponding to the service.
  • Path: Draws the topological path corresponding to the parent Span to child Span relationship in the flame graph.
  • Operation: Supports hover and click. For details, please refer to the [Flame Graph] section.

#1.1.3 Bottom Tab

#1.1.3.1 Call Details

Displays detailed information of Spans in the flame graph in the form of a list. Clicking on a Span in the flame graph will highlight the corresponding call details in the list; conversely, clicking on a row in the list will highlight the corresponding Span.

Call Details

Call Details

#1.1.3.2 IO Events

When clicking on a system Span in the flame graph, if the process corresponding to the system Span has IO read/write events, the corresponding IO events can be viewed. The IO events tab allows for quick viewing of the time consumed by Span for file read/write.

IO Events

IO Events

① First Row: Overlays all IO event blocks of the threads below, with darker colors indicating more overlap. ② Thread Row: Displays the IO events of each thread, with each block corresponding to an event. The length of the block is calculated based on the start and end times of the IO event.

  • Tip: Consists of file name + IO event type + event duration. ③ Detailed Information: Displays details of the IO event.

#1.1.3.3 Flow Logs

When clicking on a network Span in the flame graph, analyze the latency data of flow logs corresponding to the time period of the call log.

Flow Logs

Flow Logs

① Status Row: Determines the observation point, flow duration, and flow log status. ② Latency: Analyzes network-related latency, including TCP connection latency, TLS connection latency, average data latency, average system latency, and average client wait latency. The calculation method of latency can be referred to in the metric diagram.

#1.1.3.4 Span Tracing

When analyzing why a Span exists in the flame graph, the Span tracing feature can be used. Clicking on a Span in the flame graph displays the relationship with other Spans in the form of a list. DeepFlow's distributed tracing is calculated based on a series of IDs, including TraceID, SpanID, ParentSpanID, request X-Request-ID, response X-Request-ID, request Syscall TraceID, response Syscall TraceID, request TCP Seq number, and response TCP Seq number. When there is an association between IDs, the Spans can be displayed in a single flame graph, with the association of IDs marked in purple in the list.

Span Tracing

Span Tracing

① Clicked Span: The Span clicked in the flame graph. ① Associated Span: The Span associated with the clicked Span.

#1.1.4 Quick Understanding of Flame Graph

Flame Graph Example

Flame Graph Example

The flame graph represents the passage of time from left to right. In the sample call chain above, the complete processing of a business request goes through the following process:

  • (1) The "Client" process initiates an HTTP GET request, which is transmitted through multiple network cards to the "Frontend Service".
  • (2) The "Frontend Service", to complete this business process, first initiates a DNS query to the "DNS Service", which is transmitted through the network to the "DNS Service".
  • (3) The "DNS Service" processes the query and returns a DNS response to the "Frontend Service", which is transmitted through the network to the "Frontend Service".
  • (4) The "Frontend Service" continues to initiate an SQL query, which is transmitted through the network to the "MySQL Service".
  • (5) The "MySQL Service" processes the query and returns an SQL response to the "Frontend Service", which is transmitted through the network to the "Frontend Service".
  • (6) The "Frontend Service" continues to initiate an RPC request, which is transmitted through the network to the "RPC Service".
  • (7) The "RPC Service" processes the request and returns an RPC response, which is transmitted through the network to the "Frontend Service".
  • (8) The "Frontend Service" receives the RPC response and replies with the final HTTP response to the "Client", which is transmitted through the network to the "Client".

The difference in length between any two Spans represents the amount of delay introduced between the two positions.

#1.1.5 Flame Graph Analysis Examples

  • Example 1: Significant Difference Between Network Spans

In the figure below, the significant difference between two network Spans indicates a noticeable delay in the transmission of call data packets between two network cards. If the two network cards are "client container node" and "server container node", it indicates that the root cause of the slow response is the forwarding network between the container nodes.

Slow Call Flame Graph Example 1 - Significant Difference Between Network Spans

Slow Call Flame Graph Example 1 - Significant Difference Between Network Spans

  • Example 2: Significant Difference Between System Spans

In the figure below, the significant difference between two system Spans of the "Frontend Service" indicates that the root cause of the slow response lies in the processing process of the "Frontend Service".

Slow Call Flame Graph Example 2 - Significant Difference Between System Spans

Slow Call Flame Graph Example 2 - Significant Difference Between System Spans

  • Example 3: Significant Length of Terminal System Span

In the figure below, the significant length of the "DNS Service" indicates that the root cause of the slow response lies in the processing process of the "DNS Service".

Slow Call Flame Graph Example 3 - Significant Length of Terminal System Span

Slow Call Flame Graph Example 3 - Significant Length of Terminal System Span