AutoMetrics

Created:2024-06-01 Last Modified:2024-06-24

This document was translated by ChatGPT

#1. Introduction

In the past, we typically inserted statistical code actively through SDKs, bytecode enhancement, or manual instrumentation. This imposed a heavy burden on application developers, who needed to adapt to various programming languages and RPC frameworks. Rapid business iterations often led developers to discover the lack of necessary performance metrics only when dealing with failures. Additionally, operations personnel for foundational services like Nginx and MySQL were more passive, relying on the hope that these services had already exposed key performance metrics. However, built-in metrics often only provided instance-level granularity, unable to distinguish user calls.

DeepFlow's AutoMetrics capability can automatically obtain golden performance metrics for each microservice's API calls across the full-stack path, including application functions, system functions, and network communications. These capabilities are extended to a broader range of Linux kernel versions and Windows operating systems through BPF and AF_PACKET/winpcap.

Currently, DeepFlow supports the parsing of mainstream application protocols via eBPF, including HTTP 1/2, HTTPS (Golang/openssl), Dubbo, gRPC, SOFARPC, FastCGI, MySQL, PostgreSQL, Redis, MongoDB, Kafka, MQTT, and DNS, with plans to support more application protocols in the future. Based on DeepFlow's AutoMetrics capability, it is possible to non-intrusively obtain RED (Request, Error, Delay) metrics for applications, as well as throughput, latency, connection anomalies, retransmissions, zero windows, and other metrics for the network protocol stack. The DeepFlow Agent maintains the session state for each TCP connection and each application protocol request, referred to as Flow. All raw performance metric data is detailed to the Flow level and automatically aggregated to generate 1s and 1min metric data. Based on these metric data, we can present full-stack performance data for any service, workload, or API and draw a universal map of interactions between any services.

#2. Metric Types

To more intuitively experience the capabilities of AutoMetrics, if there is no business traffic in your DeepFlow environment, it is recommended to first refer to the OpenTelemetry WebStore Demo Experience - Deploy Demo section to deploy an official OpenTelemetry microservice demo application. This demo consists of more than ten microservices implemented in languages such as Go, C#, Node.js, Python, and Java. It is particularly noted that all metric data presented in this section has no dependency on OpenTelemetry.

#3. Observation Point Explanation

DeepFlow automatically collects metric data from various locations through cBPF/eBPF. To distinguish these data observation points, we use the observation_point label to annotate the data.

For data collected by cBPF, the meanings of observation_point values are as follows:

Data Source Type observation_point Value Observation Point Meaning
cBPF c Client Network Card
cBPF c-nd Client Container Node
cBPF c-hv Client Host Machine
cBPF c-gw-hv Client to Gateway Host
cBPF c-gw Client to Gateway
cBPF local Local Network Card
cBPF rest Other Network Cards
cBPF s-gw Gateway to Server
cBPF s-gw-hv Gateway Host to Server
cBPF s-hv Server Host Machine
cBPF s-nd Server Container Node
cBPF s Server Network Card

For data collected by eBPF, the meanings of observation_point values are as follows:

Data Source Type observation_point Value Observation Point Meaning
eBPF c-p Client Process
eBPF s-p Server Process

Additionally, after integrating OpenTelemetry data, we also inject the observation_point label, with the following meanings:

Data Source Type observation_point Value Observation Point Meaning
OTel c-app Client Application, corresponding to span.spankind = SPAN_KIND_CLIENT, SPAN_KIND_PRODUCER
OTel s-app Server Application, corresponding to span.spankind = SPAN_KIND_SERVER, SPAN_KIND_CONSUMER
OTel app Application, corresponding to other spankind values