Metrics and Operators Calculation Logic
This document was translated by ChatGPT
This article will introduce different types of metrics and the calculation logic of various operators.
#1. Metrics
Metrics are divided into two main categories: Application Performance Metrics
and Network Performance Metrics
.
#1.1 Application Performance Metrics
Application metrics are used to measure the performance of services during actual operation, focusing mainly on service throughput, response delay, and anomalies. By collecting these metrics, operations personnel and developers can better understand the performance of applications in real-world usage, identify potential performance issues, and take appropriate measures for optimization and improvement.
The metrics described below will record a metric value in each statistical cycle, which can be customized by the user. The system currently supports 1m (one minute) and 1s (one second) by default (these data are collectively referred to as raw data sources in the DeepFlow platform). If multiple metric values are calculated within a statistical cycle, they will be aggregated into one metric value. The aggregation logic is described in the subsequent Types
section.
#1.1.1 Throughput
Field | DisplayName | Unit | Type | Description |
---|---|---|---|---|
request | Request | counter | ||
response | Response | counter |
generate from csv file: application.en?Category=Throughput
#1.1.2 Delay
Field | DisplayName | Unit | Type | Description |
---|---|---|---|---|
rrt | Avg Delay | us | delay | |
rrt_max | Max Delay | us | delay |
generate from csv file: application.en?Category=Delay
#1.1.3 Error
Field | DisplayName | Unit | Type | Description |
---|---|---|---|---|
error | Error | counter | ||
client_error | Client Error | counter | ||
server_error | Server Error | counter | ||
timeout | Timeout | counter | ||
error_ratio | Error % | % | percentage | |
client_error_ratio | Client Error % | % | percentage | |
server_error_ratio | Server Error % | % | percentage |
generate from csv file: application.en?Category=Error
#1.2 Network Performance Metrics
Network metrics are quantitative indicators used to evaluate network performance, covering the network layer, transport layer, and application layer. These metrics include throughput, delay, performance, and anomaly types.
#1.2.1 L3 Throughput
Field | DisplayName | Unit | Type | Description |
---|---|---|---|---|
byte | Byte | Byte | counter | |
byte_tx | Byte TX | Byte | counter | |
byte_rx | Byte RX | Byte | counter | |
packet | Packet | Packet | counter | |
packet_tx | Packet TX | Packet | counter | |
packet_rx | Packet RX | Packet | counter | |
l3_byte | L3 Payload | Byte | counter | |
l3_byte_tx | L3 Payload TX | Byte | counter | |
l3_byte_rx | L3 Payload RX | Byte | counter | |
bpp | Bytes per Packet | Byte | quotient | |
bpp_tx | Bytes per Packet TX | Byte | quotient | |
bpp_rx | Bytes per Packet RX | Byte | quotient |
generate from csv file: network.en?Category=L3 Throughput
#1.2.2 L4 Throughput
Field | DisplayName | Unit | Type | Description |
---|---|---|---|---|
new_flow | New Flow | Flow | counter | |
closed_flow | Closed Flow | Flow | counter | |
flow_load | Active Flow | Flow | gauge | |
syn_count | SYN Packet | Packet | counter | |
synack_count | SYN-ACK Packet | Packet | counter | |
l4_byte | L4 Payload | Byte | counter | |
l4_byte_tx | L4 Payload TX | Byte | counter | |
l4_byte_rx | L4 Payload RX | Byte | counter |
generate from csv file: network.en?Category=L4 Throughput
Active connection calculation logic:
- The collector counts the raw active connections based on the quadruple (client IP, server IP, protocol, server port) and then calculates the active connections corresponding to resources and paths.
- If traffic is collected within the time interval corresponding to the data source, active connections are counted, but there are some special cases:
- 1s data source: Describes the active connections counted per second.
- The first second of each minute: Includes connections that have no traffic within that second but have not ended, generally used to evaluate concurrent connections (multiple non-overlapping connections with a duration of less than one second may introduce some errors).
- The last 59 seconds of each minute: If multiple flows with the same quadruple have no traffic within that second, the connections corresponding to that quadruple will be ignored for that second, generally used to evaluate the lower bound of concurrent connections.
- 1m data source: Describes the active connections counted per minute.
- Includes connections that have no traffic but have not ended, generally used to evaluate the upper bound of concurrent connections.
- Custom data source: Calculated based on 1s/1m data sources using Avg/Max/Min, with the same meaning as directly using 1s/1m data sources and selecting Avg/Max/Min operators.
- 1s data source: Describes the active connections counted per second.
#1.2.3 TCP Slow
Field | DisplayName | Unit | Type | Description |
---|---|---|---|---|
retrans_syn | SYN Retransmission | Packet | counter | |
retrans_synack | SYN-ACK Retransmission | Packet | counter | |
retrans | TCP Retransmission | Packet | counter | |
retrans_tx | TCP Client Retransmission | Packet | counter | |
retrans_rx | TCP Server Retransmission | Packet | counter | |
zero_win | TCP ZeroWindow | Packet | counter | |
zero_win_tx | TCP Client ZeroWindow | Packet | counter | |
zero_win_rx | TCP Server ZeroWindow | Packet | counter | |
retrans_syn_ratio | SYN Retrans. % | % | percentage | |
retrans_synack_ratio | SYN-ACK Retrans. % | % | percentage | |
retrans_ratio | TCP Retrans. % | % | percentage | |
retrans_tx_ratio | TCP Client Retrans. % | % | percentage | |
retrans_rx_ratio | TCP Server Retrans. % | % | percentage | |
zero_win_ratio | TCP ZeroWindow % | % | percentage | |
zero_win_tx_ratio | TCP Client ZeroWindow % | % | percentage | |
zero_win_rx_ratio | TCP Server ZeroWindow % | % | percentage |
generate from csv file: network.en?Category=TCP Slow
#1.2.4 TCP Error
Field | DisplayName | Unit | Type | Description |
---|---|---|---|---|
tcp_establish_fail | Error | Flow | counter | |
client_establish_fail | Client Error | Flow | counter | |
server_establish_fail | Server Error | Flow | counter | |
tcp_establish_fail_ratio | Error % | % | percentage | |
client_establish_fail_ratio | Client Error % | % | percentage | |
server_establish_fail_ratio | Client Error % | % | percentage | |
tcp_transfer_fail | Transfer Error | Flow | counter | All transfer errors. |
tcp_transfer_fail_ratio | Transfer Error % | % | percentage | |
tcp_rst_fail | RST | Flow | counter | All RST errors. |
tcp_rst_fail_ratio | RST % | % | percentage | |
client_source_port_reuse | Est. - Client Port Reuse | Flow | counter | |
server_syn_miss | Est. - Server SYN Miss | Flow | counter | |
client_establish_other_rst | Est. - Client Other RST | Flow | counter | |
client_ack_miss | Est. - Client ACK Miss | Flow | counter | |
server_reset | Est. - Server Direct RST | Flow | counter | |
server_establish_other_rst | Est. - Server Other RST | Flow | counter | |
client_rst_flow | Transfer - Client RST | Flow | counter | |
server_rst_flow | Transfer - Server RST | Flow | counter | |
server_queue_lack | Transfer - Server Queue Overflow | Flow | counter | |
tcp_timeout | Transfer - TCP Timeout | Flow | counter | |
client_half_close_flow | Close - Client Half Close | Flow | counter | |
server_half_close_flow | Close - Server Half Close | Flow | counter |
generate from csv file: network.en?Category=TCP Error
#1.2.4.1 TCP Connection Errors
TCP 建连异常
#1.2.4.2 TCP Transmission Errors
TCP 传输异常
#1.2.5 Delay
Field | DisplayName | Unit | Type | Description |
---|---|---|---|---|
rtt | Avg TCP Est. Delay | us | delay | |
rtt_client | Avg TCP Est. Client Delay | us | delay | |
rtt_server | Avg TCP Est. Server Delay | us | delay | |
srt | Avg TCP/ICMP Response Delay | us | delay | |
art | Avg Data Delay | us | delay | |
cit | Avg Client Idle Delay | us | delay | |
rtt_max | Max TCP Est. Delay | us | delay | |
rtt_client_max | Max TCP Est. Client Delay | us | delay | |
rtt_server_max | Max TCP Est. Server Delay | us | delay | |
srt_max | Max TCP/ICMP Response Delay | us | delay | |
art_max | Max Data Delay | us | delay | |
cit_max | Max Client Idle Delay | us | delay |
generate from csv file: network.en?Category=Delay
TCP 网络时延解剖
- Delay generated during connection establishment
- [1] The complete
connection establishment delay
includes the entire time from the client sending the SYN packet to receiving the SYN+ACK packet from the server and then replying with an ACK packet. The connection establishment delay can be further divided intoclient connection establishment delay
andserver connection establishment delay
. - [2]
Client connection establishment delay
is the time taken for the client to reply with an ACK packet after receiving the SYN+ACK packet. - [3]
Server connection establishment delay
is the time taken for the server to reply with a SYN+ACK packet after receiving the SYN packet.
- [1] The complete
- Delay generated during data communication can be divided into
client waiting delay
+data transmission delay
.- [4]
Client waiting delay
is the time taken for the client to send the first request after the connection is successfully established; it is also the time taken for the client to send a data packet after receiving a data packet from the server. - [5]
Data transmission delay
is the time taken for the client to send a data packet and receive a reply data packet from the server. - [6] During data transmission delay, there is also a delay generated by the system protocol stack, called
system delay
, which is the time taken for the data packet to receive an ACK packet.
- [4]
#1.2.6 Application
Field | DisplayName | Unit | Type | Description |
---|---|---|---|---|
l7_request | Request | counter | ||
l7_response | Response | counter | ||
rrt | Avg App. Delay | us | delay | |
rrt_max | Max App. Delay | us | delay | |
l7_error | App. Error | counter | ||
l7_client_error | App. Client Error | counter | ||
l7_server_error | App. Server Error | counter | ||
l7_timeout | App. Server Timeout | counter | ||
l7_error_ratio | App. Error % | % | percentage | |
l7_client_error_ratio | App. Client Error % | % | percentage | |
l7_server_error_ratio | App. Server Error % | % | percentage |
generate from csv file: network.en?Category=Application
#1.2.7 Cardinality
During the statistical cycle, the number of unique tags collected is counted. For example, querying the client IP address (ip_0)
metric for all accesses to pod_1 means counting the number of unique client IP addresses in all traffic accessing pod_1.
Field | DisplayName | Unit | Type | Description |
---|
generate from csv file: network.en?Category=Cardinality
#2. Operators
Operators calculate data from raw data sources based on the selected time range and interval. For example, using a line chart to view 1s raw data sources for the last 5 minutes with a 20s interval, a point on the line chart (14:43:00) would read all data within the time range of 14:42:40 - 14:43:00 and then calculate the average value.
Operators support nested stacking, but aggregate operators
do not support stacking. For example, PerSecond(Avg(byte)) means calculating Avg(byte) first, and then the resulting value is recalculated based on PerSecond.
#2.1 Aggregate Operators
Operator | English Name | Applicable Metric Types | Description |
---|---|---|---|
Avg | Average | All types | Average value (does not ignore zero values for Counter/Gauge metrics) |
AAvg | Arithmetic Average | All types | Arithmetic average (first calculate the average at each time point, then calculate the average of the averages) |
Sum | Sum | Counter type | Sum |
Max | Maximum | All types | Maximum value |
Min | Minimum | All types | Minimum value |
Percentile | Estimated Percentile | All types | Estimated percentile |
PercentileExact | Exact Percentile | All types | Exact percentile |
Spread | Spread | All types | Absolute spread, Max minus Min within the statistical cycle |
Rspread | Relative Spread | All types | Relative spread, Max divided by Min within the statistical cycle |
Stddev | Standard Deviation | All types | Standard deviation |
Apdex | Application Performance Index | Delay type | Delay satisfaction |
Last | Last | All types | Latest value |
Uniq | Estimated Uniq | Cardinality type | Estimated cardinality |
UniqExact | Exact Uniq | Cardinality type | Exact cardinality |
#2.2 Secondary Operators
Operator | Description |
---|---|
PerSecond | Calculate rate, divide the result of the inner operator by the time interval [1] |
Math | Arithmetic operations, supports +, -, *, / |
Percentage | Unit conversion % |
- [1] For example:
PerSecond(Sum)
means calculating the sum first, then dividing by the time intervalinterval
passed by the API;PerSecond(Avg)
means calculating the average first, then dividing by the data source time intervaldata_precision
.
#3. Calculation Logic of Different Metrics' Operators
#3.1 Counter/Gauge Metrics
- flow_metrics data table
Sum
operator- Calculate the
Sum
of all data within the query time range
- Calculate the
Avg
operator- Calculate the
Sum
of all data within the query time range and divide byinterval/data_precision
- Calculate the
- Other operators
- First use
Sum
to aggregate based ondata_precision
- Then call the
ClickHouse
function for the selected specific operator
- First use
- When forced (due to the need for other metrics in the same statement) to use two layers of
SQL
calculationsSum/Avg
operator- First use
Sum
to aggregate based ondata_precision
- Then call the
ClickHouse
function for the selected specific operator
- First use
- flow_log data table
- Call the
ClickHouse
function for the selected specific operator
- Call the
- prometheus/ext_metrics/deepflow_system data table
- Same as flow_metrics data table
- Additional notes
- The
Min
operator fills 0 for time points with no data or data asnull
- The
#3.2 Quotient/Percentage Metrics
- flow_metric data table
Avg
operator- Calculate
Sum(x)/Sum(y)
for all data within the query time range
- Calculate
- Other operators
- First use
Sum(x)/Sum(y)
to aggregate based ondata_precision
- Then call the
ClickHouse
function for the selected specific operator
- First use
- When forced (due to the need for other metrics in the same statement) to use two layers of
SQL
calculationsAvg
operator- First use
Sum(x)/Sum(y)
to aggregate based ondata_precision
- Then call the
ClickHouse
function for the selected specific operator
- First use
- flow_log data table
- Call the
ClickHouse
functionfunc(x/y)
for the selected specific operator
- Call the
- Additional notes
- The
Min
operator forPercentage
metrics fills 0 for time points with no data - When calculating
Sum(x)/Sum(y)
, points with a denominator of0/null
or a numerator ofnull
are ignored
- The
#3.3 Delay/BoundedGauge Metrics
- flow_metric data table
- Call the
ClickHouse
function for the selected specific operator - When forced (due to the need for other metrics in the same statement) to use two layers of
SQL
calculationsAvg/Min/Max
operator- Both layers call the
ClickHouse
function for the selected specific operator
- Both layers call the
Spread/Rspread
operator- First use
Max
andMin
to aggregate based ondata_precision
- Then call the
ClickHouse
function for the selected specific operator
- First use
- Other operators
- First use
groupArray
to aggregate - Then call the
ClickHouse
function for the selected specific operator
- First use
- Call the
- flow_log data table
- Call the
ClickHouse
function for the selected specific operator
- Call the
- Additional notes
- The
Min
operator forBoundedGauge
metrics fills 0 for time points with no data or data asnull
Delay
metrics ignore points with a value of 0, considering 0 as a meaningless delay value
- The
#3.4 data_precision of Different Databases/Tables
Database | data_precision | Remarks |
---|---|---|
flow_metrics | 1s/1m | Supports 1s and 1m by default, can be aggregated to 1h and 1d |
flow_log | 1s | No actual concept of data_precision , the value is for convenience in calculation |
application_log | 1s | No actual concept of data_precision , the value is for convenience in calculation |
prometheus | 10s | Can be modified through the data_source_prometheus_interval field in server.yaml |
ext_metrics | 10s | Can be modified through the data_source_ext_metrics_interval field in server.yaml |
deepflow_admin | 10s | |
deepflow_tenant | 10s | |
event | 1s | No actual concept of data_precision , the value is for convenience in calculation |
profile | 1s | No actual concept of data_precision , the value is for convenience in calculation |