v6.5 EE Release Notes

Created:2024-07-05 Last Modified:2024-08-08

This document was translated by ChatGPT

#1. Zero Intrusion

#1.1 Tracing

  • AutoTracing
    • ⭐ Enhanced the ability to extract TraceID and SpanID from SQL statement comments, support parsing variable values in precompiled SQL statements, and support collecting login usernames and current database names. See documentation here.
    • ⭐ Added parsing capability for the bRPC protocol. See documentation here.
    • ⭐ Added parsing capabilities for RabbitMQ AMQP, ActiveMQ OpenWire, NATS, ZeroMQ, and Pulsar protocols. See documentation here.
    • ⭐ Enhanced Kafka protocol parsing: added the ability to parse Partition, Offset, GroupID fields, and JoinGroup, LeaveGroup, SyncGroup messages; support extracting correlation_id from Kafka protocol headers as x_request_id_0/1, automatically tracing Kafka call chains in Request-Response mode; support extracting SpanID from traceparent and sw8 in protocol headers, enhancing tracing capabilities. See documentation here.
    • ⭐ Support using Wasm Plugin to enhance the parsing of Dubbo, NATS, and ZeroMQ protocols. See demo here (opens new window).
    • Support parsing Kryo serialization format for Dubbo protocol. See documentation here.
    • Mark MySQL unidirectional messages (CLOSE, QUIT) log type directly as session.
    • Added captured_request_byte and captured_response_byte metrics to call logs. See documentation here (opens new window).
    • Enhanced the parsing capability of X-Tingyun TraceID.
  • AutoTagging
    • ⭐ Added biz_type tag to application metrics and call logs, which can be used with Wasm Plugin to identify business types.
    • ⭐ Kafka protocol supports extracting topic_name as endpoint. See documentation here.
    • ⭐ Aggregated metrics no longer aggregate WAN server-side as 0.0.0.0, and private IP addresses without any resource tags (192.168, 172.16, 10, 169.254) are no longer marked as WAN.
    • ⭐ Support custom collection of HTTP/HTTP2/gRPC header fields and store them in the attribute.$field_name field of call logs. See detailed documentation.
    • For A/AAAA type DNS requests, extract QNAME as request_domain. See documentation here.
    • FastCGI, MQTT, and DNS protocols support extracting the endpoint field. See documentation here.
    • Added main IP (pod_node_ip, chost_ip) and hostname (pod_node_hostname, chost_hostname) tags for container nodes and cloud servers to all data.
    • The auto_service tag automatically aggregates container nodes (pod_node) into container clusters (pod_cluster), but auto_instance will not do this aggregation.
    • When a K8s workload (pod_group) is associated with multiple container services, the service name with the smallest dictionary order is used to mark the container service (pod_service) tag.
  • Search Capabilities
    • Added syntax sugar field XX, which can be used to match either of the two original fields XX_0 or XX_1. Supported fields include: x_request_id, syscall_thread, syscall_coroutine, syscall_cap_seq, syscall_trace_id, tcp_seq.
    • Added role grouping capability on the resource analysis page to distinguish statistics when resources act as clients or servers.
    • Optimized the loading speed when switching the search box to container search or process search mode.
    • The client and server columns in the aggregated data table support copy-pasting to the search bar.
    • When entering resource filter conditions, candidate options support hovering to prompt resource information.
  • Usability Enhancements
    • ⭐ Linked call chain tracing with flow logs to view network performance metrics of Spans.
    • ⭐ Call chain tracing and topology analysis pages support using DeepFlow Stella agent for intelligent analysis and interpretation, supporting the GPT4 model.
    • Optimized the user experience of the search bar in "click search button to trigger" mode.
    • Support remembering the activated state of the Tab below the call chain tracing flame graph and stabilizing the Tab layout.
    • Linked highlighting of Spans in the call chain tracing flame graph and call logs in the table below.
    • Optimized the presentation of Span tracing in call chain tracing.
    • Optimized the parent-child logic of NET Spans in the call chain tracing flame graph.
    • Improved the zoom in and zoom out experience of the topology graph.
    • Enhanced the usability of copying knowledge graph tags.
    • Displayed delay 0 as N/A in tables.
    • Optimized the display of the "query area" in data tags.
    • Support displaying resource icons by application protocol.

#1.2 Profiling

  • AutoProfiling
    • ⭐ Support Off-CPU Profiling, low overhead, continuous operation, can be used to quickly locate bottleneck functions when application performance is low but CPU usage is not high.
  • Usability Enhancements
    • ⭐ Performance profiling flame graph supports using DeepFlow Stella agent for intelligent analysis and interpretation, supporting the GPT4 model.
    • Changed the first line name in the flame graph from root to $app_service, which is the process name collected by eBPF or the service name set internally by the application.
    • Optimized the loading speed when switching the search box to container search or process search mode.
    • Differentiated the types of function names displayed in the eBPF flame graph: kernel functions, dynamic library functions, application functions.
    • Optimized the Tip display of the eBPF flame graph.

#1.3 Network

  • AutoMetrics
    • Exposed traffic distribution metrics, supporting monitoring of traffic rates matching specific traffic distribution strategies.
    • Renamed anomaly metrics: Connection-Client SYN End (client_syn_repeat) renamed to Connection-Server SYN Missing (server_syn_miss) and included in server anomalies.
    • Renamed anomaly metrics: Connection-Server SYN End (server_syn_repeat) renamed to Connection-Client ACK Missing (client_ack_miss) and included in client anomalies.
    • Set the status of flow logs with TCP disconnection anomalies to normal.
  • AutoTagging
    • ⭐ Added request_domain field to network flow logs, automatically associating with application metrics and call logs.
    • ⭐ Aggregated metrics no longer aggregate WAN server-side as 0.0.0.0, and private IP addresses without any resource tags (192.168, 172.16, 10, 169.254) are no longer marked as WAN.
    • Added main IP (pod_node_ip, chost_ip) and hostname (pod_node_hostname, chost_hostname) tags for container nodes and cloud servers to all data.
    • The auto_service tag automatically aggregates container nodes (pod_node) into container clusters (pod_cluster), but auto_instance will not do this aggregation.
    • When a K8s workload (pod_group) is associated with multiple container services, the service name with the smallest dictionary order is used to mark the container service (pod_service) tag.
  • Search Capabilities
    • Added syntax sugar field XX, which can be used to match either of the two original fields XX_0 or XX_1. Supported fields include: tunnel_tx_ip, tunnel_rx_ip, tunnel_tx_mac, tunnel_rx_mac, tcp_seq.
    • Added role grouping capability on the resource analysis page to distinguish statistics when resources act as clients or servers.
    • Optimized the loading speed when switching the search box to container search or process search mode.
    • The client and server columns in the aggregated data table support copy-pasting to the search bar.
    • When entering resource filter conditions, candidate options support hovering to prompt resource information.
  • Usability Enhancements
    • ⭐ Topology analysis page supports using DeepFlow Stella agent for intelligent analysis and interpretation, supporting the GPT4 model.
    • Optimized the user experience of the search bar in "click search button to trigger" mode.
    • Improved the zoom in and zoom out experience of the topology graph.
    • Enhanced the usability of copying knowledge graph tags.
    • Displayed delay 0 as N/A in tables.
    • Optimized the display of the "query area" in data tags.
    • Optimized the display of the access relationship right slide-out panel.

#2. Customization

#2.1 Dashboard

  • Panel Enhancements
    • ⭐ Added text-type Panels, supporting Markdown and Mermaid syntax.
    • ⭐ Support adding Markdown descriptions to Panels.
    • ⭐ Support customizing the right slide-out panel Tab page for all Panels, automatically associating the data displayed in the Tab page.
    • Panels with multiple query conditions support waking up the right slide-out panel, automatically associating all observability data.
    • The background curve of the overview chart supports hiding the coordinate axis.
    • Optimized the color selection box on the Panel editing page.
    • Optimized the style, metric settings, and advanced settings of Panels.
  • Usability Enhancements
    • Support copying and cloning Panels.
    • Added metric setting function to the Panel editing box.
    • The detail table supports sorting by start time and end time columns.
    • The list page supports sorting by name, creator, and modification time.
    • Optimized the ability to set icon information in the Panel editing box.
    • Optimized the interaction of the new Panel creation box.
    • Optimized the legend display of line charts, bar charts, and pie charts.
    • Moved the modify metric button of Panels into the right slide-out panel for editing.
    • Optimized the layout and style of the Panel page, and optimized the layout and style of the right slide-out panel for editing Panels.
    • Split the Dashboard list into two pages: custom Dashboards and built-in Dashboards.
    • Support switching the chart type of Panels.
    • Optimized the search module on the Panel editing page.

#2.2 Universal Map

  • Usability Enhancements
    • ⭐ Optimized data display in the physical network section, enhancing the usability of "cloud and on-premises integrated monitoring".
    • Support batch (multi-select client services, server services) definition of paths in the business.
    • Improved the zoom in and zoom out experience of the topology graph.
    • Optimized the operation experience in the topology graph editing mode, and optimized the operation experience of arranging services and service groups in the topology graph.
    • Optimized the operation experience of the right slide-out panel in the universal map.

#3. Integration

#3.1 Metrics

  • Metric Templates
    • ⭐ Added metric template management capabilities, facilitating quick selection of metric sets on tracing, network, and Dashboard pages.

#3.2 Logs

#4. Operations

#4.1 Alerts

  • Alert Policies
    • ⭐ Enhanced granularity: added configuration capabilities for monitoring frequency and monitoring intervals.
    • ⭐ Refined event types: added configuration capabilities for recovery events and information events.
  • Push Endpoints
    • Added Kafka push endpoint, supporting Plain type SASL authentication.
  • System Alerts
    • When the disk space where ClickHouse is located is insufficient, deepflow-server will perform a forced cleanup, triggering a built-in system alert to notify the user.
    • Added more comprehensive metrics to the alert for collector data loss.
  • Usability Enhancements
    • Optimized the display of the alert policy list and alert event list.

#4.2 Reports

N/A

#5. Management

#5.1 Resources

  • AutoTagging
    • ⭐ Significantly improved the real-time performance of K8s tag injection. The previous code path involved 5 independent 1-minute timers, while the optimized path only involves 1 10-second timer and 1 1-minute timer. The worst-case delay is reduced from 5 minutes to 1 minute and 20 seconds (the agent's list/watch of K8s resources may span two cycles, so the worst-case delay may be 20 seconds).
    • Enhanced the ability to synchronize resource information with Ping An Cloud, supporting the acquisition of CIDR for tenant Pods in Serverless clusters.
    • By default, the enterprise edition disables the Agent from automatically triggering the generation of Kubernetes-type cloud platforms, simplifying the deployment steps in On-Prem mode.
    • Support synchronizing custom tags of cloud servers in Alibaba Cloud and automatically injecting cloud.tag.$key tag fields into all observability data.
    • Decoupled the synchronization of cloud resources and container resources, so that errors in the public cloud API do not affect the synchronization of container resource tags.
  • Usability Improvements
    • Excluded deleted resources from the resource count displayed in the knowledge graph.

#5.2 System

  • SQL API
    • ⭐ Optimized the Percentile operator for Delay and BoundedGauge type metrics, reducing the number of layers in the compiled ClickHouse SQL to one.
    • Modified ClickHouse table names and field names, see the table at the end (deprecated names can still be used, but will no longer be supported starting from v7.0).
    • Data in the flow_log and event databases support precise search using the _id field.
    • Simplified the query semantics of map-type fields for easier user understanding.
  • Server
    • ⭐ Added Kafka Exporter data export method. See documentation here, supporting the export of the following observability signals:
      • Metrics: flow_metrics.application* (application performance metrics/access relationships), flow_metrics.network* (network performance metrics/access relationships).
      • Logs: flow_log.l4_flow_log (network flow logs), flow_log.l7_flow_log (application call logs).
      • Events: event.perf_event (file read/write events).
    • Prometheus Remote Write supports exporting metrics from flow_metrics.application* and flow_metrics.network*.
    • Added a global configuration for whether the Agent requests the Server NAT IP, suitable for scenarios where all Agents request the Server through the public network.
    • Added a Token management page and optimized the Token timeout determination mechanism.
    • Traffic distribution strategies support export and import.
  • Agent
    • ⭐ Default enabled system load circuit breaker mechanism. When the ratio of system load to CPU cores exceeds system_load_circuit_breaker_threshold, the Agent triggers the circuit breaker mechanism, automatically entering a disabled state and alerting. Configuration details can be found in the Agent configuration sample.
    • ⭐ Optimized Redis and MySQL protocol parsing performance: after optimization, an Agent with 1 CPU and 300MB memory can collect 50K TPS MySQL or Redis traffic.
    • Added flow-count-limit configuration parameter to prevent the Agent from consuming too much memory under sudden traffic, avoiding triggering the OOM Killer.
    • ⭐ Improved HTTP2 Huffman decoding performance. Under the condition of limited 1 logical core, the extreme TPS collection performance increased by 5 to 25 times. Test data is shown in the table below.
    • ⭐ Support configuring call log blacklist to reduce storage consumption, eliminate large delay metrics interference from health checks, and eliminate DNS NXDOMAIN anomaly interference.
    • ⭐ Support eBPF data out-of-order reordering and segment reassembly, enhancing the success rate of application protocol parsing.
    • Support collecting traffic from Open vSwitch Bond sub-interfaces and correctly aggregating them into flow logs and call logs.
    • Dedicated collectors support stripping ERSPAN, TEB, and VXLAN tunnel encapsulation from mirrored traffic.
    • Improved eBPF collection performance [test data to be supplemented].
    • Added 6443 (default port for K8s apiserver) to the default parsed ports for the TLS protocol.
    • Allowed collectors to remotely execute low-privilege debug commands.
  • Deployment
  • Usability Improvements
    • ⭐ Added AskGPT Copilot to DeepFlow Topo and DeepFlow Tracing Panel in Grafana: Demo1 (opens new window), Demo2 (opens new window). Currently supported models include GPT4, Tongyi Qianwen, Wenxin Yiyan, ChatGLM.
    • URLs in the page are URL-ized, supporting opening in a new page through the right-click menu.
    • Simplified the URL length of the page.

HTTP2 Collection Performance Comparison Test:

Random Header Count Version Agent CPU Agent Memory TPS
3 OLD 96% 34 MB 10K
NEW 97% 94 MB 50K
12 OLD 89% 9 MB 1.2K
NEW 93% 112 MB 30K

#5.3 Account

  • Multi-Tenant Support
    • ⭐ Support creating multiple isolated organizations to meet the isolation needs of large enterprises with multiple subsidiaries and business units, and support joint operation of SaaS services with industry clouds.
    • Support setting tenant account permissions, including four roles: owner, maintainer, user, and guest.
    • Support dividing tenant accounts into teams according to the organizational structure and setting the visibility of resources within the team.
    • Support Google and GitHub account SSO.
  • Usability Improvements
    • Added a preference settings page, allowing configuration of search box trigger methods, search box display forms, icon display, and other behaviors.

#6. Compatibility

#6.1 Incompatible Changes

  • eBPF AutoProfiling
  • AutoTagging
    • Synchronization of security group information in cloud resources is no longer supported.
  • Server
  • Agent
    • The static configuration item src-interfaces has been merged into the dynamic configuration item tap_interface_regex, reducing configuration complexity in scenarios such as MACVlan, Huawei Cloud CCE Turbo, VMware, etc.

#6.2 Compatible Changes

Note: The following changes will no longer be compatible starting from v7.0.

Modifications to table names in the ClickHouse flow_metrics database:

Old Name New Name Data Function
vtap_app_port application Application performance metrics for all services
vtap_app_edge_port application_map Application access relationships and their performance metrics
vtap_flow_port network Network performance metrics for all services
vtap_flow_edge_port network_map Network access relationships and their performance metrics
vtap_acl traffic_policy Network policy metrics (Enterprise Edition only)

Modifications to field names in the ClickHouse database:

Old Name New Name Data Function
vtap agent Agent
vtap_id agent_id Agent ID
tap_side observation_point Observation point
tap capture_network_type Network location (Enterprise Edition only)
tap_port capture_nic Capture NIC identifier
tap_port_name capture_nic_name Capture NIC name
tap_port_type capture_nic_type Capture NIC type
tap_port_host capture_nic_host Host machine of the capture NIC (Enterprise Edition only)
tap_port_chost capture_nic_chost Cloud server of the capture NIC
tap_port_pod_node capture_nic_pod_node Container node of the capture NIC

#7. Documentation