v6.5 EE Release Notes

Created:2024-02-27 Last Modified:2024-07-05

This document was translated by ChatGPT

#1. Zero Intrusion

#1.1 Tracing

  • AutoTracing
    • ⭐ Enhanced the ability to extract TraceID and SpanID from SQL statement comments, support parsing variable values in precompiled SQL statements, and support collecting login usernames and current database names. See documentation.
    • ⭐ Added parsing capability for the bRPC protocol. See documentation.
    • ⭐ Added parsing capabilities for RabbitMQ AMQP, ActiveMQ OpenWire, NATS, ZeroMQ, and Pulsar protocols. See documentation.
    • ⭐ Enhanced Kafka protocol parsing: added the ability to parse Partition, Offset, GroupID fields, and JoinGroup, LeaveGroup, SyncGroup messages; support extracting correlation_id from Kafka protocol headers as x_request_id_0/1, automatically tracing Kafka call chains in Request-Response mode; support extracting SpanID from traceparent and sw8 in protocol headers, enhancing tracing capabilities. See documentation.
    • ⭐ Support using Wasm Plugin to enhance the parsing of Dubbo, NATS, and ZeroMQ protocols. See demo (opens new window).
    • Support parsing Kryo serialization format for Dubbo protocol. See documentation.
    • Mark MySQL unidirectional messages (CLOSE, QUIT) log type directly as session.
    • Added captured_request_byte and captured_response_byte metrics to call logs. See documentation (opens new window).
    • Enhanced the parsing capability of X-Tingyun TraceID.
  • AutoTagging
    • ⭐ Added biz_type tag to application metrics and call logs, which can be used with Wasm Plugin to identify business types.
    • ⭐ Kafka protocol supports extracting topic_name as endpoint. See documentation.
    • ⭐ Aggregated metrics no longer aggregate WAN server as 0.0.0.0, and private IP addresses without any resource tags (192.168, 172.16, 10, 169.254) are no longer marked as WAN.
    • ⭐ Support custom collection of HTTP/HTTP2/gRPC header fields and store them in the attribute.$field_name field of call logs. See documentation.
    • For A/AAAA type DNS requests, extract QNAME as request_domain. See documentation.
    • FastCGI, MQTT, and DNS protocols support extracting the endpoint field. See documentation.
    • Added main IP (pod_node_ip, chost_ip) and hostname (pod_node_hostname, chost_hostname) tags for container nodes and cloud servers to all data.
    • The auto_service tag automatically aggregates container nodes (pod_node) into container clusters (pod_cluster), but auto_instance will not aggregate in this way.
    • When a K8s workload (pod_group) is associated with multiple container services, the service name with the smallest dictionary order is used to mark the container service (pod_service) tag.
  • Search Capabilities
    • Added syntactic sugar field XX, which can be used to match either of the two original fields XX_0 or XX_1. Supported fields include: x_request_id, syscall_thread, syscall_coroutine, syscall_cap_seq, syscall_trace_id, tcp_seq.
    • Added role grouping capability on the resource analysis page to distinguish statistics when resources act as clients or servers.
    • Optimized the loading speed when switching the search box to container search or process search mode.
    • The client and server columns in the aggregated data table support copy-pasting to the search bar.
    • When entering resource filter conditions, candidate options support hovering to prompt resource information.
  • Usability Improvements
    • ⭐ Linked distributed tracing with flow logs to view network performance metrics of Spans.
    • ⭐ Distributed tracing and topology analysis pages support using DeepFlow Stella intelligent analysis, with support for the GPT4 model.
    • Optimized the user experience of the search bar in "click search button to trigger" mode.
    • Support remembering the active state of the Tab below the distributed tracing flame graph and stabilizing the Tab layout.
    • Linked highlighting between Spans in the distributed tracing flame graph and call logs in the table below.
    • Optimized the presentation of Span tracing in distributed tracing.
    • Optimized the parent-child logic of NET Spans in the distributed tracing flame graph.
    • Improved the zoom in and zoom out experience of the topology graph.
    • Enhanced the usability of copying knowledge graph tags.
    • Displayed delay 0 as N/A in tables.
    • Optimized the display of the "query area" in data tags.
    • Support displaying resource icons by application protocol.

#1.2 Profiling

  • AutoProfiling
    • ⭐ Support Off-CPU Profiling, low overhead, continuous operation, can be used to quickly locate bottleneck functions when application performance is low but CPU usage is not high.
  • Usability Improvements
    • ⭐ Performance profiling flame graph supports using DeepFlow Stella intelligent analysis, with support for the GPT4 model.
    • Changed the first line name in the flame graph from root to $app_service, which is the process name collected by eBPF or the service name set internally by the application.
    • Optimized the loading speed when switching the search box to container search or process search mode.
    • Differentiated the types of function names in the eBPF flame graph: kernel functions, dynamic library functions, application functions.
    • Optimized the Tip display in the eBPF flame graph.

#1.3 Network

  • AutoMetrics
    • Exposed traffic distribution metrics to support monitoring the traffic rate of specific traffic distribution strategies.
    • Renamed anomaly metrics: Connection-Client SYN End (client_syn_repeat) renamed to Connection-Server SYN Missing (server_syn_miss) and included in server anomalies.
    • Renamed anomaly metrics: Connection-Server SYN End (server_syn_repeat) renamed to Connection-Client ACK Missing (client_ack_miss) and included in client anomalies.
    • Set the status of flow logs with TCP disconnection anomalies to normal.
  • AutoTagging
    • ⭐ Added request_domain field to network flow logs, automatically associating with application metrics and call logs.
    • ⭐ Aggregated metrics no longer aggregate WAN server as 0.0.0.0, and private IP addresses without any resource tags (192.168, 172.16, 10, 169.254) are no longer marked as WAN.
    • Added main IP (pod_node_ip, chost_ip) and hostname (pod_node_hostname, chost_hostname) tags for container nodes and cloud servers to all data.
    • The auto_service tag automatically aggregates container nodes (pod_node) into container clusters (pod_cluster), but auto_instance will not aggregate in this way.
    • When a K8s workload (pod_group) is associated with multiple container services, the service name with the smallest dictionary order is used to mark the container service (pod_service) tag.
  • Search Capabilities
    • Added syntactic sugar field XX, which can be used to match either of the two original fields XX_0 or XX_1. Supported fields include: tunnel_tx_ip, tunnel_rx_ip, tunnel_tx_mac, tunnel_rx_mac, tcp_seq.
    • Added role grouping capability on the resource analysis page to distinguish statistics when resources act as clients or servers.
    • Optimized the loading speed when switching the search box to container search or process search mode.
    • The client and server columns in the aggregated data table support copy-pasting to the search bar.
    • When entering resource filter conditions, candidate options support hovering to prompt resource information.
  • Usability Improvements
    • ⭐ Topology analysis page supports using DeepFlow Stella intelligent analysis, with support for the GPT4 model.
    • Optimized the user experience of the search bar in "click search button to trigger" mode.
    • Improved the zoom in and zoom out experience of the topology graph.
    • Enhanced the usability of copying knowledge graph tags.
    • Displayed delay 0 as N/A in tables.
    • Optimized the display of the "query area" in data tags.
    • Optimized the display of the access relationship right slide frame.

#2. Customization

#2.1 Dashboard

  • Panel Enhancements
    • ⭐ Added text-type panels, supporting Markdown and Mermaid syntax.
    • ⭐ Support adding Markdown descriptions to panels.
    • ⭐ Support customizing the right slide frame Tab page for all panels, automatically associating the data displayed in the Tab page.
    • Panels with multiple query conditions support waking up the right slide frame, automatically associating all observability data.
    • The background curve of the overview chart supports hiding the coordinate axis.
    • Optimized the color selection box on the panel editing page.
    • Optimized the style, metric settings, and advanced settings of panels.
  • Usability Improvements
    • Support copying and cloning panels.
    • Added metric setting function to the panel editing box.
    • The detail table supports sorting by start time and end time columns.
    • The list page supports sorting by name, creator, and modification time.
    • Optimized the ability to set icon information in the panel editing box.
    • Optimized the interaction of the new panel box.
    • Optimized the legend display of line charts, bar charts, and pie charts.
    • Moved the modify metric button of the panel into the right slide frame of the editor.
    • Optimized the layout and style of the panel page, and optimized the layout and style of the panel editing right slide frame.
    • Split the dashboard list into two pages: custom dashboards and built-in dashboards.
    • Support switching the chart type of the panel.
    • Optimized the search module on the panel editing page.

#2.2 Universal Map

  • Usability Improvements
    • ⭐ Optimized the data display in the physical network section, enhancing the usability of "cloud and on-premises integrated monitoring".
    • Support batch (multi-select client services, server services) definition of paths in the business.
    • Improved the zoom in and zoom out experience of the topology graph.
    • Optimized the operation experience in the topology graph editing mode, and optimized the operation experience of arranging services and service groups in the topology graph.
    • Optimized the operation experience of the right slide frame in the universal map.

#3. Integration

#3.1 Metrics

  • Metric Templates
    • ⭐ Added metric template management capabilities to facilitate quick selection of metric sets in tracing, network, and dashboard pages.

#3.2 Logs

#4. Operations

#4.1 Alerts

  • Alert Policies
    • ⭐ Enhanced granularity: added configuration capabilities for monitoring frequency and monitoring intervals.
    • ⭐ Refined event types: added configuration capabilities for recovery events and informational events.
  • Push Endpoints
    • Added Kafka push endpoint, supporting SASL authentication of Plain type.
  • System Alerts
    • When the disk space where ClickHouse is located is insufficient, deepflow-server will perform a forced cleanup, triggering a built-in system alert to inform the user.
    • Added richer metrics to the alert for collector data loss.
  • Usability Improvements
    • Optimized the display of the alert policy list and alert event list.

#4.2 Reports

N/A

#5. Management

#5.1 Resources

  • AutoTagging
    • ⭐ Significantly improved the real-time performance of K8s tag injection. The previous code path involved 5 independent 1-minute timers, while the optimized path only involves 1 10-second timer and 1 1-minute timer. The worst-case delay is reduced from 5 minutes to 1 minute and 20 seconds (the agent's list/watch of K8s resources may span two cycles at most, so the worst-case delay may be 20 seconds).
    • Enhanced the ability to synchronize resource information with Ping An Cloud, supporting the acquisition of CIDR for tenant Pods in Serverless clusters.
    • By default, the enterprise edition disables the Agent from automatically triggering the generation of Kubernetes-type cloud platforms, simplifying the deployment steps in On-Prem mode.
    • Support synchronizing custom tags of cloud servers in Alibaba Cloud and automatically injecting cloud.tag.$key tag fields into all observability data.
    • Decoupled the synchronization of cloud resources and container resources, so that errors in the public cloud API do not affect the synchronization of container resource tags.
  • Usability Improvements
    • Excluded deleted resources from the resource count displayed in the knowledge graph.

#5.2 System

  • SQL API
    • ⭐ Optimized the Percentile operator for Delay and BoundedGauge type metrics, reducing the number of layers in the compiled ClickHouse SQL to one.
    • Modified ClickHouse table names and field names. See the table at the end (deprecated names can still be used, but will no longer be supported starting from v7.0).
    • Data in the flow_log and event databases support precise search using the _id field.
    • Simplified the query semantics of map type fields for easier user understanding.
  • Server
    • ⭐ Added Kafka Exporter data export method. See documentation. Supports exporting the following observability signals:
      • Metrics: flow_metrics.application* (application performance metrics/access relationships), flow_metrics.network* (network performance metrics/access relationships).
      • Logs: flow_log.l4_flow_log (network flow logs), flow_log.l7_flow_log (application call logs).
      • Events: event.perf_event (file read/write events).
    • Prometheus Remote Write supports exporting metrics from flow_metrics.application* and flow_metrics.network*.
    • Added a global configuration for whether the Agent requests the Server NAT IP, suitable for scenarios where all Agents request the Server through the public network.
    • Added a Token management page and optimized the Token timeout determination mechanism.
    • Traffic distribution strategies support export and import.
  • Agent
    • ⭐ Enabled system load circuit breaker mechanism by default. When the ratio of system load to CPU cores exceeds system_load_circuit_breaker_threshold, the Agent triggers the circuit breaker mechanism, automatically entering a disabled state and alerting. Configuration details can be found in the Agent configuration sample.
    • ⭐ Optimized Redis and MySQL protocol parsing performance: after optimization, an Agent with 1 CPU and 300MB memory can collect 50K TPS MySQL or Redis traffic.
    • Added flow-count-limit configuration parameter to prevent the agent from consuming too much memory under sudden traffic, avoiding triggering the OOM Killer.
    • ⭐ Improved HTTP2 Huffman decoding performance. Under the condition of limited to 1 logical core, the extreme TPS collection performance increased by 5 to 25 times. Test data is shown in the table below.
    • ⭐ Support configuring call log blacklist to reduce storage consumption, eliminate large delay metrics interference from health checks, and eliminate DNS NXDOMAIN anomaly interference.
    • ⭐ Support eBPF data out-of-order reordering and segment reassembly, enhancing the success rate of application protocol parsing.
    • Support collecting traffic from Open vSwitch Bond sub-interfaces and correctly aggregating them into flow logs and call logs.
    • Dedicated collectors support stripping ERSPAN, TEB, and VXLAN tunnel encapsulation from mirrored traffic.
    • Improved eBPF collection performance [test data to be supplemented].
    • Added 6443 (default port for K8s apiserver) to the default parsed ports for the TLS protocol.
    • Allowed collectors to remotely execute low-privilege debug commands.
  • Deployment
  • Usability Improvements
    • ⭐ Added AskGPT Copilot to DeepFlow Topo and DeepFlow Tracing Panel in Grafana: Demo1 (opens new window), Demo2 (opens new window). Currently supported models include GPT4, Tongyi Qianwen, Wenxin Yiyan, ChatGLM.
    • URLs in the page are URL-ized, supporting opening in a new page through the right-click menu.
    • Simplified the URL length of the page.

HTTP2 Collection Performance Comparison Test:

Random Header Count Version Agent CPU Agent Memory TPS
3 OLD 96% 34 MB 10K
NEW 97% 94 MB 50K
12 OLD 89% 9 MB 1.2K
NEW 93% 112 MB 30K

#5.3 Account

  • Multi-Tenant Support
    • ⭐ Support creating multiple isolated organizations to meet the isolation needs of large enterprises with multiple subsidiaries and business units, and support joint operation of SaaS services with industry clouds.
    • Support setting account permissions for tenants, including four roles: owner, maintainer, user, and guest.
    • Support dividing tenant accounts into teams according to the organizational structure and setting the visibility of resources within the team.
    • Support Google and GitHub account SSO.
  • Usability Improvements
    • Added a preference settings page to configure the behavior of the search box trigger method, search box display form, icon display, etc.

#6. Compatibility

#6.1 Incompatible Changes

  • eBPF AutoProfiling
  • AutoTagging
    • Security group information in cloud resources will no longer be synchronized.
  • Server
  • Agent
    • The static configuration item src-interfaces has been merged into the dynamic configuration item tap_interface_regex, reducing configuration complexity in scenarios such as MACVlan, Huawei Cloud CCE Turbo, VMware, etc.

#6.2 Compatible Changes

Note: The following changes will cease to be compatible starting from version 7.0.

Changes to table names in the ClickHouse flow_metrics database:

Old Name New Name Data Role
vtap_app_port application Application performance metrics for all services
vtap_app_edge_port application_map Application access relationships and their performance metrics
vtap_flow_port network Network performance metrics for all services
vtap_flow_edge_port network_map Network access relationships and their performance metrics
vtap_acl traffic_policy Network policy metrics (Enterprise Edition only)

Changes to field names in the ClickHouse database:

Old Name New Name Data Role
vtap agent Agent
vtap_id agent_id Agent ID
tap_side observation_point Observation point
tap capture_network_type Network location (Enterprise Edition only)
tap_port capture_nic Capture NIC identifier
tap_port_name capture_nic_name Capture NIC name
tap_port_type capture_nic_type Capture NIC type
tap_port_host capture_nic_host Host of the capture NIC (Enterprise Edition only)
tap_port_chost capture_nic_chost Cloud server of the capture NIC
tap_port_pod_node capture_nic_pod_node Container node of the capture NIC

#7. Documentation