Reliability - 增强熔断等防护机制，增加告警分析 Dashboard

向阳 2024-06-17 阅读:

Reliability

为了提升 DeepFlow 自身的可靠性，v6.5 中做了一系列工作，主要包括四点：

为 deepflow-agent 增加系统负载熔断机制，使得运行环境的负载过高时 deepflow-agent 不会引入额外的压力，避免和业务争抢资源；
deepflow-agent 的 CPU、内存限制能通过 agent-group-config 直接同步至 K8s 配置中，无需再手动配置 K8s；
deepflow-server 自动监控 ClickHosue 磁盘剩余空间，当空间不足时自动删除旧数据，并触发告警；
社区版、企业版新增了系统告警分析 Dashboard，统一展示所有组件的告警相关指标，并说明原因及处置办法。

0x0: Agent 新增系统负载熔断机制

当 deepflow-agent 所在的运行环境处于高负载状态时，运行 deepflow-agent 可能导致雪上加霜，和业务争抢宝贵的资源。v6.5 中 deepflow-agent 新增了一项新的熔断机制：通过持续监控系统的运行负载，一旦发现超过预设的阈值，deepflow-agent 即进入自我禁用的状态，直到系统负载降低到一定的值以下且保持稳定。

系统负载熔断机制默认是开启的，整个机制共涉及到如下三个可配置的参数：

system_load_circuit_breaker_threshold：定义了触发熔断的阈值，该值由 system_load / cpu_cores 算得，默认为 1.0，表示如果系统负载高于 CPU 核数，则进入熔断状态。
system_load_circuit_breaker_recover：定义了允许 deepflow-agent 从熔断（禁用）状态中恢复的阈值，默认值为 0.9，通常定义为 threshold 的 90%。注意，仅当 deepflow-agent 持续五分钟观测到系统负载低于 recover 时，才会自行恢复。
system_load_circuit_breaker_metric：定义了上述计算中使用的系统负载指标，默认为 load15，可配置为 load1、load5、load15 之一。

deepflow-agent 系统负载熔断原理

如上图所示，红色背景的时间区域即为 deepflow-agent 进入熔断（禁用）状态的时间范围。从系统负载超过 threshold（t0 时刻）开始，到持续 5 分钟低于 recover（t4 时刻）截止，deepflow-agent 一直处于禁用状态。仔细观察该机制，我们能否发现：

负载仅低于 threshold（例如 t1 时刻）时，deepflow-agent 不会恢复，这样确保了系统负载在 threshold 上下震荡时，deepflow-agent 不会频繁禁用-启用；
负载偶尔低于 recover（例如 t2 时刻）时，deepflow-agent 不会恢复，这样确保了系统负载在 recover 上下震荡时，deepflow-agent 也不会恢复，我们认为此时风险可能还未完全解除。

在系统位于高负载时，自动禁用 deepflow-agent 能够降低整个系统的负载，提升系统自愈的可能性，避免和业务争抢资源。而且，实际环境中系统负载一般都是缓慢上涨的，deepflow-agent 每隔 10s 对负载进行一次监控，因此即使 deepflow-agent 自动进入了禁用状态，在禁用之前采集到的数据通常已经足够用于故障排查。

0x1: Agent 增强资源消耗限制

以往，K8s 环境中的 deepflow-agent 需要通过 K8s API（例如使用 kubectl 命令）配置允许使用的最大 CPU 和内存。在 v6.5 中我们做了两项优化：

DeepFlow agent-group-config 中的 max_cpus、max_memory 将会被直接同步到 K8s 中，不再需要手动配置 K8s。
agent-group-config 新增 max_millicpus，提供更加精细的 CPU 限制能力。

详细配置方法可参考 Agent 的配置示例：

# CPU Limit (in CPU Cores)
# Unit: number of logical cores. Default: 1. Range: [1, 1000]
# Note: deepflow-agent uses cgroups to limit CPU usage. 1 cpu = 1 core.
#   The actual CPU limit is based on the lesser of max_cpus and max_millicpus.
#   For example, if max_cpus = 2 and max_millicpus = 1500, the actual CPU limit
#   would be 1.5 cores.
max_cpus: 1

# CPU Limit (in MilliCPUs)
# Unit: number of millicpus. Default: 1000. Range: [1, 1000000]
# Note: deepflow-agent uses cgroups to limit CPU usage. 1 millicpu = 1 millicore = 0.001 core.
#   The actual CPU limit is based on the lesser of max_cpus and max_millicpus.
#   For example, if max_cpus = 2 and max_millicpus = 1500, the actual CPU limit
#   would be 1.5 cores.
max_millicpus: 1000

# Memory Limit
# Unit: M bytes. Default: 768. Range: [128, 100000]
# Note: deepflow-agent uses cgroups to limit memory usage.
max_memory: 768

0x2: Server 增强磁盘剩余空间防护

用户可以为不同的观测数据设置不同的保留时长，但如果还没有到达保留时长时磁盘空间就已经不足了，deepflow-server 会强制删除最老的数据。v6.5 中对该强制删除机制进行了增强，目前该机制在 server.yaml 中的配置项如下：

ck-disk-monitor:
  check-interval: 180 # check time interval (unit: seconds)
  ttl-check-disabled: false # whether to not check TTL expired data
  # When the disk space is insufficient, the disk occupancy > 'used-percent' and
  # the disk idle < 'free-space' are met at the same time, or the disk occupancy > 'used-space',
  # then the data is cleaned up
  disk-cleanups:
  - disk-name-prefix: default # monitor disks starting with 'disk-name-prefix', check the disks 'select * from system.disks'
    used-percent: 80          #  disk usage threshold, ranges: 0-100
    free-space: 300           #  uint: GB, disk minimum free threshold
    used-space: 0             #  uint: GB, disk maximum usage space threshold. (If it is 0, it means ignore the condition)
  - disk-name-prefix: path_   # monitor disks starting with 'disk-name-prefix', check the disks 'select * from system.disks'
    used-percent: 80          #  disk usage threshold, ranges: 0-100
    free-space: 300           #  uint: GB, disk minimum free threshold
    used-space: 0             #  uint: GB, disk maximum usage space threshold. (If it is 0, it means ignore the condition)
  priority-drops:   # set which database and table data will be deleted first when disk is full
  - database: flow_log
    tables-contain: # tables name containing the string will be priority-dropped. If it is empty, it means all the tables in this database
  - database: flow_metrics
    tables-contain: 1s_local
  - database: profile
  - database: application_log

从配置中可以看出，我们可以定义如下行为：

磁盘空间检查的时间间隔
检查哪些磁盘
磁盘空间低于多少（剩余空间绝对值、剩余空间比例）时触发强制删除
磁盘空间使用了多少时触发强制删除
当磁盘空间不足时，允许删除哪些种类的数据，删除的优先级是怎样的

0x3: 新增告警分析 Dashboard

v6.5 中，我们在社区版、企业版均增加了系统告警分析的 Dashboard，帮助用户更加快捷的查看整个系统的健康状况，并引导处置其中的告警。下图中 ① 对 Dashboard 进行了说明，② 对每一个告警指标进行了说明（包括触发原因和处置办法），③ 中会展示异常指标的值（当异常指标为 0 时不会显示数据）。

DeepFlow 社区版 Alert Analysis

DeepFlow 企业版系统告警分析

此外，deepflow-agent、deepflow-server 中也增加了更多的内存保护，为一些以往未曾被防护机制覆盖到的 Hashmap、Dict 等数据结构设置空间上限，避免极端情况下内存增长不受控。当这些数据结构的空间触达上限时，相应的告警指标也会出现在系统告警分析的 Dashboard 中。

0x4: 什么是 DeepFlow

DeepFlow 是云杉网络开发的一款可观测性产品，旨在为复杂的云原生及 AI 应用提供深度可观测性。DeepFlow 基于 eBPF 实现了应用性能指标、分布式追踪、持续性能剖析等观测信号的零侵扰（Zero Code）采集，并结合智能标签（SmartEncoding）技术实现了所有观测信号的全栈（Full Stack）关联和高效存取。使用 DeepFlow，可以让云原生及 AI 应用自动具有深度可观测性，从而消除开发者不断插桩的沉重负担，并为 DevOps/SRE 团队提供从代码到基础设施的监控及诊断能力。

GitHub 地址：https://github.com/deepflowio/deepflow

访问 DeepFlow Demo，体验零侵扰、全栈的可观测性。