PromQL API

Created:2024-11-05 Last Modified:2024-11-05

This document was translated by ChatGPT

#1. Introduction

Starting from v6.2.1, DeepFlow supports PromQL. The following Prometheus APIs are currently implemented and can be called directly via HTTP as per the Prometheus API definition (opens new window):

Http Method Path Prometheus API Description
GET/POST /prom/api/v1/query /api/v1/query Query data at a single point in time
GET/POST /prom/api/v1/query_range /api/v1/query_range Query data over a time range
GET /prom/api/v1/label/:labelName/values /api/v1/label/<label_name>/values Get all labels for a metric
GET/POST /prom/api/v1/series /api/v1/series Get all time series

#1.1 Calling Method

You can call the API in DeepFlow as follows:

Get the server endpoint port number:

port=$(kubectl get --namespace deepflow -o jsonpath="{.spec.ports[0].nodePort}" services deepflow-server)
1

Example of API call:

Instant Query:

time=$((`date +%s`))

curl -XPOST "http://${deepflow_server_node_ip}:${port}/prom/api/v1/query" \
--data-urlencode "query=sum(flow_log__l7_flow_log__server_error) by(request_resource, response_code)" \
--data-urlencode "time=${time}"
1
2
3
4
5

Range Query:

end=$((`date +%s`))
start=$((end-600))

curl -XPOST "http://${deepflow_server_node_ip}:${port}/prom/api/v1/query_range" \
--data-urlencode "query=sum(flow_log__l7_flow_log__server_error) by(request_resource, response_code)" \
--data-urlencode "start=${start}" \
--data-urlencode "end=${end}" \
--data-urlencode "step=60s"
1
2
3
4
5
6
7
8

#1.2 DeepFlow Metric Definitions

When providing PromQL queries externally, DeepFlow metrics are constructed in the format ${database}__${table}__${metric}__${data_precision}. You can obtain the target data source to query through the definition of AutoMetrics Metric Types. The specific rules are as follows:

db metrics
flow_log {db}__{table}__{metric}
flow_metrics (data_precision values are 1m/1s) {db}__{table}__{metric}__{data_precision}
prometheus (data written via Prometheus RemoteWrite) prometheus__samples__{metric}

For example:

  • flow_metrics__application__request__1m: Represents the number of application layer requests aggregated per minute
  • flow_metrics__network__tcp_timeout__1s: Represents the number of network layer TCP timeouts aggregated per second
  • flow_log__l7_flow_log__error: Represents the number of application layer errors

#1.3 Known Limitations

In the Grafana panel operations, the following limitations are currently known:

  • Labels cannot be queried directly; you need to select metrics first before choosing labels
  • When reselecting metrics, all labels need to be removed first
  • Queries will fail if the metrics name contains characters like .- that are not supported by Prometheus

When querying directly based on PromQL or writing alert rules, the following limitations are currently known:

  • Metrics names cannot be searched using ~/!~ regex
  • For metrics provided by DeepFlow, you must first determine the aggregation evaluation method through the aggregation operator (opens new window) before performing specific metric queries. The functions stdvar, topk, bottomk, and quantile are not yet supported and will be supported in future iterations.

#2. Querying DeepFlow Metrics Based on PromQL

Based on the above definitions, we can query DeepFlow metrics using PromQL. Note to determine the aggregation evaluation method first, for example:

  • Query the time series trend of all requests with HTTP 500 response codes:
sum(flow_log__l7_flow_log__server_error{response_code="500"}) by(request_resource, response_code)
1
  • If we want to see the time series trend for each specific service, add service grouping:
sum(flow_log__l7_flow_log__server_error{response_code="500"}) by(auto_service_1, request_resource, response_code)
1
  • Query the trend of TCP connection delay changes over the past 5 minutes with a 10s evaluation interval, grouped by service:
rate(sum(flow_metrics__network_map__rtt__1s)by(auto_service_1)[5m:10s])
1
  • Query the trend of average application delay over the past 10 minutes with a 1m evaluation interval, grouped by service:
avg_over_time(avg(flow_metrics__application__rrt__1m)by(auto_service)[10m:1m])
1

#3. Implementing Prometheus Alerts Based on DeepFlow Metrics

With the above examples, after configuring Prometheus RemoteRead, you can build alert rules on Prometheus based on these metrics, such as:

  • Alert for requests with latency > 1s and lasting more than 1m:
groups:
  - name: requestMonitoring
    rules:
      - alert: requestDelayAlert
        expr: avg(flow_metrics__application__rrt__1m)by(l7_protocol, auto_service, auto_instance) / 10^6 > 1
        for: 1m
        annotations:
          summary: 'High Request Latency'
          description: '{{ $labels.auto_instance }} request to {{ $labels.auto_service }} has a high request latency above 1s (current value: {{ $value }}s)'
1
2
3
4
5
6
7
8
9

We will also support direct configuration of alert rules in future iterations of DeepFlow, so stay tuned.