Grafana & Alerting

Ready-to-import Grafana dashboards and Prometheus alerting rules built on the proxy and agent metrics. For the full metric reference and what each panel watches, see Monitoring.

Prerequisite

Metrics must be enabled (--metrics / METRICS_ENABLED=true) and a Prometheus datasource must be scraping both the proxy's and the agents' internal /metrics endpoints. See Scraping Internal Metrics.

Dashboards

Two dashboards ship in the repository:

File	Covers
`grafana/prometheus-proxy.json`	Proxy health, throughput, latency, payload sizes, internal state, errors
`grafana/prometheus-agents.json`	Agent health, connections, scrape activity, per-agent latency

Importing

In Grafana (10.0+): Dashboards → Import, then either upload the JSON file or paste the raw contents and pick your Prometheus datasource:

# Proxy dashboard
curl -O https://raw.githubusercontent.com/pambrose/prometheus-proxy/master/grafana/prometheus-proxy.json

# Agents dashboard
curl -O https://raw.githubusercontent.com/pambrose/prometheus-proxy/master/grafana/prometheus-agents.json

The per-panel breakdown of each dashboard is documented under Grafana Dashboards.

Alerting rules

Drop the following into a Prometheus rule_files group. The thresholds are starting points — tune them to your environment. Each rule is grounded in a metric documented in Monitoring.

groups:
  - name: prometheus-proxy
    rules:
      - alert: ProxyScrapeSuccessRateLow
        expr: |
          sum(rate(proxy_scrape_requests{type="success"}[5m]))
            / sum(rate(proxy_scrape_requests[5m])) < 0.99
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "Proxy scrape success rate below 99%"

      - alert: ProxyScrapeLatencyHigh
        expr: |
          histogram_quantile(0.99,
            sum by (le) (rate(proxy_scrape_request_latency_seconds_bucket[5m]))) > 2
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "Proxy P99 scrape latency above 2s"

      - alert: ProxyNoAgentsConnected
        expr: proxy_agent_map_size == 0
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "No agents connected to the proxy"

      - alert: ProxyBacklogGrowing
        expr: proxy_cumulative_agent_backlog_size > 100
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "Proxy agent scrape backlog is large ({{ $value }} queued)"

      - alert: ProxyPayloadTooLarge
        expr: rate(proxy_scrape_requests{type="payload_too_large"}[5m]) > 0
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "Scrapes are being rejected for exceeding the size limit"

      - alert: ProxyFrequentEvictions
        expr: rate(proxy_eviction_count[5m]) > 0
        for: 15m
        labels: { severity: warning }
        annotations:
          summary: "Proxy is evicting stale agents repeatedly"

  - name: prometheus-agent
    rules:
      - alert: AgentConnectFailures
        expr: rate(agent_connect_count{type="failure"}[5m]) > 0
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "Agent {{ $labels.launch_id }} is failing to connect to the proxy"

      - alert: AgentBacklogGrowing
        expr: agent_scrape_backlog_size > 50
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "Agent {{ $labels.launch_id }} scrape backlog is large"

Detecting restarts

proxy_start_time_seconds and agent_start_time_seconds carry a per-process launch_id; a change in launch_id (or a sudden reset of *_start_time_seconds) flags a restart on a dashboard or alert.

Grafana & Alerting

Dashboards

Importing

Alerting rules

See also