Skip to content

Monitoring & Observability

Both the proxy and agent expose their own operational metrics and admin endpoints.

Enabling Metrics

Metrics are disabled by default. Enable them via CLI, environment variables, or config file.

proxy {
  metrics {
    enabled = true
    port = 8082
    path = "metrics"
  }

  admin {
    enabled = true
    port = 8092
  }
}

Or via CLI: --metrics or METRICS_ENABLED=true

Default endpoint: http://proxy-host:8082/metrics

agent {
  metrics {
    enabled = true
    port = 8083
    path = "metrics"
  }

  admin {
    enabled = true
    port = 8093
  }
}

Or via CLI: --metrics or METRICS_ENABLED=true

Default endpoint: http://agent-host:8083/metrics

Scraping Internal Metrics

Add these scrape jobs to your prometheus.yml:

scrape_configs:
  # Scrape proxy internal metrics
  - job_name: 'prometheus-proxy'
    metrics_path: /metrics
    static_configs:
      - targets: ['proxy-host.example.com:8082']

  # Scrape agent internal metrics
  - job_name: 'prometheus-agent'
    metrics_path: /metrics
    static_configs:
      - targets: ['agent-host.example.com:8083']

JVM and gRPC Metrics

Both components support optional JVM and gRPC metrics:

proxy.metrics {
  enabled = true

  // Optional JVM metrics
  standardExportsEnabled = true
  memoryPoolsExportsEnabled = true
  garbageCollectorExportsEnabled = true
  threadExportsEnabled = true
  classLoadingExportsEnabled = false
  versionInfoExportsEnabled = false

  // Optional gRPC metrics
  grpc {
    metricsEnabled = true
    allMetricsReported = false    // true = include expensive metrics
  }
}

The same options are available under agent.metrics.


Proxy Metrics

Counters

Metric Labels Description
proxy_scrape_requests type Scrape request outcomes (see below)
proxy_connect_count -- Agent connection count
proxy_eviction_count -- Stale agent evictions
proxy_heartbeat_count -- Heartbeats received from agents
proxy_chunk_validation_failures_total stage Chunk integrity failures (chunk or summary)
proxy_chunked_transfers_abandoned_total -- Chunked transfers abandoned mid-stream
proxy_agent_displacement_total -- Path registrations that displaced another agent

proxy_scrape_requests type labels:

Value Meaning
success Scrape completed successfully
timed_out Agent did not respond within timeout
no_agents No agents registered for the requested path
invalid_path Requested path is empty or unrecognized
agent_disconnected Agent stream closed before response was received
missing_results Internal error: results object was null
path_not_found Agent returned a non-200 status for the target
payload_too_large Unzipped content exceeded size limit
invalid_gzip Gzip decompression failed
proxy_not_running Proxy is shutting down
invalid_agent_context All agents for the path are in an invalid state

Histograms

Metric Labels Description
proxy_scrape_request_latency_seconds path End-to-end scrape latency
proxy_scrape_response_bytes path, encoding Response payload size after decompression

Latency buckets: 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s

Response size buckets: 1KB, 10KB, 100KB, 500KB, 1MB, 5MB, 10MB

Gauges

Metric Description
proxy_start_time_seconds Proxy start time (Unix epoch)
proxy_agent_map_size Number of connected agents
proxy_path_map_size Number of registered scrape paths
proxy_scrape_map_size Number of in-flight scrape requests
proxy_chunk_context_map_size Number of in-flight chunked transfers
proxy_cumulative_agent_backlog_size Total queued scrape requests across all agents

Agent Metrics

Counters

Metric Labels Description
agent_scrape_request_count launch_id, type Scrape requests processed
agent_scrape_result_count launch_id, type Results sent (non-gzipped, gzipped, chunked)
agent_connect_count launch_id, type Connection attempts (success, failure)

The launch_id label uniquely identifies each agent process lifetime.

Histograms

Metric Labels Description
agent_scrape_request_latency_seconds launch_id, agent_name Time to fetch from target endpoint

Gauges

Metric Labels Description
agent_start_time_seconds launch_id Agent start time (Unix epoch)
agent_scrape_backlog_size launch_id Pending scrape requests queued
agent_client_cache_size launch_id Number of cached HTTP clients

Metric Flow

Prometheus --- HTTP GET ---> Proxy                        Agent
                              |                             |
                  latency.startTimer()                      |
                              |                             |
                  writeScrapeRequest() -- gRPC stream --> fetchScrapeUrl()
                              |                     agentLatency.startTimer()
                              |                             |
                              |                     HTTP GET to target
                              |                             |
                              |                     agentLatency.observeDuration()
                              |                     scrapeResultCount.inc()
                              |                             |
                  assignScrapeResults() <-- gRPC -----------+
                              |
                  responseBytes.observe()
                  latency.observeDuration()
                  scrapeRequestCount.labels(outcome).inc()
                              |
                <-- HTTP response ---

PromQL Examples

Scrape Success Rate

# Scrape success rate (last 5 minutes):
sum(rate(proxy_scrape_requests{type="success"}[5m]))
  / sum(rate(proxy_scrape_requests[5m])) * 100

P99 Scrape Latency

# P99 scrape latency:
histogram_quantile(0.99,
  sum by (le) (rate(proxy_scrape_request_latency_seconds_bucket[5m]))
)

P99 Latency Per Path

# P99 latency per path:
histogram_quantile(0.99,
  sum by (le, path) (rate(proxy_scrape_request_latency_seconds_bucket[5m]))
)

Error Rate by Type

# Error rate by type:
sum by (type) (rate(proxy_scrape_requests{type!="success"}[5m]))

Agent Latency by Name

# Agent scrape latency by agent name:
histogram_quantile(0.99,
  sum by (le, agent_name) (rate(agent_scrape_request_latency_seconds_bucket[5m]))
)

Admin Endpoints

Admin endpoints (when admin is enabled):

Proxy (default port 8092):
  GET /ping         - Returns "pong" (liveness check)
  GET /healthcheck  - Returns health status JSON
  GET /version      - Returns version info
  GET /threaddump   - Returns JVM thread dump
  GET /debug        - Returns proxy debug info (if debug enabled)

Agent (default port 8093):
  GET /ping         - Returns "pong" (liveness check)
  GET /healthcheck  - Returns health status JSON
  GET /version      - Returns version info
  GET /threaddump   - Returns JVM thread dump
  GET /debug        - Returns agent debug info (if debug enabled)

Enable admin endpoints:

java -jar prometheus-proxy.jar --admin
java -jar prometheus-agent.jar --admin

Grafana Dashboards

Importing Grafana dashboards:

1. In Grafana, go to Dashboards > Import
2. Upload the JSON file or paste its contents
3. Select your Prometheus datasource when prompted

Dashboard files in the repository:
  grafana/prometheus-proxy.json   - Proxy health, throughput, latency, errors
  grafana/prometheus-agents.json  - Agent health, scrape activity, per-agent latency

Requirements:
  - Grafana 10.0 or later
  - A Prometheus datasource scraping both proxy and agent metrics endpoints

Proxy Dashboard

Key panels to monitor:

Section What to Watch
Overview Success rate dropping below 99%, error count spikes
Throughput Sudden changes in request volume or error ratio
Latency P99 creeping up indicates slow targets or network issues
Payload Unexpectedly large responses, gzip vs plain distribution
Internal State Growing backlog means agents can't keep up
Errors Which error types dominate, frequent evictions

Agents Dashboard

Key panels to monitor:

Section What to Watch
Overview Unexpected agent count changes
Connections Failure spikes indicate proxy or network issues
Scrape Activity Imbalanced load across agents
Latency Per-agent latency outliers point to slow targets
Internals Growing backlog means the agent is falling behind