Monitoring & Observability

Both the proxy and agent expose their own operational metrics and admin endpoints.

Enabling Metrics

Metrics are disabled by default. Enable them via CLI, environment variables, or config file.

ProxyAgent

proxy {
  metrics {
    enabled = true
    port = 8082
    path = "metrics"
  }

  admin {
    enabled = true
    port = 8092
  }
}

Or via CLI: --metrics or METRICS_ENABLED=true

Default endpoint: http://proxy-host:8082/metrics

agent {
  metrics {
    enabled = true
    port = 8083
    path = "metrics"
  }

  admin {
    enabled = true
    port = 8093
  }
}

Or via CLI: --metrics or METRICS_ENABLED=true

Default endpoint: http://agent-host:8083/metrics

Scraping Internal Metrics

Add these scrape jobs to your prometheus.yml:

scrape_configs:
  # Scrape proxy internal metrics
  - job_name: 'prometheus-proxy'
    metrics_path: /metrics
    static_configs:
      - targets: ['proxy-host.example.com:8082']

  # Scrape agent internal metrics
  - job_name: 'prometheus-agent'
    metrics_path: /metrics
    static_configs:
      - targets: ['agent-host.example.com:8083']

JVM and gRPC Metrics

Both components support optional JVM and gRPC metrics:

proxy.metrics {
  enabled = true

  // Optional JVM metrics
  standardExportsEnabled = true
  memoryPoolsExportsEnabled = true
  garbageCollectorExportsEnabled = true
  threadExportsEnabled = true
  classLoadingExportsEnabled = false
  versionInfoExportsEnabled = false

  // Optional gRPC metrics
  grpc {
    metricsEnabled = true
    allMetricsReported = false    // true = include expensive metrics
  }
}

The same options are available under agent.metrics.

Proxy Metrics

Counters

Metric	Labels	Description
`proxy_scrape_requests`	`type`	Scrape request outcomes (see below)
`proxy_connect_count`	--	Agent connection count
`proxy_eviction_count`	--	Stale agent evictions
`proxy_heartbeat_count`	--	Heartbeats received from agents
`proxy_chunk_validation_failures_total`	`stage`	Chunk integrity failures (`chunk` or `summary`)
`proxy_chunked_transfers_abandoned_total`	--	Chunked transfers abandoned mid-stream
`proxy_agent_displacement_total`	--	Path registrations that displaced another agent

proxy_scrape_requests type labels:

Value	Meaning
`success`	Scrape completed successfully
`timed_out`	Agent did not respond within timeout
`no_agents`	No agents registered for the requested path
`invalid_path`	Requested path is empty or unrecognized
`agent_disconnected`	Agent stream closed before response was received
`missing_results`	Internal error: results object was null
`path_not_found`	Agent returned a non-200 status for the target
`payload_too_large`	Unzipped content exceeded size limit
`invalid_gzip`	Gzip decompression failed
`proxy_not_running`	Proxy is shutting down
`invalid_agent_context`	All agents for the path are in an invalid state

Histograms

Metric	Labels	Description
`proxy_scrape_request_latency_seconds`	`path`	End-to-end scrape latency
`proxy_scrape_response_bytes`	`path`, `encoding`	Response payload size after decompression

Latency buckets: 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s

Response size buckets: 1KB, 10KB, 100KB, 500KB, 1MB, 5MB, 10MB

Gauges

Metric	Description
`proxy_start_time_seconds`	Proxy start time (Unix epoch)
`proxy_agent_map_size`	Number of connected agents
`proxy_path_map_size`	Number of registered scrape paths
`proxy_scrape_map_size`	Number of in-flight scrape requests
`proxy_chunk_context_map_size`	Number of in-flight chunked transfers
`proxy_cumulative_agent_backlog_size`	Total queued scrape requests across all agents

Agent Metrics

Counters

Metric	Labels	Description
`agent_scrape_request_count`	`launch_id`, `type`	Scrape requests processed
`agent_scrape_result_count`	`launch_id`, `type`	Results sent (`non-gzipped`, `gzipped`, `chunked`)
`agent_connect_count`	`launch_id`, `type`	Connection attempts (`success`, `failure`)

The launch_id label uniquely identifies each agent process lifetime.

Histograms

Metric	Labels	Description
`agent_scrape_request_latency_seconds`	`launch_id`, `agent_name`	Time to fetch from target endpoint

Gauges

Metric	Labels	Description
`agent_start_time_seconds`	`launch_id`	Agent start time (Unix epoch)
`agent_scrape_backlog_size`	`launch_id`	Pending scrape requests queued
`agent_client_cache_size`	`launch_id`	Number of cached HTTP clients

Metric Flow

Prometheus --- HTTP GET ---> Proxy                        Agent
                              |                             |
                  latency.startTimer()                      |
                              |                             |
                  writeScrapeRequest() -- gRPC stream --> fetchScrapeUrl()
                              |                     agentLatency.startTimer()
                              |                             |
                              |                     HTTP GET to target
                              |                             |
                              |                     agentLatency.observeDuration()
                              |                     scrapeResultCount.inc()
                              |                             |
                  assignScrapeResults() <-- gRPC -----------+
                              |
                  responseBytes.observe()
                  latency.observeDuration()
                  scrapeRequestCount.labels(outcome).inc()
                              |
                <-- HTTP response ---

PromQL Examples

Scrape Success Rate

# Scrape success rate (last 5 minutes):
sum(rate(proxy_scrape_requests{type="success"}[5m]))
  / sum(rate(proxy_scrape_requests[5m])) * 100

P99 Scrape Latency

# P99 scrape latency:
histogram_quantile(0.99,
  sum by (le) (rate(proxy_scrape_request_latency_seconds_bucket[5m]))
)

P99 Latency Per Path

# P99 latency per path:
histogram_quantile(0.99,
  sum by (le, path) (rate(proxy_scrape_request_latency_seconds_bucket[5m]))
)

Error Rate by Type

# Error rate by type:
sum by (type) (rate(proxy_scrape_requests{type!="success"}[5m]))

Agent Latency by Name

# Agent scrape latency by agent name:
histogram_quantile(0.99,
  sum by (le, agent_name) (rate(agent_scrape_request_latency_seconds_bucket[5m]))
)

Admin Endpoints

Admin endpoints (when admin is enabled):

Proxy (default port 8092):
  GET /ping         - Returns "pong" (liveness check)
  GET /healthcheck  - Returns health status JSON
  GET /version      - Returns version info
  GET /threaddump   - Returns JVM thread dump
  GET /debug        - Returns proxy debug info (if debug enabled)

Agent (default port 8093):
  GET /ping         - Returns "pong" (liveness check)
  GET /healthcheck  - Returns health status JSON
  GET /version      - Returns version info
  GET /threaddump   - Returns JVM thread dump
  GET /debug        - Returns agent debug info (if debug enabled)

Enable admin endpoints:

java -jar prometheus-proxy.jar --admin
java -jar prometheus-agent.jar --admin

Grafana Dashboards

Importing Grafana dashboards:

1. In Grafana, go to Dashboards > Import
2. Upload the JSON file or paste its contents
3. Select your Prometheus datasource when prompted

Dashboard files in the repository:
  grafana/prometheus-proxy.json   - Proxy health, throughput, latency, errors
  grafana/prometheus-agents.json  - Agent health, scrape activity, per-agent latency

Requirements:
  - Grafana 10.0 or later
  - A Prometheus datasource scraping both proxy and agent metrics endpoints

Proxy Dashboard

Key panels to monitor:

Section	What to Watch
Overview	Success rate dropping below 99%, error count spikes
Throughput	Sudden changes in request volume or error ratio
Latency	P99 creeping up indicates slow targets or network issues
Payload	Unexpectedly large responses, gzip vs plain distribution
Internal State	Growing backlog means agents can't keep up
Errors	Which error types dominate, frequent evictions

Agents Dashboard

Key panels to monitor:

Section	What to Watch
Overview	Unexpected agent count changes
Connections	Failure spikes indicate proxy or network issues
Scrape Activity	Imbalanced load across agents
Latency	Per-agent latency outliers point to slow targets
Internals	Growing backlog means the agent is falling behind