Refactor Grafana dashboard to use server_name label (#19337)

- Update `synapse_xxx` (server-level) metrics to use `server_name="$server_name",` instead of `instance="$instance"` - Add `synapse_server_name_info` metric to map Synapse `server_name`s to the `instance`s they're hosted on. - For process level metrics, update to use `xxx * on (instance, job, index) group_left(server_name) synapse_server_name_info{server_name="$server_name"}` All of the changes here are backwards compatible with whatever people were doing before with their Prometheus/Grafana dashboards. Previously, the recommendation was to use the `instance` label to group everything under the same server (https://github.com/element-hq/synapse/blob/803e4b4d884b2de4b9e20dc47ffb59a983b8a4b5/docs/metrics-howto.md#L93-L147) But the `instance` label actually has a special meaning and we're actually abusing it by using it that way: > `instance`: The `<host>:<port>` part of the target's URL that was scraped. > > *-- https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series* Since https://github.com/element-hq/synapse/issues/18592 (Synapse `v1.139.0`), we now have the `server_name` label to use instead. --- Additionally, the assumption that a single process is serving a single server is no longer true with [Synapse Pro for small hosts](https://docs.element.io/latest/element-server-suite-pro/synapse-pro-for-small-hosts/overview/). Part of https://github.com/element-hq/synapse-small-hosts/issues/106 ### Motivating use case Although this change also benefits [Synapse Pro for small hosts](https://docs.element.io/latest/element-server-suite-pro/synapse-pro-for-small-hosts/overview/) (https://github.com/element-hq/synapse-small-hosts/issues/106), this is actually spawning from adding Prometheus metrics to our workerized Docker image (https://github.com/element-hq/synapse/pull/19324, https://github.com/element-hq/synapse/pull/19336) with a more correct label setup (without `instance`) and wanting the dashboard to be better. ### Testing strategy 1. Make sure your firewall allows the Docker containers to communicate to the host (`host.docker.internal`) so they can access exposed ports of other Docker containers. We want to allow Synapse to access the Prometheus container and Grafana to access to the Prometheus container. - `sudo ufw allow in on docker0 comment "Allow traffic from the default Docker network to the host machine (host.docker.internal)"` - `sudo ufw allow in on br-+ comment "(from Matrix Complement testing) Allow traffic from custom Docker networks to the host machine (host.docker.internal)"` - [Complement firewall docs](https://github.com/matrix-org/complement/blob/ee6acd9154bbae2d0071a9cb39593c0a5e37268b/README.md#potential-conflict-with-firewall-software) 1. Build the Docker image for Synapse: `docker build -t matrixdotorg/synapse -f docker/Dockerfile .` ([docs](https://github.com/element-hq/synapse/blob/7a24fafbc376b9bffeb3277b1ad4aa950720c96c/docker/README-testing.md#building-and-running-the-images-manually)) 1. Generate config for Synapse: ``` docker run -it --rm \ --mount type=volume,src=synapse-data,dst=/data \ -e SYNAPSE_SERVER_NAME=my.docker.synapse.server \ -e SYNAPSE_REPORT_STATS=yes \ -e SYNAPSE_ENABLE_METRICS=1 \ matrixdotorg/synapse:latest generate ``` 1. Start Synapse: ``` docker run -d --name synapse \ --mount type=volume,src=synapse-data,dst=/data \ -p 8008:8008 \ -p 19090:19090 \ matrixdotorg/synapse:latest ``` 1. You should be able to see metrics from Synapse at http://localhost:19090/_synapse/metrics 1. Create a Prometheus config (`prometheus.yml`) ```yaml global: scrape_interval: 15s scrape_timeout: 15s evaluation_interval: 15s scrape_configs: - job_name: prometheus scrape_interval: 15s metrics_path: /_synapse/metrics scheme: http static_configs: - targets: # This should point to the Synapse metrics listener (we're using `host.docker.internal` because this is from within the Prometheus container) - host.docker.internal:19090 ``` 1. Start Prometheus (update the volume bind mount to the config you just saved somewhere): ``` docker run \ --detach \ --name=prometheus \ --add-host host.docker.internal:host-gateway \ -p 9090:9090 \ -v ~/Documents/code/random/prometheus-config/prometheus.yml:/etc/prometheus/prometheus.yml \ prom/prometheus ``` 1. Make sure you're seeing some data in Prometheus. On http://localhost:9090/query, search for `synapse_build_info` 1. Start [Grafana](https://hub.docker.com/r/grafana/grafana) ``` docker run -d --name=grafana --add-host host.docker.internal:host-gateway -p 3000:3000 grafana/grafana ``` 1. Visit the Grafana dashboard, http://localhost:3000/ (Credentials: `admin`/`admin`) 1. **Connections** -> **Data Sources** -> **Add data source** -> **Prometheus** - Prometheus server URL: `http://host.docker.internal:9090` 1. Import the Synapse dashboard: `contrib/grafana/synapse.json` To test workers, you can use the testing strategy from https://github.com/element-hq/synapse/pull/19336 (assumes both changes from this PR and the other PR are combined)
2026-01-14 17:57:42 -06:00
parent 9b776c6a48
commit 58f59ffbcb
5 changed files with 217 additions and 195 deletions
@@ -0,0 +1 @@
+Refactor Grafana dashboard to use `server_name` label (instead of `instance`).
@@ -123,25 +123,21 @@ Example Prometheus target for Synapse with workers:
    static_configs:
      - targets: ["my.server.here:port"]
        labels:
-          instance: "my.server"
          job: "master"
          index: 1
      - targets: ["my.workerserver.here:port"]
        labels:
-          instance: "my.server"
          job: "generic_worker"
          index: 1
      - targets: ["my.workerserver.here:port"]
        labels:
-          instance: "my.server"
          job: "generic_worker"
          index: 2
      - targets: ["my.workerserver.here:port"]
        labels:
-          instance: "my.server"
          job: "media_repository"
          index: 1
 ```

-Labels (`instance`, `job`, `index`) can be defined as anything.
+Labels (`job`, `index`) can be defined as anything.
 The labels are used to group graphs in grafana.
@@ -659,6 +659,26 @@ build_info.labels(
    " ".join([platform.system(), platform.release()]),
 ).set(1)

+
+synapse_server_name_info = Gauge(
+    "synapse_server_name_info",
+    "Maps Synapse `server_name`s to the `instance`s they're hosted on",
+    # `instance` will automatically be set by Prometheus
+    labelnames=[SERVER_NAME_LABEL],
+)
+"""
+Maps Synapse `server_name`s to the `instance`s they're hosted on.
+
+This is an info-style metric where the value is always 1, and labels carry metadata:
+
+ - `server_name`: The Synapse `server_name`
+ - `instance`: Automatically be set by Prometheus and is the `<host>:<port>` part
+    of the target's URL that was scraped.
+
+This is useful as it allows us to correlate process-level metrics (like `process_*`,
+`python_*`, etc) with homeservers.
+"""
+
 # 3PID send info
 threepid_send_requests = Histogram(
    "synapse_threepid_send_requests_with_tries",
@@ -147,8 +147,10 @@ from synapse.http.matrixfederationclient import MatrixFederationHttpClient
 from synapse.logging.context import PreserveLoggingContext
 from synapse.media.media_repository import MediaRepository
 from synapse.metrics import (
+    SERVER_NAME_LABEL,
    all_later_gauges_to_clean_up_on_shutdown,
    register_threadpool,
+    synapse_server_name_info,
 )
 from synapse.metrics.background_process_metrics import run_as_background_process
 from synapse.metrics.common_usage_metrics import CommonUsageMetricsManager
@@ -361,6 +363,9 @@ class HomeServer(metaclass=abc.ABCMeta):
        self._sync_shutdown_handlers: list[ShutdownInfo] = []
        self._background_processes: set[defer.Deferred[Any | None]] = set()

+        # For every server we spawn in the process, track it in the metrics
+        synapse_server_name_info.labels(**{SERVER_NAME_LABEL: self.hostname}).set(1)
+
    def run_as_background_process(
        self,
        desc: "LiteralString",
				`@@ -0,0 +1 @@`
				Refactor Grafana dashboard to use `server_name` label (instead of `instance`).