1
0

Refactor Grafana dashboard to use server_name label (#19337)

- Update `synapse_xxx` (server-level) metrics to use
`server_name="$server_name",` instead of `instance="$instance"`
- Add `synapse_server_name_info` metric to map Synapse `server_name`s to
the `instance`s they're hosted on.
- For process level metrics, update to use `xxx * on (instance, job,
index) group_left(server_name)
synapse_server_name_info{server_name="$server_name"}`

All of the changes here are backwards compatible with whatever people
were doing before with their Prometheus/Grafana dashboards.

Previously, the recommendation was to use the `instance` label to group
everything under the same server (803e4b4d88/docs/metrics-howto.md (L93-L147))

But the `instance` label actually has a special meaning and we're
actually abusing it by using it that way:

> `instance`: The `<host>:<port>` part of the target's URL that was
scraped.
>
> *--
https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series*

Since https://github.com/element-hq/synapse/issues/18592 (Synapse
`v1.139.0`), we now have the `server_name` label to use instead.


---

Additionally, the assumption that a single process is serving a single
server is no longer true with [Synapse Pro for small
hosts](https://docs.element.io/latest/element-server-suite-pro/synapse-pro-for-small-hosts/overview/).

Part of https://github.com/element-hq/synapse-small-hosts/issues/106

### Motivating use case

Although this change also benefits [Synapse Pro for small
hosts](https://docs.element.io/latest/element-server-suite-pro/synapse-pro-for-small-hosts/overview/)
(https://github.com/element-hq/synapse-small-hosts/issues/106), this is
actually spawning from adding Prometheus metrics to our workerized
Docker image (https://github.com/element-hq/synapse/pull/19324,
https://github.com/element-hq/synapse/pull/19336) with a more correct
label setup (without `instance`) and wanting the dashboard to be better.



### Testing strategy

1. Make sure your firewall allows the Docker containers to communicate
to the host (`host.docker.internal`) so they can access exposed ports of
other Docker containers. We want to allow Synapse to access the
Prometheus container and Grafana to access to the Prometheus container.
- `sudo ufw allow in on docker0 comment "Allow traffic from the default
Docker network to the host machine (host.docker.internal)"`
- `sudo ufw allow in on br-+ comment "(from Matrix Complement testing)
Allow traffic from custom Docker networks to the host machine
(host.docker.internal)"`
- [Complement firewall
docs](ee6acd9154/README.md (potential-conflict-with-firewall-software))
1. Build the Docker image for Synapse: `docker build -t
matrixdotorg/synapse -f docker/Dockerfile .`
([docs](7a24fafbc3/docker/README-testing.md (building-and-running-the-images-manually)))
 1. Generate config for Synapse:
    ```
    docker run -it --rm \
        --mount type=volume,src=synapse-data,dst=/data \
        -e SYNAPSE_SERVER_NAME=my.docker.synapse.server \
        -e SYNAPSE_REPORT_STATS=yes \
        -e SYNAPSE_ENABLE_METRICS=1 \
        matrixdotorg/synapse:latest generate
    ```
 1. Start Synapse:
     ```
    docker run -d --name synapse \
        --mount type=volume,src=synapse-data,dst=/data \
        -p 8008:8008 \
        -p 19090:19090 \
        matrixdotorg/synapse:latest
    ```
1. You should be able to see metrics from Synapse at
http://localhost:19090/_synapse/metrics
 1. Create a Prometheus config (`prometheus.yml`)
    ```yaml
    global:
      scrape_interval: 15s
      scrape_timeout: 15s
      evaluation_interval: 15s
    
    scrape_configs:
      - job_name: prometheus
        scrape_interval: 15s
        metrics_path: /_synapse/metrics
        scheme: http
        static_configs:
          - targets:
# This should point to the Synapse metrics listener (we're using
`host.docker.internal` because this is from within the Prometheus
container)
              - host.docker.internal:19090
    ```
1. Start Prometheus (update the volume bind mount to the config you just
saved somewhere):
    ```
    docker run \
        --detach \
        --name=prometheus \
        --add-host host.docker.internal:host-gateway \
        -p 9090:9090 \
-v
~/Documents/code/random/prometheus-config/prometheus.yml:/etc/prometheus/prometheus.yml
\
        prom/prometheus
    ```
1. Make sure you're seeing some data in Prometheus. On
http://localhost:9090/query, search for `synapse_build_info`
 1. Start [Grafana](https://hub.docker.com/r/grafana/grafana)
    ```
docker run -d --name=grafana --add-host
host.docker.internal:host-gateway -p 3000:3000 grafana/grafana
    ```
1. Visit the Grafana dashboard, http://localhost:3000/ (Credentials:
`admin`/`admin`)
1. **Connections** -> **Data Sources** -> **Add data source** ->
**Prometheus**
     - Prometheus server URL: `http://host.docker.internal:9090`
 1. Import the Synapse dashboard: `contrib/grafana/synapse.json`

To test workers, you can use the testing strategy from
https://github.com/element-hq/synapse/pull/19336 (assumes both changes
from this PR and the other PR are combined)
This commit is contained in:
Eric Eastwood
2026-01-14 17:57:42 -06:00
committed by GitHub
parent 9b776c6a48
commit 58f59ffbcb
5 changed files with 217 additions and 195 deletions

View File

@@ -0,0 +1 @@
Refactor Grafana dashboard to use `server_name` label (instead of `instance`).

File diff suppressed because it is too large Load Diff

View File

@@ -123,25 +123,21 @@ Example Prometheus target for Synapse with workers:
static_configs:
- targets: ["my.server.here:port"]
labels:
instance: "my.server"
job: "master"
index: 1
- targets: ["my.workerserver.here:port"]
labels:
instance: "my.server"
job: "generic_worker"
index: 1
- targets: ["my.workerserver.here:port"]
labels:
instance: "my.server"
job: "generic_worker"
index: 2
- targets: ["my.workerserver.here:port"]
labels:
instance: "my.server"
job: "media_repository"
index: 1
```
Labels (`instance`, `job`, `index`) can be defined as anything.
Labels (`job`, `index`) can be defined as anything.
The labels are used to group graphs in grafana.

View File

@@ -659,6 +659,26 @@ build_info.labels(
" ".join([platform.system(), platform.release()]),
).set(1)
synapse_server_name_info = Gauge(
"synapse_server_name_info",
"Maps Synapse `server_name`s to the `instance`s they're hosted on",
# `instance` will automatically be set by Prometheus
labelnames=[SERVER_NAME_LABEL],
)
"""
Maps Synapse `server_name`s to the `instance`s they're hosted on.
This is an info-style metric where the value is always 1, and labels carry metadata:
- `server_name`: The Synapse `server_name`
- `instance`: Automatically be set by Prometheus and is the `<host>:<port>` part
of the target's URL that was scraped.
This is useful as it allows us to correlate process-level metrics (like `process_*`,
`python_*`, etc) with homeservers.
"""
# 3PID send info
threepid_send_requests = Histogram(
"synapse_threepid_send_requests_with_tries",

View File

@@ -147,8 +147,10 @@ from synapse.http.matrixfederationclient import MatrixFederationHttpClient
from synapse.logging.context import PreserveLoggingContext
from synapse.media.media_repository import MediaRepository
from synapse.metrics import (
SERVER_NAME_LABEL,
all_later_gauges_to_clean_up_on_shutdown,
register_threadpool,
synapse_server_name_info,
)
from synapse.metrics.background_process_metrics import run_as_background_process
from synapse.metrics.common_usage_metrics import CommonUsageMetricsManager
@@ -361,6 +363,9 @@ class HomeServer(metaclass=abc.ABCMeta):
self._sync_shutdown_handlers: list[ShutdownInfo] = []
self._background_processes: set[defer.Deferred[Any | None]] = set()
# For every server we spawn in the process, track it in the metrics
synapse_server_name_info.labels(**{SERVER_NAME_LABEL: self.hostname}).set(1)
def run_as_background_process(
self,
desc: "LiteralString",