Refactor Grafana dashboard to use server_name label (#19337)
- Update `synapse_xxx` (server-level) metrics to use
`server_name="$server_name",` instead of `instance="$instance"`
- Add `synapse_server_name_info` metric to map Synapse `server_name`s to
the `instance`s they're hosted on.
- For process level metrics, update to use `xxx * on (instance, job,
index) group_left(server_name)
synapse_server_name_info{server_name="$server_name"}`
All of the changes here are backwards compatible with whatever people
were doing before with their Prometheus/Grafana dashboards.
Previously, the recommendation was to use the `instance` label to group
everything under the same server (803e4b4d88/docs/metrics-howto.md (L93-L147))
But the `instance` label actually has a special meaning and we're
actually abusing it by using it that way:
> `instance`: The `<host>:<port>` part of the target's URL that was
scraped.
>
> *--
https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series*
Since https://github.com/element-hq/synapse/issues/18592 (Synapse
`v1.139.0`), we now have the `server_name` label to use instead.
---
Additionally, the assumption that a single process is serving a single
server is no longer true with [Synapse Pro for small
hosts](https://docs.element.io/latest/element-server-suite-pro/synapse-pro-for-small-hosts/overview/).
Part of https://github.com/element-hq/synapse-small-hosts/issues/106
### Motivating use case
Although this change also benefits [Synapse Pro for small
hosts](https://docs.element.io/latest/element-server-suite-pro/synapse-pro-for-small-hosts/overview/)
(https://github.com/element-hq/synapse-small-hosts/issues/106), this is
actually spawning from adding Prometheus metrics to our workerized
Docker image (https://github.com/element-hq/synapse/pull/19324,
https://github.com/element-hq/synapse/pull/19336) with a more correct
label setup (without `instance`) and wanting the dashboard to be better.
### Testing strategy
1. Make sure your firewall allows the Docker containers to communicate
to the host (`host.docker.internal`) so they can access exposed ports of
other Docker containers. We want to allow Synapse to access the
Prometheus container and Grafana to access to the Prometheus container.
- `sudo ufw allow in on docker0 comment "Allow traffic from the default
Docker network to the host machine (host.docker.internal)"`
- `sudo ufw allow in on br-+ comment "(from Matrix Complement testing)
Allow traffic from custom Docker networks to the host machine
(host.docker.internal)"`
- [Complement firewall
docs](ee6acd9154/README.md (potential-conflict-with-firewall-software))
1. Build the Docker image for Synapse: `docker build -t
matrixdotorg/synapse -f docker/Dockerfile .`
([docs](7a24fafbc3/docker/README-testing.md (building-and-running-the-images-manually)))
1. Generate config for Synapse:
```
docker run -it --rm \
--mount type=volume,src=synapse-data,dst=/data \
-e SYNAPSE_SERVER_NAME=my.docker.synapse.server \
-e SYNAPSE_REPORT_STATS=yes \
-e SYNAPSE_ENABLE_METRICS=1 \
matrixdotorg/synapse:latest generate
```
1. Start Synapse:
```
docker run -d --name synapse \
--mount type=volume,src=synapse-data,dst=/data \
-p 8008:8008 \
-p 19090:19090 \
matrixdotorg/synapse:latest
```
1. You should be able to see metrics from Synapse at
http://localhost:19090/_synapse/metrics
1. Create a Prometheus config (`prometheus.yml`)
```yaml
global:
scrape_interval: 15s
scrape_timeout: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: prometheus
scrape_interval: 15s
metrics_path: /_synapse/metrics
scheme: http
static_configs:
- targets:
# This should point to the Synapse metrics listener (we're using
`host.docker.internal` because this is from within the Prometheus
container)
- host.docker.internal:19090
```
1. Start Prometheus (update the volume bind mount to the config you just
saved somewhere):
```
docker run \
--detach \
--name=prometheus \
--add-host host.docker.internal:host-gateway \
-p 9090:9090 \
-v
~/Documents/code/random/prometheus-config/prometheus.yml:/etc/prometheus/prometheus.yml
\
prom/prometheus
```
1. Make sure you're seeing some data in Prometheus. On
http://localhost:9090/query, search for `synapse_build_info`
1. Start [Grafana](https://hub.docker.com/r/grafana/grafana)
```
docker run -d --name=grafana --add-host
host.docker.internal:host-gateway -p 3000:3000 grafana/grafana
```
1. Visit the Grafana dashboard, http://localhost:3000/ (Credentials:
`admin`/`admin`)
1. **Connections** -> **Data Sources** -> **Add data source** ->
**Prometheus**
- Prometheus server URL: `http://host.docker.internal:9090`
1. Import the Synapse dashboard: `contrib/grafana/synapse.json`
To test workers, you can use the testing strategy from
https://github.com/element-hq/synapse/pull/19336 (assumes both changes
from this PR and the other PR are combined)
This commit is contained in:
1
changelog.d/19337.feature
Normal file
1
changelog.d/19337.feature
Normal file
@@ -0,0 +1 @@
|
||||
Refactor Grafana dashboard to use `server_name` label (instead of `instance`).
|
||||
File diff suppressed because it is too large
Load Diff
@@ -123,25 +123,21 @@ Example Prometheus target for Synapse with workers:
|
||||
static_configs:
|
||||
- targets: ["my.server.here:port"]
|
||||
labels:
|
||||
instance: "my.server"
|
||||
job: "master"
|
||||
index: 1
|
||||
- targets: ["my.workerserver.here:port"]
|
||||
labels:
|
||||
instance: "my.server"
|
||||
job: "generic_worker"
|
||||
index: 1
|
||||
- targets: ["my.workerserver.here:port"]
|
||||
labels:
|
||||
instance: "my.server"
|
||||
job: "generic_worker"
|
||||
index: 2
|
||||
- targets: ["my.workerserver.here:port"]
|
||||
labels:
|
||||
instance: "my.server"
|
||||
job: "media_repository"
|
||||
index: 1
|
||||
```
|
||||
|
||||
Labels (`instance`, `job`, `index`) can be defined as anything.
|
||||
Labels (`job`, `index`) can be defined as anything.
|
||||
The labels are used to group graphs in grafana.
|
||||
|
||||
@@ -659,6 +659,26 @@ build_info.labels(
|
||||
" ".join([platform.system(), platform.release()]),
|
||||
).set(1)
|
||||
|
||||
|
||||
synapse_server_name_info = Gauge(
|
||||
"synapse_server_name_info",
|
||||
"Maps Synapse `server_name`s to the `instance`s they're hosted on",
|
||||
# `instance` will automatically be set by Prometheus
|
||||
labelnames=[SERVER_NAME_LABEL],
|
||||
)
|
||||
"""
|
||||
Maps Synapse `server_name`s to the `instance`s they're hosted on.
|
||||
|
||||
This is an info-style metric where the value is always 1, and labels carry metadata:
|
||||
|
||||
- `server_name`: The Synapse `server_name`
|
||||
- `instance`: Automatically be set by Prometheus and is the `<host>:<port>` part
|
||||
of the target's URL that was scraped.
|
||||
|
||||
This is useful as it allows us to correlate process-level metrics (like `process_*`,
|
||||
`python_*`, etc) with homeservers.
|
||||
"""
|
||||
|
||||
# 3PID send info
|
||||
threepid_send_requests = Histogram(
|
||||
"synapse_threepid_send_requests_with_tries",
|
||||
|
||||
@@ -147,8 +147,10 @@ from synapse.http.matrixfederationclient import MatrixFederationHttpClient
|
||||
from synapse.logging.context import PreserveLoggingContext
|
||||
from synapse.media.media_repository import MediaRepository
|
||||
from synapse.metrics import (
|
||||
SERVER_NAME_LABEL,
|
||||
all_later_gauges_to_clean_up_on_shutdown,
|
||||
register_threadpool,
|
||||
synapse_server_name_info,
|
||||
)
|
||||
from synapse.metrics.background_process_metrics import run_as_background_process
|
||||
from synapse.metrics.common_usage_metrics import CommonUsageMetricsManager
|
||||
@@ -361,6 +363,9 @@ class HomeServer(metaclass=abc.ABCMeta):
|
||||
self._sync_shutdown_handlers: list[ShutdownInfo] = []
|
||||
self._background_processes: set[defer.Deferred[Any | None]] = set()
|
||||
|
||||
# For every server we spawn in the process, track it in the metrics
|
||||
synapse_server_name_info.labels(**{SERVER_NAME_LABEL: self.hostname}).set(1)
|
||||
|
||||
def run_as_background_process(
|
||||
self,
|
||||
desc: "LiteralString",
|
||||
|
||||
Reference in New Issue
Block a user