1
0

33 Commits

Author SHA1 Message Date
Eric Eastwood 826a7dd29a Update "Event Send Time Quantiles" graph to only use dots for the event persistence rate (#19399)
This is the same thing we already do in the [`matrix.org`
dashboard](https://grafana.matrix.org/d/000000012/synapse) and although
the purple dots aren't new (introduced in
https://github.com/matrix-org/synapse/pull/10001), you can see that was
the intention in https://github.com/element-hq/synapse/pull/18510. I
think this was just how our contrib dashboard looked at the time and
perhaps was a Grafana version mismatch thing which is why it didn't
translate.
2026-01-22 14:07:22 -06:00
Eric Eastwood d6b45a7c8c Update and align Grafana dashboard to use regex matching for job=~"$job" (#19400)
We're already using `job=~"$job"` in the majority of the other panels.
This is just aligning the stragglers.

### Background

For a variable in Grafana, when the "All" value is selected, it
translates the variable into a wildcard regex. By default, this is just
a giant list of all of the possible values or'd together. It's possible
to define a "custom all value" like we've done for `index` as `.*` and
feels like we should also do this in a follow-up PR.

Input:
```
job="$job"
```

Before (using **exact** match) -> resulted in matching nothing:

```
job="(appservice|background_worker|client_reader|device_lists|event_creator|event_persister|federation_inbound|federation_reader|federation_sender|media_repository|pusher|stream_writers|synapse|synchrotron|user_dir)""
```

After (using **regex** match) -> matches all jobs as expected:

```
job=~"(appservice|background_worker|client_reader|device_lists|event_creator|event_persister|federation_inbound|federation_reader|federation_sender|media_repository|pusher|stream_writers|synapse|synchrotron|user_dir)""
```
2026-01-22 11:18:49 -06:00
Eric Eastwood 87d93b1ae6 Latest changes from importing/exporting from Grafana 12.3.1 (#19381)
These are automatic changes from importing/exporting from Grafana
12.3.1.

In order to verify that I'm not sneaking in any changes, you can follow
these steps to get the same output.

Reproduction instructions:

 1. Start [Grafana](https://hub.docker.com/r/grafana/grafana)
    ```
docker run -d --name=grafana --add-host
host.docker.internal:host-gateway -p 3000:3000 grafana/grafana
    ```
1. Visit the Grafana dashboard, http://localhost:3000/ (Credentials:
`admin`/`admin`)
 1. Import the Synapse dashboard: `contrib/grafana/synapse.json`
1. Export the Synapse dashboard. On the dashboard page -> **Export** ->
**Export as code** -> Using the **Classic** model -> Check **Export for
sharing externally** -> Copy
 1. Paste into `contrib/grafana/synapse.json`
 1. `git status`/`git diff` to check if there is any diff

Sanity checked the dashboard itself by importing the dashboard on
https://grafana.matrix.org/ (Grafana 10.4.1 according to
https://grafana.matrix.org/api/health). The process-level metrics won't
work because https://github.com/element-hq/synapse/pull/19337 just
merged and isn't on `matrix.org` yet. Also just generally, this
dashboard works for me locally with the
[load-tests](https://github.com/element-hq/synapse-rust-apps/pull/397)
I've been doing.


### Motivation

There are few fixes I want to make to the Grafana dashboard and it sucks
having to manually translate everything back over because we have
different formatting.

Hopefully after this bulk change, future exports will have exactly what
we want to change.
2026-01-16 11:36:49 -06:00
Eric Eastwood 58f59ffbcb Refactor Grafana dashboard to use server_name label (#19337)
- Update `synapse_xxx` (server-level) metrics to use
`server_name="$server_name",` instead of `instance="$instance"`
- Add `synapse_server_name_info` metric to map Synapse `server_name`s to
the `instance`s they're hosted on.
- For process level metrics, update to use `xxx * on (instance, job,
index) group_left(server_name)
synapse_server_name_info{server_name="$server_name"}`

All of the changes here are backwards compatible with whatever people
were doing before with their Prometheus/Grafana dashboards.

Previously, the recommendation was to use the `instance` label to group
everything under the same server (https://github.com/element-hq/synapse/blob/803e4b4d884b2de4b9e20dc47ffb59a983b8a4b5/docs/metrics-howto.md#L93-L147)

But the `instance` label actually has a special meaning and we're
actually abusing it by using it that way:

> `instance`: The `<host>:<port>` part of the target's URL that was
scraped.
>
> *--
https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series*

Since https://github.com/element-hq/synapse/issues/18592 (Synapse
`v1.139.0`), we now have the `server_name` label to use instead.


---

Additionally, the assumption that a single process is serving a single
server is no longer true with [Synapse Pro for small
hosts](https://docs.element.io/latest/element-server-suite-pro/synapse-pro-for-small-hosts/overview/).

Part of https://github.com/element-hq/synapse-small-hosts/issues/106

### Motivating use case

Although this change also benefits [Synapse Pro for small
hosts](https://docs.element.io/latest/element-server-suite-pro/synapse-pro-for-small-hosts/overview/)
(https://github.com/element-hq/synapse-small-hosts/issues/106), this is
actually spawning from adding Prometheus metrics to our workerized
Docker image (https://github.com/element-hq/synapse/pull/19324,
https://github.com/element-hq/synapse/pull/19336) with a more correct
label setup (without `instance`) and wanting the dashboard to be better.



### Testing strategy

1. Make sure your firewall allows the Docker containers to communicate
to the host (`host.docker.internal`) so they can access exposed ports of
other Docker containers. We want to allow Synapse to access the
Prometheus container and Grafana to access to the Prometheus container.
- `sudo ufw allow in on docker0 comment "Allow traffic from the default
Docker network to the host machine (host.docker.internal)"`
- `sudo ufw allow in on br-+ comment "(from Matrix Complement testing)
Allow traffic from custom Docker networks to the host machine
(host.docker.internal)"`
- [Complement firewall
docs](https://github.com/matrix-org/complement/blob/ee6acd9154bbae2d0071a9cb39593c0a5e37268b/README.md#potential-conflict-with-firewall-software)
1. Build the Docker image for Synapse: `docker build -t
matrixdotorg/synapse -f docker/Dockerfile .`
([docs](https://github.com/element-hq/synapse/blob/7a24fafbc376b9bffeb3277b1ad4aa950720c96c/docker/README-testing.md#building-and-running-the-images-manually))
 1. Generate config for Synapse:
    ```
    docker run -it --rm \
        --mount type=volume,src=synapse-data,dst=/data \
        -e SYNAPSE_SERVER_NAME=my.docker.synapse.server \
        -e SYNAPSE_REPORT_STATS=yes \
        -e SYNAPSE_ENABLE_METRICS=1 \
        matrixdotorg/synapse:latest generate
    ```
 1. Start Synapse:
     ```
    docker run -d --name synapse \
        --mount type=volume,src=synapse-data,dst=/data \
        -p 8008:8008 \
        -p 19090:19090 \
        matrixdotorg/synapse:latest
    ```
1. You should be able to see metrics from Synapse at
http://localhost:19090/_synapse/metrics
 1. Create a Prometheus config (`prometheus.yml`)
    ```yaml
    global:
      scrape_interval: 15s
      scrape_timeout: 15s
      evaluation_interval: 15s
    
    scrape_configs:
      - job_name: prometheus
        scrape_interval: 15s
        metrics_path: /_synapse/metrics
        scheme: http
        static_configs:
          - targets:
# This should point to the Synapse metrics listener (we're using
`host.docker.internal` because this is from within the Prometheus
container)
              - host.docker.internal:19090
    ```
1. Start Prometheus (update the volume bind mount to the config you just
saved somewhere):
    ```
    docker run \
        --detach \
        --name=prometheus \
        --add-host host.docker.internal:host-gateway \
        -p 9090:9090 \
-v
~/Documents/code/random/prometheus-config/prometheus.yml:/etc/prometheus/prometheus.yml
\
        prom/prometheus
    ```
1. Make sure you're seeing some data in Prometheus. On
http://localhost:9090/query, search for `synapse_build_info`
 1. Start [Grafana](https://hub.docker.com/r/grafana/grafana)
    ```
docker run -d --name=grafana --add-host
host.docker.internal:host-gateway -p 3000:3000 grafana/grafana
    ```
1. Visit the Grafana dashboard, http://localhost:3000/ (Credentials:
`admin`/`admin`)
1. **Connections** -> **Data Sources** -> **Add data source** ->
**Prometheus**
     - Prometheus server URL: `http://host.docker.internal:9090`
 1. Import the Synapse dashboard: `contrib/grafana/synapse.json`

To test workers, you can use the testing strategy from
https://github.com/element-hq/synapse/pull/19336 (assumes both changes
from this PR and the other PR are combined)
2026-01-14 17:57:42 -06:00
Erik Johnston 3ba3c7fe7d Reduce cardinality of metrics on event persister (#19133)
This reduces the size of metrics by ~80%. Responding with the metrics
takes significant amounts of time.
2025-11-12 13:41:58 +00:00
Eric Eastwood f13a136396 Refactor Gauge metrics to be homeserver-scoped (#18725)
Bulk refactor `Gauge` metrics to be homeserver-scoped. We also add lints
to make sure that new `Gauge` metrics don't sneak in without using the
`server_name` label (`SERVER_NAME_LABEL`).

Part of https://github.com/element-hq/synapse/issues/18592



### Testing strategy

 1. Add the `metrics` listener in your `homeserver.yaml`
    ```yaml
    listeners:
      # This is just showing how to configure metrics either way
      #
      # `http` `metrics` resource
      - port: 9322
        type: http
        bind_addresses: ['127.0.0.1']
        resources:
          - names: [metrics]
            compress: false
      # `metrics` listener
      - port: 9323
        type: metrics
        bind_addresses: ['127.0.0.1']
    ```
1. Start the homeserver: `poetry run synapse_homeserver --config-path
homeserver.yaml`
1. Fetch `http://localhost:9322/_synapse/metrics` and/or
`http://localhost:9323/metrics`
1. Observe response includes the TODO metrics with the `server_name`
label

### Pull Request Checklist

<!-- Please read
https://element-hq.github.io/synapse/latest/development/contributing_guide.html
before submitting your pull request -->

* [x] Pull request is based on the develop branch
* [x] Pull request includes a [changelog
file](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#changelog).
The entry should:
- Be a short description of your change which makes sense to users.
"Fixed a bug that prevented receiving messages from other servers."
instead of "Moved X method from `EventStore` to `EventWorkerStore`.".
  - Use markdown where necessary, mostly for `code blocks`.
  - End with either a period (.) or an exclamation mark (!).
  - Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by
@github_username." or "Contributed by [Your Name]." to the end of the
entry.
* [x] [Code
style](https://element-hq.github.io/synapse/latest/code_style.html) is
correct (run the
[linters](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#run-the-linters))
2025-07-29 10:37:59 -05:00
Eric Eastwood e80bc4b062 Distinguish all vs local events being persisted in the "Event Send Time Quantiles" graph (#18510)
(Applies to the Grafana graphs)

As discovered by @devonh, we use `synapse_storage_events_persisted_events_total` (which tracks *all* persisted events) for the "Events" rate in the "Event Send Time Quantiles" graph. This is pretty misleading as I would expect it to be the rate of events being sent given the graph title, "Event Send Time Quantiles".

Since the event persistence queues are shared for local and remote events from federation and will block local events being sent, I think it does still make sense to have the event persist rate. I've updated the graph to include the rate of "Local events being persisted" and the rate of "All events being persisted". I think this properly disambiguates and clarifies what the graph is trying to show.
2025-06-05 15:30:28 -05:00
Erik Johnston 0455c40085 Update book location 2023-12-13 16:15:22 +00:00
Michael Sasser 3df70aa800 Replace all Prometheus datasource UIDs of the Grafana Dashboard with the variable ${DS_PROMETHEUS} and remove __inputs (#16471) 2023-10-23 19:50:50 +01:00
Will Lewis 835174180b Fixed grafana deploy annotations in the dashboard config, so it shows for those not managing matrix.org (#15957)
Removed the 'matrix.org' hardcorded instance setting

Originally introduced in #15674

Co-authored-by: wrjlewis <will.lewis@askattest.com>
2023-07-20 12:33:06 +00:00
Eric Eastwood 0b5f64ff09 Add Synapse version deploy annotations to Grafana dashboard (#15674)
Fix https://github.com/matrix-org/synapse/issues/15662

This manifests as purple lines that show up on all time series panels
that you can hover and see what version was deployed.

Also added a new "Deployed Synapse versions over time" panel
where the color block changes with each version. And mixed this
color block into the "Up" time series panel.

To get the Grafana dashboard JSON to copy here: use the **Share** icon at the top -> **Export** -> check the **Export for sharing externally** option -> **View JSON** or **Save to file**
2023-05-31 14:35:49 -05:00
reivilibre 51e7255fbb Fix the *MAU Limits* section of the Grafana dashboard relying on a specific job name for the workers of a Synapse deployment. (#14644) 2022-12-13 14:19:43 +00:00
reivilibre a6514792b2 Update forgotten references to legacy metrics in the included Grafana dashboard. (#14477)
Fixes https://github.com/matrix-org/synapse/issues/14465
2022-11-22 10:51:01 +00:00
reivilibre b455c2a5ec Update Grafana dashboard to not use legacy metric names. (#13714) 2022-09-06 12:21:21 +01:00
reivilibre f48f4dd59e Update the Grafana dashboard that is included with Synapse in the contrib directory. (#13697)
* Add missing graph to contrib

* Update with minor but plausible changes, including positioning changes

* Newsfile

Signed-off-by: Olivier Wilkinson (reivilibre) <oliverw@matrix.org>

Signed-off-by: Olivier Wilkinson (reivilibre) <oliverw@matrix.org>
2022-09-01 16:27:06 +01:00
Richard van der Hoff 434fd82d5f Update grafana dashboard 2022-08-13 21:50:20 +01:00
Sheogorath 77258b6725 docs(contrib): Add link to documentation in dashboard (#12602) 2022-05-09 10:08:31 +00:00
David Robertson a10988983a Break down cache expiry reasons in grafana (#10880)
A follow-up to #10829
2021-09-23 14:45:32 +01:00
Brendan Abolivier d1f43b731c Update the Synapse Grafana dashboard (#10570) 2021-08-16 12:57:09 +02:00
Dirk Klimpel e938f69697 Fix some links in docs and contrib (#10370) 2021-07-13 11:55:48 +01:00
Erik Johnston 9c76d0561b Update the contrib grafana dashboard (#10001) 2021-05-19 11:47:16 +01:00
Michael Kaye f49c2093b5 Cross-link documentation to the prometheus recording rules. (#8667) 2020-10-27 15:29:50 -04:00
Richard van der Hoff fa361c8f65 Update grafana dashboard 2020-07-13 14:48:21 +01:00
Richard van der Hoff 816589b09a update grafana dashboard 2020-06-02 12:44:36 +01:00
Richard van der Hoff 782b811789 update grafana dashboard 2020-03-19 10:45:40 +00:00
Matthew Hodgson cc7ab0d84a rst->md 2020-03-01 21:21:36 +00:00
Richard van der Hoff abd334d27b Add extremities graphs to grafana dashboard 2019-06-25 11:51:32 +01:00
Richard van der Hoff dd1c722a39 format json for grafana dashboard 2019-06-25 08:59:19 +01:00
Richard van der Hoff 6b0ddf8ee5 update grafana dashboard 2019-04-13 13:10:46 +01:00
Richard van der Hoff 0649306fde Update grafana dashboard 2018-09-25 13:29:28 +01:00
Richard van der Hoff 9b92720d88 fix event lag graph 2018-08-07 12:27:34 +01:00
Paul Tötterman 9c14c2b561 Add some documentation for using the dashboard 2018-07-31 12:48:37 +03:00
Richard van der Hoff 6aab397ada synapse grafana dashboard 2018-07-31 09:45:58 +01:00