Commit Graph

39 Commits

Author SHA1 Message Date
Andrew Ferrazzutti
fc244bb592 Use type hinting generics in standard collections (#19046)
aka PEP 585, added in Python 3.9

 - https://peps.python.org/pep-0585/
 - https://docs.astral.sh/ruff/rules/non-pep585-annotation/
2025-10-22 16:48:19 -05:00
Devon Hudson
396de6544a Cleanly shutdown SynapseHomeServer object (#18828)
This PR aims to allow for a clean shutdown of the `SynapseHomeServer`
object so that it can be fully deleted and cleaned up by garbage
collection without shutting down the entire python process.

Fix https://github.com/element-hq/synapse-small-hosts/issues/50

### Pull Request Checklist

<!-- Please read
https://element-hq.github.io/synapse/latest/development/contributing_guide.html
before submitting your pull request -->

* [x] Pull request is based on the develop branch
* [x] Pull request includes a [changelog
file](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#changelog).
The entry should:
- Be a short description of your change which makes sense to users.
"Fixed a bug that prevented receiving messages from other servers."
instead of "Moved X method from `EventStore` to `EventWorkerStore`.".
  - Use markdown where necessary, mostly for `code blocks`.
  - End with either a period (.) or an exclamation mark (!).
  - Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by
@github_username." or "Contributed by [Your Name]." to the end of the
entry.
* [x] [Code
style](https://element-hq.github.io/synapse/latest/code_style.html) is
correct (run the
[linters](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#run-the-linters))

---------

Co-authored-by: Eric Eastwood <erice@element.io>
2025-10-01 02:42:09 +00:00
Eric Eastwood
5a9ca1e3d9 Introduce Clock.call_when_running(...) to include logcontext by default (#18944)
Introduce `Clock.call_when_running(...)` to wrap startup code in a
logcontext, ensuring we can identify which server generated the logs.

Background:

>  Ideally, nothing from the Synapse homeserver would be logged against the `sentinel` 
>  logcontext as we want to know which server the logs came from. In practice, this is not 
>  always the case yet especially outside of request handling. 
>   
>  Global things outside of Synapse (e.g. Twisted reactor code) should run in the 
>  `sentinel` logcontext. It's only when it calls into application code that a logcontext 
>  gets activated. This means the reactor should be started in the `sentinel` logcontext, 
>  and any time an awaitable yields control back to the reactor, it should reset the 
>  logcontext to be the `sentinel` logcontext. This is important to avoid leaking the 
>  current logcontext to the reactor (which would then get picked up and associated with 
>  the next thing the reactor does). 
>
> *-- `docs/log_contexts.md`

Also adds a lint to prefer `Clock.call_when_running(...)` over
`reactor.callWhenRunning(...)`

Part of https://github.com/element-hq/synapse/issues/18905
2025-09-22 10:27:59 -05:00
Eric Eastwood
bff4a11b3f Re-introduce: Fix LaterGauge metrics to collect from all servers (#18791)
Re-introduce: https://github.com/element-hq/synapse/pull/18751 that was
reverted in https://github.com/element-hq/synapse/pull/18789 (explains
why the PR was reverted in the first place).

- Adds a `cleanup` pattern that cleans up metrics from each homeserver
in the tests. Previously, the list of hooks built up until our CI
machines couldn't operate properly, see
https://github.com/element-hq/synapse/pull/18789
- Fix long-standing issue with `synapse_background_update_status`
metrics only tracking the last database listed in the config (see
https://github.com/element-hq/synapse/pull/18791#discussion_r2261706749)
2025-09-02 12:14:27 -05:00
Eric Eastwood
ff03a51cb0 Revert "Fix LaterGauge metrics to collect from all servers (#18751)" (#18789)
This PR reverts https://github.com/element-hq/synapse/pull/18751

### Why revert?

@reivilibre
[found](https://matrix.to/#/!vcyiEtMVHIhWXcJAfl:sw1v.org/$u9OEmMxaFYUzWHhCk1A_r50Y0aGrtKEhepF7WxWJkUA?via=matrix.org&via=node.marinchik.ink&via=element.io)
that our CI was failing in bizarre ways (thanks for stepping up to dive
into this 🙇). Examples:

- `twisted.internet.error.ProcessTerminated: A process has ended with a
probable error condition: process ended by signal 9.`
- `twisted.internet.error.ProcessTerminated: A process has ended with a
probable error condition: process ended by signal 15.`

<details>
<summary>More detailed part of the log</summary>


https://github.com/element-hq/synapse/actions/runs/16758038107/job/47500520633#step:9:6809
```
tests.util.test_wheel_timer.WheelTimerTestCase.test_single_insert_fetch
===============================================================================
Error: 
Traceback (most recent call last):
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/trial/_dist/disttrial.py", line 371, in task
    await worker.run(case, result)
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/trial/_dist/worker.py", line 305, in run
    return await self.callRemote(workercommands.Run, testCase=testCaseId)  # type: ignore[no-any-return]
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/defer.py", line 1187, in __iter__
    yield self
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/defer.py", line 1092, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/protocols/amp.py", line 1968, in _massageError
    error.trap(RemoteAmpError)
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/python/failure.py", line 431, in trap
    self.raiseException()
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/python/failure.py", line 455, in raiseException
    raise self.value.with_traceback(self.tb)
twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended by signal 9.

tests.util.test_macaroons.MacaroonGeneratorTestCase.test_guest_access_token
-------------------------------------------------------------------------------
Ran 4325 tests in 669.321s

FAILED (skips=159, errors=62, successes=4108)
while calling from thread
Traceback (most recent call last):
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/base.py", line 1064, in runUntilCurrent
    f(*a, **kw)
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/base.py", line 790, in stop
    raise error.ReactorNotRunning("Can't stop reactor that isn't running.")
twisted.internet.error.ReactorNotRunning: Can't stop reactor that isn't running.

joining disttrial worker #0 failed
Traceback (most recent call last):
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/defer.py", line 1853, in _inlineCallbacks
    result = context.run(
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/python/failure.py", line 467, in throwExceptionIntoGenerator
    return g.throw(self.value.with_traceback(self.tb))
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/trial/_dist/worker.py", line 406, in exit
    await endDeferred
  File "/home/runner/.cache/pypoetry/virtualenvs/matrix-synapse-pswDeSvb-py3.9/lib/python3.9/site-packages/twisted/internet/defer.py", line 1187, in __iter__
    yield self
twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended by signal 15.
```

</details>


With more debugging (thanks @devonh for also stepping in as maintainer),
we were finding that the CI was consistently failing at
`test_exposed_to_prometheus` which was a bit of smoke because of all of
the [metrics
changes](https://github.com/element-hq/synapse/issues/18592) that were
merged recently.

Locally, although I wasn't able to reproduce the bizarre errors, I could
easily see increased memory usage (~20GB vs ~2GB) and the
`test_exposed_to_prometheus` test taking a while to complete when
running a full test run (`SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial
tests`).

<img width="1485" height="78" alt="Lots of memory usage"
src="https://github.com/user-attachments/assets/811e2a96-75e5-4a3c-966c-00dc0512cea9"
/>

After updating `test_exposed_to_prometheus` to dump the
`latest_metrics_response = generate_latest(REGISTRY)`, I could see that
it's a massive 3.2GB response. Inspecting the contents, we can see 4.1M
(4,137,123) entries for just
`synapse_background_update_status{server_name="test"} 3.0` which is a
`LaterGauge`. I don't think we have 4.1M test cases so it's also unclear
why we end up with so many samples but it does make sense that we do see
a lot of duplicates because each `HomeserverTestCase` will create a
homeserver for each test case that will `LaterGauge.register_hook(...)`
(part of the https://github.com/element-hq/synapse/pull/18751 changes).

`tests/storage/databases/main/test_metrics.py`
```python
        latest_metrics_response = generate_latest(REGISTRY)
        with open("/tmp/synapse-test-metrics", "wb") as f:
            f.write(latest_metrics_response)
```

After reverting the https://github.com/element-hq/synapse/pull/18751
changes, running the full test suite locally doesn't result in memory
spikes and seems to run normally.



### Dev notes

Discussion in the
[`#synapse-dev:matrix.org`](https://matrix.to/#/!vcyiEtMVHIhWXcJAfl:sw1v.org/$vkMATs04yqZggVVd6Noop5nU8M2DVoTkrAWshw7u1-w?via=matrix.org&via=node.marinchik.ink&via=element.io)
room.

### Pull Request Checklist

<!-- Please read
https://element-hq.github.io/synapse/latest/development/contributing_guide.html
before submitting your pull request -->

* [x] Pull request is based on the develop branch
* [ ] Pull request includes a [changelog
file](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#changelog).
The entry should:
- Be a short description of your change which makes sense to users.
"Fixed a bug that prevented receiving messages from other servers."
instead of "Moved X method from `EventStore` to `EventWorkerStore`.".
  - Use markdown where necessary, mostly for `code blocks`.
  - End with either a period (.) or an exclamation mark (!).
  - Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by
@github_username." or "Contributed by [Your Name]." to the end of the
entry.
* [ ] [Code
style](https://element-hq.github.io/synapse/latest/code_style.html) is
correct (run the
[linters](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#run-the-linters))
2025-08-06 22:14:40 +00:00
Eric Eastwood
076db0ab49 Fix LaterGauge metrics to collect from all servers (#18751)
Fix `LaterGauge` metrics to collect from all servers

Follow-up to https://github.com/element-hq/synapse/pull/18714

Previously, our `LaterGauge` metrics did include the `server_name` label
as expected but we were only seeing the last server being reported in
some cases. Any `LaterGauge` that we were creating multiple times was
only reporting the last instance.

This PR updates all `LaterGauge` to be created once and then we use
`LaterGauge.register_hook(...)` to add in the metric callback as before.
This works now because we store a list of callbacks instead of just one.

I noticed this problem thanks to some [tests in the Synapse Pro for
Small Hosts](https://github.com/element-hq/synapse-small-hosts/pull/173)
repo that sanity check all metrics to ensure that we can see each metric
includes data from multiple servers.


### Testing strategy

1. This is only noticeable when you run multiple Synapse instances in
the same process.
 1. TODO

(see test that was added)

### Dev notes

Previous non-global `LaterGauge`:

```
synapse_federation_send_queue_xxx
synapse_federation_transaction_queue_pending_destinations
synapse_federation_transaction_queue_pending_pdus
synapse_federation_transaction_queue_pending_edus
synapse_handlers_presence_user_to_current_state_size
synapse_handlers_presence_wheel_timer_size
synapse_notifier_listeners
synapse_notifier_rooms
synapse_notifier_users
synapse_replication_tcp_resource_total_connections
synapse_replication_tcp_command_queue
synapse_background_update_status
synapse_federation_known_servers
synapse_scheduler_running_tasks
```



### Pull Request Checklist

<!-- Please read
https://element-hq.github.io/synapse/latest/development/contributing_guide.html
before submitting your pull request -->

* [x] Pull request is based on the develop branch
* [x] Pull request includes a [changelog
file](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#changelog).
The entry should:
- Be a short description of your change which makes sense to users.
"Fixed a bug that prevented receiving messages from other servers."
instead of "Moved X method from `EventStore` to `EventWorkerStore`.".
  - Use markdown where necessary, mostly for `code blocks`.
  - End with either a period (.) or an exclamation mark (!).
  - Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by
@github_username." or "Contributed by [Your Name]." to the end of the
entry.
* [x] [Code
style](https://element-hq.github.io/synapse/latest/code_style.html) is
correct (run the
[linters](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#run-the-linters))
2025-08-05 15:28:55 +00:00
Eric Eastwood
e16fbdcdcc Update metrics linting to be able to handle custom metrics (#18733)
Part of https://github.com/element-hq/synapse/issues/18592
2025-08-01 15:34:11 -05:00
Eric Eastwood
e43a1cec84 Fix cache metrics to collect from all servers (#18748)
Follow-up to https://github.com/element-hq/synapse/pull/18604

Previously, our cache metrics did include the `server_name` label as
expected but we were only seeing the last server being reported. This
was caused because we would
`CACHE_METRIC_REGISTRY.register_hook(metric_name, metric.collect)` where
the `metric_name` only took into account the cache name so it would be
overwritten every time we spawn a new server.

This PR updates the register logic to include the `server_name` so we
have a hook for every cache on every server as expected.

I noticed this problem thanks to some [tests in the Synapse Pro for
Small Hosts](https://github.com/element-hq/synapse-small-hosts/pull/173)
repo that sanity check all metrics to ensure that we can see each metric
includes data from multiple servers.
2025-08-01 12:29:58 -05:00
reivilibre
a31d53b28f Use twisted.internet.testing module in tests instead of deprecated twisted.test.proto_helpers. (#18728)
Follows: #18727

---------

Signed-off-by: Olivier 'reivilibre <oliverw@matrix.org>
2025-07-30 12:32:10 +01:00
Eric Eastwood
b7e7f537f1 Refactor background process metrics to be homeserver-scoped (#18670)
Part of https://github.com/element-hq/synapse/issues/18592

Separated out of https://github.com/element-hq/synapse/pull/18656
because it's a bigger, unique piece of the refactor


### Testing strategy

 1. Add the `metrics` listener in your `homeserver.yaml`
    ```yaml
    listeners:
      # This is just showing how to configure metrics either way
      #
      # `http` `metrics` resource
      - port: 9322
        type: http
        bind_addresses: ['127.0.0.1']
        resources:
          - names: [metrics]
            compress: false
      # `metrics` listener
      - port: 9323
        type: metrics
        bind_addresses: ['127.0.0.1']
    ```
1. Start the homeserver: `poetry run synapse_homeserver --config-path
homeserver.yaml`
1. Fetch `http://localhost:9322/_synapse/metrics` and/or
`http://localhost:9323/metrics`
1. Observe response includes the background processs metrics
(`synapse_background_process_start_count`,
`synapse_background_process_db_txn_count_total`, etc) with the
`server_name` label
2025-07-23 13:28:17 -05:00
Eric Eastwood
cda922830e Clean up MetricsResource and Prometheus hacks (#18687)
Clean up `MetricsResource`, Prometheus hacks
(`_set_prometheus_client_use_created_metrics`), and better document why
we care about having a separate `metrics` listener type.

These clean-up changes have been split out from
https://github.com/element-hq/synapse/pull/18584 since that PR was
closed.
2025-07-17 11:57:19 -05:00
Eric Eastwood
88785dbaeb Refactor cache metrics to be homeserver-scoped (#18604)
(add `server_name` label to cache metrics).

Part of https://github.com/element-hq/synapse/issues/18592
2025-07-16 16:04:57 -05:00
Andrew Morgan
4b1d9d5d0e Add a unit test for the phone home stats (#18463) 2025-05-20 16:26:45 +01:00
Devon Hudson
89cb613a4e Revert "Add total event, unencrypted message, and e2ee event counts to stats reporting" (#18346)
Reverts element-hq/synapse#18260

It is causing a failure when building release debs for `debian:bullseye`
with the following error:
```
sqlite3.OperationalError: near "RETURNING": syntax error
```
2025-04-16 16:41:41 +00:00
Andrew Morgan
a832375bfb Add total event, unencrypted message, and e2ee event counts to stats reporting (#18260)
Co-authored-by: Eric Eastwood <erice@element.io>
2025-04-15 07:49:08 -07:00
V02460
068e22b4b7 Cleanup Python 3.8 leftovers (#17967)
Some small cleanups after Python3.8 became EOL.

- Move some type imports from `typing_extensions` to `typing`
- Remove the `abi3-py38` feature from pyo3

### Pull Request Checklist

<!-- Please read
https://element-hq.github.io/synapse/latest/development/contributing_guide.html
before submitting your pull request -->

* [x] Pull request is based on the develop branch
* [x] Pull request includes a [changelog
file](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#changelog).
The entry should:
- Be a short description of your change which makes sense to users.
"Fixed a bug that prevented receiving messages from other servers."
instead of "Moved X method from `EventStore` to `EventWorkerStore`.".
  - Use markdown where necessary, mostly for `code blocks`.
  - End with either a period (.) or an exclamation mark (!).
  - Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by
@github_username." or "Contributed by [Your Name]." to the end of the
entry.
* [x] [Code
style](https://element-hq.github.io/synapse/latest/code_style.html) is
correct
(run the
[linters](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#run-the-linters))

---------

Co-authored-by: Quentin Gliech <quenting@element.io>
2025-02-10 16:53:24 +00:00
Erik Johnston
23740eaa3d Correctly mention previous copyright (#16820)
During the migration the automated script to update the copyright
headers accidentally got rid of some of the existing copyright lines.
Reinstate them.
2024-01-23 11:26:48 +00:00
Patrick Cloke
8e1e62c9e0 Update license headers 2023-11-21 15:29:58 -05:00
Eric Eastwood
561d06b481 Remove support for Python 3.7 (#15851)
Fix https://github.com/matrix-org/synapse/issues/15836
2023-07-05 18:45:42 -05:00
Patrick Cloke
a4ca770655 Add missing type hints to tests. (#14687)
Adds type hints to tests.metrics and tests.crypto.
2022-12-28 08:29:35 -05:00
David Robertson
2bb2c32e8e Avoid incrementing bg process utime/stime counters by negative durations (#14323) 2022-10-31 13:02:07 +00:00
Amber Brown
fcc525b0b7 rest of the changes 2018-05-21 19:48:57 -05:00
Erik Johnston
32015e1109 Escape label values in prometheus metrics 2018-05-02 16:52:42 +01:00
Richard van der Hoff
bc496df192 report metrics on number of cache evictions 2018-02-05 15:34:01 +00:00
Erik Johnston
73c7112433 Change CacheMetrics to be quicker
We change it so that each cache has an individual CacheMetric, instead
of having one global CacheMetric. This means that when a cache tries to
increment a counter it does not need to go through so many indirections.
2016-06-03 11:26:52 +01:00
Mark Haines
7641a90c34 Add a test for TreeCache.__contains__ 2016-02-22 15:22:38 +00:00
Matthew Hodgson
6c28ac260c copyrights 2016-01-07 04:26:29 +00:00
Paul "LeoNerd" Evans
c1cdd7954d Add an .inc_by() method to CounterMetric; implement DistributionMetric a neater way 2015-03-12 16:24:51 +00:00
Paul "LeoNerd" Evans
f1fbe3e09f Rename TimerMetric to DistributionMetric; as it could count more than just time 2015-03-12 16:24:51 +00:00
Paul "LeoNerd" Evans
cbc0406be8 Export CacheMetric as hits+total, rather than hits+misses, as it's easier to derive hit ratio from that 2015-03-12 16:24:51 +00:00
Paul "LeoNerd" Evans
0e847540c3 Prometheus needs "escaped" label values 2015-03-12 16:24:51 +00:00
Paul "LeoNerd" Evans
b3a0179d64 Bugfix to rendering output of vectored TimerMetrics 2015-03-12 16:24:51 +00:00
Paul "LeoNerd" Evans
f9478e475b Rename Metrics' "keys" to "labels" 2015-03-12 16:24:51 +00:00
Paul "LeoNerd" Evans
72625f2f4d Initial hack at a TimerMetric; for storing counts + duration accumulators 2015-03-12 16:24:50 +00:00
Paul "LeoNerd" Evans
23ab0c68c2 Implement vector CallbackMetrics 2015-03-12 16:24:50 +00:00
Paul "LeoNerd" Evans
8664599af7 Rename CacheCounterMetric to just CacheMetric; add a CallbackMetric component to give the size of the cache 2015-03-12 16:24:50 +00:00
Paul "LeoNerd" Evans
d8caa5454d Initial attempt at a scalar callback-based metric to give instantaneous snapshot gauges 2015-03-12 16:24:50 +00:00
Paul "LeoNerd" Evans
ce8b5769f7 Create the concept of a cachecounter metric; generating two counters specific to caches 2015-03-12 16:24:50 +00:00
Paul "LeoNerd" Evans
e7420a3bef Initial tiny attempt at (vectorable) counter metrics 2015-03-12 16:24:50 +00:00