These errors are harmless and are a long-standing issue that is just now
being logged, see https://github.com/element-hq/synapse/issues/19042
```
2025-10-10 15:30:00,026 - synapse.util.metrics - 330 - ERROR - notify_interested_services-0 - Metric named cache_lru_cache__matches_user_in_member_list_example.com already registered for server example.com
2025-10-10 16:30:00.167
2025-10-10 15:30:00,026 - synapse.util.metrics - 330 - ERROR - notify_interested_services-0 - Metric named cache_lru_cache_is_interested_in_room_example.com already registered for server example.com
2025-10-10 16:30:00.167
2025-10-10 15:30:00,025 - synapse.util.metrics - 330 - ERROR - notify_interested_services-0 - Metric named cache_lru_cache_is_interested_in_event_example.com already registered for server example.com
2025-10-10 16:29:15.560
2025-10-10 15:29:15,449 - synapse.util.metrics - 330 - ERROR - notify_interested_services_ephemeral-0 - Metric named cache_lru_cache__matches_user_in_member_list_example.com already registered for server example.com
2025-10-10 16:29:15.560
2025-10-10 15:29:15,449 - synapse.util.metrics - 330 - ERROR - notify_interested_services_ephemeral-0 - Metric named cache_lru_cache_is_interested_in_room_example.com already registered for server example.com
```
(more sane standard location for this sort of thing)
The one difference here is that previously, `start_doing_background_updates
()` only ran on the main Synapse instance. But since it now lives in
`start_background_tasks()`, it will run on the worker that supposed to
`run_background_tasks`. Doesn't seem like a problem though.
This means we
can move the open registration config validation from `setup()` to
`HomeServerConfig.validate_config()` (much more sane).
Spawning from looking at this area of code in
https://github.com/element-hq/synapse/pull/19015
### Background
As part of Element's plan to support a light form of vhosting (virtual
host) (multiple instances of Synapse in the same Python process), we're
currently diving into the details and implications of running multiple
instances of Synapse in the same Python process.
"Clean tenant provisioning" tracked internally by
https://github.com/element-hq/synapse-small-hosts/issues/221
### Partial startup problem
In the context of Synapse Pro for Small Hosts, since the Twisted reactor
is already running (from the `multi_synapse` shard process itself), when
provisioning a homeserver tenant, the `reactor.callWhenRunning(...)`
callbacks will be invoked immediately. This includes the Synapse's
[`start`](0615b64bb4/synapse/app/homeserver.py (L418-L429))
callback which sets up everything (including listeners, background
tasks, etc). If we encounter an error at this point, we are partially
setup but the exception will [bubble back to
us](8be122186b/multi_synapse/app/shard.py (L114-L121))
without us having a handle to the homeserver yet so we can't call
`hs.shutdown()` and clean everything up.
### What does this PR do?
Structures Synapse so we split creating the homeserver instance from
setting everything up. This way we have access to `hs` if anything goes
wrong during setup and can subsequently `hs.shutdown()` to clean
everything up.
This was originally removed in
https://github.com/element-hq/synapse/pull/18886 but it looks like it
snuck back in https://github.com/element-hq/synapse/pull/18828 during a
[bad
merge](4cd3d9172e).
Noticed while looking at Synapse setup and startup (just by happen
stance).
I don't think this has adverse effects on Synapse actually working and
`start_background_tasks()` can be called multiple times.
### Is there a good way to audit all of these merges?
As I would like to see the conflicts for each merge.
This works but it's still hard to notice anything is wrong:
```
git log --remerge-diff <commit-sha>
```
> shows the difference from mechanical merge result and the result that
is actually recorded in a merge commit
via
https://stackoverflow.com/questions/15277708/how-do-you-see-show-a-git-merge-conflict-resolution-that-was-done-given-a-mer/71181334#71181334
The following better. Specify the version range to the commit right
before the merge to the merge. And can even specify which file to look
at to make it more obvious with the hindsight we have now.
```
git log --remerge-diff <merge-commit-sha>~1..<merge-commit-sha> -- synapse/server.py
```
Example:
```
git log --remerge-diff 4cd3d9172ed7b87e509746851a376c861a27820e~1..4cd3d9172ed7b87e509746851a376c861a27820e -- synapse/server.py
```
See https://github.com/matrix-org/synapse/pull/12973 where we previously
used `version_string="Synapse/" +
get_distribution_version_string("matrix-synapse")` everywhere; and then
updated to use `version_string=f"Synapse/{SYNAPSE_VERSION}"` for every
other place except `synapse/app/homeserver.py` (why?!?!?!). This seems
more like a typo than something on purpose especially without any
context in the comments or PR. The whole point of that PR was trying to
solve the missing git info in version strings.
For reference, here is what both variables look like for me locally on
the latest `develop`:
- `SYNAPSE_VERSION`: `1.139.0 (b=develop,1d2ddbc76e,dirty)`
- `VERSION`: `1.139.0`
Only reason we may want to do this is to hide the branch name (some
sensitive name that exposes a security fix, etc). But we don't hide
anything:
`https://matrix.org/_matrix/federation/v1/version`
```json
{
"server": {
"name": "Synapse",
"version": "1.139.0rc3 (b=matrix-org-hotfixes-priv,f538ed5ac3)"
}
}
```
On `matrix.org`, the `Server` response header is masked as `cloudflare`
which would otherwise show `1.139.0rc3` for everything from the main
process.
---
This is spawning from looking at the way we setup and start Synapse for
homeserver tenant provisioning in the Synapse Pro for Small Hosts
project (https://github.com/element-hq/synapse-small-hosts/issues/221)
Add debug logs wherever we change current logcontext (`LoggingContext`).
I've had to make this same set of changes over and over as I've been
debugging things so it seems useful enough to include by default.
Instead of tracing things at the `set_current_context(...)` level, I've
added the debug logging on all of the utilities that utilize
`set_current_context(...)`. It's much easier to reason about the log
context changing because of `PreserveLoggingContext` changing things
than an opaque `set_current_context(...)` call.
Revert https://github.com/element-hq/synapse/pull/18849
Go back to our custom `LogContextScopeManager` after trying
OpenTracing's `ContextVarsScopeManager`.
Fix https://github.com/element-hq/synapse/issues/19004
### Why revert?
For reference, with the normal reactor, `ContextVarsScopeManager` worked
just as good as our custom `LogContextScopeManager` as far as I can tell
(and even better in some cases). But since Twisted appears to not fully
support `ContextVar`'s, it doesn't work as expected in all cases.
Compounding things, `ContextVarsScopeManager` was causing errors with
the experimental `SYNAPSE_ASYNC_IO_REACTOR` option.
Since we're not getting the full benefit that we originally desired, we
might as well revert and figure out alternatives for extending the
logcontext lifetimes to support the use case we were trying to unlock
(c.f. https://github.com/element-hq/synapse/pull/18804).
See
https://github.com/element-hq/synapse/issues/19004#issuecomment-3358052171
for more info.
### Does this require backporting and patch releases?
No. Since `ContextVarsScopeManager` operates just as good with the
normal reactor and was only causing actual errors with the experimental
`SYNAPSE_ASYNC_IO_REACTOR` option, I don't think this requires us to
backport and make patch releases at all.
### Maintain cross-links between main trace and background process work
In order to maintain the functionality introduced in https://github.com/element-hq/synapse/pull/18932 (cross-links between the background process trace and currently active trace), we also needed a small change.
Previously, when we were using `ContextVarsScopeManager`, it tracked the tracing scope across the logcontext changes without issue. Now that we're using our own custom `LogContextScopeManager` again, we need to capture the active span from the logcontext before we reset to the sentinel context because of the `PreserveLoggingContext()` below.
Added some tests to ensure we maintain the `run_as_background` tracing behavior regardless of the tracing scope manager we use.
Prefer the utils over raw logcontext manipulation.
Spawning from adding some logcontext debug logs in
https://github.com/element-hq/synapse/pull/18966 and since we're not
logging at the `set_current_context(...)` level (see reasoning there),
this removes some usage of `set_current_context(...)`.
Spawning from adding some logcontext debug logs in
https://github.com/element-hq/synapse/pull/18966 and since we're not
logging at the `set_current_context(...)` level (see reasoning there),
this removes some usage of `set_current_context(...)`.
Specifically, `MockClock.call_later(...)` doesn't handle logcontexts
correctly. It uses the calling logcontext as the callback context
(wrong, as the logcontext could finish before the callback finishes) and
it didn't reset back to the sentinel context before handing back to the
reactor. It was like this since it was [introduced 10+ years
ago](38da9884e7).
Instead of fixing the implementation which would just be a copy of our
normal `Clock`, we can just remove `MockClock`
### Background
As part of Element's plan to support a light form of vhosting (virtual
host) (multiple instances of Synapse in the same Python process), we're
currently diving into the details and implications of running multiple
instances of Synapse in the same Python process.
"Per-tenant logging" tracked internally by
https://github.com/element-hq/synapse-small-hosts/issues/48
### Prior art
Previously, we exposed `server_name` by providing a static logging
`MetadataFilter` that injected the values:
205d9e4fc4/synapse/config/logger.py (L216)
While this can work fine for the normal case of one Synapse instance per
Python process, this configures things globally and isn't compatible
when we try to start multiple Synapse instances because each subsequent
tenant will overwrite the previous tenant.
### What does this PR do?
We remove the `MetadataFilter` and replace it by tracking the
`server_name` in the `LoggingContext` and expose it with our existing
[`LoggingContextFilter`](205d9e4fc4/synapse/logging/context.py (L584-L622))
that we already use to expose information about the `request`.
This means that the `server_name` value follows wherever we log as
expected even when we have multiple Synapse instances running in the
same process.
### A note on logcontext
Anywhere, Synapse mistakenly uses the `sentinel` logcontext to log
something, we won't know which server sent the log. We've been fixing up
`sentinel` logcontext usage as tracked by
https://github.com/element-hq/synapse/issues/18905
Any further `sentinel` logcontext usage we find in the future can be
fixed piecemeal as normal.
d2a966f922/docs/log_contexts.md (L71-L81)
### Testing strategy
1. Adjust your logging config to include `%(server_name)s` in the format
```yaml
formatters:
precise:
format: '%(asctime)s - %(server_name)s - %(name)s - %(lineno)d -
%(levelname)s - %(request)s - %(message)s'
```
1. Start Synapse: `poetry run synapse_homeserver --config-path
homeserver.yaml`
1. Make some requests (`curl
http://localhost:8008/_matrix/client/versions`, etc)
1. Open the homeserver logs and notice the `server_name` in the logs as
expected. `unknown_server_from_sentinel_context` is expected for the
`sentinel` logcontext (things outside of Synapse).
Related to https://github.com/element-hq/synapse/issues/18905
These fixes were split off from
https://github.com/element-hq/synapse/pull/18828 where @devonh was
seeing some test failures because `timeout_deferred(...)` is being
updated to use `Clock` utilities instead of raw `reactor` methods. This
test was failing in that branch/PR until we made this new version that
handles the logcontexts properly.
While the previous version of this test does pass on `develop`, it was
using what appears completely wrong assertions, assumptions, and bad
patterns to make it happen (see diff comments below)
---
Test originally introduced in https://github.com/matrix-org/synapse/pull/4407
Before https://github.com/element-hq/synapse/pull/18849, we we're using
our own custom `LogContextScopeManager` which tied the tracing scope to
the `LoggingContext`. Since we created a new
`BackgroundProcessLoggingContext` any time we
`run_as_background_process(...)`, the trace for the background work was
separate from the trace that kicked of the work as expected (e.g.
request trace is separate from the background process we kicked to fetch
more messages from the federation).
Since we've now switched to the `ContextVarsScopeManager` (in
https://github.com/element-hq/synapse/pull/18849), the tracing scope now
crosses the `LoggingContext` boundaries (and thread boundaries) without
a problem. This means we end up with request traces that include all of
the background work that we've kicked off bloating the trace and making
it hard to understand what's going on.
This PR separates the traces again to how things were before.
Additionally, things are even better now since I added some cross-link
references between the traces to easily be able to jump between.
Follow-up to https://github.com/element-hq/synapse/pull/18849
---
In the before, you can see that the trace is blown up by the background
process (`bgproc.qwer`).
In the after, we now only have a little cross-link marker span
(`start_bgproc.qwer`) to jump to background process trace.
Before | After
--- | ---
<some image> | <some image>
### Testing strategy
1. Run a Jaeger instance
(https://www.jaegertracing.io/docs/1.6/getting-started/)
```shell
$ docker run -d --name jaeger \
-e COLLECTOR_ZIPKIN_HTTP_PORT=9411 \
-p 5775:5775/udp \
-p 6831:6831/udp \
-p 6832:6832/udp \
-p 5778:5778 \
-p 16686:16686 \
-p 14268:14268 \
-p 9411:9411 \
jaegertracing/all-in-one:1.59.0
```
1. Configure Synapse to use tracing:
`homeserver.yaml`
```yaml
## Tracing ##
opentracing:
enabled: true
jaeger_config:
sampler:
type: const
param: 1
logging:
false
```
1. Make sure the optional `opentracing` dependency is installed: `poetry
install --extras all`
1. In the `VersionsRestServlet`, modify it to kick off a dummy
background process (easy to test this way)
```python
from synapse.metrics.background_process_metrics import
run_as_background_process
async def _qwer() -> None:
await self.clock.sleep(1)
run_as_background_process("qwer", "test_server", _qwer)
```
1. Run Synapse: `poetry run synapse_homeserver --config-path
homeserver.yaml`
1. Fire off a version requests: `curl
http://localhost:8008/_matrix/client/versions`
1. Visit http://localhost:16686/search to view the traces
- Select the correct service
- Look for the `VersionsRestServlet` operation
- Press 'Find Traces' button
- Select the relevant trace
- Notice how the trace isn't bloated
- Look for the `start_bgproc.qwer` span cross-linking to the background
process
- Jump to the other trace using the cross-link reference ->
`bgproc.qwer`
Spawning from
https://github.com/matrix-org/synapse/pull/12588#discussion_r865843321
> It turns out `Deferred.cancel()` is a lot like
`Deferred.callback()`/`errback()` in that it will trash the logging
context:
> it can resume a coroutine, which will restore its own logging context,
then run:
>
> - until it blocks, setting the sentinel context
> - or until it terminates, setting the context it was started with
>
> So we need to wrap it in `with PreserveLoggingContext():`, like we do
with `.callback()`:
>
> ```python
> with PreserveLoggingContext():
> self.render_deferred.cancel()
> ```
>
> *-- @squahtx,
https://github.com/matrix-org/synapse/pull/12588#discussion_r865843321*
Regressed in
https://github.com/element-hq/synapse/pull/18900#discussion_r2331554278
(see conversation there for more context)
### How is this a regression?
> To give this an update with more hindsight; this logic *was* redundant
with the early return and it is safe to remove this complexity
✅
>
> It seems like this actually has to do with completed vs incomplete
deferreds...
>
> To explain how things previously worked *without* the early-return
shortcut:
>
> With the normal case of **incomplete awaitable**, we store the
`calling_context` and the `f` function is called and runs until it
yields to the reactor. Because `f` follows the logcontext rules, it sets
the `sentinel` logcontext. Then in `run_in_background(...)`, we restore
the `calling_context`, store the current `ctx` (which is `sentinel`) and
return. When the deferred completes, we restore `ctx` (which is
`sentinel`) before yielding to the reactor again (all good
✅)
>
> With the other case where we see a **completed awaitable**, we store
the `calling_context` and the `f` function is called and runs to
completion (no logcontext change). *This is where the shortcut would
kick in but I'm going to continue explaining as if we commented out the
shortcut.* -- Then in `run_in_background(...)`, we restore the
`calling_context`, store the current `ctx` (which is same as the
`calling_context`). Because the deferred is already completed, our extra
callback is called immediately and we restore `ctx` (which is same as
the `calling_context`). Since we never yield to the reactor, the
`calling_context` is perfect as that's what we want again (all good
✅)
>
> ---
>
> But this also means that our early-return shortcut is no longer just
an optimization and is *necessary* to act correctly in the **completed
awaitable** case as we want to return with the `calling_context` and not
reset to the `sentinel` context. I've updated the comment in
https://github.com/element-hq/synapse/pull/18964 to explain the
necessity as it's currently just described as an optimization.
>
> But because we made the same change to
`run_coroutine_in_background(...)` which didn't have the same
early-return shortcut, we regressed the correct behavior ❌ . This is
being fixed in https://github.com/element-hq/synapse/pull/18964
>
>
> *-- @MadLittleMods,
https://github.com/element-hq/synapse/pull/18900#discussion_r2373582917*
### How did we find this problem?
Spawning from @wrjlewis
[seeing](https://matrix.to/#/!SGNQGPGUwtcPBUotTL:matrix.org/$h3TxxPVlqC6BTL07dbrsz6PmaUoZxLiXnSTEY-QYDtA?via=jki.re&via=matrix.org&via=element.io)
`Starting metrics collection 'typing.get_new_events' from sentinel
context: metrics will be lost` in the logs:
<details>
<summary>More logs</summary>
```
synapse.http.request_metrics - 222 - ERROR - sentinel - Trying to stop RequestMetrics in the sentinel context.
2025-09-23 14:43:19,712 - synapse.util.metrics - 212 - WARNING - sentinel - Starting metrics collection 'typing.get_new_events' from sentinel context: metrics will be lost
2025-09-23 14:43:19,713 - synapse.rest.client.sync - 851 - INFO - sentinel - Client has disconnected; not serializing response.
2025-09-23 14:43:19,713 - synapse.http.server - 825 - WARNING - sentinel - Not sending response to request <XForwardedForRequest at 0x7f23e8111ed0 method='POST' uri='/_matrix/client/unstable/org.matrix.simplified_msc3575/sync?pos=281963%2Fs929324_147053_10_2652457_147960_2013_25554_4709564_0_164_2&timeout=30000' clientproto='HTTP/1.1' site='8008'>, already dis
connected.
2025-09-23 14:43:19,713 - synapse.access.http.8008 - 515 - INFO - sentinel - 92.40.194.87 - 8008 - {@me:wi11.co.uk} Processed request: 30.005sec/-8.041sec (0.001sec, 0.000sec) (0.000sec/0.002sec/2) 0B 200! "POST /_matrix/client/unstable/org.matrix.simplified_msc3575/
```
</details>
From the logs there, we can see things relating to
`typing.get_new_events` and
`/_matrix/client/unstable/org.matrix.simplified_msc3575/sync` which led
me to trying out Sliding Sync with the typing extension enabled and
allowed me to reproduce the problem locally. Sliding Sync is a unique
scenario as it's the only place we use `gather_optional_coroutines(...)`
-> `run_coroutine_in_background(...)` (introduced in
https://github.com/element-hq/synapse/pull/17884) to exhibit this
behavior.
### Testing strategy
1. Configure Synapse to enable
[MSC4186](https://github.com/matrix-org/matrix-spec-proposals/pull/4186):
Simplified Sliding Sync which is actually under
[MSC3575](https://github.com/matrix-org/matrix-spec-proposals/pull/3575)
```yaml
experimental_features:
msc3575_enabled: true
```
1. Start synapse: `poetry run synapse_homeserver --config-path
homeserver.yaml`
1. Make a Sliding Sync request with one of the extensions enabled
```http
POST
http://localhost:8008/_matrix/client/unstable/org.matrix.simplified_msc3575/sync
{
"lists": {},
"room_subscriptions": {
"!FlgJYGQKAIvAscfBhq:my.synapse.linux.server": {
"required_state": [],
"timeline_limit": 1
}
},
"extensions": {
"typing": {
"enabled": true
}
}
}
```
1. Open your homeserver logs and notice warnings about `Starting ...
from sentinel context: metrics will be lost`
Fix `no active span when trying to log` tracing error on startup.
Example error:
```log
synapse.logging.opentracing - 427 - ERROR - wake_destinations_needing_catchup-0 - There was no active span when trying to log. Did you forget to start one or did a context slip?
Stack (most recent call last):
File "/usr/lib/python3.13/threading.py", line 1014, in _bootstrap
self._bootstrap_inner()
File "/usr/lib/python3.13/threading.py", line 1043, in _bootstrap_inner
self.run()
File "/usr/lib/python3.13/threading.py", line 994, in run
self._target(*self._args, **self._kwargs)
File "python3.13/site-packages/twisted/_threads/_threadworker.py", line 75, in work
task()
File "python3.13/site-packages/twisted/_threads/_team.py", line 192, in doWork
task()
File "python3.13/site-packages/twisted/python/threadpool.py", line 269, in inContext
result = inContext.theWork() # type: ignore[attr-defined]
File "python3.13/site-packages/twisted/python/threadpool.py", line 285, in <lambda>
inContext.theWork = lambda: context.call( # type: ignore[attr-defined]
File "python3.13/site-packages/twisted/python/context.py", line 117, in callWithContext
return self.currentContext().callWithContext(ctx, func, *args, **kw)
File "python3.13/site-packages/twisted/python/context.py", line 82, in callWithContext
return func(*args, **kw)
File "python3.13/site-packages/twisted/enterprise/adbapi.py", line 282, in _runWithConnection
result = func(conn, *args, **kw)
File "synapse/synapse/storage/database.py", line 1094, in inner_func
return func(db_conn, *args, **kwargs)
File "synapse/synapse/storage/database.py", line 822, in new_transaction
opentracing.log_kv({"message": "commit"})
File "synapse/synapse/logging/opentracing.py", line 427, in ensure_active_span_inner_2
logger.error(
```
### Why did this happen before?
This previously occurred because we called `init_tracer(...)` after the
reactor started up in `_base.start()`. But we actually attempt some
database transactions earlier than that which try to do some tracing
because of that `oidc = hs.get_oidc_handler()` line.
Notice `oidc = hs.get_oidc_handler()` happened before `_base.start(hs)`:
5be7679dd9/synapse/app/homeserver.py (L397-L408)
With this PR, I've updated things to `init_tracer(...)` earlier on
alongside where we `setup_logging(...)`.
## Fix last seen timestamp in `/_synapse/admin/v2/users` response
Fixes#18955
The last seen timestamps contained in `/_synapse/admin/v2/users`
responses were computed as follows:
```sql
[...]
LEFT JOIN (
SELECT user_id, MAX(last_seen) AS last_seen_ts
FROM user_ips GROUP BY user_id
) ls ON u.name = ls.user_id
[...]
```
4367fb2d07/synapse/storage/databases/main/__init__.py (L302C1-L305C44)
This leads to empty timestamps (as in: user was never seen) if users are
inactive for longer than
[`user_ips_max_age`](https://element-hq.github.io/synapse/latest/usage/configuration/config_documentation.html#user_ips_max_age).
The fix is quite trivial: Use the `devices` table, as this one also
contains last seen timestamps but is *not* periodically purged.
We are using this for automatic user account deletion (via
[synadm](https://codeberg.org/synadm/synadm)) and the patched code works
as intended, whereas the unpatched version wants to delete users during
long vacations. 🫣