Replace PyICU with Rust icu_segmenter crate (#18553)

Co-authored-by: anoa's Codex Agent <codex@amorgan.xyz>
Co-authored-by: Quentin Gliech <quenting@element.io>
This commit is contained in:
Andrew Morgan
2025-07-03 11:12:12 +01:00
committed by GitHub
parent 832690e746
commit be4c95baf1
15 changed files with 70 additions and 136 deletions

View File

@@ -29,8 +29,6 @@ easiest way of installing the latest version is to use [rustup](https://rustup.r
Synapse can connect to PostgreSQL via the [psycopg2](https://pypi.org/project/psycopg2/) Python library. Building this library from source requires access to PostgreSQL's C header files. On Debian or Ubuntu Linux, these can be installed with `sudo apt install libpq-dev`.
Synapse has an optional, improved user search with better Unicode support. For that you need the development package of `libicu`. On Debian or Ubuntu Linux, this can be installed with `sudo apt install libicu-dev`.
The source code of Synapse is hosted on GitHub. You will also need [a recent version of git](https://github.com/git-guides/install-git).
For some tests, you will need [a recent version of Docker](https://docs.docker.com/get-docker/).

View File

@@ -164,10 +164,7 @@ $ poetry cache clear --all .
# including the wheel artifacts which is not covered by the above command
# (see https://github.com/python-poetry/poetry/issues/10304)
#
# This is necessary in order to rebuild or fetch new wheels. For example, if you update
# the `icu` library in on your system, you will need to rebuild the PyICU Python package
# in order to incorporate the correct dynamically linked library locations otherwise you
# will run into errors like: `ImportError: libicui18n.so.75: cannot open shared object file: No such file or directory`
# This is necessary in order to rebuild or fetch new wheels.
$ rm -rf $(poetry config cache-dir)
```

View File

@@ -286,7 +286,7 @@ Installing prerequisites on Ubuntu or Debian:
```sh
sudo apt install build-essential python3-dev libffi-dev \
python3-pip python3-setuptools sqlite3 \
libssl-dev virtualenv libjpeg-dev libxslt1-dev libicu-dev
libssl-dev virtualenv libjpeg-dev libxslt1-dev
```
##### ArchLinux
@@ -295,7 +295,7 @@ Installing prerequisites on ArchLinux:
```sh
sudo pacman -S base-devel python python-pip \
python-setuptools python-virtualenv sqlite3 icu
python-setuptools python-virtualenv sqlite3
```
##### CentOS/Fedora
@@ -305,8 +305,7 @@ Installing prerequisites on CentOS or Fedora Linux:
```sh
sudo dnf install libtiff-devel libjpeg-devel libzip-devel freetype-devel \
libwebp-devel libxml2-devel libxslt-devel libpq-devel \
python3-virtualenv libffi-devel openssl-devel python3-devel \
libicu-devel
python3-virtualenv libffi-devel openssl-devel python3-devel
sudo dnf group install "Development Tools"
```
@@ -333,7 +332,7 @@ dnf install python3.12 python3.12-devel
```
Finally, install common prerequisites
```bash
dnf install libicu libicu-devel libpq5 libpq5-devel lz4 pkgconf
dnf install libpq5 libpq5-devel lz4 pkgconf
dnf group install "Development Tools"
```
###### Using venv module instead of virtualenv command
@@ -365,20 +364,6 @@ xcode-select --install
Some extra dependencies may be needed. You can use Homebrew (https://brew.sh) for them.
You may need to install icu, and make the icu binaries and libraries accessible.
Please follow [the official instructions of PyICU](https://pypi.org/project/PyICU/) to do so.
If you're struggling to get icu discovered, and see:
```
RuntimeError:
Please install pkg-config on your system or set the ICU_VERSION environment
variable to the version of ICU you have installed.
```
despite it being installed and having your `PATH` updated, you can omit this dependency by
not specifying `--extras all` to `poetry`. If using postgres, you can install Synapse via
`poetry install --extras saml2 --extras oidc --extras postgres --extras opentracing --extras redis --extras sentry`.
ICU is not a hard dependency on getting a working installation.
On ARM-based Macs you may also need to install libjpeg and libpq:
```sh
brew install jpeg libpq
@@ -400,8 +385,7 @@ Installing prerequisites on openSUSE:
```sh
sudo zypper in -t pattern devel_basis
sudo zypper in python-pip python-setuptools sqlite3 python-virtualenv \
python-devel libffi-devel libopenssl-devel libjpeg62-devel \
libicu-devel
python-devel libffi-devel libopenssl-devel libjpeg62-devel
```
##### OpenBSD

View File

@@ -117,6 +117,13 @@ each upgrade are complete before moving on to the next upgrade, to avoid
stacking them up. You can monitor the currently running background updates with
[the Admin API](usage/administration/admin_api/background_updates.html#status).
# Upgrading to v1.134.0
## ICU bundled with Synapse
Synapse now uses the Rust `icu` library for improved user search. Installing the
native ICU library on your system is no longer required.
# Upgrading to v1.130.0
## Documented endpoint which can be delegated to a federation worker

View File

@@ -77,14 +77,11 @@ The user provided search term is lowercased and normalized using [NFKC](https://
this treats the string as case-insensitive, canonicalizes different forms of the
same text, and maps some "roughly equivalent" characters together.
The search term is then split into words:
* If [ICU](https://en.wikipedia.org/wiki/International_Components_for_Unicode) is
available, then the system's [default locale](https://unicode-org.github.io/icu/userguide/locale/#default-locales)
will be used to break the search term into words. (See the
[installation instructions](setup/installation.md) for how to install ICU.)
* If unavailable, then runs of ASCII characters, numbers, underscores, and hyphens
are considered words.
The search term is then split into segments using the [`icu_segmenter`
Rust crate](https://crates.io/crates/icu_segmenter). This crate ships with its
own dictionary and Long Short Term-Memory (LSTM) machine learning models
per-language to segment words. Read more [in the crate's
documentation](https://docs.rs/icu/latest/icu/segmenter/struct.WordSegmenter.html#method.new_auto).
The queries for PostgreSQL and SQLite are detailed below, but their overall goal
is to find matching users, preferring users who are "real" (e.g. not bots,