Full text search in Sonar

In this post I'll outline how we're doing full-text search in Sonar. Sonar is a peer to peer database, and it integrates a full-text search engine.

The peer to peer database is based on the Kappa architecture. Our basic model for a set of content is an Island. An island is a set of Hypercore feeds that are replciated together. Through kappa-core, these feeds are indexed as a set. The first feature we added here was a rewrite of kappa-core to be able to deal with different kinds of source, which allowed us to add kappa-sparse-indexer to the mix. The sparse indexer takes a set of hypercores and creates a log of all blocks in all the cores, providing a local total indexing of all messages in an island. This also assigns a unique lseq number to each message, making it easy to identify a single message within the local context.

This combined log of all feeds in an island is then fed into kappa-core, from where the messages flow into the different Views mounted on the kappa core. A view is a function that receives each message only once, and has its state tracked by the kappa core. The views usually then build secondary indexes out of the stream of messages.

In Sonar, we have a few simple views that are use LevelDBs for storage. Those track a global history, and implement our CRDT. CRDT means "conflict-free replicated data type" and is a strategy to deal with updates that happen out of sync. Sonar employs a very simple CRDT that is based on unordered materialized views and the conecpt of backlinks. It works like this: Whenever you update a record, the update will include a list the currently most recent versions of the record that you know about. These links are then used by other views to delete or invalidate the previous, now overwritten versions of the new record.

The fulltext search works in exactly this way. Now, because we want our search to be fast, we need a fast search engine. And we also want a search engine that's easy to integrate and run and doesn't require additional setup. For these reasons, we picked Tantivy, a full-text search engine written in Rust. To integrate Tantivy with Sonar, we compile the tantivy library into a custom binary, called sonar-tantivy. This binary is built for different architectures and operating systems automatically, and the result is published as a downloadable artifact on Github. When installing Sonar, the correct binary for your operation system and architecture is automatically downloaded from Github releases. sonar-tantivy includes a wrapper module written in NodeJS that starts the Rust binary and talks to it over a simple RPC protocol.

Now, in sonar-core, we define a kappa view that uses this sonar-tantivy wrapper module to push new messages in an island into the full-text search. Currently, we have a very basic schema management around that. There's a full-text index, where all string fields of a message are appended together into a single field. This textdump index is currently used when you type in the search bar in the Sonar UI. This is nice because it just works without setting up a schema for your records. We also create an index for each schema for which the index property is set. At the moment, we're developing a more powerful schema API that will allow to declaratively define indexing properties for individual fields. Once we merge this, you will be able to set in a schema definition properties for each field, e.g. if the field should be indexed as a facet property. This will then make it possible to filter the search view by these facets. Follow this tracking issue for details.

The interesting part is that the search index runs on each instance of Sonar, consuming both schemas and data records over the peer to peer network and live-indexes them into the full-text search. This is usually very fast. There's work going on in Tantivy to make this even faster by supporting in-memory segments. The architecture of Tantivy is already very friendly towards this model, as the search index is constructed out of immutable segments of data.