Evaluating Tantivy with Information Retrieval measures
With Sonar we are implementing search on peer-to-peer data structures. To do so efficiently and in the spirit of peer-to-peer concepts, we seek to implement distributed search indices and we need to stay low on system requirements.
On our search for potential technologies we found Tantivy, a search engine which indices are segmented. Segmented indices brings us already a big step forward towards distributed indices. In addition, Tantivy is implemented in Rust and therefore performant regarding system ressources.
Another candidate for implementing search functionality would have been Lucene, which is already well established and returns high quality results. However, its requirement for a Java VM contradicts our agenda of staying low on ressources.
As the choice of the underlying search engine is a very important one for the Sonar project, we tested Tantivy a lot in early 2019 to get more confidence in a possible choice of Tantivy for the implementation of search functionalities. The most important question for us was and still is, whether the search results returned by Tantivy are holding up to information retrieval quality measures, i.e. that they are "good results". To measure this, we needed to compare our Tantivy against a baseline. We chose Lucene to serve as a baseline, as it is the right choice regarding search quality as well as search performance.
The speed performance of querying Tantivy indices has already been evaluated. The results support our decision, as they show that searching a Tantivy index is as fast or even faster than Lucene. But to our knowledge Tantivy has not been evaluated in regards to information retrieval measures yet.
So we set out to start an information retrieval evaluation of Tantivy in order to get a greater confidence with our technology decision. The code of our evaluation for Tantivy and Lucene can be found on github.
In the following we are describing the dataset and the search tasks we used.
Dataset
Sadly there are still no open source datasets for information retrieval evaluation tasks.
So we chose the MOVIES dataset, to which we had access through the university context. Many thanks at this point to Prof. Hannah Bast, who provided us with this evaluation dataset. MOVIES is a collection of information about movies. The Dataset contains nearly 190.000 files, each is a simple text file with tab-separated <title>
and <body>
fields. One of these looks, e. g., like this:
Hipsters "A vibrant musical might not be what you'd expect from contemporary Russian cinema, but Valery Todorovsky's Hipsters is an Iron Curtain version of Swing Kids meets Hairspray, bursting with razzle, dazzle and, of course, rhythm. Christened with a name that stands for Marx-Engels-Lenin-Stalin, Communist youth Mels (Anton Shagin) is obviously primed to rebel. [...]
As part of our evaluation we are extracting this information and transform them into Tantivy respectively Lucene documents. For every text file a document with the fields title, body and fulltext (title and body combined) is created.
The MOVIES dataset comes with a benchmark file which contains ten queries with the associated document IDs, i.e. the relevant search results for this query. One line in the benchmark file looks like this:
films about spiders 4858 5908 14601 22926 25464 32538 38331 64378 92610 140836
Evaluation
For every query three search tasks were evaluated with Lucene and Tantivy:
1) Searching only in the title field 2) Searching only in the body field 3) Fulltext search in the combined text of title and body.
We decided to use MP@3, MP@R (Mean Precision at R, where R is the number of relevant document ID's in our benchmark file) and MAP (Mean Average Precision) to measure our results. These measures measure different ratios of how many of the relevant documents are included in the search results. The closer to 1 (or 100 %) these measures are the better. For an in-depth explanation of these measures please refer to the slides of the Standord Introduction to Information Retrieval course chapter on evaluation.
We computed the mentioned information retrieval measures for Tantivy (0.10.1) and Lucene (8.2.2) with a limit of 100 search results and got the following results:
Lucene
Task | MP@3 | MP@R | MAP |
---|---|---|---|
MOVIES (Title) | 0.134 | 0.065 | 0.039 |
MOVIES (Body) | 0.434 | 0.241 | 0.250 |
MOVIES (Fulltext) | 0.5 | 0.264 | 0.237 |
Tantivy
Task | MP@3 | MP@R | MAP |
---|---|---|---|
MOVIES (Title) | 0.1 | 0.066 | 0.039 |
MOVIES (Body) | 0.367 | 0.282 | 0.291 |
MOVIES (Fulltext) | 0.5 | 0.294 | 0.283 |
Conclusion
For this specific dataset Lucene and Tantivy return comparable results of information retrieval measures. With respect to MP@R and MAP the search results of Tantivy are even slightly better. The fulltext task returns the best results for MP@3 and MP@R, which was to be expected.
In summary, although we used a very specific dataset with only ten queries, Tantivy appears to be a well made choice. Tantivys speed performance is very good and our prelimited information retrieval evaluation provides reasoning to conclude that Tantivy returns search results that are as good as Lucenes.
In the last months we focused our work on the architecture and base functionality of Sonar. After we made the decision to use Tantivy, we implemented basic search functionality for Sonar and switched our focus to other parts, e.g. the Dat layer and the UI. In the next months we will concentrate on improving the search funcionalities and our integration of Tantivy. We also hope to continue our work on evaluating Tantivy and Sonar in regards to information retrieval measures.