Meilisearch across the Semantic Verse

CommunityNews · 16 June 2023 00:29

Meilisearch across the Semantic Verse · Issue #3838 · meilisearch/meilisearch.
Meilisearch is currently exploring the semantic search universe. A promising epoch of search is unfolding right before us. The semantic search unlocks new paradigms, and understanding the documents…

Read in full here:

github.com/meilisearch/meilisearch

Meilisearch across the Semantic Verse

opened 01:24PM - 15 Jun 23 UTC

Kerollmops

tech discussion

![Spider-Man-Across-the-Spider-Verse-Spider-People-scaled](https://github.com/me…ilisearch/meilisearch/assets/3610253/34669999-a4eb-403c-b468-f3ec82025902) Meilisearch is currently exploring the semantic search universe. A promising epoch of search is unfolding right before us. The semantic search unlocks new paradigms, and understanding the documents and user query is a deal breaker. When it seems easy to aggregate vectors in a store and retrieve the nearest neighbors on a query, much more can be done. This is only the first step of this wonderful journey that awaits us. ## Where are we? We just [released the v1.3-rc.0 of Meilisearch with the vector store experimental feature](https://github.com/meilisearch/meilisearch/releases/tag/v1.3.0-rc.0). It can do what most of the other vector stores on the market do: storing documents associated with a vector and returning the nearest ones based on another vector. We did that in a week or so. We will polish and release it in the final v1.3 version as an experimental feature. We would like to provide this experimental feature on the cloud and make it easier to compute your vectors by plugging Meilisearch into OpenAi and Hugging Face. Exposing vector database capabilities unlocks a lot of potential use cases. People will be able to use Meilisearch to create conversational chatbots. We did a fun experiment: feeding Meilisearch our private Notion pages and asking the bot about internal processes [using LangChain](https://github.com/hwchase17/langchain). The results were stunning! There are a lot of other interesting use cases, specifically around raw search. As meilisearch can return the nearest documents based on the query's vector, one can mix the similarity score with the [soon-to-be-released ranking score](https://github.com/meilisearch/product/discussions/489#discussioncomment-6153006) of the keyword search ranking rules. It is one of the paths that we can take to walk toward hybrid search. Not necessarily the best, but it works for many simple use cases. We did some experiments with an e-commerce dataset, and the results have been pretty good so far! ## Where do we want to go? There are plenty of roads we can take. Meilisearch is an excellent keyword search engine, one of the best on the market. We created tools to measure our engine's relevancy: precision and recall. This ensures we continue to improve our results and avoid regressions throughout our releases. We plan to release a blog post explaining everything and why we have better relevancy than the competition when using keyword-based queries, e.g., _blue shirt_, _delonghi coffee machine_. However, we must and plan to extend this TREC-covid test suite with question-based queries, not only keyword ones. This is where Meilisearch lacks semantic understanding, e.g., _What's the capital of France?_. By mixing semantic search with keyword search, we can ensure that questioning the engine works and does not make it try to search for useless words, e.g., _What_, _the_. Mixing the different scores doesn't work well with the ranking rules. Those rules let the user define where the relevancy and correctness of sorting the results belong. When you want to ascend sort products by price, you want the more relevant ones first, even if they are more expensive. Furthermore, semantic search doesn't support typos well or prefix search, also known as search-as-you-type queries. When an e-commerce user is typing his query in the search bar, the last word is partially written. The keyword-based Meilisearch version handle that very well, but not the semantic version. In our relevancy benchmark, a partial query like _machine del_ gives bad results with the semantic search version. It interprets _del_ as the computer brand, not the prefix of _coffee machine delonghi_. On the other hand, the keyword version finds occurrences of _machine_ near _DeLonghi_, which shows in the relevancy scores. ![Meilisearch's Where to Watch demo](https://raw.githubusercontent.com/meilisearch/meilisearch/main/assets/demo-dark.gif) We will explore different ways to mix them. We can run the semantic search when the query is lengthy and the keyword-based search on short queries. We could also **increase the recall** with query rewriting by fetching the nearest document, doing keyword extraction, and using those keywords as an alternative query. Extracting the most important terms of the document to fetch more, not necessarily better, but more documents on the same subject. In the same vein, we would like to explore automatic synonyms. Nonetheless, we can mix the Ranking Rules system with semantic search. We will experiment with exposing a new `semanticBoosting` ranking rule that will decide if a document semantically matches the query and move it up or down, depending on the score. This ranking rule **increases the precision** by increasing the number of interesting documents on the first page and moving the others down. Another interesting way to use semantic understanding is to refine filters. By understanding negation, we could move unrelated documents down by relying on the `semanticBoosting` ranking rule, by adding new filters, or at least be able to propose them to the user via a nice UI. Unfortunately, we had bad results with the OpenAI ada-002 text embedding model when negatively searching an e-commerce dataset, e.g., _non-ASUS computer_. ## Technical Terms In the following examples, I want you to imagine, very deeply, that you have an e-commerce dataset with 100 _DeLonghi machines_ and 200 _Nespresso coffee machines_. **The recall** is the number of interesting documents the engine found in the whole dataset based on the query. If you search for _coffee_ with the keyword search API of Meilisearch, you will likely find 200 _Nespresso coffee machines_ and not a single _DeLonghi machine_. The reason is that the _coffee_ keyword is present on the first documents. However, if you use the semantic search API of Meilisearch, you will probably get all of the 300 coffee machines and probably even more stuff. In this example, semantic search highly increases the recall. **The precision** is the number of interesting documents the engine can move up on the first pages. If you search for _delonghi_ this time with the keyword search API of Meilisearch, you'll find exactly what you want, no more than the 100 _DeLonghi machines_. However, if you use the semantic search API of Meilisearch, you will get the 300 coffee machines in your dataset. There is a high chance that the engine will mix the Nespresso and the DeLonghi ones, the distance where it finds those documents will be quite useful but will hardly be as good as a keyword-based search here. To be continued... [_could have been written with the help of an LLM_]

This thread was posted by one of our members via one of our news source trackers.