How to bring ML to Search?

search engines 2022-04-05 550 words 3 minutes

Contents

If you use leading search engines a lot, you have probably noticed that they haven’t improved much over the years. They really seem to be stuck in the 2000s, just with more spam than back then. One way in which they could improve, is by using Machine Learning (ML) for processing search results. Some of the companies involved have good ML and AI teams, so why can’t they use that know-how to improve their search engines? In this post I want to talk about how to bring ML to search engines.

One technique from ML that could be used is Text Summarization, which summarizes arbitrary texts. Page previews on search engines simply work with snippets taken around the occurrences of the search terms on the relevant page. Using Text Summarization instead, the relevant aspect of the page would be summarized for the user, to display under the page link. I believe this would make understanding what results are on offer easier and the overall user experience better.

The next aspect of ML to use is Text Clustering, where similar texts are grouped together in a cluster. This could be used to prevent search engines from presenting the same type of content multiple times from different domains, thereby making sure that there is enough original material among the first page of results. Text Clustering can also be used to combat spam, which significantly degrades result quality.

Another technique from ML to use is Word-sense Disambiguation, but I have written an entire post about this, so I won’t cover it again here.

These aren’t even the newest or most difficult ML techniques, and they don’t necessarily require Deep Learning. So bringing ML to Search should be in reach of most companies with their own Web index.

The most important application of ML by far is to directly estimate page content quality. Currently search engines use something like the Page-rank algorithm to indirectly estimate the ranking of results, by the number of inbound links. For many reasons this is a very dated method. I think it should be possible to use ML to estimate the quality of page content, while fighting spam at the same time. This is not a well-established technique like the others in this post, but there are good chances for this to work, and this would be the best argument for ML in search. There are many high quality sites online with terrible search engine ranking. Bringing them to the front would improve the usefulness of a search engine considerably.

The strongest prohibiting factor for deployment of ML is the consumption of computational resources, in particular of processor time. Applying all the techniques mentioned here on an entire Web search index will require extensive resources. But processing power has come a long way since the early days of these search engines, so it should be somehow possible. Indeed there are probably clever optimizations to be found which reduce the full load, for example by focusing only on the most common results for the beginning.

Web search has been losing its relevance for quite a few years now, so an improvement of result quality should be a priority for any incumbent. Either way, there is a lot of potential for product improvement here and indeed a lot of earning potential!