Contents

Search Term Disambiguation

Contents

If you are interested in the island of Java, and type its name into a regular Web search engine, you are returned copious results about the programming language of the same name. This can of course be easily remedied by searching for “Java the island” instead. But if your query involves multiple terms and is more complex, this can become tricky to do manually. Today’s idea is for a component of a search engine or search agent, which helps users resolve the ambiguity in their search terms, before returning matching results.

The relevant technique from Natural Language Processing (NLP) is called “word-sense disambiguation” (WSD). It allows for automatically discerning which sense of a word is relevant to an occurance of that word in a text of sufficient length. This can be applied to the documents that a Web search engine processes, as part of a full-text indexing step. So instead of storing search terms only as strings, the index would keep track of representations of their meaning, too. So far this is nothing unusual.

But the search query terms have to be disambiguated too, to properly match them with the indexed terms. This can not be done fully automatically, as say, one to four search terms alone do not allow for NLP processing. Instead, an interactive system could be used as such: if the user says “Java”, the disambiguation system presents the choices of “programming language”, “island” and “coffee”. If the user chooses one of these, this sense of the word is used for this term for the rest of the search session. Multiple search terms can be disambiguated in turn likewise, so that the search can proceed with several term-meaning pairs, rather than just text tokens.

This is a rather simple example, and there are of course more challenging ones. How far a team would get from easy to hard test cases, I cannot predict. But even handling only moderate examples would in my opinion make the undertaking worthwile. There is much more feature-space to be explored along these lines!

So far this is more of a product enhancement idea, rather than a startup idea. But it is possible to create a novel search engine around this idea, for example using the Common Crawl data set, or just Wikipedia. Common Crawl provides data from Web crawling so that you don’t have to operate your own crawler, which can be the hardest part of Web search operations. And Wikipedia provides their open data set for download, for free. This too can be the foundation of a better search engine. In the case of Wikipedia, part of the structured data includes disambiguation pages for common terms, which can be included in the semantic analysis.

If I haven’t convinced you yet that this idea is important, simply try to find a book about creating (as in programming) your own database engine. No matter how you search, and which terms you use, you will only get results about implementing a database, using an existing engine. The search terms are the same: “implement”, “develop”, “code” … The space of implementing your own engine is simply swamped out by many more texts about implementing a database with an existing engine. To this day I don’t know whether such books even exist!