Question Answering for Wikipedia

wikipedia 2022-04-20 529 words 3 minutes

Contents

There is a Natural Language Processing (NLP) technique called Question Answering (QA), where users ask a question in natural language to a system, and receive an answer. For example we might ask, “what is the maximum take-off weight of the Airbus A380?” and receive the answer “575,000 kg”. This is an improvement over plain search engines, which only take keywords as a query and answer with entire pages or documents of information, from which the user has to extract the desired answer. Google already has partial Question Answering capability, and indeed the question about the A380 is answered correctly by Google.

Research projects of Question Answering systems have been launched and tested for quite some time now, and the technology is probably ripe for creating a product.

Applying full Question Answering capability to Web search is likely to be too computationally expensive for a startup, and possibly even for an established Web search engine. So instead, I want to propose implementing this just for Wikipedia at first. To be more precise, to a version of a single language of Wikipedia, such as the English version. Wikipedia contains enough knowledge to cover many use cases, while being a manage-ably sized data base. In addition, Wikipedia includes quite a bit of semantic markup which would provide a head-start for any algorithm working with the text.

There are different ways to provide such a service. One type works as a dialog system, like a chat-bot, which interacts with the user, with both questions and answers in plain English. A different variation would be to have the user specify questions in English, but return snippets from Wikipedia as answers, instead of automatically formulated text. In our Airbus example, the system would quote a line from a table of Wikipedia which contains the right information. Other times a single paragraph would be returned. This might be easier to implement while less error prone than the first type of system.

While this capability is part of existing search engines, I think there is a business case for a separate product. A dedicated Question Answering system could enter into an interaction with the user and extract more information about what they are looking for than a keyword-based search. Users who are less capable at interacting with IT systems would find such a QA system easier to use, in part because they would no longer have to interpret result pages themselves.

While this idea requires advanced NLP algorithms to work, the theory is established enough to be covered in main textbooks on NLP and should be accessible to competent engineers in the field. The hardest part aside from these NLP algorithms is probably properly parsing Wikipedia. While the content dumps of Wikipedia, which can be downloaded, are an XML file, most of Wikipedia is marked up in so-called Wikitext, for which there are many half-baked parsers, but no mature one which works reliably in all cases. But the syntax of Wikitext is public knowledge, so creating a good parser for the project should be possible.

This idea could be a starting point for a different type of search interface, to at some point replace incumbent search engines.