A Semantics Based Bi-Lingual Information Retrieval System To Improve Results Of Search Engines

Advent of pervasive computing (smart phones, note books, tablets, kindles) has made a huge change regarding access to information in developing countries. However, Input devices, specifically keyboards are still mostly English-Based. This has given rise to Romanization of languages and same is the case with Roman Urdu which is commonly used for typing text messages on cell phones, for chatting and even on social media (such as Twitter, Facebook). Using native language with a familiar keyboard of English layout can help reduce computer shyness in people not very well versant with English. However, despite being spoken by large a population, the online resources in Urdu language are scarce and local population, are left with dilemma of having search results that constitute of only a small percentage of actual contents available on Web related to their query. To address this issue, some mechanism is required to convert words written in Romanized form to English in the first place.

This applied research presents a Semantics based Bi-Lingual Information Retrieval System to improve results of search engines, materialized as an Ontology Resource and Intelligent text search. We have demonstrated this fact by searching Urdu words written in Roman script on top 5 contemporary browsers and scrutinized the links retrieved. Evaluation reveals that browsers perform text-based search for Roman Urdu words and not semantic search. The idea of semantic search is to retrieve semantically similar or related information in addition to information holding the exact match for input query. We present a practical implementation of our work in the form of a semantically enriched Ontology and a functional interface to it. The search mechanism empowers the classical approximate string matching algorithms with an indigenously developed phonetic matching algorithm. The extended algorithm significantly improves both recall and precision. Semantic search results have been validated both empirically and observationally compared to human curated standard and were found encouraging.