Extension Of Semantic Based Urdu Linguistic Resources Using Natural Language Processing Techniques

Urdu is the 8th largest spoken language in the world with more than 140 million speakers in subcontinent. It is national language of Pakistan and official language of six Indian states. Despite its abundant usage, the research work in Urdu language is scarce in terms of electronic linguistic tools and resources such as Named Entity  Recognizers and WordNet etc. Terms in any language represent a concept which can be inter-connected to make conceptual model of relationships for applications. Automatic term recognition is the baseline for the enhancement of the Natural Language Processing tasks such as ontology, vocabulary creation, text recognition, automatic translation etc. For Urdu language, very little research work on multi-word terms is carried out. Therefore, there is a need to develop such linguistic resources and automatic recognition of multi-word terms.

In this thesis, a framework comprising NLP tools for Urdu is developed. Initially, a tool and a language resource are proposed and developed: Automatic Urdu Multi-word Terms Recognizer, which is then used to automatically create an extension of Urdu WordNet comprising multi-word terms. These terms are then connected semantically to produce Princeton WordNet like structure. Automatic Term Recognizer uses lexical and statistical information from Urdu corpus to calculate scored-value. The Automatic Term Recognizer is then applied on Urdu corpora to collect multi-word terms; these terms are then organized as synsets using information from various online dictionaries and thesaurus for Urdu. Finally, both resources, i.e. Automatic Term Recognizer and WordNet are validated. It showed the precision of 58.3% and recall of 73%. The developed Urdu WordNet contains 735 multi-word terms and is validated by domain expert.