Supervised Topic Modeling in Urdu Language for Domain Lexicon Generation

Urdu is the 8th largest spoken language in the world with more than 140 million speakers in subcontinent. It is a national language of Pakistan and an official language of six Indian states. Despite its abundant usage, the research work in Urdu language is still limited. Textual data of Urdu is increasing day by data and it is very important to understand and extract information about underlying themes. LDA is a topic modeling technique that has been massively applied on large set of textual data to uncover latent themes. It is based on “bag of word “ assumption. For Urdu language, small research work on topic modeling is carried out which is limited to unigrams. In this thesis, an alternative method for Urdu LDA based on “bag of words” plus “bag of multi-terms” is developed. Multi-terms are extracted using C-value method, these terms are then integrated with words. As LDA is an unsupervised method,it doesn’t provide any label to the extracted theme. In proposed framework,an automatic labeling of Urdu topic models is also developed. Candidate labels are extracted using linguistic filters and then these labels are ranked based on similarity of topics and candidate label vectors; using word2vector and letter trigram.Experiments are performed on Urdu corpus which is colleted from BBC urdu.Results are evaluated using word intrusion user study method and coherence score, also performance of model is tested on unseen documents. For automatic labeling,results are validated by domain experts, which demonstrate that our framework can aid Urdu researchers gain fast and better understanding of their Urdu document collections