Embedding Part of Speech (POS) Information in Word2vec for Text Classification

In this research, we present a novel approach to enhance Text Classification task using improved word representations that embed linguistic information (i.e. POS Tags). Word2vec is the vector based word representation, can be manipulated to work in conjunction with vector based POS representations to improve the accuracy of classifier in text classification task. we will demonstrate how distributed representations of words could be used with part of speech embeddings by jointly learning both models on top of pre-trained convolution neural network primarily used to carry out text classification task. Distributed representation of words exhibit semantic association between words however very less effective amount of work done on its syntactic association. Findings also suggests that CBOW model works better on syntactic tasks as opposed to the skipgram that is slightly better over the semantic tasks. The primary objective of this thesis is to surface out syntactic relationship and its impact on word vectors. Syntactic information can be incorporated in word embeddings in various ways. One of which is integrating part of speech (POS) information in different text classification tasks. To date word2vec achieve state of the art performance in natural language processing tasks however it is still inadequate to consider syntactical information of part of speech tags in it. To incorporate such information we primarily focus on to collaborate both semantic and syntactic information to produce high quality word vectors. The resulting word vectors reportedly shows effective improvements over baseline techniques. We summarize our best published results on the famous publically available benchmark datasets of text classification