Automatic Filtration of Spams in Multi-lingual Short Messages on Mobile and Social Media

The problem of Spam in SMS (Short Messages) and online social media is growing. Application of machine learning techniques for spam checking is quite an interesting area. Furthermore, the recent popularity of social media such as Twitter and Facebook etc. has made these platforms a major medium of interaction among society and therefore, spams have also emerged on these forums with more sophisticated methods of spreading the unwanted messages. Our research analyses a few techniques that are currently being used in Spam filtering in the context of SMS as well as emails. The contents of SMS are unique in nature so some techniques might be effective while some might not be. Furthermore, messages on twitter are also limited to only 140 characters, therefore, the characteristics exhibited in mobile SMS can be related to those in tweets as well. We are aiming to extend techniques developed during the course of our research to SMS and tweets.

This research encompasses filtering and classifying Roman Urdu messages (Urdu written using English alphabets) as spam and not spam by applying algorithms like naïve Bayesian, SVM and many others on Roman Urdu SMS and Tweets from Twitter. Our Naïve Bayes classifier achieved accuracy of 92.22% on SMS dataset, 93.65% on English tweets and 98.1% on Roman Urdu tweets. Some of the algorithms performed satisfactory and their performance was increased on our dataset like the SVM which gave an accuracy of 70.88% on English tweets but after using our features, the accuracy increased to 87.49%. Best performance was achieved using Naïve Bayes Multinomial and DMNBText classifier.  Furthermore, a manually curated dataset of SMS messages in Roman Urdu and English, labelled as Spam/No spam is provided that can be used by other researcher.