A Multilingual Framework for Identification and Categorization of Hate Speech and Abusive Language

According to the rule of law, |a hate speech is a speech that attacks an individual or group of individuals on the basis of religious orientation, race, gender, disability, national origin and ethnic origin.” The definition and implications of hate speech and extremism is different in every society, mainly because of the dynamics that persists within it. A lot of work has been done on Twitter in English tweets to detect “Hate Speech and offensive language”. As now a day almost everyone is on Social media so the volume of posts has increased heavily in recent years. You will find all dignitaries, celebrities, many government officials, politicians etc using social media to convey messages or update their fans or inform people of country due to which hate speech and use of offensive language against them is easily conveyed due to freedom of speech on such platforms. And no way can this all be handled manually so there is a need to automate and design a system that can detect and identify such posts and does not allow them to be posted. So the data sets have been collected and annotated and being worked on with different techniques and algorithms to detect such tweets and categorized them as offensive and shall not allow to be published. Given dataset is shared task of HASOC 2019 competition to predict hate speech and offensive language. We have applied BERT with model reinforcement over the dataset and achieved considerable results. We have achieved better results as compared to the winner of the competition in all tasks. The F1-score we have achieved are fractionally better for sub task A but marginally better in sub task B and promising in sub task C. The macros F1 of these are 0.8426, 0.6285 and 0.6884 for task A, B and C respectively which are produced by our model. This is more than each sub task of English dataset for the teams the topped in each task in the competition. We fine tuned “Bidirectional Encoder Representations from Transformers (BERT) which is a neural-network-based technique for pre-training natural language processing” (NLP)