Skip to main content

Fight against racism and hate speech on reddit

Reddit is an American social news aggregationand discussion website. Registered members submit content to the site such as links, text posts, and images, which are then voted up or down by other members. Posts are organized by subject into boards called "subreddits", which cover a variety of topics including news, science, movies, video games, music, books, fitness, food, and image-sharing. Registering an account with Reddit is free and does not require an email address to complete. Since Reddit is an open platform and anyone is free to post anything, there is no (or very limited) censorship. Due to this, we have come across many offensive posts which are filled negative comments.

How is Reddit structured?

Racism and hate speech can cause a lot of damage to both individuals and communities. A study of 800 Australian secondary school students discovered that racism had huge mental health impacts on students who experience it. We have built a plugin to battle this hate speech on reddit.


  • Collection of data using Reddit API and then cleaning the data. The data was collected from subreddits such as r/ImGoingToHellForThis/ , r/Incels/ etc.
  • The text from the posts and their comments are extracted which becomes the training part of our machine learning model. Each comment was annotated by 3 members. If the comment was considered offensive it was marked as 1 else 0.
  • A machine learning model based on SVM was trained using the collected dataset (Tfidvectorizer was used instead of feature set).
  • The model was tested on the posts of few subreddits like :/r/gaming, /r/Iamgoingtohellforthis,  /r/aww, /r/MadeMeSmile.
  • Top 200 comments from a post are scanned, and if the number of offensive comments crosses a certain threshold the post will be labelled as offensive.

  • snippet of the dataset collected using praw (Python reddit wrapper)

    Some points for the methodology:
    • In the training set, we have taken 50% offensive and 50% non-offensive cases in order to increase the accuracy of the model and keep it fair.
    • The size of the training set was 950 comments.
    • The model was improved using Vader sentiment analysis which is a python library for sentiment analysis. Our final plugin used this to improve the accuracy. Comments were labelled according to the score it got. 

    The plugin was tested on 4 major subreddits. /r/ImGoingToHellForThis is a well know subreddit to have offensive content. This was the first subreddit which we analysed. 10 posts from this subreddit were tested (All time top). A similar analysis was done for 3 other subreddits and the findings matched our expectations.  /r/ImGoingToHellForThis had the most number of offensive posts. /r/gaming came in just behind it. /r/aww and /r/MadeMeSmile which are popular family-friendly subreddits had the least number of offensive posts.

    TOTAL POSTS = 10*4 = 40
    ACCURACY = 0.875

    observations for the 4 subreddits analysed

    For the final plugin we used vader sentiment analysis to classify the comment as offensive or not. It uses a bag of word approach. Whenever it comes across an offensive word, the sentence is given a high negative score. This helped us classifying the comments. Accuracy for linearSVC could be improved by increasing the data-set by a lot and tweaking the parameters. But with the data set we had vader gave us the best score.

    Linear SVC vs Vader accuracy

    Final Plugin and poster presentation:

    • We used a django server to communicate with the plugin and the python machine learning model.
    • Whenever a user opens a post, a HTTP POST request is sent to the server. The score is calculated and returned as a response (A nudge for the user).

    response to a non-offensive post

    response to an offensive post

    Some pics from the poster presentation

    Group Members:

    Ashutosh Batabyal (Group Leader)

    Shreya Sharma

    Abhishek Chauhan

    Shivani Raina

    Aarushi Arya

    Sarthak Jindal




    Popular posts from this blog

    Identifying Tinder Profiles on Facebook

    Identifying Tinder Profiles on Facebook In the online world, everything that you ever put is linked and connected. You might think that you’ve put some information on one platform and that’s it, you’re good to go. But you, my friend, are sadly mistaken. With this thought in mind and the privacy concerns linked with Online Social Media, we would like to introduce you to our problem statement: Identifying Facebook Profiles from Tinder Profiles. Given a tinder profile, our aim is to identify the corresponding Facebook profile of that person. We are addressing the linkability issue here and trying to highlight how more information than what you’ve mentioned on Tinder can be picked up from your Facebook profile. For those who don’t know, Tinder is a Dating Platform available for a Mobile Application and a Web App. It shows the geographically close profiles around you and you have an option to right swipe(Like) or left swipe(Dislike) them. When two people right swipe each other then it’

    iFROOSN: Incentivised Fake Reviews On OSNs with Yelp as the reference

    Yelp is an OSN primarily used to popularise the businesses and give reviews about those business. Yelp can be used as an efficient business expander for many upcoming restaurants/spas/saloons who always look for new customers. Problem Statement Our main objective of this course project was to target fake/incentivised reviews on yelp and give a credibility score using which a new user of Yelp can get an overall estimate about the restaurant he/she will visit .We developed an application which required an business ID of yelp as an input and it gave the credibility score as the output along with some inferred results in form of graphs Dataset The primary requirement before starting the project was collecting dataset for Yelp business and corresponding reviews and details about the user which post these reviews .The dataset was obtained through Yelp dataset challenge which was available for academic usage and result collections .The database had predefined schema and

    Privacy Control

    Online social networks have become an important part of our social lives, and their inherent privacy problems have become a major concern for users. As of March 2016, 142 million Indians maintain a social network profile on Facebook and 30 million on Twitter, which provides them with a convenient way to communicate with family, friends and even total strangers. The Services provided by social media though add convenience to our life to a great extent and have made the world a much closely connected, this boon comes with few hidden problems. Though social media lets users share a part of our life to the world, it also gives birth to the security threats to our personal information.  The users are confronted with a dichotomy between sharing information with their loved ones and friends and sharing information with everyone else on the internet. To help users tackle this dilemma, social networks provide a plethora of privacy settings which allow the user to control his/her pri