Skip to main content

Fight against racism and hate speech on reddit

Reddit is an American social news aggregationand discussion website. Registered members submit content to the site such as links, text posts, and images, which are then voted up or down by other members. Posts are organized by subject into boards called "subreddits", which cover a variety of topics including news, science, movies, video games, music, books, fitness, food, and image-sharing. Registering an account with Reddit is free and does not require an email address to complete. Since Reddit is an open platform and anyone is free to post anything, there is no (or very limited) censorship. Due to this, we have come across many offensive posts which are filled negative comments.



How is Reddit structured?



Racism and hate speech can cause a lot of damage to both individuals and communities. A study of 800 Australian secondary school students discovered that racism had huge mental health impacts on students who experience it. We have built a plugin to battle this hate speech on reddit.


Methodology:

  • Collection of data using Reddit API and then cleaning the data. The data was collected from subreddits such as r/ImGoingToHellForThis/ , r/Incels/ etc.
  • The text from the posts and their comments are extracted which becomes the training part of our machine learning model. Each comment was annotated by 3 members. If the comment was considered offensive it was marked as 1 else 0.
  • A machine learning model based on SVM was trained using the collected dataset (Tfidvectorizer was used instead of feature set).
  • The model was tested on the posts of few subreddits like :/r/gaming, /r/Iamgoingtohellforthis,  /r/aww, /r/MadeMeSmile.
  • Top 200 comments from a post are scanned, and if the number of offensive comments crosses a certain threshold the post will be labelled as offensive.




  • snippet of the dataset collected using praw (Python reddit wrapper)




    Some points for the methodology:
    • In the training set, we have taken 50% offensive and 50% non-offensive cases in order to increase the accuracy of the model and keep it fair.
    • The size of the training set was 950 comments.
    • The model was improved using Vader sentiment analysis which is a python library for sentiment analysis. Our final plugin used this to improve the accuracy. Comments were labelled according to the score it got. 





    Analysis:
    The plugin was tested on 4 major subreddits. /r/ImGoingToHellForThis is a well know subreddit to have offensive content. This was the first subreddit which we analysed. 10 posts from this subreddit were tested (All time top). A similar analysis was done for 3 other subreddits and the findings matched our expectations.  /r/ImGoingToHellForThis had the most number of offensive posts. /r/gaming came in just behind it. /r/aww and /r/MadeMeSmile which are popular family-friendly subreddits had the least number of offensive posts.

    TOTAL POSTS = 10*4 = 40
    FALSE POSITIVE = 3
    FALSE NEGATIVE = 2
    TRUE POSITIVE = 18
    TRUE NEGATIVE = 17
    ACCURACY = 0.875





    observations for the 4 subreddits analysed



    For the final plugin we used vader sentiment analysis to classify the comment as offensive or not. It uses a bag of word approach. Whenever it comes across an offensive word, the sentence is given a high negative score. This helped us classifying the comments. Accuracy for linearSVC could be improved by increasing the data-set by a lot and tweaking the parameters. But with the data set we had vader gave us the best score.


    Linear SVC vs Vader accuracy


    Final Plugin and poster presentation:

    • We used a django server to communicate with the plugin and the python machine learning model.
    • Whenever a user opens a post, a HTTP POST request is sent to the server. The score is calculated and returned as a response (A nudge for the user).





    response to a non-offensive post

    response to an offensive post




    Some pics from the poster presentation




    Group Members:

    Ashutosh Batabyal (Group Leader)


    Shreya Sharma



    Abhishek Chauhan




    Shivani Raina



    Aarushi Arya




    Sarthak Jindal





    References:

    https://en.wikipedia.org/wiki/Reddit
    https://www.reddit.com/








      

    Comments

    Popular posts from this blog

    White or Blue, the Whale gets its Vengeance: A Social Media Analysis of the Blue Whale Challenge

    The Blue Whale Challenge - a set of tasks that must be completed in a duration of 50 days - is an online social media rage. The tasks of the “game” cause both physical and mental harm to the players; the final task is to take his/her own life. The tasks include waking up at odd hours, listening to psychedelic music, watching scary videos, inflicting cuts and wounds on their bodies and the final task is to commit suicide. The game is supposedly administered by people called “curators” who incite others to take the challenge, brainwash them to cause self harm and ultimately commit suicide. Most conversations between curators and players are suspected to take place via direct message but, in order to find curators, the players need a public platform where they can express their desire to play the game - knowingly or unknowingly. Online social media serves as this platform as people post about not just their desire to be a part of the game but also details and pictures of the various task…

    Social Bot Detection on Twitch

    Twitch is the leading world live streaming video platform for the Gamer’s community. It is a very famous networking site and has close to 100 million monthly unique users. Bots are very prominent on the network due to various financial favors that the gaming platform provides to a user. The main objective of our Project is Detecting Social Bots on Twitch using various techniques such as Meta-data Analysis, Sentiment analysis from Chats on a Channel, and classification using Machine learning.
    We started by collecting usernames of 510 channels for which we compared chatters and viewers on that channels live video. We got 51 channels which had chatters>viewers. On those channels, we did Temporal analysis for over a period of 4 weeks. Alongside, we collected their metadata, such as, Follower, Followings, Status, Partner, and total views. We calculated a Score using these features, from which we could conclude that higher the score, higher the probability of an account being a Bot accoun…

    Privacy Concerns on Tinder

    Introduction
    Mobile dating apps have become a popular means to meet potential partners. Mobile dating application such as Tinder have exploded in popularity in recent years. Most users on Tinder use/have used Facebook as their primary way to sign up. By doing this, Tinder automatically takes user information directly from Facebook, thus saving the need to authenticate the user and user details.  In this project we aim to identify a Tinder profile on Facebook using the information that tinder obtains from Facebook. Below is the information that Tinder takes from a user when they log in for the first time.