Skip to main content

Social Bot Detection on Twitch



Twitch is the leading world live streaming video platform for the Gamer’s community. It is a very famous networking site and has close to 100 million monthly unique users. Bots are very prominent on the network due to various financial favors that the gaming platform provides to a user. The main objective of our Project is Detecting Social Bots on Twitch using various techniques such as Meta-data Analysis, Sentiment analysis from Chats on a Channel, and classification using Machine learning.

We started by collecting usernames of 510 channels for which we compared chatters and viewers on that channels live video. We got 51 channels which had chatters>viewers. On those channels, we did Temporal analysis for over a period of 4 weeks. Alongside, we collected their metadata, such as, Follower, Followings, Status, Partner, and total views. We calculated a Score using these features, from which we could conclude that higher the score, higher the probability of an account being a Bot account.
Twitch has its official IRC client from which we collected chats on a channel using another tool, Chatty. These chats were collected for channels with higher score and sentiment analysis was done on the extracted messages. From the IRC client, we could also get users which were getting banned from chats by Twitch itself. We considered such users as Ground Truth for our next technique. 
Using the Score from metadata features and the Ground truth from Chat analysis, we gave labels to our data set, that we used as our machine leaning training data set and a randomly collected test set to get our accuracies for our classification. 


                              Summary of Methodology and Analysis


Data Collection and Filtering:  We used Twitchs' official API, Kraken to collect 510 random users and their chatters and viewers count. From these users, 51 accounts showed the Bot behaviour i.e. chatters>viewers. 
We limited are analysis to the 51 users from which we extracted the list of chatters. For those users(chatters), we collected various endpoints, such as followers, followings, status , partner value and views of a user which are further used to perform temporal analysis, meta-data analysis and apply machine learning classifiers.
For sentiment analysis, we extracted the chats of the most suspicious accounts using a tool known as Chatty. This extracted the live chats from the IRC client of twitch.
Temporal Analysis:  Our project was associated with live streaming data analysis, so temporal analysis was an integral part for our data collection. We performed Temporal analysis on the metrics viewers and chatters count for over a period of 4 weeks to finalize our data set for for further analysis. We had initially found 51 of 510 users which showed Bot behavior, i.e. chatter count > viewers count. This analysis helped us in keeping track of accounts that showed bot activity regularly.  

Meta-data Analysis: Using the metrics collected, such as follower, followings, status, partner and views, we created a Botscore formula which we used to identify the probability of an account being a bot. The botscore was formulated on the basis of the prominence of the various metrics being used. This score was given out of 5- where the higher the score, more chances the account is suspicious. We then took users with higher botscore to do sentiment analysis on their chats. Apart from this, this data was also used to train and classify accounts using Machine Learning models.














Sentiment Analysis: For sentiment analysis ,we used Textblob, a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
We collected chats of possible fake bots from IRC chat channel.Then using TextBlob we got polarity and subjectivity of each message.For neutral message and containing possible spam words we increased the count of spams detected of each user id.Thus we got data on how frequently a user id sent a spam message. We further analyzed the banned users' profiles and found that 70% of the accounts were already suspended by Twitch. 
Through this method we were able to get 75% accurate detection.


Classification using Machine Learning: The manually annotated data and our ground truth that we collected was used to train Machine Learning models. We trained the following five models.
1. Gaussian Naive Bayes
2. Logistic Regression
3. Support Vector Machines
4. Decision Tree Classifier
5. Neural Networks
Following are the obtained accuracies for the different models:


Next, to check the validity of the Botscore we created, we made a small data set similar to the one we used for training. The only difference between the training data set and this data set was that in this data set, ground truth was replaced by our calculated Botscore.

We re-ran our trained models on this new testing data set and found out that accuracies were comparable to the original accuracies we got. This conformed that our Botscore formula was in accordance with the trained models.

Conclusion: Twitch is growing platform and such fake accounts only hamper the profit for the company and actual deserving video streamers.Therefore, the need to identify these bot accounts is increasing proportionally with the number of bot accounts getting created. The above analysis was successful in identifying 70% of the actual fake accounts banned or suspended by Twitch itself. For future purposes, we plan on increasing the data set, use more features for Machine learning and content analysis for the extracted chats. 

Link to our video - https://youtu.be/AXLK9H_Uuls             

The Team

  
    L-R:  Akhil Goel, Shreyash Arya, Tushita Rathore, Sarthika Dhawan, Mayank Bhoria 

Some more photos from the presentation:










Comments

Popular posts from this blog

White or Blue, the Whale gets its Vengeance: A Social Media Analysis of the Blue Whale Challenge

The Blue Whale Challenge - a set of tasks that must be completed in a duration of 50 days - is an online social media rage. The tasks of the “game” cause both physical and mental harm to the players; the final task is to take his/her own life. The tasks include waking up at odd hours, listening to psychedelic music, watching scary videos, inflicting cuts and wounds on their bodies and the final task is to commit suicide. The game is supposedly administered by people called “curators” who incite others to take the challenge, brainwash them to cause self harm and ultimately commit suicide. Most conversations between curators and players are suspected to take place via direct message but, in order to find curators, the players need a public platform where they can express their desire to play the game - knowingly or unknowingly. Online social media serves as this platform as people post about not just their desire to be a part of the game but also details and pictures of the various task…

Privacy Concerns on Tinder

Introduction
Mobile dating apps have become a popular means to meet potential partners. Mobile dating application such as Tinder have exploded in popularity in recent years. Most users on Tinder use/have used Facebook as their primary way to sign up. By doing this, Tinder automatically takes user information directly from Facebook, thus saving the need to authenticate the user and user details.  In this project we aim to identify a Tinder profile on Facebook using the information that tinder obtains from Facebook. Below is the information that Tinder takes from a user when they log in for the first time.