Skip to main content

Raincheck for SoundCloud : Detecting spam users in SoundCloud

Raincheck for SoundCloud

Soundcloud is an audio distribution platform that enables its users to upload, record, and share their originally-created sounds.
Bot/fake accounts are a persistent problem on SoundCloud where, they are used by rich and undeserving musicians to get more followers. Because of this reason, musicians which are genuinely talented go unnoticed whereas plagiarized/undeserving music(by bot/fake accounts) becomes famous. Below is an infographic that explains the motivation behind our work. The primary objective of our project was to to prevent users from following untrustworthy SoundCloud accounts, and to make the playing ground level for the artists that are actually worthy.




Therefore, to solve the above problem, we built a chrome extension that detects profiles which use/are like bots on SoundCloud by validating a machine learning model to identify fake accounts using our feature vector described below.


Methodology


We crawled SoundCloud’s website in a BFS manner(followers of a follower and so on) and extracted the following features using web-scraping:-
    • Follower and Following Count
    • Profile picture
    • Number of tracks posted
    • Suspicious Usernames

Total number of users crawled- 2000
Total number of users manually annotated(every user was annotated by 3 people)- 300(ground truth)


An account will be a bot if:
    • Follower-Following ratio will be skewed in one direction but keep in mind the date of creation of profile.
    • No profile pictures uploaded.
    • Pornographic picture uploaded.
    • Less than 𝞪  posts uploaded where 𝞪 is an arbitrary number chosen on the basis of our ground truth.
    • Repeating same activity.
Based upon our feature vector above, we created the ground truth through manual annotation. We then trained our model using 3 supervised machine learning techniques, SVM, linear regression and decision tree classifier. We did not use neural nets as the ground truth dataset was not sufficient in size.
We classified our fake accounts into 3 categories:-
  • Genuine users  - Users which appear genuine and real
  • Ambiguous users - Users for which we are not sure if they are bots/fake
  • Fake users - Users which are bots/fake


Results
Figure 1 : Graph representing the distribution of type of users in our study

Figure 2: Graph representing the distribution of type of profile pictures of fake profiles
                                                          
Figure 3: Skewed follower-following ratio for fake profiles

Figure 4: Accuracy vs User count graph for 3 different methodologies
As shown in Figure 1,  the number of fake users/bots on SoundCloud came out to be 14% i.e 280 of 2000 undeserving musicians are getting the attention of a huge number of followers with the help of bots. This is a huge number and hence cannot be ignored.
The behaviour of our features is described in Figure 2 and Figure 3.
Profile picture came out to be an important feature of our detection as 57% of fake users do not have any picture uploaded and 29% of users have a pornographic/inappropriate picture(Figure 2). Also, it can be clearly seen from Figure 3, that follower-following ratio is marginally skewed in one direction as in the case of fake users as compared to genuine musicians.
Finally as shown in Figure 4,  the accuracy of our methodology came out to be around 80%( maximum in case of decision tree classifier).


Conclusion:
The ML model that was trained as explained was exposed through a chrome extension. The chrome extension warned users when they opened the page of a suspected fake user, and gave the green light when a genuine user’s profile was opened. Using this tool, users will be more informed. This data could also be used by SoundCloud employees while deciding whether a profile is fake.



Disclaimer : All images used in this blog have been created or captured by us with the exception of the following picture captured by Siddharth Arya.

The Team 
From L-R : Rishi Mohan, Ishita Verma, Prachi Singh, Arushi Kumar, Akshat Sharda, Anisha Sejwal

Comments

Popular posts from this blog

White or Blue, the Whale gets its Vengeance: A Social Media Analysis of the Blue Whale Challenge

The Blue Whale Challenge - a set of tasks that must be completed in a duration of 50 days - is an online social media rage. The tasks of the “game” cause both physical and mental harm to the players; the final task is to take his/her own life. The tasks include waking up at odd hours, listening to psychedelic music, watching scary videos, inflicting cuts and wounds on their bodies and the final task is to commit suicide. The game is supposedly administered by people called “curators” who incite others to take the challenge, brainwash them to cause self harm and ultimately commit suicide. Most conversations between curators and players are suspected to take place via direct message but, in order to find curators, the players need a public platform where they can express their desire to play the game - knowingly or unknowingly. Online social media serves as this platform as people post about not just their desire to be a part of the game but also details and pictures of the various task…

Social Bot Detection on Twitch

Twitch is the leading world live streaming video platform for the Gamer’s community. It is a very famous networking site and has close to 100 million monthly unique users. Bots are very prominent on the network due to various financial favors that the gaming platform provides to a user. The main objective of our Project is Detecting Social Bots on Twitch using various techniques such as Meta-data Analysis, Sentiment analysis from Chats on a Channel, and classification using Machine learning.
We started by collecting usernames of 510 channels for which we compared chatters and viewers on that channels live video. We got 51 channels which had chatters>viewers. On those channels, we did Temporal analysis for over a period of 4 weeks. Alongside, we collected their metadata, such as, Follower, Followings, Status, Partner, and total views. We calculated a Score using these features, from which we could conclude that higher the score, higher the probability of an account being a Bot accoun…

Privacy Concerns on Tinder

Introduction
Mobile dating apps have become a popular means to meet potential partners. Mobile dating application such as Tinder have exploded in popularity in recent years. Most users on Tinder use/have used Facebook as their primary way to sign up. By doing this, Tinder automatically takes user information directly from Facebook, thus saving the need to authenticate the user and user details.  In this project we aim to identify a Tinder profile on Facebook using the information that tinder obtains from Facebook. Below is the information that Tinder takes from a user when they log in for the first time.