Skip to main content

Raincheck for SoundCloud : Detecting spam users in SoundCloud

Raincheck for SoundCloud

Soundcloud is an audio distribution platform that enables its users to upload, record, and share their originally-created sounds.
Bot/fake accounts are a persistent problem on SoundCloud where, they are used by rich and undeserving musicians to get more followers. Because of this reason, musicians which are genuinely talented go unnoticed whereas plagiarized/undeserving music(by bot/fake accounts) becomes famous. Below is an infographic that explains the motivation behind our work. The primary objective of our project was to to prevent users from following untrustworthy SoundCloud accounts, and to make the playing ground level for the artists that are actually worthy.




Therefore, to solve the above problem, we built a chrome extension that detects profiles which use/are like bots on SoundCloud by validating a machine learning model to identify fake accounts using our feature vector described below.


Methodology


We crawled SoundCloud’s website in a BFS manner(followers of a follower and so on) and extracted the following features using web-scraping:-
    • Follower and Following Count
    • Profile picture
    • Number of tracks posted
    • Suspicious Usernames

Total number of users crawled- 2000
Total number of users manually annotated(every user was annotated by 3 people)- 300(ground truth)


An account will be a bot if:
    • Follower-Following ratio will be skewed in one direction but keep in mind the date of creation of profile.
    • No profile pictures uploaded.
    • Pornographic picture uploaded.
    • Less than 𝞪  posts uploaded where 𝞪 is an arbitrary number chosen on the basis of our ground truth.
    • Repeating same activity.
Based upon our feature vector above, we created the ground truth through manual annotation. We then trained our model using 3 supervised machine learning techniques, SVM, linear regression and decision tree classifier. We did not use neural nets as the ground truth dataset was not sufficient in size.
We classified our fake accounts into 3 categories:-
  • Genuine users  - Users which appear genuine and real
  • Ambiguous users - Users for which we are not sure if they are bots/fake
  • Fake users - Users which are bots/fake


Results
Figure 1 : Graph representing the distribution of type of users in our study

Figure 2: Graph representing the distribution of type of profile pictures of fake profiles
                                                          
Figure 3: Skewed follower-following ratio for fake profiles

Figure 4: Accuracy vs User count graph for 3 different methodologies
As shown in Figure 1,  the number of fake users/bots on SoundCloud came out to be 14% i.e 280 of 2000 undeserving musicians are getting the attention of a huge number of followers with the help of bots. This is a huge number and hence cannot be ignored.
The behaviour of our features is described in Figure 2 and Figure 3.
Profile picture came out to be an important feature of our detection as 57% of fake users do not have any picture uploaded and 29% of users have a pornographic/inappropriate picture(Figure 2). Also, it can be clearly seen from Figure 3, that follower-following ratio is marginally skewed in one direction as in the case of fake users as compared to genuine musicians.
Finally as shown in Figure 4,  the accuracy of our methodology came out to be around 80%( maximum in case of decision tree classifier).


Conclusion:
The ML model that was trained as explained was exposed through a chrome extension. The chrome extension warned users when they opened the page of a suspected fake user, and gave the green light when a genuine user’s profile was opened. Using this tool, users will be more informed. This data could also be used by SoundCloud employees while deciding whether a profile is fake.



Disclaimer : All images used in this blog have been created or captured by us with the exception of the following picture captured by Siddharth Arya.

The Team 
From L-R : Rishi Mohan, Ishita Verma, Prachi Singh, Arushi Kumar, Akshat Sharda, Anisha Sejwal

Comments

Popular posts from this blog

Identifying Tinder Profiles on Facebook

Identifying Tinder Profiles on Facebook In the online world, everything that you ever put is linked and connected. You might think that you’ve put some information on one platform and that’s it, you’re good to go. But you, my friend, are sadly mistaken. With this thought in mind and the privacy concerns linked with Online Social Media, we would like to introduce you to our problem statement: Identifying Facebook Profiles from Tinder Profiles. Given a tinder profile, our aim is to identify the corresponding Facebook profile of that person. We are addressing the linkability issue here and trying to highlight how more information than what you’ve mentioned on Tinder can be picked up from your Facebook profile. For those who don’t know, Tinder is a Dating Platform available for a Mobile Application and a Web App. It shows the geographically close profiles around you and you have an option to right swipe(Like) or left swipe(Dislike) them. When two people right swipe each other then it’

iFROOSN: Incentivised Fake Reviews On OSNs with Yelp as the reference

Yelp is an OSN primarily used to popularise the businesses and give reviews about those business. Yelp can be used as an efficient business expander for many upcoming restaurants/spas/saloons who always look for new customers. Problem Statement Our main objective of this course project was to target fake/incentivised reviews on yelp and give a credibility score using which a new user of Yelp can get an overall estimate about the restaurant he/she will visit .We developed an application which required an business ID of yelp as an input and it gave the credibility score as the output along with some inferred results in form of graphs Dataset The primary requirement before starting the project was collecting dataset for Yelp business and corresponding reviews and details about the user which post these reviews .The dataset was obtained through Yelp dataset challenge which was available for academic usage and result collections .The database had predefined schema and

White or Blue, the Whale gets its Vengeance: A Social Media Analysis of the Blue Whale Challenge

The Blue Whale Challenge - a set of tasks that must be completed in a duration of 50 days - is an online social media rage. The tasks of the “game” cause both physical and mental harm to the players; the final task is to take his/her own life. The tasks include waking up at odd hours, listening to psychedelic music, watching scary videos, inflicting cuts and wounds on their bodies and the final task is to commit suicide. The game is supposedly administered by people called “curators” who incite others to take the challenge, brainwash them to cause self harm and ultimately commit suicide. Most conversations between curators and players are suspected to take place via direct message but, in order to find curators, the players need a public platform where they can express their desire to play the game - knowingly or unknowingly. Online social media serves as this platform as people post about not just their desire to be a part of the game but also details and pictures of the various tas