Raincheck for SoundCloud
Soundcloud is an audio distribution platform that enables its users to upload, record, and share their originally-created sounds.
Bot/fake accounts are a persistent problem on SoundCloud where, they are used by rich and undeserving musicians to get more followers. Because of this reason, musicians which are genuinely talented go unnoticed whereas plagiarized/undeserving music(by bot/fake accounts) becomes famous. Below is an infographic that explains the motivation behind our work. The primary objective of our project was to to prevent users from following untrustworthy SoundCloud accounts, and to make the playing ground level for the artists that are actually worthy.
Therefore, to solve the above problem, we built a chrome extension that detects profiles which use/are like bots on SoundCloud by validating a machine learning model to identify fake accounts using our feature vector described below.
We crawled SoundCloud’s website in a BFS manner(followers of a follower and so on) and extracted the following features using web-scraping:-
- Follower and Following Count
- Profile picture
- Number of tracks posted
- Suspicious Usernames
Total number of users crawled- 2000
Total number of users manually annotated(every user was annotated by 3 people)- 300(ground truth)
An account will be a bot if:
- Follower-Following ratio will be skewed in one direction but keep in mind the date of creation of profile.
- No profile pictures uploaded.
- Pornographic picture uploaded.
- Less than 𝞪 posts uploaded where 𝞪 is an arbitrary number chosen on the basis of our ground truth.
- Repeating same activity.
Based upon our feature vector above, we created the ground truth through manual annotation. We then trained our model using 3 supervised machine learning techniques, SVM, linear regression and decision tree classifier. We did not use neural nets as the ground truth dataset was not sufficient in size.
We classified our fake accounts into 3 categories:-
- Genuine users - Users which appear genuine and real
- Ambiguous users - Users for which we are not sure if they are bots/fake
- Fake users - Users which are bots/fake
|Figure 1 : Graph representing the distribution of type of users in our study|
|Figure 2: Graph representing the distribution of type of profile pictures of fake profiles|
|Figure 3: Skewed follower-following ratio for fake profiles|
|Figure 4: Accuracy vs User count graph for 3 different methodologies|
As shown in Figure 1, the number of fake users/bots on SoundCloud came out to be 14% i.e 280 of 2000 undeserving musicians are getting the attention of a huge number of followers with the help of bots. This is a huge number and hence cannot be ignored.
The behaviour of our features is described in Figure 2 and Figure 3.
Profile picture came out to be an important feature of our detection as 57% of fake users do not have any picture uploaded and 29% of users have a pornographic/inappropriate picture(Figure 2). Also, it can be clearly seen from Figure 3, that follower-following ratio is marginally skewed in one direction as in the case of fake users as compared to genuine musicians.
Finally as shown in Figure 4, the accuracy of our methodology came out to be around 80%( maximum in case of decision tree classifier).
The ML model that was trained as explained was exposed through a chrome extension. The chrome extension warned users when they opened the page of a suspected fake user, and gave the green light when a genuine user’s profile was opened. Using this tool, users will be more informed. This data could also be used by SoundCloud employees while deciding whether a profile is fake.
Disclaimer : All images used in this blog have been created or captured by us with the exception of the following picture captured by Siddharth Arya.
|From L-R : Rishi Mohan, Ishita Verma, Prachi Singh, Arushi Kumar, Akshat Sharda, Anisha Sejwal|