Identifying fake reviews in Goodreads

It's been a long time since the concept of e-commerce was introduced. Many people moved towards it. Dependency on these product based companies has increased too much. No one knew the reason, but everyone followed the trend. The reason was actually, people trusted these services and the products out there.

They never knew what was going on, on these websites or on most of the social media these days. The point which we are trying to raise by the medium of this blog is ‘Fake Reviewing’. Being Fake is a trending problem these days and many are already out there to fight against this. Just like them, we also moved a step forward in this direction.

What does ‘Being Fake’ mean? Being Fake means to purposefully bring to the notice of others some information, which actually doesn’t exist rather ‘someone fake’ tries to make other believe that this information actually exists.

This problem has been recently recognised by many, many are still trying to identify this. Coming to the point,it has become a common trend to check reviews of a product online before buying it. Reviews are one source of information about any product which people trust, because according to them the review has been genuinely done by some other user. But these other users might be genuine or not. So fake reviews from non genuine users might be positive as well as negative. Positive reviews tend to increase the sales of the product in question whereas negative reviews create a negative image of the product.

These negative and positive impacts on the image of the product are done for only the purpose of impacting the sales of any product. If it’s a positive review, it might help to improve the sales and in the other case, it might help to decrease the sales. So, this was the basic idea of our project.

Aim of the Project

Talking about the project idea, what we chose as our project was fake review detection regarding books on an OSN named Goodreads.

Goodreads is a social network for book readers which was founded in 2008. It allows users to sign up, make friends, review books, create reading lists, join groups etc. With more than 55 million users, goodreads has become the number 1 site for checking out book reviews.

Readers throughout the world preferred this network for sharing their opinions about the book and getting reviews from other readers as well and having a healthy discussion. BUt unfortunately, this OSN has become a victim of opinion spamming.

Rather than posting true reviews regarding a book, reviewers are intentionally posting untrue opinions, no matter if they are about author or book or whatever like advertisements. Regarding this, there have been multiple reports of books being reviewed negatively before publishing date. Authors are complaining about fake reviews on their books on various discussion forums. Also sites like Fiverr provide commercial services like posting multiple positive/negative reviews for money. All these incidents created a scenario and such things motivated us to choose Goodreads for our course project.

Our project, in a nutshell:

Chose a network of our interest.
Identified a problem that we could work with.
Established the fact that the problem exists.
Data collection by Web scraping.
Data analysis for deciding features.
Defined what a fake review according to us is.
Annotated reviews according to our definitions.
Trained a model to classify the reviews.
Showed 5-fold cross validation accuracy and ROC plot

Proof of existence of problem

To verify that the problem of fake review exists, we chose 5 random books for each year between 2013-1017. There were total 25 books and for each book we collected 500 reviews through web scraping. We tried to identify fake reviews by comparing review date with publishing date of the book. The results were astonishing . The number of fake reviews show an upward trend. It shows how prominent the problem has become.

Since not much work has been done on goodreads before, we had no existing annotated dataset to build our training model. We manually annotated 10,000 reviews. After our annotated dataset was ready, we had to select some important features that would be useful in differentiating fake reviews from true ones.

After selecting the network, we searched for proofs of problem existence, let’s have a look at them-

Following is a snap of Dan Brown’s book named Origin which was rated much before it was even published.

Another example

2. Data Collection

Now the turn was for data collection. There were many problems we faced during data collection-

Firstly, goodreads api is not sharing any useful data rather than just providing book and user identities, so we had to collect whole data using web scraping.
Secondly, goodreads api didn’t even share much data publically, like while collecting the reviews of a book, we are able to collect only initial 500 most popular reviews no matter whether the book had 10k reviews or 30k.For this we in fact mailed the Goodreads administrators, but they refused to provide the complete data of this type.

3. Defining a “Fake” review

Ours is binary classification problem, either a review is “fake”, or it is not. So, it was very important for us to draw that separation boundary for what kind of review is fake for us. For this, we took help from earlier research papers done on Amazon reviews.

Here are the

Reviews posted before publishing date of the book.
Untruthful opinions.
Review about the author and not the book.
Advertisements and promotions.

4. Manual Annotation

It was a new dataset, so to train a model, we needed truth values. We manually annotated 10,000 reviews based on the above 4 parameters to fit on the Model.

5. Feature Set Identification

The most crucial part which is largely responsible for accuracy of the model is its feature vector set. The following features were used in the project to make the dataset appropriately fit on the classifier. Two kinds of features were used, Review Centric and Reviewer centric.

Review Centric	Reviewer Centric
Number of reviews and replies a book gets.	Average rating
Length of the review	Standard Deviation in Ratings
Cosine Similarity.	If all reviews have been good or all have been bad.
Position of the review when sorted with time	If the book is one of Reader’s Choice books
Percentage of positive and negative opinion bearing words in the review.	Deviation of a review from the average rating.
Percentage of times author name is mentioned in the review.
Percentage of numerals, capitals and non capital reviews

6. Results of the model

We used these features to train logistic regression to build a learning model. Following is the plot of 5 fold cross validation accuracies.

https://lh3.googleusercontent.com/0il3RP2gf9ydwfyz2mxY-OqgET1J3dGX1c2FUqf9KscyIWRr3JRPiiyQGYklAp_sdYMXbge4FgOoTnxJ1hMKAu3PdubhdcBhJPLW-Y25qzALDr9IqbPMiksAmCSHToAlJ6vqiy2Dmd0

Fig. 5 Fold Cross Validation Accuracies

https://lh5.googleusercontent.com/oTRMTLQv3K0Ud2T911vEHCPi1-go0U15wEMQD4owlnofi7vXj6a-Hgt4mAcSRAqjzcO9oBbZ8RLHPm_nol6HC7KIIlHkUHb9stKViez_IuDYWrNqj0--4zG23_pfNdAZE84K8Bcx51U

Fig. ROC Curve

Through our work, we created a labelled dataset for fake reviews on goodreads and trained a model on it. Our project can be used to identify fake reviews and help readers and authors to strengthen their trust on goodreads. There can be improvement of accuracy for a better set of features and better hyper parameter optimization techniques and removal of outliers.

Future Work

Finding relationship between fake reviewers.
Creating a browser extension based on our learning model to rate fakeness of a review on a scale of 1-10.

References

[1] https://www.cs.uic.edu/~liub/FBS/fake-reviews.html

[2] http://www.bbc.com/news/technology-22166606

[3] http://www.thedenverchannel.com/news/woman-paid-to-post-five-star-google-feedback

[4] https://www.cs.uic.edu/~liub/FBS/opinion-spam-WSDM-08.pdf

[5] http://blogs.wsj.com/wallet/2009/07/09/delonghis-strange-brew-tracking-down-fake-amazon-raves/

The Project Presentation

Our Team: (left to right) Prerna Kalla, Ojasvi Agarwal and Prateek Kumar Yadav

White or Blue, the Whale gets its Vengeance: A Social Media Analysis of the Blue Whale Challenge

The Blue Whale Challenge - a set of tasks that must be completed in a duration of 50 days - is an online social media rage. The tasks of the “game” cause both physical and mental harm to the players; the final task is to take his/her own life. The tasks include waking up at odd hours, listening to psychedelic music, watching scary videos, inflicting cuts and wounds on their bodies and the final task is to commit suicide. The game is supposedly administered by people called “curators” who incite others to take the challenge, brainwash them to cause self harm and ultimately commit suicide. Most conversations between curators and players are suspected to take place via direct message but, in order to find curators, the players need a public platform where they can express their desire to play the game - knowingly or unknowingly. Online social media serves as this platform as people post about not just their desire to be a part of the game but also details and pictures of the various tas...

CSE648: Privacy and Security in Online Social Media Projects

Search This Blog

Identifying fake reviews in Goodreads

Comments

Post a Comment

Popular posts from this blog

White or Blue, the Whale gets its Vengeance: A Social Media Analysis of the Blue Whale Challenge

Privacy Control

Identifying Tinder Profiles on Facebook