Skip to main content

Identifying fake reviews in Goodreads

It's been a long time since the concept of e-commerce was introduced. Many people moved towards it. Dependency on these product based companies has increased too much. No one knew the reason, but everyone followed the trend. The reason was actually, people trusted these services and the products out there.
They never knew what was going on, on these websites or on most of the social media these days. The point which we are trying to raise by the medium of this blog is ‘Fake Reviewing’. Being Fake is a trending problem these days and many are already out there to fight against this. Just like them, we also moved a step forward in this direction.

What does ‘Being Fake’ mean? Being Fake means to purposefully bring to the notice of others some information, which  actually doesn’t exist rather ‘someone fake’ tries to make other believe that this information actually exists.
This problem has been recently recognised by many, many are still trying to identify this. Coming to the point,it has become a common trend to check reviews of a product online before buying it. Reviews are one source of information about any product which people trust, because according to them the review has been genuinely done by some other user. But these other users might be genuine or not. So fake reviews from non genuine  users might be positive as well as negative. Positive reviews tend to increase the sales of the product in question whereas negative reviews create a negative image of the product.
    These negative and positive impacts on the image of the product are done for only the purpose of impacting the sales of any product. If it’s a positive review, it might help to improve the sales and in the other case, it might help to decrease the sales. So, this was the basic idea of our project.
Aim of the Project
Talking about the project idea, what we chose as our project was fake review detection regarding books on an OSN named Goodreads.
Goodreads is a social network for book readers which was founded in 2008. It allows users to sign up, make friends, review books, create reading lists, join groups etc. With more than 55 million users, goodreads has become the number 1 site for checking out book reviews.
Readers throughout the world preferred this network for sharing their opinions about the book and getting reviews from other readers as well and having a healthy discussion. BUt unfortunately, this OSN has become a victim of opinion spamming.
Rather than posting true reviews regarding a book, reviewers are intentionally posting untrue opinions, no matter if they are about author or book or whatever like advertisements. Regarding this, there have been multiple reports of books being reviewed negatively before publishing date. Authors are complaining about fake reviews on their books on various discussion forums. Also sites like Fiverr  provide commercial services like posting multiple positive/negative reviews for money. All these incidents created a scenario and such things motivated us to choose Goodreads for our course project.

Our project, in a nutshell:
  • Chose a network of our interest.
  • Identified a problem that we could work with.
  • Established the fact that the problem exists.
  • Data collection by Web scraping.
  • Data analysis for deciding features.
  • Defined what a fake review according to us is.
  • Annotated reviews according to our definitions.
  • Trained a model to classify the reviews.
  • Showed 5-fold cross validation accuracy and ROC plot

  1. Proof of existence of problem
To verify that the problem of fake review exists, we chose 5 random books for each year between 2013-1017. There were total 25 books and for each book we collected 500 reviews through web scraping. We tried to identify fake reviews by comparing review date with publishing date of the book. The results were astonishing . The number of fake reviews show an upward trend. It shows how prominent the problem has become.

Since not much work has been done on goodreads before, we had no existing annotated dataset to build our training model. We manually annotated 10,000 reviews. After our annotated dataset was ready, we had to select some important features that would be useful in differentiating fake reviews from true ones.

After selecting the network, we searched for proofs of problem existence, let’s have a look at them-
Following is a snap of Dan Brown’s book named Origin which was rated much before it was even published.

Another example

2. Data Collection
Now the turn was for data collection. There were many problems we faced during data collection-

  • Firstly, goodreads api is not sharing any useful data rather than just providing book and user identities, so we had to collect whole data using web scraping.
  • Secondly, goodreads api didn’t even share much data publically, like while collecting the reviews of a book, we are able to collect only initial 500 most popular reviews no matter whether the book had 10k reviews or 30k.For this we in fact mailed the Goodreads administrators, but they refused to provide the complete data of this type.

3. Defining a “Fake” review
Ours is binary classification problem, either a review is “fake”, or it is not. So, it was very important for us to draw that separation boundary for what kind of review is fake for us. For this, we took help from earlier research papers done on Amazon reviews.
Here are the
  1. Reviews posted before publishing date of the book.
  2. Untruthful opinions.
  3. Review about the author and not the book.
  4. Advertisements and promotions.

4. Manual Annotation
It was a new dataset, so to train a model, we needed truth values. We manually annotated 10,000 reviews based on the above 4 parameters to fit on the Model.

5. Feature Set Identification

The most crucial part which is largely responsible for accuracy  of the model is its feature vector set. The following features were used in the project to make the dataset appropriately fit on the classifier. Two kinds of features were used, Review Centric and Reviewer centric.

      Review Centric
Reviewer Centric
Number of reviews and replies a book gets.
Average rating
Length of the review
Standard Deviation in Ratings
Cosine Similarity.
If all reviews have been good or all have been bad.
Position of the review when sorted with time
If the book is one of Reader’s Choice books
Percentage of positive and negative opinion bearing words in the review.
Deviation of a review from the average rating.
Percentage of times author name is mentioned in the review.
Percentage of numerals, capitals and non capital reviews

6. Results of the model
We used these features to train logistic regression to build a learning model. Following is the plot of 5 fold cross validation accuracies.
Fig. 5 Fold Cross Validation Accuracies
Fig. ROC Curve
Through our work, we created a labelled dataset for fake reviews on goodreads and trained a model on it. Our project can be used to identify fake reviews and help readers and authors to strengthen their trust on goodreads. There can be improvement of accuracy for a better set of features and better hyper parameter optimization techniques and removal of outliers.
Future Work
  • Finding relationship between fake reviewers.
  • Creating a browser extension based on our learning model to rate fakeness of a review on a scale of 1-10.

The Project Presentation
Our Team: (left to right) Prerna Kalla, Ojasvi Agarwal and Prateek Kumar Yadav


Popular posts from this blog

Identifying Tinder Profiles on Facebook

Identifying Tinder Profiles on Facebook In the online world, everything that you ever put is linked and connected. You might think that you’ve put some information on one platform and that’s it, you’re good to go. But you, my friend, are sadly mistaken. With this thought in mind and the privacy concerns linked with Online Social Media, we would like to introduce you to our problem statement: Identifying Facebook Profiles from Tinder Profiles. Given a tinder profile, our aim is to identify the corresponding Facebook profile of that person. We are addressing the linkability issue here and trying to highlight how more information than what you’ve mentioned on Tinder can be picked up from your Facebook profile. For those who don’t know, Tinder is a Dating Platform available for a Mobile Application and a Web App. It shows the geographically close profiles around you and you have an option to right swipe(Like) or left swipe(Dislike) them. When two people right swipe each other then it’

iFROOSN: Incentivised Fake Reviews On OSNs with Yelp as the reference

Yelp is an OSN primarily used to popularise the businesses and give reviews about those business. Yelp can be used as an efficient business expander for many upcoming restaurants/spas/saloons who always look for new customers. Problem Statement Our main objective of this course project was to target fake/incentivised reviews on yelp and give a credibility score using which a new user of Yelp can get an overall estimate about the restaurant he/she will visit .We developed an application which required an business ID of yelp as an input and it gave the credibility score as the output along with some inferred results in form of graphs Dataset The primary requirement before starting the project was collecting dataset for Yelp business and corresponding reviews and details about the user which post these reviews .The dataset was obtained through Yelp dataset challenge which was available for academic usage and result collections .The database had predefined schema and

Privacy Control

Online social networks have become an important part of our social lives, and their inherent privacy problems have become a major concern for users. As of March 2016, 142 million Indians maintain a social network profile on Facebook and 30 million on Twitter, which provides them with a convenient way to communicate with family, friends and even total strangers. The Services provided by social media though add convenience to our life to a great extent and have made the world a much closely connected, this boon comes with few hidden problems. Though social media lets users share a part of our life to the world, it also gives birth to the security threats to our personal information.  The users are confronted with a dichotomy between sharing information with their loved ones and friends and sharing information with everyone else on the internet. To help users tackle this dilemma, social networks provide a plethora of privacy settings which allow the user to control his/her pri