It's been a long time since the concept of e-commerce was introduced. Many people moved towards it. Dependency on these product based companies has increased too much. No one knew the reason, but everyone followed the trend. The reason was actually, people trusted these services and the products out there.
They never knew what was going on, on these websites or on most of the social media these days. The point which we are trying to raise by the medium of this blog is ‘Fake Reviewing’. Being Fake is a trending problem these days and many are already out there to fight against this. Just like them, we also moved a step forward in this direction.
What does ‘Being Fake’ mean? Being Fake means to purposefully bring to the notice of others some information, which actually doesn’t exist rather ‘someone fake’ tries to make other believe that this information actually exists.
This problem has been recently recognised by many, many are still trying to identify this. Coming to the point,it has become a common trend to check reviews of a product online before buying it. Reviews are one source of information about any product which people trust, because according to them the review has been genuinely done by some other user. But these other users might be genuine or not. So fake reviews from non genuine users might be positive as well as negative. Positive reviews tend to increase the sales of the product in question whereas negative reviews create a negative image of the product.
These negative and positive impacts on the image of the product are done for only the purpose of impacting the sales of any product. If it’s a positive review, it might help to improve the sales and in the other case, it might help to decrease the sales. So, this was the basic idea of our project.
Aim of the Project
Talking about the project idea, what we chose as our project was fake review detection regarding books on an OSN named Goodreads.
Goodreads is a social network for book readers which was founded in 2008. It allows users to sign up, make friends, review books, create reading lists, join groups etc. With more than 55 million users, goodreads has become the number 1 site for checking out book reviews.
Readers throughout the world preferred this network for sharing their opinions about the book and getting reviews from other readers as well and having a healthy discussion. BUt unfortunately, this OSN has become a victim of opinion spamming.
Rather than posting true reviews regarding a book, reviewers are intentionally posting untrue opinions, no matter if they are about author or book or whatever like advertisements. Regarding this, there have been multiple reports of books being reviewed negatively before publishing date. Authors are complaining about fake reviews on their books on various discussion forums. Also sites like Fiverr provide commercial services like posting multiple positive/negative reviews for money. All these incidents created a scenario and such things motivated us to choose Goodreads for our course project.
Our project, in a nutshell:
- Chose a network of our interest.
- Identified a problem that we could work with.
- Established the fact that the problem exists.
- Data collection by Web scraping.
- Data analysis for deciding features.
- Defined what a fake review according to us is.
- Annotated reviews according to our definitions.
- Trained a model to classify the reviews.
- Showed 5-fold cross validation accuracy and ROC plot
- Proof of existence of problem
To verify that the problem of fake review exists, we chose 5 random books for each year between 2013-1017. There were total 25 books and for each book we collected 500 reviews through web scraping. We tried to identify fake reviews by comparing review date with publishing date of the book. The results were astonishing . The number of fake reviews show an upward trend. It shows how prominent the problem has become.
Since not much work has been done on goodreads before, we had no existing annotated dataset to build our training model. We manually annotated 10,000 reviews. After our annotated dataset was ready, we had to select some important features that would be useful in differentiating fake reviews from true ones.
After selecting the network, we searched for proofs of problem existence, let’s have a look at them-
Following is a snap of Dan Brown’s book named Origin which was rated much before it was even published.
2. Data Collection
Now the turn was for data collection. There were many problems we faced during data collection-
- Firstly, goodreads api is not sharing any useful data rather than just providing book and user identities, so we had to collect whole data using web scraping.
- Secondly, goodreads api didn’t even share much data publically, like while collecting the reviews of a book, we are able to collect only initial 500 most popular reviews no matter whether the book had 10k reviews or 30k.For this we in fact mailed the Goodreads administrators, but they refused to provide the complete data of this type.
3. Defining a “Fake” review
Ours is binary classification problem, either a review is “fake”, or it is not. So, it was very important for us to draw that separation boundary for what kind of review is fake for us. For this, we took help from earlier research papers done on Amazon reviews.
Here are the
- Reviews posted before publishing date of the book.
- Untruthful opinions.
- Review about the author and not the book.
- Advertisements and promotions.
4. Manual Annotation
It was a new dataset, so to train a model, we needed truth values. We manually annotated 10,000 reviews based on the above 4 parameters to fit on the Model.
5. Feature Set Identification
The most crucial part which is largely responsible for accuracy of the model is its feature vector set. The following features were used in the project to make the dataset appropriately fit on the classifier. Two kinds of features were used, Review Centric and Reviewer centric.
Number of reviews and replies a book gets.
Length of the review
Standard Deviation in Ratings
If all reviews have been good or all have been bad.
Position of the review when sorted with time
If the book is one of Reader’s Choice books
Percentage of positive and negative opinion bearing words in the review.
Deviation of a review from the average rating.
Percentage of times author name is mentioned in the review.
Percentage of numerals, capitals and non capital reviews
6. Results of the model
We used these features to train logistic regression to build a learning model. Following is the plot of 5 fold cross validation accuracies.
Fig. 5 Fold Cross Validation Accuracies
Fig. ROC Curve
Through our work, we created a labelled dataset for fake reviews on goodreads and trained a model on it. Our project can be used to identify fake reviews and help readers and authors to strengthen their trust on goodreads. There can be improvement of accuracy for a better set of features and better hyper parameter optimization techniques and removal of outliers.
- Finding relationship between fake reviewers.
- Creating a browser extension based on our learning model to rate fakeness of a review on a scale of 1-10.
The Project Presentation
Our Team: (left to right) Prerna Kalla, Ojasvi Agarwal and Prateek Kumar Yadav