Skip to main content

Better Safe Than Sorry: An Analysis on the Crime in various Indian Cities

If rich people died in wars, the media would shift and finally pay attention and label it as a crime - @marvinthe_great 

Safe Than Sorry 


Data is being created all the time promoting the development of enormous datasets, for example, online networking platforms like Twitter. By mining these large datasets, it is conceivable to extricate helpful samples and patterns. The primary concentration of this venture is to break down Twitter and distinguish references to wrongdoing or weakness in the open content of tweets at that point can picture hotspots of wrongdoing onto a guide by utilising the area metadata connected to these tweets. We hopefully aim to gather a correlation of the "crimes talked about" versus the crime statistics of the areas over the years.


Better safe than sorry. Your mother most likely shouted those words as she pursued you down the road, bicycle helmet in hand. Be that as it may, the expression turns out to be less fun when you consider this: a violent crime occurs every 45 seconds in India. Possibly it's time to quit fooling around about safety after all?
Would you be concerned if you knew that people tweeted about rape three times as often in your city than in the neighbouring one? 
Whether you see infractions or not, they are happening -- and people are talking about them.
To find out what the people are talking about the most, we scoured 40 days worth of Twitter posts in various cities in India for crime-related words. We collated these findings into maps and charts, categorising the crime into five main categories -- violence, theft, sexual assault, drug smuggling, and damage.


The key outcomes of this project:
  1. Source code which evaluates Twitter for crime-related keywords for each crime type by pattern matching. These include the Tweet text and the URL entity attached to it if any.
  2. Complete an appropriate statistical test and analyse results to check for any correlation between the two datasets for each of the crime types.
  3. Analysis of city-wise crime rates to get crime statistics per city.


There were three significant steps involved in curating the data for our analysis:

Collecting and Categorising Twitter Data

We've gathered tweets since the past 40 days from seven metropolitan cities in India -- Bengaluru, Chennai, Goa, Hyderabad, Kolkata, Mumbai and New Delhi, based on the search term dictionary defined below.

Crime Category
Crime Related Terms
murder’, ‘killed’, ‘stabbed’, ‘homicide’, ‘drink driving’, ‘harassment’, ‘gunshot’, ‘domestic abuse’, ‘stalking

‘theft’, ‘stolen’, ‘robbed’, ‘burglary’, ‘breaking’ + ‘entering’, ‘nicked’, ‘stealing’
Sexual Assault
rape’, ‘molestation’, ‘sexual assault’, ‘raped’, ‘sexual abuse’
Drug Smuggling
‘drug’ + ‘possession’, ‘drugs’, ‘possession’, ‘intent to supply’, ‘heroin’, ‘LSD’, ‘cocaine’, ‘smuggle’
vandalise’, ‘vandals’, ‘arson’, ‘set fire’, ‘destroyed’, vandalize’

The collection for each city was done using the GET search/tweets Application Program Interface (API) provided by Twitter using the keywords from the above table and also through scraping. We collected both geo-coded tweets (tweets within a radius of 15 km from the city), and tweets containing the crime and the city in the tweet text.

Sanitising Outputs

Apart from this, there also seemed the need to sanitize tweets. Many (around 75%) of the tweets collected were insignificant and not applicable to our analysis. For example, while searching for the keyword 'murder', we encountered multiple tweets related to the 2004 Bollywood movie named 'Murder'.
We semi-automatically remove the tweets which had nothing to do with crime and were collected because of happenstance. This was done by filtering out tweets with noisy keywords. So, we filtered our total no. of tweets to ~30000.

Collecting Actual Crime Data

To obtain actual crime data, we analysed the 2013 crime statistics available by the Government of India on We recorded the crime in each city based on our categories.


The first step towards the analysis was performing feature extraction on the data. We used the text of the tweet, along with the text from the titles of the URLs (news reports and YouTube videos talking about the incident mostly), as well as any text from images with the tweet. With the text, we extracted the tf-idf features and scaled down to 20 features using Machine Learning Techniques. 

Following the feature extraction, we performed clustering on the data we had. Our data set contained data pertaining to each category for each city. To identify the tweets belonging to the same incident, we applied DBSCAN clustering to each data set. The following was the output : 

We also made word clouds for the each of the cities, namely Bangalore, Delhi, Mumbai, Kolkata, Hyderabad. The following were the results : 





With these words clouds, we could get a good idea of which words were used to describe the crimes most for each city, and could present it effectively as well. 

Additionally, we also made graphs for each city comparing the frequencies of the actual crime numbers which we found from the site and the number of crimes found via Twitter after data collection, filtering the tweets and extracting features and clustering. 

The following were the results: 
In some cities, frequency of crimes from Twitter are higher than the actual crimes and in some cases it was lower. Crime stats related to things like Theft are much higher in police stats because people usually report to the police about petty thefts much more often than they post on Twitter, while cases about sexual violence are usually picked up by newspapers too, which triggers higher Twitter activity. Smuggling is a very lowly reported and talked about crime for the reason people are more interested in crimes which attack their personal liberties.

In addition to this, we also made graphs for comparing the tweets per category comparing the frequencies of what was being spoken about between genders. We got the gender of the person from his/her name using the api 'Gender API'. The following were the results: 

The graphs show that overall, men talk about report more crimes on Twitter than women on Twitter. This is backed by the fact that men have higher social presence on social media. Violent crimes like Murder are more talked about on Twitter by both the genders in all the cities.


Because the government doesn’t release actual crime descriptions case-by-case, there is no ground truth to verify the news stories on Twitter. This makes it difficult to figure out if two tweets are talking about the same incident. 

Team and Poster Presentation

The following was the project poster we presented on the day of  the poster presentation:

Group 16. 

Members - 
  1. Parikshit Diwan (2014074)
  2. Shiven Mian (2015094)
  3. Madhav Varma (2014061)
  4. Mridul Gupta (2015061)


Popular posts from this blog

Identifying Tinder Profiles on Facebook

Identifying Tinder Profiles on Facebook In the online world, everything that you ever put is linked and connected. You might think that you’ve put some information on one platform and that’s it, you’re good to go. But you, my friend, are sadly mistaken. With this thought in mind and the privacy concerns linked with Online Social Media, we would like to introduce you to our problem statement: Identifying Facebook Profiles from Tinder Profiles. Given a tinder profile, our aim is to identify the corresponding Facebook profile of that person. We are addressing the linkability issue here and trying to highlight how more information than what you’ve mentioned on Tinder can be picked up from your Facebook profile. For those who don’t know, Tinder is a Dating Platform available for a Mobile Application and a Web App. It shows the geographically close profiles around you and you have an option to right swipe(Like) or left swipe(Dislike) them. When two people right swipe each other then it’

iFROOSN: Incentivised Fake Reviews On OSNs with Yelp as the reference

Yelp is an OSN primarily used to popularise the businesses and give reviews about those business. Yelp can be used as an efficient business expander for many upcoming restaurants/spas/saloons who always look for new customers. Problem Statement Our main objective of this course project was to target fake/incentivised reviews on yelp and give a credibility score using which a new user of Yelp can get an overall estimate about the restaurant he/she will visit .We developed an application which required an business ID of yelp as an input and it gave the credibility score as the output along with some inferred results in form of graphs Dataset The primary requirement before starting the project was collecting dataset for Yelp business and corresponding reviews and details about the user which post these reviews .The dataset was obtained through Yelp dataset challenge which was available for academic usage and result collections .The database had predefined schema and

Privacy Control

Online social networks have become an important part of our social lives, and their inherent privacy problems have become a major concern for users. As of March 2016, 142 million Indians maintain a social network profile on Facebook and 30 million on Twitter, which provides them with a convenient way to communicate with family, friends and even total strangers. The Services provided by social media though add convenience to our life to a great extent and have made the world a much closely connected, this boon comes with few hidden problems. Though social media lets users share a part of our life to the world, it also gives birth to the security threats to our personal information.  The users are confronted with a dichotomy between sharing information with their loved ones and friends and sharing information with everyone else on the internet. To help users tackle this dilemma, social networks provide a plethora of privacy settings which allow the user to control his/her pri