Ever wondered whats the best way to get real audience opinions about the movie you are about to watch? Even we wondered. What we have now is all opinionated critic reviews or fan dominated IMDB.
So we thought it will be cool to build an app that can rate movies on basis of real world audience. Wait , but where are people and their opinions? Of course on Twitter. Its a good thing that people are highly opinionated on twitter which help us to make clear distinction between positive or negative sentiment about movies.
So go to this site and you get movie rating for all the movies that are running in theaters. We also shows how many tweets we used for arriving at this score and as it is automatic, we are not putting our biases in the score. You can observe the scores on par with imdb but all automatic and democratic as everyone get only one vote.
How we do it?
Ok now you like to know our secret sauce . Let’s get to it
We use machine learning to do this . Particularly we use sentiment analysis method to rate given tweet as positive,negative or neutral. We collect huge amount of tweets about these movies for few weeks and then by aggregating them we come up with final score.
If you want to play with this sentiment analyzer you can go here.
Machine learning Model
We have used this paper as base with improvements from our side. The following are few improvements for the model
- Negation Handling
The origin paper does not talk much about negation handling. After publish of this work in 2009, there are lot of different ways are proposed to handle negation. We use technique of negating words till some kind of identifier for ending of phrase, like , : ? , . As twitter has limit on 140 characters this way negation works good most of the times.
- Hashtag Handling
As twitter grew bigger and bigger hashtag played bigger role in twitter. Hashtag are the way of organizing and searching through billions of tweets. Model uses hashtag as strong indicator of mood. Hashtags like #awesome gives positive and #sad gives a negative indicator.
- Caps HandlingPeople use caps to indicate a strong opinion. Model incorporates that too.
Machine Algorithm Classifier – Naive Bayes
We picked naive bayes for two reason. One it works very good for document classification compared to Logistic regression . Its less accurate than svm but much faster. We have observed that we don’t get much better accuracy using svm but there is significant performance hit.
Technology stack – Spark and MLLib
Spark is a high performance cluster computing engine with great api. MLLib is machine learning library which allow to run naive bayes like algorithm on large scale.