Ever wondered whats the best way to get real audience opinions about the movie you are about to watch? Even we wondered. What we have now is all opinionated critic reviews or fan dominated IMDB.


So we thought it will be cool to build an app that can rate movies on basis of real world audience. Wait , but where are people and their opinions? Of course on Twitter. Its a good thing that people are highly opinionated on twitter which help us to make clear distinction between positive or negative sentiment about movies.

twitter_allaboutmovies

So go to this site and you get movie rating for all the movies that are running in theaters. We also shows how many tweets we used for arriving at this score and as it is automatic, we are not putting our biases in the score. You can observe the scores on par with imdb but all automatic and democratic as everyone get only one vote.

How we do it?

Ok now you like to know our secret sauce . Let’s get to it

Technique

We use machine learning to do this . Particularly we use sentiment analysis method to rate given tweet as positive,negative or neutral. We collect huge amount of tweets about these movies for few weeks and then by aggregating them we come up with final score.
If you want to play with this sentiment analyzer you can go here.

Motog
twitter_senti_motog

Machine learning Model

We have used this paper as base with improvements from our side. The following are few improvements for the model

  • Negation Handling
    The origin paper does not talk much about negation handling. After publish of this work in 2009, there are lot of different ways are proposed to handle negation. We use technique of negating words till some kind of identifier for ending of phrase, like , : ? , . As twitter has limit on 140 characters this way negation works good  most of the times.
  • Hashtag Handling
    As twitter grew bigger and bigger hashtag played bigger role in twitter. Hashtag are the way of organizing and searching through billions of tweets. Model uses hashtag as strong indicator of mood. Hashtags like #awesome gives positive and #sad gives a negative indicator.
  • Caps HandlingPeople use caps to indicate a strong opinion. Model incorporates that too.

Machine Algorithm Classifier – Naive Bayes

We picked naive bayes for two reason. One it works very good for document classification compared to Logistic regression . Its less accurate than svm but much faster. We have observed that we don’t get much better accuracy using svm but there is significant performance hit.

Technology stack – Spark and MLLib

Spark is a high performance cluster computing engine with great api. MLLib is machine learning library which allow to run naive bayes like algorithm on large scale.

Team

This weekend more than 20 companies came together at BigData conclave to explore big ideas. It was two day event held at Sheraton hotel, Bangalore. Developers from all across country flew to Bangalore to participate in the event.

Flutura, a bigdata company, hosted hackathon to crack big data problem within 24 hours. More than 53 teams from various companies had participated in the event.We were able to crack the problem and win the hackathon. It was a great team performance.

Problem

Energy is well known problem across the globe.There is a huge need to precisely analyze the usage and predict the every growing energy demands. Evolution of smart meters allowed industry to capture more data about power usage but it still challenging to analyse this data for specific actions. So every team is provided data for last four years, which recorded the every minute power usage for this time period. We had 24 hours to predict the future energy usage and device usage patterns using this data. It was very interesting problem and we were all ready to have a crack.

Our solution

We chose Scala and Spark as a platform to solve this particular problem over traditional hadoop as it gives more control to implement the machine learning algorithms. As spark allowed us faster prototyping and implementation we were able to solve the problems within given time. We will be sharing more information about our solution in future posts, but here are few interesting facts from our analysis.

Do you know that there is sharp spike in kitchen power usage in Dec 25? Seems like more people likes to cook and celebrate with family.

S1 indicates kitchen appliances, S2 indicates laundry and S3 water heater

AC and Water heater takes more 33% of power usage across whole globe

Device-Usage-Distribution-monthly

S1 indicates kitchen appliances, S2 indicates laundry and S3 water heater

We will be back with more interesting facts about data and our solution so keep on eye on this place.

Hadoop 2.0 is here. After 5 ½ year of initial proposal hadoop community has delivered next major version update to the world’s most popular big data stack. Though it looks like single number upgrade, its going to redefine how we use Hadoop.

It’s right time to get on the next generation platform. If you are still not convinced, the following reasons should get you excited.

Hadoop becomes Big Data OS

Hadoop is the kernel of a distributed operating system – Doug cutting

With Hadoop 2.0 hadoop becomes kernel of big data operating system. Hadoop 1.0 used to only support one way of data processing, Map-Reduce, which limited it to few specific applications. But with introduction of YARN, hadoop now can support any kind of data processing model.

Evolving Hadoop ecosystem independent of core

Over the years, ecosystem projects like Hive, Pig have suffered from supporting ever changing Hadoop versions. As there was tight coupling between Map/Reduce and hadoop versions, they ended up supporting only 1.x versions leaving all 0.21, 0.22 versions unsupported. As Map/Reduce is becoming userland library, independent of core of Hadoop, now ecosystem projects can evolve faster and support any Hadoop 2.x versions.

Innovation at Scale

More processing paradigms like MPI, Spark landing on Hadoop with YARN

Map/Reduce is a great algorithm for certain kind of problems. But it fares poorly in the problems like iterative processing, machine learning. Even though projects like Mahout, RHadoop tried to build machine learning on Map/Reduce it was too hard. But over the years there are many projects like Spark, Storm are solving this problem. But till Hadoop 2.0 they were not able to run natively on Hadoop cluster. But from 2.0 they can run natively on Hadoop and deliver high performance on same cluster. This is enabling frameworks to innovate at scale.

Hadoop becomes Real time

Hadoop was built for batch processing systems. But as its popularity grew need for real time processing also grew. As Map/Reduce was inherently batch processing system it was very difficult to support to real time on it. But from 2.0 there are many frameworks like Apache Storm, Apache Tez available to support real time processing.

With Hadoop 2.0 release, finally hadoop comes out of Google Map/Reduce shadow. Hadoop 2.0 is built for enterprise from scratch. Hadoop 2.0 is ready, are you?