Zinnia Systems

How many of you still like to watch old English movies? If yes, which one you want to start with? which suits to your interest? And many of us are uncertain which movies we may like to watch? Ever wondered you can have all the above issues addressed in one place, so here we come with the Recommendation Engine which gives you the best match for your movie interest.

So now the question is, what is the idea behind telling you the movies you like to watch from our app. We thought facebook would be the best place where we can get to know your interests. Our app exactly does the same, we will take the movies liked by you in your facebook profile and use them to recommend you the old english movies. So visit this link and get free recommendations.

How we do it?

Lets know the technical background of the whole magic of recommendation engine.

Background

This is one of the apps which is completely based on Machine Learning. Recommendation engines are normally designed in two approaches

Collaborative Filtering : It is based on collecting and analyzing a large amount of information on users’ behaviors, activities or preferences and predicting what users will like based on their similarity to other users.
Content Based Filtering : It is based on a description of the item and a profile of the user’s preference . In a content-based recommender system, keywords are used to describe the items; beside, a user profile is built to indicate the type of item this user likes.

We have done the collaborative filtering method where our model is developed based on the ratings provided by the users for different movies.

Our Model

We used Alternating Least Squares (ALS) Matrix Factorization model. ALS Matrix Factorization Model is the well known algorithm for recommendations these days. It allows us to handle both implicit feedback and explicit feedback data from the users. It has the ability to build the feature set for the movies and the users who has rated them.

We also use K-Means clustering to cluster similar movies. Based on the feature set built by ALS model for the movies we cluster the movies. The clustering assists us to dynamically fetch the movies and deliver to the user at faster rate.

“So this is it, the whole idea behind the mysterious recommendation engine using the Machine Learning techniques using the superior models.”

What question strikes your mind when you are planning to buy a new gadget? There can be one common questions in everyone’s mind, how long this technology may sustain in the market? We will know the fate of todays technology only when we know what is coming next.

Wondering how to know about future technology or products today itself? A simple solution is reading technical blogs everyday, searching for rumors about future technology. Isn’t that tough? Yes, we also felt the same and that was an inspiration to build an app which is capable of capturing rumors from wide variety of sources.

What is a Rumor engine?
Once upon a time customers used to know about new technology only after weeks or months of its release, but now they are so much curious to know about the next generation technology at the time of its development or testing phase itself. Most of the customers are curious to know about the new technology way before its release but not everyone have time or patience to read technical blogs across 10s of blogging sites and follow rumors, leaks and releases. This Rumor engine is for such people. We crawl multiple technical blogging sites for the data, crunch them into a machine understandable format, apply machine learning and natural language processing algorithms on them to capture rumors. This rumor engine also gives users the flexibility of searching for a specific type of rumors like rumors based about Samsung, Iphone etc.

How do we do it?
We use machine learning algorithms and natural processing techniques to do the above specified task.

Technique
The idea of having a Rumor engine is quite a new idea because of which we were not able to get any labelled dataset on the internet, hence we choose to do it manually. We collected xml feeds from multiple blog sites for 2 months and labelled 1.5k blogs manually, this is how we created our initial dataset. After this we used a technique called Co-training to label the remaining unlabelled data. This complete dataset we used to train our algorithm and create a model out of it. We use this model to classify blogs daily and show up on our UI. Now you would have got some idea about this engine, you can go ahead and use it here
Data Collection phase.
The data collection phase was made easy by using Google api for parsing xml feed from the blog sites. Different blog sites use different formats for feed and developing a generic parser would be a time killing task, api provided by google does this task and provides feed in a single format.
Features creation
We create feature array based on the words present in the blog. A numerical statistic called tf-idf (term frequency–inverse document frequency) represents a score of how important a word is in a corpus. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others. This score handles stop words and common words that appear in all blogs and adjusts the score such that common words do not play a prominent role in classifying documents. We had experimented with raw feature extraction from word count which performed very poorly compared to later implement tf-idf technique.
Building a classification model
We use Naive Bayes classification algorithm to build a model which is capable of classifying rumors and non-rumors. Naive Bayes algorithm is said to be one of the best performing algorithm for natural language processing.
Searching for specific rumors
Once we completed first phase of our rumor engine, we wanted something more than just showing up todays rumors on an user interface. This pushed us to provide user an option to search for specific categories of rumors. Initially we had an idea of implementing a hash based database search but that would not be fast enough for real time application. Then we came across Apache Solr which is built upon Apache Lucene project. Apache Solr is an open source enterprise search platform which includes full-text-search and other features. This is a in-memory hash based search which performs well in real time.

What do we use underneath to do this?

Technology stack
Spark is a high performance cluster computing engine with great api. MLLib is machine learning library which allow to run naive bayes and natural language processing to run on large number of blogs.

In short, What are we trying to say?
Our Rumor engine is for people out there who are very keen in understanding the upcoming products and rumors about it. So go ahead and use our engine. As of now our Rumor engine is for technology domain driven engine. This idea can be expanded to different domains like automobiles industry. We hope we get more chance to build that too and give out to people who are interested.

Ever wondered whats the best way to get real audience opinions about the movie you are about to watch? Even we wondered. What we have now is all opinionated critic reviews or fan dominated IMDB.

So we thought it will be cool to build an app that can rate movies on basis of real world audience. Wait , but where are people and their opinions? Of course on Twitter. Its a good thing that people are highly opinionated on twitter which help us to make clear distinction between positive or negative sentiment about movies.

So go to this site and you get movie rating for all the movies that are running in theaters. We also shows how many tweets we used for arriving at this score and as it is automatic, we are not putting our biases in the score. You can observe the scores on par with imdb but all automatic and democratic as everyone get only one vote.

How we do it?

Ok now you like to know our secret sauce . Let’s get to it

Technique

We use machine learning to do this . Particularly we use sentiment analysis method to rate given tweet as positive,negative or neutral. We collect huge amount of tweets about these movies for few weeks and then by aggregating them we come up with final score.
If you want to play with this sentiment analyzer you can go here.

Motog

Machine learning Model

We have used this paper as base with improvements from our side. The following are few improvements for the model

Negation Handling
The origin paper does not talk much about negation handling. After publish of this work in 2009, there are lot of different ways are proposed to handle negation. We use technique of negating words till some kind of identifier for ending of phrase, like , : ? , . As twitter has limit on 140 characters this way negation works good most of the times.
Hashtag Handling
As twitter grew bigger and bigger hashtag played bigger role in twitter. Hashtag are the way of organizing and searching through billions of tweets. Model uses hashtag as strong indicator of mood. Hashtags like #awesome gives positive and #sad gives a negative indicator.
Caps HandlingPeople use caps to indicate a strong opinion. Model incorporates that too.

Machine Algorithm Classifier – Naive Bayes

We picked naive bayes for two reason. One it works very good for document classification compared to Logistic regression . Its less accurate than svm but much faster. We have observed that we don’t get much better accuracy using svm but there is significant performance hit.

Technology stack – Spark and MLLib

Spark is a high performance cluster computing engine with great api. MLLib is machine learning library which allow to run naive bayes like algorithm on large scale.

In software industry with adoption of Hadoop, data scientists are in high demand. There is a well known fact that people from data science background always face difficulty to apply data science on bigdata due to lack of bigdata knowledge and people from programming background face the same when they try data science on bigdata due to lack of data science knowledge. Here we are seeing two different set of people whose end goal “Machine Learning on Big Data” appears similar. So we try to solve this and give you the correct steps to get started in this regard.

Where to start Machine Learning?

We followed Machine Learning course “Stanford Machine Learning” in coursera, where we implement ML in Octave. This course covers all the basic algorithms of machine learning in great depth. In this course we implement algorithms in octave – similar to Matlab for implement algorithms. Octave is not enough to get started in bigdata as it is limited to its in-memory calculations. We need the same for Big-Data . So we thought of solving the same exercises on Spark. Our team from Zinnia Systems want to put forward our machine learning work through our GitHub repository. Our repository contains all the exercises ported from octave to spark, of the above mentioned course. We hope it will accelerate your Machine Learning on Bigdata at higher speeds, when you start learning by comparing octave and spark code.

Why Spark?

Spark’s in memory processing and emphasis on iterative processing, makes it best suited for Machine Learning

With the advent of Spark on big-data gap between data scientists and the programming communities started decreasing very quickly. One thing that kept Spark ahead of all other big-data frameworks is its basic design ideas of framework like, high emphasis on iterative algorithms and in-memory processing to achieve the landmark “100x faster than Hadoop”. When we started working with spark we felt very much fond of RDD (Resilient Distributed Datasets) which are basically distributed immutable collections acting as basic abstraction in framework. Along with this Spark also provides more complex operations like joins , group-by and reduce-by which are very handy in forming complex model data-flows without iterations. By this time you would be aware of the fact that the most happening thing in big-data is machine learning on Spark. So, our team from Zinnia Sytems started the machine learning on Spark.

Machine Learning Library (MLLIB)

MLLIB is the fastest growing machine learning community with more than 137 active contributors

Everybody who tries to start using machine learning in Spark will end up in MLLIB . In fact MLLIB is the fastest growing community with more than 137 active contributors. It has already pushed Apache-Mahout to back seat in developing machine learning algorithms. MLLIB already includes algorithms like Linear Regression and Logistic Regression using Stochastic Gradient Descent, K-Means, Recommender Systems and Linear Support Vector Machines etc. MLLIB is the part of a larger machine learning project MLBASE which includes an API for feature extraction and an optimizer. It is not a better practice to start implementing the algorithms as we are seeing there are active people who are putting their work forward ,only after they test them in adequate conditions. So, here comes our team who want to help you in using the MLLIB from the first day. We give you the basic programs to use all algorithms in MLLIB in one shot though our repo. It will be a very interesting journey. You would be amazed to start your nightmare of Machine Learning on bigdata using Spark. You must be excited to start this venture!!!!!

As you know Zinnia Systems is an active contributor in the open-source , this is our next step in the same direction. Come, try, increase your knowledge in those areas which are never heard before.

Zinnia Systems Blog

Category Archives: Big_Data

Movie Recommendation Engine

Rumor Engine

Movie rating using twitter sentiment analysis

Machine Learning with Spark Get equipped with basics of Machine Learning

Where to start Machine Learning?

Why Spark?

Machine Learning Library (MLLIB)