How many of you still like to watch old English movies? If yes, which one you want to start with? which suits to your interest? And many of us are uncertain which movies we may like to watch? Ever wondered you can have all the above issues addressed in one place, so here we come with the Recommendation Engine which gives you the best match for your movie interest.

So now the question is, what is the idea behind telling you the movies you like to watch from our app. We thought facebook would be the best place where we can get to know your interests. Our app exactly does the same, we will take the movies liked by you in your facebook profile and use them to recommend you the old english movies. So visit this link and get free recommendations.


How we do it?

Lets know the technical background of the whole magic of recommendation engine.


This is one of the apps which is completely based on Machine Learning. Recommendation engines are normally designed in two approaches

  • Collaborative Filtering : It is based on collecting and analyzing a large amount of information on users’ behaviors, activities or preferences and predicting what users will like based on their similarity to other users.

  • Content Based Filtering : It is based on a description of the item and a profile of the user’s preference . In a content-based recommender system, keywords are used to describe the items; beside, a user profile is built to indicate the type of item this user likes.

We have done the collaborative filtering method where our model is developed based on the ratings provided by the users for different movies.

Our Model

We used Alternating Least Squares (ALS) Matrix Factorization model. ALS Matrix Factorization Model is the well known algorithm for recommendations these days. It allows us to handle both implicit feedback and explicit feedback data from the users. It has the ability to build the feature set for the movies and the users who has rated them.

We also use K-Means clustering to cluster similar movies. Based on the feature set built by ALS model for the movies we cluster the movies. The clustering assists us to dynamically fetch the movies and deliver to the user at faster rate.

“So this is it, the whole idea behind the mysterious recommendation engine using the Machine Learning techniques using the superior models.”

In software industry with adoption of Hadoop, data scientists are in high demand. There is a well known fact that people from data science background always face difficulty to apply data science on bigdata due to lack of bigdata knowledge and people from programming background face the same when they try data science on bigdata due to lack of data science knowledge. Here we are seeing two different set of people whose end goal “Machine Learning on Big Data” appears similar. So we try to solve this and give you the correct steps to get started in this regard.

Where to start Machine Learning?

We followed Machine Learning course “Stanford Machine Learning” in coursera, where we implement ML in Octave. This course covers all the basic algorithms of machine learning in great depth. In this course we implement algorithms in octave – similar to Matlab for implement algorithms. Octave is not enough to get started in bigdata as it is limited to its in-memory calculations. We need the same for Big-Data . So we thought of solving the same exercises on Spark. Our team from Zinnia Systems want to put forward our machine learning work through our GitHub repository. Our repository contains all the exercises ported from octave to spark, of the above mentioned course. We hope it will accelerate your Machine Learning on Bigdata at higher speeds, when you start learning by comparing octave and spark code.

Why Spark?

Spark’s in memory processing and emphasis on iterative processing, makes it best suited for Machine Learning

With the advent of Spark on big-data gap between data scientists and the programming communities started decreasing very quickly. One thing­­ that kept Spark ahead of all other big-data frameworks is its basic design ideas of framework like, high emphasis on iterative algorithms and in-memory processing to achieve the landmark “100x faster than Hadoop”. When we started working with spark we felt very much fond of RDD (Resilient Distributed Datasets) which are basically distributed immutable collections acting as basic abstraction in framework. Along with this Spark also provides more complex operations like joins , group-by and reduce-by which are very handy in forming complex model data-flows without iterations. By this time you would be aware of the fact that the most happening thing in big-data is machine learning on Spark. So, our team from Zinnia Sytems started the machine learning on Spark.

Machine Learning Library (MLLIB)

MLLIB is the fastest growing machine learning community with more than 137 active contributors

Everybody who tries to start using machine learning in Spark will end up in MLLIB . In fact MLLIB is the fastest growing community with more than 137 active contributors. It has already pushed Apache-Mahout to back seat in developing machine learning algorithms. MLLIB already includes algorithms like Linear Regression and Logistic Regression using Stochastic Gradient Descent, K-Means, Recommender Systems and Linear Support Vector Machines etc. MLLIB is the part of a larger machine learning project MLBASE which includes an API for feature extraction and an optimizer. It is not a better practice to start implementing the algorithms as we are seeing there are active people who are putting their work forward ,only after they test them in adequate conditions. So, here comes our team who want to help you in using the MLLIB from the first day. We give you the basic programs to use all algorithms in MLLIB in one shot though our repo. It will be a very interesting journey. You would be amazed to start your nightmare of Machine Learning on bigdata using Spark. You must be excited to start this venture!!!!!

As you know Zinnia Systems is an active contributor in the open-source , this is our next step in the same direction. Come, try, increase your knowledge in those areas which are never heard before.

This is the 2nd post in the Predicting Global Energy Demand using Spark series. The series has the following posts.

Its our pleasure to share our approach in solving the hackathon problems briefly. So, lets start the exploration with details about the data we handled. The given data represents the energy consumption details recorded for each minute in household usage.The hourly data was having 2075259 measurements gathered between December 2008 and November 2012 (almost 4 years).You can find the schema for data here.

Data Preparation

Data preparation involves pre-processing techniques where we make the data ready for actual prediction. It involves three phases, data cleaning , handling duplicate data and data sampling.

In data cleaning, we handled the missing values in the data.Our approach was to ignore the measurements where there is no readings for that day. Next step is handling duplicates in the data. We ensured there are no duplicates in the data. Final step is data sampling.We drive our process towards perfection by applying it on sampled data. Once we ensure that, the results for the subset holds true. We apply the same process for complete data and publish the results.

Data Modeling

Data modelling involves steps where we model our data in such a way that, it fits into the algorithm used for prediction. We used spark framework and scala language to accomplish this task. In the next sections you will know more about the platform and algorithm we used for prediction.

Scala and Spark

Scala is a JVM based functional language which is well known for its concurrent capabilities.It’s breeze to write map/reduce based machine learning algorithms with very few lines of code. All our examples in the post are written in scala.

Apache Spark is an open source cluster computing system that aims to make data analytics fast, both fast to run and fast to write. We used spark as it allowed us to prototype within time limits of hackathon.

Geometric Brownian Motion

For every prediction problem there will be a strong backbone algorithm which is the key for prediction models. Since the given data is about the energy consumption in household, it is a type of quantity which changes over uncertainty. These kind of quantities follows Geometric Brownian Motion. This model is frequently invoked as a model for diverse quantities.Other quantities which follows this model are stock prices,natural resource prices and growth in demand for products or services etc.

Brownian motion is the simplest of the continuous time stochastic processes.Wiener process gives the mathematical representation for the brownian motion. This model says the variable’s value changes in one unit of time by an amount that is normally distributed with µ and σ ,where µ denotes drift and σ denotes volatility. Brownian motion suggests the following equation to predict the next value at the given instant in constant interval.

Sᵢ˖₁ =SᵢµΔt + Sᵢσԑ√Δt


  • Sᵢ˖₁ : Predicted value
  • Sᵢ : Present known value
  • µ : Drift
  • σ : Volatility
  • ԑ : Random Number


µ – Standard Mean ( Drift in data )

standardMean(inputRDD: RDD[(Date, Record)]): (List[Double], Double) = {
val count = inputRDD.count()
var sum = 0.0
var riList = List[Double]()
for (i val firstRecord = inputRDD.toArray()(i)
val secondRecord = inputRDD.toArray()(i - 1)
val difference = (firstRecord._2.totalPowerUsed - secondRecord._2.totalPowerUsed) / firstRecord._2.totalPowerUsed
riList = riList ::: List(difference)
sum += difference
(riList, sum / count)

σ – Standard Deviation ( Volatility in data )

  def standDeviation(inputRDD: RDD[Double], mean: Double): Double = {
    val sum = => {
      (value - mean) * (value - mean)
    }).reduce((firstValue, secondValue) => {
      firstValue + secondValue
    scala.math.sqrt(sum / inputRDD.count())

Note : Here inputRDD has each measurement aggregated to one day.

Sᵢ˖₁ – Predicted Value

PredictedValue = lastValue * (1 + mean * 1 + stdDeviation * scala.math.random * 1)


  • PredictedValue : Sᵢ˖₁
  • lastValue : Sᵢ
  • mean : µ
  • stdDeviation : σ
  • scala.math.random:ԑ

Since we are predicting for next day with the daily aggregated data both Δt and √Δt will be 1.

Hope this blog gave you an idea about our prediction strategy for the next day energy consumption estimation. Wait for the next blog which will explain you the prediction technique for next one year.

This is the first post in the Predicting Global Energy Demand using Spark series. The series has the following posts.

This series of blogs will walk you through our complete solution , designed and implemented for predicting future energy demand using spark.

This problem was solved within 24hrs in hackathon.

Problem statement

Predict the global energy demand for next year using the energy usage data available for last four years, in order to enable utility companies effectively handle the energy demand.

Fluctuating energy demand for utilities is becoming a very big problem. To be more precise about the problem, the demand for energy over a period of time is not consistent. This in turn becomes huge bane for the utilities with varying energy demands. As we know with the existing systems storing of electricity is extremely inefficient, this will increase difficulties for the utilities to bank energy against a time of sudden demand.

As we observe, energy demand rises and falls throughout the day in response to a number of things,including time and environmental factors.The difference between the demand in extremes is important, because utilities must be able to handle demand with supply. One approach to managing electricity demand is building more generation facilities that can be brought online to manage peaks. But using right estimations allows companies to make better decisions.

Available data

Data for energy consumption from 2008-2012 is available here.

The following is the structure of data

  • date: Date in format dd/mm/yyyy
  • time: time in format hh:mm:ss
  • global_active_power: household global minute-averaged active power (in kilowatt)
  • global_reactive_power: household global minute-averaged reactive power (in kilowatt)
  • Voltage: minute-averaged voltage (in volt)
  • global_intensity: household global minute-averaged current intensity (in ampere)
  • sub_metering_1: energy sub-metering No. 1 (in watt-hour of active energy). It corresponds to the kitchen, containing mainly a dishwasher, an oven and a microwave (hot plates are not electric but gas powered)
  • sub_metering_2: energy sub-metering No. 2 (in watt-hour of active energy). It corresponds to the laundry room, containing a washing-machine, a tumble-drier, a refrigerator and a light.
  • Sub_metering_3: energy sub-metering No. 3 (in watt-hour of active energy). It corresponds to an electric water-heater and an air-conditioner

Problems we addressed

  • What would be the energy consumption for the next day?
  • What would be week wise Energy consumption for the next one year?
  • What would be household’s peak time load (Peak time is between 7 AM to 10 AM) for the next month.
    During Weekdays
    During Weekends
  • What are the patterns in voltage fluctuations?
  • Can you identify the device usage patterns?
  • Assuming there was full day of outage, Calculate the Revenue Loss for a particular day next year by finding the Average Revenue per day (ARPD) of the household using given tariff plan

In upcoming blog posts , we will be sharing our solution to above problems.