How many of you still like to watch old English movies? If yes, which one you want to start with? which suits to your interest? And many of us are uncertain which movies we may like to watch? Ever wondered you can have all the above issues addressed in one place, so here we come with the Recommendation Engine which gives you the best match for your movie interest.


So now the question is, what is the idea behind telling you the movies you like to watch from our app. We thought facebook would be the best place where we can get to know your interests. Our app exactly does the same, we will take the movies liked by you in your facebook profile and use them to recommend you the old english movies. So visit this link and get free recommendations.

allaboutmovies

How we do it?

Lets know the technical background of the whole magic of recommendation engine.

Background

This is one of the apps which is completely based on Machine Learning. Recommendation engines are normally designed in two approaches

  • Collaborative Filtering : It is based on collecting and analyzing a large amount of information on users’ behaviors, activities or preferences and predicting what users will like based on their similarity to other users.

  • Content Based Filtering : It is based on a description of the item and a profile of the user’s preference . In a content-based recommender system, keywords are used to describe the items; beside, a user profile is built to indicate the type of item this user likes.

We have done the collaborative filtering method where our model is developed based on the ratings provided by the users for different movies.

Our Model

We used Alternating Least Squares (ALS) Matrix Factorization model. ALS Matrix Factorization Model is the well known algorithm for recommendations these days. It allows us to handle both implicit feedback and explicit feedback data from the users. It has the ability to build the feature set for the movies and the users who has rated them.

We also use K-Means clustering to cluster similar movies. Based on the feature set built by ALS model for the movies we cluster the movies. The clustering assists us to dynamically fetch the movies and deliver to the user at faster rate.

“So this is it, the whole idea behind the mysterious recommendation engine using the Machine Learning techniques using the superior models.”

What question strikes your mind when you are planning to buy a new gadget? There can be one common questions in everyone’s mind, how long this technology may sustain in the market? We will know the fate of todays technology only when we know what is coming next.

Wondering how to know about future technology or products today itself? A simple solution is reading technical blogs everyday, searching for rumors about future technology. Isn’t that tough? Yes, we also felt the same and that was an inspiration to build an app which is capable of capturing rumors from wide variety of sources.

What is a Rumor engine?
Once upon a time customers used to know about new technology only after weeks or months of its release, but now they are so much curious to know about the next generation technology at the time of its development or testing phase itself. Most of the customers are curious to know about the new technology way before its release but not everyone have time or patience to read technical blogs across 10s of blogging sites and follow rumors, leaks and releases. This Rumor engine is for such people. We crawl multiple technical blogging sites for the data, crunch them into a machine understandable format, apply machine learning and natural language processing algorithms on them to capture rumors. This rumor engine also gives users the flexibility of searching for a specific type of rumors like rumors based about Samsung, Iphone etc.

rumorengine

How do we do it?
We use machine learning algorithms and natural processing techniques to do the above specified task.

  • Technique
    The idea of having a Rumor engine is quite a new idea because of which we were not able to get any labelled dataset on the internet, hence we choose to do it manually. We collected xml feeds from multiple blog sites for 2 months and labelled 1.5k blogs manually, this is how we created our initial dataset. After this we used a technique called Co-training to label the remaining unlabelled data. This complete dataset we used to train our algorithm and create a model out of it. We use this model to classify blogs daily and show up on our UI. Now you would have got some idea about this engine, you can go ahead and use it here
  • Data Collection phase.
    The data collection phase was made easy by using Google api for parsing xml feed from the blog sites. Different blog sites use different formats for feed and developing a generic parser would be a time killing task, api provided by google does this task and provides feed in a single format.
  • Features creation
    We create feature array based on the words present in the blog. A numerical statistic called tf-idf (term frequency–inverse document frequency) represents a score of how important a word is in a corpus. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others. This score handles stop words and common words that appear in all blogs and adjusts the score such that common words do not play a prominent role in classifying documents. We had experimented with raw feature extraction from word count which performed very poorly compared to later implement tf-idf technique.
  • Building a classification model
    We use Naive Bayes classification algorithm to build a model which is capable of classifying rumors and non-rumors. Naive Bayes algorithm is said to be one of the best performing algorithm for natural language processing.
  • Searching for specific rumors
    Once we completed first phase of our rumor engine, we wanted something more than just showing up todays rumors on an user interface. This pushed us to provide user an option to search for specific categories of rumors. Initially we had an idea of implementing a hash based database search but that would not be fast enough for real time application. Then we came across Apache Solr which is built upon Apache Lucene project. Apache Solr is an open source enterprise search platform which includes full-text-search and other features. This is a in-memory hash based search which performs well in real time.

What do we use underneath to do this?

  • Technology stack
    Spark is a high performance cluster computing engine with great api. MLLib is machine learning library which allow to run naive bayes and natural language processing to run on large number of blogs.

In short, What are we trying to say?
Our Rumor engine is for people out there who are very keen in understanding the upcoming products and rumors about it. So go ahead and use our engine. As of now our Rumor engine is for technology domain driven engine. This idea can be expanded to different domains like automobiles industry. We hope we get more chance to build that too and give out to people who are interested.

Ever wondered whats the best way to get real audience opinions about the movie you are about to watch? Even we wondered. What we have now is all opinionated critic reviews or fan dominated IMDB.


So we thought it will be cool to build an app that can rate movies on basis of real world audience. Wait , but where are people and their opinions? Of course on Twitter. Its a good thing that people are highly opinionated on twitter which help us to make clear distinction between positive or negative sentiment about movies.

twitter_allaboutmovies

So go to this site and you get movie rating for all the movies that are running in theaters. We also shows how many tweets we used for arriving at this score and as it is automatic, we are not putting our biases in the score. You can observe the scores on par with imdb but all automatic and democratic as everyone get only one vote.

How we do it?

Ok now you like to know our secret sauce . Let’s get to it

Technique

We use machine learning to do this . Particularly we use sentiment analysis method to rate given tweet as positive,negative or neutral. We collect huge amount of tweets about these movies for few weeks and then by aggregating them we come up with final score.
If you want to play with this sentiment analyzer you can go here.

Motog
twitter_senti_motog

Machine learning Model

We have used this paper as base with improvements from our side. The following are few improvements for the model

  • Negation Handling
    The origin paper does not talk much about negation handling. After publish of this work in 2009, there are lot of different ways are proposed to handle negation. We use technique of negating words till some kind of identifier for ending of phrase, like , : ? , . As twitter has limit on 140 characters this way negation works good  most of the times.
  • Hashtag Handling
    As twitter grew bigger and bigger hashtag played bigger role in twitter. Hashtag are the way of organizing and searching through billions of tweets. Model uses hashtag as strong indicator of mood. Hashtags like #awesome gives positive and #sad gives a negative indicator.
  • Caps HandlingPeople use caps to indicate a strong opinion. Model incorporates that too.

Machine Algorithm Classifier – Naive Bayes

We picked naive bayes for two reason. One it works very good for document classification compared to Logistic regression . Its less accurate than svm but much faster. We have observed that we don’t get much better accuracy using svm but there is significant performance hit.

Technology stack – Spark and MLLib

Spark is a high performance cluster computing engine with great api. MLLib is machine learning library which allow to run naive bayes like algorithm on large scale.

In software industry with adoption of Hadoop, data scientists are in high demand. There is a well known fact that people from data science background always face difficulty to apply data science on bigdata due to lack of bigdata knowledge and people from programming background face the same when they try data science on bigdata due to lack of data science knowledge. Here we are seeing two different set of people whose end goal “Machine Learning on Big Data” appears similar. So we try to solve this and give you the correct steps to get started in this regard.

Where to start Machine Learning?

We followed Machine Learning course “Stanford Machine Learning” in coursera, where we implement ML in Octave. This course covers all the basic algorithms of machine learning in great depth. In this course we implement algorithms in octave – similar to Matlab for implement algorithms. Octave is not enough to get started in bigdata as it is limited to its in-memory calculations. We need the same for Big-Data . So we thought of solving the same exercises on Spark. Our team from Zinnia Systems want to put forward our machine learning work through our GitHub repository. Our repository contains all the exercises ported from octave to spark, of the above mentioned course. We hope it will accelerate your Machine Learning on Bigdata at higher speeds, when you start learning by comparing octave and spark code.

Why Spark?

Spark’s in memory processing and emphasis on iterative processing, makes it best suited for Machine Learning

With the advent of Spark on big-data gap between data scientists and the programming communities started decreasing very quickly. One thing­­ that kept Spark ahead of all other big-data frameworks is its basic design ideas of framework like, high emphasis on iterative algorithms and in-memory processing to achieve the landmark “100x faster than Hadoop”. When we started working with spark we felt very much fond of RDD (Resilient Distributed Datasets) which are basically distributed immutable collections acting as basic abstraction in framework. Along with this Spark also provides more complex operations like joins , group-by and reduce-by which are very handy in forming complex model data-flows without iterations. By this time you would be aware of the fact that the most happening thing in big-data is machine learning on Spark. So, our team from Zinnia Sytems started the machine learning on Spark.

Machine Learning Library (MLLIB)

MLLIB is the fastest growing machine learning community with more than 137 active contributors

Everybody who tries to start using machine learning in Spark will end up in MLLIB . In fact MLLIB is the fastest growing community with more than 137 active contributors. It has already pushed Apache-Mahout to back seat in developing machine learning algorithms. MLLIB already includes algorithms like Linear Regression and Logistic Regression using Stochastic Gradient Descent, K-Means, Recommender Systems and Linear Support Vector Machines etc. MLLIB is the part of a larger machine learning project MLBASE which includes an API for feature extraction and an optimizer. It is not a better practice to start implementing the algorithms as we are seeing there are active people who are putting their work forward ,only after they test them in adequate conditions. So, here comes our team who want to help you in using the MLLIB from the first day. We give you the basic programs to use all algorithms in MLLIB in one shot though our repo. It will be a very interesting journey. You would be amazed to start your nightmare of Machine Learning on bigdata using Spark. You must be excited to start this venture!!!!!

As you know Zinnia Systems is an active contributor in the open-source , this is our next step in the same direction. Come, try, increase your knowledge in those areas which are never heard before.

This is the final post in the Predicting Global Energy Demand using Spark series. The series has the following posts.

Problem:

Assuming there was full day of outage, Calculate the Revenue Loss for a particular day next year by finding the Average Revenue per day (ARPD) of the household(Use the below tariff plan).

Time of Use (TOU) tariff plan:

Time Period Tariff (Rupees per KWh)
12 AM to 5 AM 4
5 AM to 7 AM 6
7 AM to 10 AM 12
10 AM to 4 PM 4
4 PM to 8 PM 6
8 PM to 10 PM 10
10 PM to 12 AM 6

Brief Solution:

We need to find the average cost per day in a year which will be the revenue loss.
To solve this problem we need to aggregate the data on hourly basis. Based on the tariff plan calculate the cost per hour. Aggregate this cost into daily basis. Add the cost for every day in a year and get the average by dividing it with the total number of days in that year. This average represents the revenue loss in case of outage.

Getting per hour revenue:

To get the revenue per hour you can refer part 3 hourly aggregation. After getting the hourly data submeter readings will be multiplied by tariff applied for that particular hour. This will give us the cost on every hour.

def getPerHourRevenue(inputRDD:RDD[Record]):RDD[Record] = {
    val dataAggregator = new DataAggregator()
    val hourlyRDD = dataAggregator.hourlyAggregator(inputRDD)
    hourlyRDD.map(value => {
      var ((date, hour), record) = value
      val hourOfDay = record.hourofDay
      if ((hourOfDay >= 0 && hourOfDay <= 4) || (hourOfDay >= 10 && hourOfDay <= 15)) {
        record.totalCost = (record.subMetering1 * 4 + record.subMetering2 * 4 + record.subMetering3 * 4) / 1000
      } else if ((hourOfDay >= 5 && hourOfDay <= 6) || (hourOfDay >= 16 && hourOfDay <= 19) || (hourOfDay >= 22 && hourOfDay <= 23)) {
        record.totalCost = (record.subMetering1 * 6 + record.subMetering2 * 6 + record.subMetering3 * 6) / 1000
      } else if (hourOfDay >= 7 && hourOfDay <= 9) {
        record.totalCost = (record.subMetering1 * 12 + record.subMetering2 * 12 + record.subMetering3 * 12) / 1000
      } else {
        record.totalCost = (record.subMetering1 * 10 + record.subMetering2 * 10 + record.subMetering3 * 10) / 1000
      }

      record
    })
  }

Then this hourly data is aggregated into daily basis for which you can refer the code dailyAggregator in the part 2. Get the average cost for a day from this daily aggregated data which represents the revenue loss in the case of outage.

def getRevenueLoss(inputRDD:RDD[Record]):Double = {
    val dataAggregator = new DataAggregator()
    val revenueRDD= getPerHourRevenue(inputRDD)
    val dailyRDD = dataAggregator.dailyAggregator(revenueRDD)
    val totalCostCalcRDD = dailyRDD.map(value => ("sum", value._2.totalCost)).reduceByKey((a, b) => a + b)
    val revenueLossForDayOutage = totalCostCalcRDD.first()._2 / dailyRDD.count()
    revenueLossForDayOutage
  }

So this will end the explanation of the solutions for problem given in hackathon. Each one of us from the team took a particular problem and explained the solution to it. Hope you go through all the solution and get the most out of it.

You can refer here for complete code.

This is the 3rd post in the Predicting Global Energy Demand using Spark series. The series has the following posts.

This is a sequel of the previous blogs from our team members, where the solution for the problem related to energy consumption was explained. In this blog, let me explain our solution to another problem on the same data whose schema is mentioned here.

Problem

What would be household’s peak time load (Peak time is between 7 AM to 10 AM) for the next month.

  • During Weekdays?
  • During Weekends?

Brief Solution

In order to solve the above stated problem, we need to model the data in such a way that it is aggregated on hourly basis followed by the filtration of peak time records and splitting it into weekday and weekend records. This separated records are aggregated on monthly basis and the peak time load can be calculated on this.

Hourly Aggregation

In order to aggregate the energy consumption data, we need to think about some key factors such as the meter readings add up to form its aggregate whereas voltage and global intensity are averaged out to form its aggregate.

def hourlyAggregator(inputRDD: RDD[Record]): RDD[((String, Long), Record)] = {
    val groupRDD = inputRDD.map(record => ((record.date, record.hourofDay), record)).reduceByKey((firstRecord,
    secondRecord) => {
      val record = new Record()
      record.date = firstRecord.date
      record.day = firstRecord.day
      record.month = firstRecord.month
      record.year = firstRecord.year
      record.hourofDay = firstRecord.hourofDay
      record.subMetering1 = firstRecord.subMetering1 + secondRecord.subMetering1
      record.subMetering2 = firstRecord.subMetering2 + secondRecord.subMetering2
      record.subMetering3 = firstRecord.subMetering3 + secondRecord.subMetering3
      record.activePower = firstRecord.activePower + secondRecord.activePower
      record.reactivePower = firstRecord.reactivePower + secondRecord.reactivePower
      record.voltage = (firstRecord.voltage + secondRecord.voltage) / 2
      record.globalIntensity = (firstRecord.globalIntensity + secondRecord.globalIntensity) / 2
      record
    })
    groupRDD
  }

Monthly Aggregation

Once the hourly aggregation is done, the records are aggregated on daily basis followed by aggregation based on the month.

 def dailyAggregator(inputRDD: RDD[Record]): RDD[(String, Record)] = {
    val groupRDD = inputRDD.map(record => (record.date, record)).reduceByKey((firstRecord, secondRecord) => {
      val record = new Record()
      record.date = firstRecord.date
      record.day = firstRecord.day
      record.month = firstRecord.month
      record.year = firstRecord.year
      record.subMetering1 = firstRecord.subMetering1 + secondRecord.subMetering1
      record.subMetering2 = firstRecord.subMetering2 + secondRecord.subMetering2
      record.subMetering3 = firstRecord.subMetering3 + secondRecord.subMetering3
      record.totalCost = firstRecord.totalCost + secondRecord.totalCost
      record.activePower = firstRecord.activePower + secondRecord.activePower
      record.reactivePower = firstRecord.reactivePower + secondRecord.reactivePower
      record.voltage = (firstRecord.voltage + secondRecord.voltage) / 2
      record.globalIntensity = (firstRecord.globalIntensity + secondRecord.globalIntensity) / 2
      record
    })
    groupRDD
  }
def monthlyAggregator(inputRDD: RDD[Record]): RDD[((Int, Long), Record)] = {
    val groupRDD = inputRDD.map(record => ((record.month, record.year), record)).reduceByKey((firstRecord,
    secondRecord) => {
      val record = new Record()
      record.date = firstRecord.date
      record.day = firstRecord.day
      record.month = firstRecord.month
      record.year = firstRecord.year
      record.subMetering1 = firstRecord.subMetering1 + secondRecord.subMetering1
      record.subMetering2 = firstRecord.subMetering2 + secondRecord.subMetering2
      record.subMetering3 = firstRecord.subMetering3 + secondRecord.subMetering3
      record.totalCost = firstRecord.totalCost + secondRecord.totalCost
      record.activePower = firstRecord.activePower + secondRecord.activePower
      record.reactivePower = firstRecord.reactivePower + secondRecord.reactivePower
      record.voltage = (firstRecord.voltage + secondRecord.voltage) / 2
      record.globalIntensity = (firstRecord.globalIntensity + secondRecord.globalIntensity) / 2
      record
    })
    groupRDD
  }

Calculation of peak time load for the next month

The peak time load for the next month can be predicted using the Geometric Brownian motion algorithm, since the data is aggregated based on the month and this algorithm can be applied upon the data directly

def findNextMonthPeakLoad(rdd:RDD[((Int,Long),Record)],sparkContext:SparkContext) : Double={
      def standardMean(inputRDD: RDD[((Int,Long),Record)]): (List[Double], Double) = {
        val count = inputRDD.count()
        var sum = 0.0
        var riList = List[Double]()
        for (i <- Range(1, count.toInt)) {
          val firstRecord = inputRDD.toArray()(i)
          val secondRecord = inputRDD.toArray()(i - 1)
          val difference = (firstRecord._2.totalPowerUsed -secondRecord._2.totalPowerUsed)
          /firstRecord._2.totalPowerUsed
          riList = riList ::: List(difference)
          sum += difference
        }
        (riList, sum / count)
      }
      def standDeviation(inputRDD:RDD[Double],mean:Double): Double = {
        val sum = inputRDD.map(value => {
          (value - mean) * (value - mean)
        }).reduce((firstValue, secondValue) => {
          firstValue + secondValue
        })
        scala.math.sqrt(sum / inputRDD.count())
      }
      val (rList,mean) = standardMean(rdd)
      val stdDeviation = standDeviation(sparkContext.makeRDD(rList),mean)
      val sortedRdd=rdd.sortByKey(false)
      val lastValue = sortedRdd.first()._2.totalPowerUsed
      var newValues = List[Double]()
      for(i<- Range(0,1000)){
        val predictedValue = lastValue * (1 + mean * 1 + stdDeviation * scala.math.random * 1)
        newValues::=predictedValue
      }
      val sorted = newValues.sorted
      val value = sorted(10)/1000
      value
    }

Hope you understood the idea behind predicting the peak time load for the next month. Further blogs in this series would explain the solution for other problems.

For complete code refer here

This is the 2nd post in the Predicting Global Energy Demand using Spark series. The series has the following posts.

Its our pleasure to share our approach in solving the hackathon problems briefly. So, lets start the exploration with details about the data we handled. The given data represents the energy consumption details recorded for each minute in household usage.The hourly data was having 2075259 measurements gathered between December 2008 and November 2012 (almost 4 years).You can find the schema for data here.

Data Preparation

Data preparation involves pre-processing techniques where we make the data ready for actual prediction. It involves three phases, data cleaning , handling duplicate data and data sampling.

In data cleaning, we handled the missing values in the data.Our approach was to ignore the measurements where there is no readings for that day. Next step is handling duplicates in the data. We ensured there are no duplicates in the data. Final step is data sampling.We drive our process towards perfection by applying it on sampled data. Once we ensure that, the results for the subset holds true. We apply the same process for complete data and publish the results.

Data Modeling

Data modelling involves steps where we model our data in such a way that, it fits into the algorithm used for prediction. We used spark framework and scala language to accomplish this task. In the next sections you will know more about the platform and algorithm we used for prediction.

Scala and Spark

Scala is a JVM based functional language which is well known for its concurrent capabilities.It’s breeze to write map/reduce based machine learning algorithms with very few lines of code. All our examples in the post are written in scala.

Apache Spark is an open source cluster computing system that aims to make data analytics fast, both fast to run and fast to write. We used spark as it allowed us to prototype within time limits of hackathon.

Geometric Brownian Motion

For every prediction problem there will be a strong backbone algorithm which is the key for prediction models. Since the given data is about the energy consumption in household, it is a type of quantity which changes over uncertainty. These kind of quantities follows Geometric Brownian Motion. This model is frequently invoked as a model for diverse quantities.Other quantities which follows this model are stock prices,natural resource prices and growth in demand for products or services etc.

Brownian motion is the simplest of the continuous time stochastic processes.Wiener process gives the mathematical representation for the brownian motion. This model says the variable’s value changes in one unit of time by an amount that is normally distributed with µ and σ ,where µ denotes drift and σ denotes volatility. Brownian motion suggests the following equation to predict the next value at the given instant in constant interval.

Sᵢ˖₁ =SᵢµΔt + Sᵢσԑ√Δt

Where,

  • Sᵢ˖₁ : Predicted value
  • Sᵢ : Present known value
  • µ : Drift
  • σ : Volatility
  • ԑ : Random Number

Code

µ – Standard Mean ( Drift in data )

standardMean(inputRDD: RDD[(Date, Record)]): (List[Double], Double) = {
val count = inputRDD.count()
var sum = 0.0
var riList = List[Double]()
for (i val firstRecord = inputRDD.toArray()(i)
val secondRecord = inputRDD.toArray()(i - 1)
val difference = (firstRecord._2.totalPowerUsed - secondRecord._2.totalPowerUsed) / firstRecord._2.totalPowerUsed
riList = riList ::: List(difference)
sum += difference
}
(riList, sum / count)
}

σ – Standard Deviation ( Volatility in data )

  def standDeviation(inputRDD: RDD[Double], mean: Double): Double = {
    val sum = inputRDD.map(value => {
      (value - mean) * (value - mean)
    }).reduce((firstValue, secondValue) => {
      firstValue + secondValue
    })
    scala.math.sqrt(sum / inputRDD.count())
  }

Note : Here inputRDD has each measurement aggregated to one day.

Sᵢ˖₁ – Predicted Value

PredictedValue = lastValue * (1 + mean * 1 + stdDeviation * scala.math.random * 1)

Here,

  • PredictedValue : Sᵢ˖₁
  • lastValue : Sᵢ
  • mean : µ
  • stdDeviation : σ
  • scala.math.random:ԑ

Since we are predicting for next day with the daily aggregated data both Δt and √Δt will be 1.

Hope this blog gave you an idea about our prediction strategy for the next day energy consumption estimation. Wait for the next blog which will explain you the prediction technique for next one year.

This is the first post in the Predicting Global Energy Demand using Spark series. The series has the following posts.

This series of blogs will walk you through our complete solution , designed and implemented for predicting future energy demand using spark.

This problem was solved within 24hrs in hackathon.

Problem statement

Predict the global energy demand for next year using the energy usage data available for last four years, in order to enable utility companies effectively handle the energy demand.

Fluctuating energy demand for utilities is becoming a very big problem. To be more precise about the problem, the demand for energy over a period of time is not consistent. This in turn becomes huge bane for the utilities with varying energy demands. As we know with the existing systems storing of electricity is extremely inefficient, this will increase difficulties for the utilities to bank energy against a time of sudden demand.

As we observe, energy demand rises and falls throughout the day in response to a number of things,including time and environmental factors.The difference between the demand in extremes is important, because utilities must be able to handle demand with supply. One approach to managing electricity demand is building more generation facilities that can be brought online to manage peaks. But using right estimations allows companies to make better decisions.

Available data

Data for energy consumption from 2008-2012 is available here.

The following is the structure of data

  • date: Date in format dd/mm/yyyy
  • time: time in format hh:mm:ss
  • global_active_power: household global minute-averaged active power (in kilowatt)
  • global_reactive_power: household global minute-averaged reactive power (in kilowatt)
  • Voltage: minute-averaged voltage (in volt)
  • global_intensity: household global minute-averaged current intensity (in ampere)
  • sub_metering_1: energy sub-metering No. 1 (in watt-hour of active energy). It corresponds to the kitchen, containing mainly a dishwasher, an oven and a microwave (hot plates are not electric but gas powered)
  • sub_metering_2: energy sub-metering No. 2 (in watt-hour of active energy). It corresponds to the laundry room, containing a washing-machine, a tumble-drier, a refrigerator and a light.
  • Sub_metering_3: energy sub-metering No. 3 (in watt-hour of active energy). It corresponds to an electric water-heater and an air-conditioner

Problems we addressed

  • What would be the energy consumption for the next day?
  • What would be week wise Energy consumption for the next one year?
  • What would be household’s peak time load (Peak time is between 7 AM to 10 AM) for the next month.
    During Weekdays
    During Weekends
  • What are the patterns in voltage fluctuations?
  • Can you identify the device usage patterns?
  • Assuming there was full day of outage, Calculate the Revenue Loss for a particular day next year by finding the Average Revenue per day (ARPD) of the household using given tariff plan

In upcoming blog posts , we will be sharing our solution to above problems.

Team

This weekend more than 20 companies came together at BigData conclave to explore big ideas. It was two day event held at Sheraton hotel, Bangalore. Developers from all across country flew to Bangalore to participate in the event.

Flutura, a bigdata company, hosted hackathon to crack big data problem within 24 hours. More than 53 teams from various companies had participated in the event.We were able to crack the problem and win the hackathon. It was a great team performance.

Problem

Energy is well known problem across the globe.There is a huge need to precisely analyze the usage and predict the every growing energy demands. Evolution of smart meters allowed industry to capture more data about power usage but it still challenging to analyse this data for specific actions. So every team is provided data for last four years, which recorded the every minute power usage for this time period. We had 24 hours to predict the future energy usage and device usage patterns using this data. It was very interesting problem and we were all ready to have a crack.

Our solution

We chose Scala and Spark as a platform to solve this particular problem over traditional hadoop as it gives more control to implement the machine learning algorithms. As spark allowed us faster prototyping and implementation we were able to solve the problems within given time. We will be sharing more information about our solution in future posts, but here are few interesting facts from our analysis.

Do you know that there is sharp spike in kitchen power usage in Dec 25? Seems like more people likes to cook and celebrate with family.

S1 indicates kitchen appliances, S2 indicates laundry and S3 water heater

AC and Water heater takes more 33% of power usage across whole globe

Device-Usage-Distribution-monthly

S1 indicates kitchen appliances, S2 indicates laundry and S3 water heater

We will be back with more interesting facts about data and our solution so keep on eye on this place.