This is the 2nd post in the Predicting Global Energy Demand using Spark series. The series has the following posts.

Its our pleasure to share our approach in solving the hackathon problems briefly. So, lets start the exploration with details about the data we handled. The given data represents the energy consumption details recorded for each minute in household usage.The hourly data was having 2075259 measurements gathered between December 2008 and November 2012 (almost 4 years).You can find the schema for data here.

#### Data Preparation

Data preparation involves pre-processing techniques where we make the data ready for actual prediction. It involves three phases, data cleaning , handling duplicate data and data sampling.

In data cleaning, we handled the missing values in the data.Our approach was to ignore the measurements where there is no readings for that day. Next step is handling duplicates in the data. We ensured there are no duplicates in the data. Final step is data sampling.We drive our process towards perfection by applying it on sampled data. Once we ensure that, the results for the subset holds true. We apply the same process for complete data and publish the results.

#### Data Modeling

Data modelling involves steps where we model our data in such a way that, it fits into the algorithm used for prediction. We used spark framework and scala language to accomplish this task. In the next sections you will know more about the platform and algorithm we used for prediction.

#### Scala and Spark

Scala is a JVM based functional language which is well known for its concurrent capabilities.It’s breeze to write map/reduce based machine learning algorithms with very few lines of code. All our examples in the post are written in scala.

Apache Spark is an open source cluster computing system that aims to make data analytics fast, both fast to run and fast to write. We used spark as it allowed us to prototype within time limits of hackathon.

#### Geometric Brownian Motion

For every prediction problem there will be a strong backbone algorithm which is the key for prediction models. Since the given data is about the energy consumption in household, it is a type of quantity which changes over uncertainty. These kind of quantities follows Geometric Brownian Motion. This model is frequently invoked as a model for diverse quantities.Other quantities which follows this model are stock prices,natural resource prices and growth in demand for products or services etc.

Brownian motion is the simplest of the continuous time stochastic processes.Wiener process gives the mathematical representation for the brownian motion. This model says the variable’s value changes in one unit of time by an amount that is normally distributed with µ and σ ,where µ denotes drift and σ denotes volatility. Brownian motion suggests the following equation to predict the next value at the given instant in constant interval.

Sᵢ˖₁ =SᵢµΔt + Sᵢσԑ√Δt

Where,

- Sᵢ˖₁ : Predicted value
- Sᵢ : Present known value
- µ : Drift
- σ : Volatility
- ԑ : Random Number

#### Code

µ – Standard Mean ( Drift in data )

standardMean(inputRDD: RDD[(Date, Record)]): (List[Double], Double) = { val count = inputRDD.count() var sum = 0.0 var riList = List[Double]() for (i val firstRecord = inputRDD.toArray()(i) val secondRecord = inputRDD.toArray()(i - 1) val difference = (firstRecord._2.totalPowerUsed - secondRecord._2.totalPowerUsed) / firstRecord._2.totalPowerUsed riList = riList ::: List(difference) sum += difference } (riList, sum / count) }

σ – Standard Deviation ( Volatility in data )

def standDeviation(inputRDD: RDD[Double], mean: Double): Double = { val sum = inputRDD.map(value => { (value - mean) * (value - mean) }).reduce((firstValue, secondValue) => { firstValue + secondValue }) scala.math.sqrt(sum / inputRDD.count()) }

Note : Here inputRDD has each measurement aggregated to one day.

Sᵢ˖₁ – Predicted Value

PredictedValue = lastValue * (1 + mean * 1 + stdDeviation * scala.math.random * 1)

Here,

- PredictedValue : Sᵢ˖₁
- lastValue : Sᵢ
- mean : µ
- stdDeviation : σ
- scala.math.random:ԑ

Since we are predicting for next day with the daily aggregated data both Δt and √Δt will be 1.

Hope this blog gave you an idea about our prediction strategy for the next day energy consumption estimation. Wait for the next blog which will explain you the prediction technique for next one year.

Hi,

Thanks for the detailed explanation about the algorithm to solve this problem.

Could you please help me understand the formula to calculate drift? What I am making out of the scala code above is not making sense to me.

Please help.

Thanks in advance!

-Ravi

I love your blog.. very nice colors & theme.

Did you create this website yourself or did you hire

someone to do it for you? Plz answer back

as I’m looking to design my own blog and would like to know

where u got this from. kudos

Hi, I’m struggling in understanding why you translated the formula:

Sᵢ˖₁ =SᵢµΔt + Sᵢσԑ√Δt

Into:

PredictedValue = lastValue * (1 + mean * 1 + stdDeviation * scala.math.random * 1)

In my opinion, this code should refer to the following formula:

Sᵢ˖₁ =Sᵢ + SᵢµΔt + Sᵢσԑ√Δt