Hackathon
“ Hackathon is a programming ritual ,
usually done in night, where
programmers solve problems overnight
which may take years of time otherwise”
-Anonymous
2
Where / When
Dec 6 and 7 th , 2013
At BigData Conclave
Hosted by Flutura
Solving real world problem(s) in 24 hours
No restriction on tools to be used
Deliverables : Results, Working code and Visualization
3
What?
Predict the global energy demand for next year using the
energy usage data available for last four years, in order to
enable utility companies , effectively handle the energy
demand.
Every minute usage data collected from smart meters
From 2008- 2012
Around 133 mb uncompressed
2,75,000 records
Around 25,000 missing data records
4
Questions to be answered
What would be the energy consumption for the next day?
What would be week wise Energy consumption for the next
one year?
What would be household’s peak time load (Peak time is
between 7 AM to 10 AM) for the next month.
During Weekdays
During Weekends
Assuming there was full day of outage, Calculate the
Revenue Loss for a particular day next year by finding the
Average Revenue per day (ARPD) of the household using
given tariff plan
Can you identify the device usage patterns?
5
Who
Four member team from Zinnia Systems
Chose Spark/Scala for solving these problems
Every one except me ,were new to Scala and Spark
Able to solve the problems on time
Won first prize at Hackathon
6
Why Spark?
Faster prototyping
Excellent in memory performance
Uses AKKA
Able to run 2.5 million concurrent processing in 1GB RAM
Easy to debug
Excellent integration in IntelliJ and Eclipse
Little to code - 500 lines of code
Something new to try at Hackathon
7
Solution
Uses only core spark
Uses Geometric Brownian motion algorithm for prediction
Complete code Available at Github
https://github.com/zinniasystems/spark-energy-prediction
Under Apache license
Blog series
http://zinniasystems.com/blog/2013/12/12/
predicting-global-energy-demand-using-spark-part-1/
8
Embrace Scala
Scala is a JVM language in which spark is implemented.
Though Spark gives Java API and Python API Scala feels
more natural.
If you are coming from Pig , You feel at home in Scala API.
Spark source base is small. So knowing Scala helps you
peek at spark source whenever possible.
Excellent REPL support.
9
Go functional
Spark encourages you to use functional idioms over object
oriented one's
Some of the functional idioms available are
Closure
Function chaining
Lazy evaluation
Ex : Standard deviation
Sum( (xi – Mean)*(xi – Mean))
Map/Reduce way
Map calculates (xi – Mean)*(xi – Mean)
Reduce does the sum
10
Spark way
private def standDeviation(inputRDD: RDD[Double], mean:
Double): Double = {
val sum = inputRDD.map(value => {
(value - mean) * (value - mean)
}).reduce((firstValue, secondValue) => {
firstValue + secondValue
})}
private def standDeviation(inputRDD: RDD[Double],mean: Double):
Double = {
val sum =0;
inputRDD.map(value => sum+=(value-mean)*(value-mean))}
11
Use Tuples
Map/Reduce is restricted to Key,Value pairs
Representing data like Grouped data is too difficult in
Key,Value pairs
Writable is too much work to develop/maintain
There was Tuple Map/Reduce effort some point of time
Spark (Scala) has built in.
12
Tuple Example
Aggregating data over hour
def hourlyAggregator(inputRDD: RDD[Record]): RDD[((String,
Long), Record)]
The resulting RDD has tuple as key which combines date and
hour of the day.
These tuples can be sent as input to other functions
Aggregating data over Day
def dailyAggregation(inputRDD: RDD[((String, Long),
Record)]): RDD[(Date, Record)]
13
Use Lazy evaluation
Map/Reduce does not embrace lazy evaluation. Output of
every job has to be written to HDFS
HDFS is only way to share data in Map/Reduce
Spark differs
Every operation, other than actions, are lazy evaluated
Only write critical data to Disk, other intermediate data
cache in Memory
Be careful when you use actions. Try to delay calling of
actions as late possible.
Refer Main.scala
14
Hackathon
“ Hackathon is a programming ritual ,
usually done in night, where
programmers solve problems overnight
which may take years of time otherwise”
-Anonymous
2
Where / When
Dec 6 and 7 th , 2013
At BigData Conclave
Hosted by Flutura
Solving real world problem(s) in 24 hours
No restriction on tools to be used
Deliverables : Results, Working code and Visualization
3
What?
Predict the global energy demand for next year using the
energy usage data available for last four years, in order to
enable utility companies , effectively handle the energy
demand.
Every minute usage data collected from smart meters
From 2008- 2012
Around 133 mb uncompressed
2,75,000 records
Around 25,000 missing data records
4
Questions to be answered
What would be the energy consumption for the next day?
What would be week wise Energy consumption for the next
one year?
What would be household’s peak time load (Peak time is
between 7 AM to 10 AM) for the next month.
During Weekdays
During Weekends
Assuming there was full day of outage, Calculate the
Revenue Loss for a particular day next year by finding the
Average Revenue per day (ARPD) of the household using
given tariff plan
Can you identify the device usage patterns?
5
Who
Four member team from Zinnia Systems
Chose Spark/Scala for solving these problems
Every one except me ,were new to Scala and Spark
Able to solve the problems on time
Won first prize at Hackathon
6
Why Spark?
Faster prototyping
Excellent in memory performance
Uses AKKA
Able to run 2.5 million concurrent processing in 1GB RAM
Easy to debug
Excellent integration in IntelliJ and Eclipse
Little to code - 500 lines of code
Something new to try at Hackathon
7
Solution
Uses only core spark
Uses Geometric Brownian motion algorithm for prediction
Complete code Available at Github
https://github.com/zinniasystems/spark-energy-prediction
Under Apache license
Blog series
http://zinniasystems.com/blog/2013/12/12/
predicting-global-energy-demand-using-spark-part-1/
8
Embrace Scala
Scala is a JVM language in which spark is implemented.
Though Spark gives Java API and Python API Scala feels
more natural.
If you are coming from Pig , You feel at home in Scala API.
Spark source base is small. So knowing Scala helps you
peek at spark source whenever possible.
Excellent REPL support.
9
Go functional
Spark encourages you to use functional idioms over object
oriented one's
Some of the functional idioms available are
Closure
Function chaining
Lazy evaluation
Ex : Standard deviation
Sum( (xi – Mean)*(xi – Mean))
Map/Reduce way
Map calculates (xi – Mean)*(xi – Mean)
Reduce does the sum
10
Spark way
private def standDeviation(inputRDD: RDD[Double], mean:
Double): Double = {
val sum = inputRDD.map(value => {
(value - mean) * (value - mean)
}).reduce((firstValue, secondValue) => {
firstValue + secondValue
})}
private def standDeviation(inputRDD: RDD[Double],mean: Double):
Double = {
val sum =0;
inputRDD.map(value => sum+=(value-mean)*(value-mean))}
11
The code is available in
EnergyUsagePrediction.scala
Use Tuples
Map/Reduce is restricted to Key,Value pairs
Representing data like Grouped data is too difficult in
Key,Value pairs
Writable is too much work to develop/maintain
There was Tuple Map/Reduce effort some point of time
Spark (Scala) has built in.
12
Tuple Example
Aggregating data over hour
def hourlyAggregator(inputRDD: RDD[Record]): RDD[((String,
Long), Record)]
The resulting RDD has tuple as key which combines date and
hour of the day.
These tuples can be sent as input to other functions
Aggregating data over Day
def dailyAggregation(inputRDD: RDD[((String, Long),
Record)]): RDD[(Date, Record)]
13
Use Lazy evaluation
Map/Reduce does not embrace lazy evaluation. Output of
every job has to be written to HDFS
HDFS is only way to share data in Map/Reduce
Spark differs
Every operation, other than actions, are lazy evaluated
Only write critical data to Disk, other intermediate data
cache in Memory
Be careful when you use actions. Try to delay calling of
actions as late possible.
Refer Main.scala
14