Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Spark @ Hackathon
Madhukara Phatak
Zinnia Systems
@madhukaraphatak

1
Hackathon

“ Hackathon is a programming ritual ,
usually done in night, where
programmers solve problems overnight
which m...
Where / When
Dec 6 and 7 th , 2013
At BigData Conclave
Hosted by Flutura
Solving real world problem(s) in 24 hours
No rest...
What?
Predict the global energy demand for next year using the
energy usage data available for last four years, in order t...
Questions to be answered
What would be the energy consumption for the next day?
What would be week wise Energy consumption...
Who
Four member team from Zinnia Systems
Chose Spark/Scala for solving these problems
Every one except me ,were new to Sca...
Why Spark?
Faster prototyping
Excellent in memory performance
Uses AKKA
Able to run 2.5 million concurrent processing in 1...
Solution
Uses only core spark
Uses Geometric Brownian motion algorithm for prediction
Complete code Available at Github
ht...
Embrace Scala
Scala is a JVM language in which spark is implemented.
Though Spark gives Java API and Python API Scala feel...
Go functional
Spark encourages you to use functional idioms over object
oriented one's
Some of the functional idioms avail...
Spark way
private def standDeviation(inputRDD: RDD[Double], mean:
Double): Double = {
val sum = inputRDD.map(value => {
(v...
Use Tuples
Map/Reduce is restricted to Key,Value pairs
Representing data like Grouped data is too difficult in
Key,Value p...
Tuple Example
Aggregating data over hour
def hourlyAggregator(inputRDD: RDD[Record]): RDD[((String,
Long), Record)]
The re...
Use Lazy evaluation
Map/Reduce does not embrace lazy evaluation. Output of
every job has to be written to HDFS
HDFS is onl...
Thank you

15
Spark @ Hackathon
Madhukara Phatak
Zinnia Systems
@madhukaraphatak

1
Hackathon

“ Hackathon is a programming ritual ,
usually done in night, where
programmers solve problems overnight
which m...
Where / When
Dec 6 and 7 th , 2013
At BigData Conclave
Hosted by Flutura
Solving real world problem(s) in 24 hours
No rest...
What?
Predict the global energy demand for next year using the
energy usage data available for last four years, in order t...
Questions to be answered
What would be the energy consumption for the next day?
What would be week wise Energy consumption...
Who
Four member team from Zinnia Systems
Chose Spark/Scala for solving these problems
Every one except me ,were new to Sca...
Why Spark?
Faster prototyping
Excellent in memory performance
Uses AKKA
Able to run 2.5 million concurrent processing in 1...
Solution
Uses only core spark
Uses Geometric Brownian motion algorithm for prediction
Complete code Available at Github
ht...
Embrace Scala
Scala is a JVM language in which spark is implemented.
Though Spark gives Java API and Python API Scala feel...
Go functional
Spark encourages you to use functional idioms over object
oriented one's
Some of the functional idioms avail...
Spark way
private def standDeviation(inputRDD: RDD[Double], mean:
Double): Double = {
val sum = inputRDD.map(value => {
(v...
Use Tuples
Map/Reduce is restricted to Key,Value pairs
Representing data like Grouped data is too difficult in
Key,Value p...
Tuple Example
Aggregating data over hour
def hourlyAggregator(inputRDD: RDD[Record]): RDD[((String,
Long), Record)]
The re...
Use Lazy evaluation
Map/Reduce does not embrace lazy evaluation. Output of
every job has to be written to HDFS
HDFS is onl...
Thank you

15
Upcoming SlideShare
Loading in …5
×

Spark at-hackthon8jan2014

1,269 views

Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

Spark at-hackthon8jan2014

  1. 1. Spark @ Hackathon Madhukara Phatak Zinnia Systems @madhukaraphatak 1
  2. 2. Hackathon “ Hackathon is a programming ritual , usually done in night, where programmers solve problems overnight which may take years of time otherwise” -Anonymous 2
  3. 3. Where / When Dec 6 and 7 th , 2013 At BigData Conclave Hosted by Flutura Solving real world problem(s) in 24 hours No restriction on tools to be used Deliverables : Results, Working code and Visualization 3
  4. 4. What? Predict the global energy demand for next year using the energy usage data available for last four years, in order to enable utility companies , effectively handle the energy demand. Every minute usage data collected from smart meters From 2008- 2012 Around 133 mb uncompressed 2,75,000 records Around 25,000 missing data records 4
  5. 5. Questions to be answered What would be the energy consumption for the next day? What would be week wise Energy consumption for the next one year? What would be household’s peak time load (Peak time is between 7 AM to 10 AM) for the next month. During Weekdays During Weekends Assuming there was full day of outage, Calculate the Revenue Loss for a particular day next year by finding the Average Revenue per day (ARPD) of the household using given tariff plan Can you identify the device usage patterns? 5
  6. 6. Who Four member team from Zinnia Systems Chose Spark/Scala for solving these problems Every one except me ,were new to Scala and Spark Able to solve the problems on time Won first prize at Hackathon 6
  7. 7. Why Spark? Faster prototyping Excellent in memory performance Uses AKKA Able to run 2.5 million concurrent processing in 1GB RAM Easy to debug Excellent integration in IntelliJ and Eclipse Little to code - 500 lines of code Something new to try at Hackathon 7
  8. 8. Solution Uses only core spark Uses Geometric Brownian motion algorithm for prediction Complete code Available at Github https://github.com/zinniasystems/spark-energy-prediction Under Apache license Blog series http://zinniasystems.com/blog/2013/12/12/ predicting-global-energy-demand-using-spark-part-1/ 8
  9. 9. Embrace Scala Scala is a JVM language in which spark is implemented. Though Spark gives Java API and Python API Scala feels more natural. If you are coming from Pig , You feel at home in Scala API. Spark source base is small. So knowing Scala helps you peek at spark source whenever possible. Excellent REPL support. 9
  10. 10. Go functional Spark encourages you to use functional idioms over object oriented one's Some of the functional idioms available are Closure Function chaining Lazy evaluation Ex : Standard deviation Sum( (xi – Mean)*(xi – Mean)) Map/Reduce way Map calculates (xi – Mean)*(xi – Mean) Reduce does the sum 10
  11. 11. Spark way private def standDeviation(inputRDD: RDD[Double], mean: Double): Double = { val sum = inputRDD.map(value => { (value - mean) * (value - mean) }).reduce((firstValue, secondValue) => { firstValue + secondValue })} private def standDeviation(inputRDD: RDD[Double],mean: Double): Double = { val sum =0; inputRDD.map(value => sum+=(value-mean)*(value-mean))} 11
  12. 12. Use Tuples Map/Reduce is restricted to Key,Value pairs Representing data like Grouped data is too difficult in Key,Value pairs Writable is too much work to develop/maintain There was Tuple Map/Reduce effort some point of time Spark (Scala) has built in. 12
  13. 13. Tuple Example Aggregating data over hour def hourlyAggregator(inputRDD: RDD[Record]): RDD[((String, Long), Record)] The resulting RDD has tuple as key which combines date and hour of the day. These tuples can be sent as input to other functions Aggregating data over Day def dailyAggregation(inputRDD: RDD[((String, Long), Record)]): RDD[(Date, Record)] 13
  14. 14. Use Lazy evaluation Map/Reduce does not embrace lazy evaluation. Output of every job has to be written to HDFS HDFS is only way to share data in Map/Reduce Spark differs Every operation, other than actions, are lazy evaluated Only write critical data to Disk, other intermediate data cache in Memory Be careful when you use actions. Try to delay calling of actions as late possible. Refer Main.scala 14
  15. 15. Thank you 15
  16. 16. Spark @ Hackathon Madhukara Phatak Zinnia Systems @madhukaraphatak 1
  17. 17. Hackathon “ Hackathon is a programming ritual , usually done in night, where programmers solve problems overnight which may take years of time otherwise” -Anonymous 2
  18. 18. Where / When Dec 6 and 7 th , 2013 At BigData Conclave Hosted by Flutura Solving real world problem(s) in 24 hours No restriction on tools to be used Deliverables : Results, Working code and Visualization 3
  19. 19. What? Predict the global energy demand for next year using the energy usage data available for last four years, in order to enable utility companies , effectively handle the energy demand. Every minute usage data collected from smart meters From 2008- 2012 Around 133 mb uncompressed 2,75,000 records Around 25,000 missing data records 4
  20. 20. Questions to be answered What would be the energy consumption for the next day? What would be week wise Energy consumption for the next one year? What would be household’s peak time load (Peak time is between 7 AM to 10 AM) for the next month. During Weekdays During Weekends Assuming there was full day of outage, Calculate the Revenue Loss for a particular day next year by finding the Average Revenue per day (ARPD) of the household using given tariff plan Can you identify the device usage patterns? 5
  21. 21. Who Four member team from Zinnia Systems Chose Spark/Scala for solving these problems Every one except me ,were new to Scala and Spark Able to solve the problems on time Won first prize at Hackathon 6
  22. 22. Why Spark? Faster prototyping Excellent in memory performance Uses AKKA Able to run 2.5 million concurrent processing in 1GB RAM Easy to debug Excellent integration in IntelliJ and Eclipse Little to code - 500 lines of code Something new to try at Hackathon 7
  23. 23. Solution Uses only core spark Uses Geometric Brownian motion algorithm for prediction Complete code Available at Github https://github.com/zinniasystems/spark-energy-prediction Under Apache license Blog series http://zinniasystems.com/blog/2013/12/12/ predicting-global-energy-demand-using-spark-part-1/ 8
  24. 24. Embrace Scala Scala is a JVM language in which spark is implemented. Though Spark gives Java API and Python API Scala feels more natural. If you are coming from Pig , You feel at home in Scala API. Spark source base is small. So knowing Scala helps you peek at spark source whenever possible. Excellent REPL support. 9
  25. 25. Go functional Spark encourages you to use functional idioms over object oriented one's Some of the functional idioms available are Closure Function chaining Lazy evaluation Ex : Standard deviation Sum( (xi – Mean)*(xi – Mean)) Map/Reduce way Map calculates (xi – Mean)*(xi – Mean) Reduce does the sum 10
  26. 26. Spark way private def standDeviation(inputRDD: RDD[Double], mean: Double): Double = { val sum = inputRDD.map(value => { (value - mean) * (value - mean) }).reduce((firstValue, secondValue) => { firstValue + secondValue })} private def standDeviation(inputRDD: RDD[Double],mean: Double): Double = { val sum =0; inputRDD.map(value => sum+=(value-mean)*(value-mean))} 11 The code is available in EnergyUsagePrediction.scala
  27. 27. Use Tuples Map/Reduce is restricted to Key,Value pairs Representing data like Grouped data is too difficult in Key,Value pairs Writable is too much work to develop/maintain There was Tuple Map/Reduce effort some point of time Spark (Scala) has built in. 12
  28. 28. Tuple Example Aggregating data over hour def hourlyAggregator(inputRDD: RDD[Record]): RDD[((String, Long), Record)] The resulting RDD has tuple as key which combines date and hour of the day. These tuples can be sent as input to other functions Aggregating data over Day def dailyAggregation(inputRDD: RDD[((String, Long), Record)]): RDD[(Date, Record)] 13
  29. 29. Use Lazy evaluation Map/Reduce does not embrace lazy evaluation. Output of every job has to be written to HDFS HDFS is only way to share data in Map/Reduce Spark differs Every operation, other than actions, are lazy evaluated Only write critical data to Disk, other intermediate data cache in Memory Be careful when you use actions. Try to delay calling of actions as late possible. Refer Main.scala 14
  30. 30. Thank you 15

×