Why Functional Programming Is
Important In Big Data Era?
handaru@tiket.com
What Is Big Data?
What Are The Steps?
Act On
Analyze
Collect
What We Need?
D
Distributed Computing
Cluster
ProcessData
What We Need?
• Spark as data processsing in cluster, originally
written in Scala, which allows concise
function syntax an...
What We Need?
• Pure functions
• Atomic operations
• Parallel patterns or skeletons
• Lightweight algorithms
The only thin...
What Is Functional Programming?
FP Quick Tour In Scala
• Basic transformations:
var array = new Array[Int](10)
var list = List(1, 2, 3, 4, 5, 6, 7, 8, 9, ...
FP Quick Tour In Scala
• Scala closure syntax:
(x: Int) => x * 10 // full version
x => x * 10 // type interference
_ * 10 ...
FP Quick Tour In Scala
• Processing collections:
var list = List(1, 2, 3, 4, 5, 6, 7, 8, 9)
list.foreach(x => println(x))
...
Spark Quick Tour
• Spark context:
• Entry point to Spark functionality
• In spark-shell, crated as sc
• In standalone-spar...
Spark Quick Tour
Working with RDDs
Spark Quick Tour
Cached RDDs
Spark Quick Tour
• Transformations:
• Lazy operations to build RDDs from other RDDs
• Narrow transformation (involves no d...
Spark Quick Tour
Transformations
Spark Quick Tour
• Creating RDDs:
val numbers = sc.parallelize(List(1, 2, 3, 4, 5))
val textFile = sc.textFile("hdfs://loc...
Spark Quick Tour
• Basic actions:
words.collect()
words take(5)
words count
words.reduce(_ + _)
words.filter(_ == “be").co...
Spark Quick Tour
• Pair syntax:
val pair = (a, b)
• Accessing pair elements:
pair._1
pair._2
• Key-value operations:
val p...
Hello World
val logFile = "hdfs://localhost/test/tobe.txt"
val logData = sc.textFile(logFile).cache()
val wordCount = logD...
Execution
Software Components
Application
Spark Context
ZooKeeper
Mesos
Master
Mesos Slave
Spark Executor
Mesos Slave
Spark Executor...
Literature
Parallel Programming With Spark
Spark: Low latency, massively parallel processing framework
handaru@tiket.com
handaru@tiket.com
Upcoming SlideShare
Loading in …5
×

Why Functional Programming Is Important in Big Data Era

743 views

Published on

The only thing that works for parallel programming is functional programming.


--Carnegie Mello Professor Bob Harper

Published in: Data & Analytics, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
743
On SlideShare
0
From Embeds
0
Number of Embeds
28
Actions
Shares
0
Downloads
13
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Why Functional Programming Is Important in Big Data Era

  1. 1. Why Functional Programming Is Important In Big Data Era? handaru@tiket.com
  2. 2. What Is Big Data?
  3. 3. What Are The Steps? Act On Analyze Collect
  4. 4. What We Need? D Distributed Computing Cluster ProcessData
  5. 5. What We Need? • Spark as data processsing in cluster, originally written in Scala, which allows concise function syntax and interactive use • Mesos as cluster manager • ZooKeeper as highly reliable distributed coordinator • HDFS as distributed storage
  6. 6. What We Need? • Pure functions • Atomic operations • Parallel patterns or skeletons • Lightweight algorithms The only thing that works for parallel programming is functional programming. --Carnegie Mello Professor Bob Harper
  7. 7. What Is Functional Programming?
  8. 8. FP Quick Tour In Scala • Basic transformations: var array = new Array[Int](10) var list = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) • Indexing: array(0) = 1 println(list(0)) • Anonymous functions: val multiplay = (x: Int, y: Int) => x * y val procedure = { x: Int => { println(“Hello, ”+x) println(x * 10) } }
  9. 9. FP Quick Tour In Scala • Scala closure syntax: (x: Int) => x * 10 // full version x => x * 10 // type interference _ * 10 // underscore syntax x => { // body is block of code val y = 10 x * y }
  10. 10. FP Quick Tour In Scala • Processing collections: var list = List(1, 2, 3, 4, 5, 6, 7, 8, 9) list.foreach(x => println(x)) list.map(_ * 10) list.filter(x => x % 2 == 0) list.reduce((x, y) => x + y) list.reduce(_ + _) def f(x: Int) = List(x-1, x x+1) list.map(x => f(x)) list.map(f(_)) list.flatMap(x => f(x)) list.map(x => f(x)).reduce(_ ++ _)
  11. 11. Spark Quick Tour • Spark context: • Entry point to Spark functionality • In spark-shell, crated as sc • In standalone-spark-program, we must create it • Resilient distributed datasets (RDDs) : • A distributed memory abstraction • A logically centralized entity but physically partitioned across multiple machines inside a cluster based on some notion of key • Immutable • Automatically rebuilt on failure • Based on LRU (Least Recent Use) eviction algorithm
  12. 12. Spark Quick Tour Working with RDDs
  13. 13. Spark Quick Tour Cached RDDs
  14. 14. Spark Quick Tour • Transformations: • Lazy operations to build RDDs from other RDDs • Narrow transformation (involves no data shuffling) : • map • flatMap • filter • Wide transformation (involves data shuffling): • sortByKey • reduceByKey • groupByKey • Actions: • Return a result or write it to storage • collect • count • take(n)
  15. 15. Spark Quick Tour Transformations
  16. 16. Spark Quick Tour • Creating RDDs: val numbers = sc.parallelize(List(1, 2, 3, 4, 5)) val textFile = sc.textFile("hdfs://localhost/test/tobe.txt") val textFile = sc.textFile("hdfs://localhost/test/*.txt") • Basic transformations: val squares = numbers.map(x => x * x) val evens = squares.filter(_ < 9) val mapto = numbers.flatMap(x => 1 to x) val words = textFile.flatMap(_.split(" ")).cache() Base RDD Transformed RDD Turn a collection to RDD
  17. 17. Spark Quick Tour • Basic actions: words.collect() words take(5) words count words.reduce(_ + _) words.filter(_ == “be").count() words.filter(_ == “or").count() words.saveAsTextFile("hdfs://localhost/test/result") The influence of cache
  18. 18. Spark Quick Tour • Pair syntax: val pair = (a, b) • Accessing pair elements: pair._1 pair._2 • Key-value operations: val pets = sc.parallelize(List(("cat", 1), ("dog", 2), ("cat", 3))) pets.reduceByKey(_ + _) pets.groupByKey() pets.sortByKey()
  19. 19. Hello World val logFile = "hdfs://localhost/test/tobe.txt" val logData = sc.textFile(logFile).cache() val wordCount = logData.flatMap(_.split(“ “)) .map((_, 1)) .reduceByKey(_ + _) wordCount.saveAsTextFile("hdfs://localhost/wordcount/result") sc.stop()
  20. 20. Execution
  21. 21. Software Components Application Spark Context ZooKeeper Mesos Master Mesos Slave Spark Executor Mesos Slave Spark Executor HDFS/Other Storage
  22. 22. Literature Parallel Programming With Spark Spark: Low latency, massively parallel processing framework
  23. 23. handaru@tiket.com
  24. 24. handaru@tiket.com

×