3. 22
Functional Programming
It’s new and shiny! It’s the latest thing!
Not so much…
– Mathematical foundations introduced with lambda calculus in the 1930s
– Lisp language created at MIT in 1958
– Google based much of original Map/Reduce research on well-established Functional Programming
concepts
“Pure” Functional Languages
– Common Lisp, Haskell, F#, Erlang
– Can be mathematically proven to be correct (as opposed to “Design Patterns”)
Hybrids
– Scala, Clojure
Non-functional (mostly – with increasing FP capabilities)
– C, C++, Java, Python
This presentation focuses on real life application – not academic theory.
“Think functionally” is most important to start.
4. 33
Functional Programming
First-class functions
– Functions are “first-class citizens” and can be used anywhere in the code
– Function can be an argument to another function – or returned as a
result from a function
Pure functions
– No side effects
– Facilitates parallel and/or concurrent operations and optimizations
Recursion
– Iteration done through recursion (as opposed to looping)
Immutability
– Avoid changing state and avoid mutable data
Declarative (vs Imperative)
– What to do, not how to do it
http://en.wikipedia.org/wiki/Functional_programming
8. 77
Pure Functions
Y = f(x)
1. The function always evaluates the same result value given the same
argument value(s). The function result value cannot depend on any hidden
information…
2. Evaluation of the result does not cause any semantically observable side
effect or output, such as mutation of mutable objects…
http://en.wikipedia.org/wiki/Pure_function
Parallelization – work in parallel, gather results
Referentially Transparent – results can be cached and/or reused safely
Lazy evaluation – Because of guarantee of no side effect, may be able to skip function
9. 88
Recursion
• Important concept in computer science – many nuances – big topic
• For this discussion, think recursion vs iteration
• Recursion in FP allows for:
• “Iteration” without mutating
• “Tail call” optimization to avoid blowing out the stack
• Simpler code
Bad:
for (int x = 0 ; x < len(myList); x++) {
myList2(x) = doSomething with myList…
}
Good:
for (var x in myList):
produce myList2
Better:
myList2 = map(doSomething, myList)
10. 99
Immutability
• Once an object/structure is created it cannot be changed
• Thread safety is no longer an issue
• Mutations produce side effects – (some you don’t know about)
• Code is more maintainable – it just does what it does
for (int x = 0 ; x < len(myList); x++) {
myList2(x) = doSomething with myList
}
Oops! – “myList” is altered by another thread during the loop
Thought for the day:
HDFS is broken – I can’t change a file! Hurry up and fix it!
(Are you sure?)
13. 1212
<OPINION>
Gen 1 Gen 2 Gen 3
Transform Map/Reduce
(Java)
Spark (Scala) Sparkling (Clojure)
Data Flow Pig
Cascading (Java)
Scalding (Scala) Cascalog (Clojure)
CEP Storm (Java) ScalaStorm
(Scala)
Storm (Clojure
DSL)
Messaging Kafka (Java) Kafka(Scala) Kafka (Clojure)
</OPINION>
Gravitate to same language
Increasingly Functional Over Time
Big Data Framework “disappears” into
Language
14. 1313
Case Study
Hadoop job to scan transactions within date range specified as argument to script
Output a record for each transaction within the range which has:
– “closed_date” within the range specified – add “type” of “CLOSED”
– “open_date” within the range specified – add “type” of “OPENED”
– if one record has both dates within the range specified, output 2 records – 1 with each “type”
Cust
ID
Open Date Close Date
101 2014/02/01 null
202 2014/01/12 2014/02/22
303 2014/02/02 2014/02/28
404 2014/01/04 2014/03/14
Cust
ID
Open Date Close Date Type
101 2014/02/01 null OPENED
202 2014/01/12 2014/02/22 CLOSED
303 2014/02/02 2014/02/28 OPENED
303 2014/02/02 2014/02/28 CLOSED
Range specified: 2014/02/01 through 2014/02/28
15. 1414
Case Study – Version 1
Bash script date arithmetic to create multiple FOR LOOPs
– In LOOP executes “OPENED” Pig script – once per day in range
– In LOOP # 2, executes “CLOSED” Pig script – once per day in range
– So…for this example, a Pig script called 56 times
Pig script:
– FILTER input BY (close or open) date for respective script
– Call Java UDF for transformation – assigns “Type” depending on Pig script version
Outputs merged to same directory outside Pig
Results were 100% correct from business standpoint
To process a range of 1 month ~16 hours
16. 1515
Case Study – Version 2
Bash script passes date range to Pig script, calls 1 script 1 time
Pig script:
– FILTER input BY close or open date within range (either or both match)
– Call Java UDF for transformation – assigns one or two “Type” depending on date values
Results compared to “Version 1” code – matched exactly….unless I ran late in the day?!
Side effect!
Results were 100% correct from business standpoint
To process a range of 1 month ~30 minutes
3% of Version 1 runtime