Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Three Functional Programming Technologies for Big Data
1. Functional Programming and Big Data
http://glennengstrand.info/analytics/fp
What role will Functional
Prgramming play in processing
Big Data streams?
Glenn Engstrand
September 2014
2. Clojure News Feed
http://glennengstrand.info/software/architecture/oss/clojure
union
intersection
difference
map
reduce
3. OSCON 2014
Big Data Pipeline and Analytics Platform Using NetflixOSS and
Other Open Source Libraries
http://www.oscon.com/oscon2014/public/schedule/detail/34159
Data Workflows for Machine Learning
http://www.oscon.com/oscon2014/public/schedule/detail/34913
4. netflix
PigPen is map-reduce for Clojure, or distributed Clojure. It
compiles to Apache Pig, but you don't need to know much
about Pig to use it.
https://github.com/Netflix/PigPen
6. clumsy process
cd /path/to/git/clojure-news-feed/client/pigpenperf
lein run
# remove the :main from project.clj
lein uberjar
cp target/pigpenperf-0.1.0-SNAPSHOT-standalone.jar
~/oss/hadoop/pig-0.12.1/pigpen.jar
cd /path/to/oss/hadoop/pig-0.12.1
bin/pig -x local -f /path/to/pigpenperf.pig
7. Cascading
Fully-featured data processing and
querying library for Clojure or Java.
http://cascalog.org/
Cascading is the proven application
development platform for building data
applications on Hadoop.
http://www.cascading.org/
8. declarative and implicit
(defn per-minute-post-action-counts
"count of post operations grouped by time stamp"
[input-directory output-directory]
(let [data-point (metrics input-directory)
output (hfs-delimited output-directory)]
(c/?<- output
[?ts ?cnt]
(data-point ?year ?month ?day ?hour ?minute ?entity ?action
?count)
(format-time-stamp ?year ?month ?day ?hour ?minute :> ?ts)
(= ?action "post")
(o/count :> ?cnt))))
10. Scala compared to...
strongly typed
more versatile
less ideomatic
no homoiconicity
more mainstream
http://www.scala-lang.org/
lambda expressions
for comprehensions
streams
higher order
functions
Clojure
Java 7
11. spark shell
val t = sc.textFile("/path/to/newsFeedRawMetrics/perfpostgres.csv")
t.filter(line => line.contains("post"))
.map(line => (line.split(",").slice(0, 5).mkString(","), 1))
.reduceByKey(_ + _)
.saveAsTextFile("/tmp/postCount")
12. map reduce
fast
compact
interactive
not as distributive
limited reduce side
good for counters
not good for percentiles
13. margin for error
unfair basis for comparison
local spark does not use hadoop
single node mode
14. custom functions
built in functions are not as
expressive as hive
can custom functions be as
expressive as YARN?
future blog
Cascalog equivalent to News Feed
Performance map reduce job.
15. spark streaming
more popular than spark map reduce
more real-time and reactive
future blog
compare with cascalog for reproducing news
feed performance map reduce functionality
Is it really distributed?