Three Functional Programming Technologies for Big Data

•

0 likes•778 views

Learn about the future of Functional Programming and Big Data with this summary on a recent evaluation of three related open source technologies.

Data & Analytics

Functional Programming and Big Data
http://glennengstrand.info/analytics/fp
What role will Functional
Prgramming play in processing
Big Data streams?
Glenn Engstrand
September 2014

Clojure News Feed
http://glennengstrand.info/software/architecture/oss/clojure
union
intersection
difference
map
reduce

OSCON 2014
Big Data Pipeline and Analytics Platform Using NetflixOSS and
Other Open Source Libraries
http://www.oscon.com/oscon2014/public/schedule/detail/34159
Data Workflows for Machine Learning
http://www.oscon.com/oscon2014/public/schedule/detail/34913

netflix
PigPen is map-reduce for Clojure, or distributed Clojure. It
compiles to Apache Pig, but you don't need to know much
about Pig to use it.
https://github.com/Netflix/PigPen

query like syntax
(defn my-query
[data]
(->> data
(pig/map my-map)
(pig/filter (fn [x] (= (:action x) "post")))
(pig/group-by :ts {:fold (fold/count)})
(pig/store-tsv "/path/to/newsFeedPigOutput")))

clumsy process
cd /path/to/git/clojure-news-feed/client/pigpenperf
lein run
# remove the :main from project.clj
lein uberjar
cp target/pigpenperf-0.1.0-SNAPSHOT-standalone.jar
~/oss/hadoop/pig-0.12.1/pigpen.jar
cd /path/to/oss/hadoop/pig-0.12.1
bin/pig -x local -f /path/to/pigpenperf.pig

Cascading
Fully-featured data processing and
querying library for Clojure or Java.
http://cascalog.org/
Cascading is the proven application
development platform for building data
applications on Hadoop.
http://www.cascading.org/

declarative and implicit
(defn per-minute-post-action-counts
"count of post operations grouped by time stamp"
[input-directory output-directory]
(let [data-point (metrics input-directory)
output (hfs-delimited output-directory)]
(c/?<- output
[?ts ?cnt]
(data-point ?year ?month ?day ?hour ?minute ?entity ?action
?count)
(format-time-stamp ?year ?month ?day ?hour ?minute :> ?ts)
(= ?action "post")
(o/count :> ?cnt))))

ideomatic
(defn parse-data-line
"parses the kafka output into the corresponding fields"
[line]
(s/split line #"|"))
(defn metrics [dir]
(let [source (c/hfs-textline dir)]
(c/<- [?year ?month ?day ?hour ?minute ?entity ?action ?count]
(source ?line)
(parse-data-line ?line :> ?year ?month ?day ?hour ?minute
?entity ?action ?count)
(:distinct false))))

Scala compared to...
strongly typed
more versatile
less ideomatic
no homoiconicity
more mainstream
http://www.scala-lang.org/
lambda expressions
for comprehensions
streams
higher order
functions
Clojure
Java 7

spark shell
val t = sc.textFile("/path/to/newsFeedRawMetrics/perfpostgres.csv")
t.filter(line => line.contains("post"))
.map(line => (line.split(",").slice(0, 5).mkString(","), 1))
.reduceByKey(_ + _)
.saveAsTextFile("/tmp/postCount")

map reduce
fast
compact
interactive
not as distributive
limited reduce side
good for counters
not good for percentiles

margin for error
unfair basis for comparison
local spark does not use hadoop
single node mode

custom functions
built in functions are not as
expressive as hive
can custom functions be as
expressive as YARN?
future blog
Cascalog equivalent to News Feed
Performance map reduce job.

spark streaming
more popular than spark map reduce
more real-time and reactive
future blog
compare with cascalog for reproducing news
feed performance map reduce functionality
Is it really distributed?

What's hot

GeoTuple a Framework for Web Based Geo-Analytics with R and PostGISRoland Hansson

Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...InfluxData

Aggregators: Data Day Texas, 2015johnynek

Aws Quick Dirty Hadoop Mapreduce Ec2 S3Skills Matter

First impressions of SparkR: our own machine learning algorithmInfoFarm

Data visualization in python/Djangokenluck2001

Daniel Sikar: Hadoop MapReduce - 06/09/2010 Skills Matter

Graphalytics: A big data benchmark for graph-processing platformsGraph-TA

Luigi presentation NYC Data ScienceErik Bernhardsson

Barbara Nelson [InfluxData] | How Can I Put That Dashboard in My App? | Influ...InfluxData

Luigi Presentation at OSCON 2013Erik Bernhardsson

Hive query optimization infinityShashwat Shriparv

Spark by Adform Research, PauliusVasil Remeniuk

Pdf sample3Apoorvi Kapoor

Raw system logs processing with hiveArpit Patil

2017 02-07 - elastic & spark. building a search geo locatorAlberto Paro

pmuxmaebashi

Semantic search within Earth Observation products databases based on automati...Gasperi Jerome

Quick 入門 | iOS RDD テストフレームワーク for Swift/Objective-CYuki Tanabe

Map reduce (from Google)Sri Prasanna

What's hot (20)

GeoTuple a Framework for Web Based Geo-Analytics with R and PostGIS

Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...

Aggregators: Data Day Texas, 2015

Aws Quick Dirty Hadoop Mapreduce Ec2 S3

First impressions of SparkR: our own machine learning algorithm

Data visualization in python/Django

Daniel Sikar: Hadoop MapReduce - 06/09/2010

Graphalytics: A big data benchmark for graph-processing platforms

Luigi presentation NYC Data Science

Barbara Nelson [InfluxData] | How Can I Put That Dashboard in My App? | Influ...

Luigi Presentation at OSCON 2013

Hive query optimization infinity

Spark by Adform Research, Paulius

Pdf sample3

Raw system logs processing with hive

2017 02-07 - elastic & spark. building a search geo locator

pmux

Semantic search within Earth Observation products databases based on automati...

Quick 入門 | iOS RDD テストフレームワーク for Swift/Objective-C

Map reduce (from Google)

Similar to Three Functional Programming Technologies for Big Data

Spark what's new what's comingDatabricks

Apache Flink & Graph ProcessingVasia Kalavri

Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys

Spark devoxx2014Andy Petrella

Adios hadoop, Hola Spark! T3chfest 2015dhiguero

So you think you can stream.pptxPrakash Chockalingam

Monitoring Spark ApplicationsTzach Zohar

Productionizing your Streaming JobsDatabricks

Unified Big Data Processing with Apache SparkC4Media

ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri

Spark training-in-bangaloreKelly Technologies

Jump Start into Apache® Spark™ and DatabricksDatabricks

Hadoop trainingin bangaloreappaji intelhunt

Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere

Lipstick On Pig bigdatagurus_meetup

Netflix - Pig with Lipstick by Jeff Magnusson Hakka Labs

Putting Lipstick on Apache Pig at NetflixJeff Magnusson

Flink internals web Kostas Tzoumas

Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz

r,rstats,r language,r packagesAjay Ohri

Similar to Three Functional Programming Technologies for Big Data (20)

Spark what's new what's coming

Apache Flink & Graph Processing

Big Data Processing with .NET and Spark (SQLBits 2020)

Spark devoxx2014

Adios hadoop, Hola Spark! T3chfest 2015

So you think you can stream.pptx

Monitoring Spark Applications

Productionizing your Streaming Jobs

Unified Big Data Processing with Apache Spark

ScalaTo July 2019 - No more struggles with Apache Spark workloads in production

Spark training-in-bangalore

Jump Start into Apache® Spark™ and Databricks

Hadoop trainingin bangalore

Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...

Lipstick On Pig

Netflix - Pig with Lipstick by Jeff Magnusson

Putting Lipstick on Apache Pig at Netflix

Flink internals web

Spark (Structured) Streaming vs. Kafka Streams

r,rstats,r language,r packages

Recently uploaded

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh9953056974 Low Rate Call Girls In Saket, Delhi NCR

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

Halmar dropshipping via API with DroFxolyaivanovalion

BigBuy dropshipping via API with DroFx.pptxolyaivanovalion

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal

100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate

Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765

Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823

Introduction-to-Machine-Learning (1).pptxfirstjob4

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083

Invezz.com - Grow your wealth with trading signalsInvezz1

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls

Recently uploaded (20)

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

Halmar dropshipping via API with DroFx

BigBuy dropshipping via API with DroFx.pptx

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...

100-Concepts-of-AI by Anupama Kate .pptx

Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl

Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...

Schema on read is obsolete. Welcome metaprogramming..pdf

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...

Introduction-to-Machine-Learning (1).pptx

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call

Invezz.com - Grow your wealth with trading signals

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779

Three Functional Programming Technologies for Big Data

1. Functional Programming and Big Data http://glennengstrand.info/analytics/fp What role will Functional Prgramming play in processing Big Data streams? Glenn Engstrand September 2014

2. Clojure News Feed http://glennengstrand.info/software/architecture/oss/clojure union intersection difference map reduce

3. OSCON 2014 Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Libraries http://www.oscon.com/oscon2014/public/schedule/detail/34159 Data Workflows for Machine Learning http://www.oscon.com/oscon2014/public/schedule/detail/34913

4. netflix PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it. https://github.com/Netflix/PigPen

5. query like syntax (defn my-query [data] (->> data (pig/map my-map) (pig/filter (fn [x] (= (:action x) "post"))) (pig/group-by :ts {:fold (fold/count)}) (pig/store-tsv "/path/to/newsFeedPigOutput")))

6. clumsy process cd /path/to/git/clojure-news-feed/client/pigpenperf lein run # remove the :main from project.clj lein uberjar cp target/pigpenperf-0.1.0-SNAPSHOT-standalone.jar ~/oss/hadoop/pig-0.12.1/pigpen.jar cd /path/to/oss/hadoop/pig-0.12.1 bin/pig -x local -f /path/to/pigpenperf.pig

7. Cascading Fully-featured data processing and querying library for Clojure or Java. http://cascalog.org/ Cascading is the proven application development platform for building data applications on Hadoop. http://www.cascading.org/

8. declarative and implicit (defn per-minute-post-action-counts "count of post operations grouped by time stamp" [input-directory output-directory] (let [data-point (metrics input-directory) output (hfs-delimited output-directory)] (c/?<- output [?ts ?cnt] (data-point ?year ?month ?day ?hour ?minute ?entity ?action ?count) (format-time-stamp ?year ?month ?day ?hour ?minute :> ?ts) (= ?action "post") (o/count :> ?cnt))))

9. ideomatic (defn parse-data-line "parses the kafka output into the corresponding fields" [line] (s/split line #"|")) (defn metrics [dir] (let [source (c/hfs-textline dir)] (c/<- [?year ?month ?day ?hour ?minute ?entity ?action ?count] (source ?line) (parse-data-line ?line :> ?year ?month ?day ?hour ?minute ?entity ?action ?count) (:distinct false))))

10. Scala compared to... strongly typed more versatile less ideomatic no homoiconicity more mainstream http://www.scala-lang.org/ lambda expressions for comprehensions streams higher order functions Clojure Java 7

11. spark shell val t = sc.textFile("/path/to/newsFeedRawMetrics/perfpostgres.csv") t.filter(line => line.contains("post")) .map(line => (line.split(",").slice(0, 5).mkString(","), 1)) .reduceByKey(_ + _) .saveAsTextFile("/tmp/postCount")

12. map reduce fast compact interactive not as distributive limited reduce side good for counters not good for percentiles

13. margin for error unfair basis for comparison local spark does not use hadoop single node mode

14. custom functions built in functions are not as expressive as hive can custom functions be as expressive as YARN? future blog Cascalog equivalent to News Feed Performance map reduce job.

15. spark streaming more popular than spark map reduce more real-time and reactive future blog compare with cascalog for reproducing news feed performance map reduce functionality Is it really distributed?

Three Functional Programming Technologies for Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Three Functional Programming Technologies for Big Data

Similar to Three Functional Programming Technologies for Big Data (20)

Recently uploaded

Recently uploaded (20)

Three Functional Programming Technologies for Big Data