MacGyver loved duct tape. From diffusing bombs to building ultralights, he never left home without it. If MacGyver were a coder today, odds are Spark would be his digital duct tape. From replacing MR batch processing, to Spark SQL, to machine learning, to streaming, Spark covers the gamut. While there are a lot of exotic uses, this presentation focuses on one of Spark's fundamental use cases: ETL.
Whether the final resting place for your data is MongoDB or SQL Server, ETL is a must. This presentation will show just how impactful Spark can be with your existing ETL processes and that you don't have to do a full rewrite of your Stack to start taking advantage of it.
4. MacGyver Trivia
● Answer these questions 3:
○ What was the name of the actor who
played the role of MacGyver?
○ What other series is this actor best known
for?
○ Was there another actor who was in both
MacGyver and the other series?
○ OR
○ What’s MacGyver’s first name?
5. If MacGyver Were a Coder ….
● Suppose he retired in 1996
from the Phoenix Foundation
and became a Software
Engineer
● He’s given the assignment in
2004 to build a new ETL
platform.
● What would the architecture
look like?
6. ETL circa 2000 - Present
= SQL
=
Oracle DB
SQL Server
MySQL
PostGreSQL
8. MacGyver at Orchestro
● If he worked at Orchestro ….
○ He might find a lot of :
○ But he might find cases where there are
problems need additional tools.
9. Maybe Orchestro doesn’t
have a BIG Data Problem
● Currently, our team deals with 15-20TB of
data total
● In Matt Asay’s talk at QCon in 2014,
○ 64% of Big Data projects have < 100TB
● So maybe we don’t have a “Big Data”
problem, but there’s a good chance we have
a complex data problem
11. Example
4,177 Stores
~600 products sold at 4,177
Stores
~ 2.4 million new sales recs /
day
But supplier is a Category Captain,
> 70 million new sales recs to analyze
(including competitor data)
And that’s just for
Smiley Mart!
13. Drivers for Change
(in no particular order)
● Cost
○ SQL Server License = $$$
● ~$6k core
○ DB Servers = $$$
● 64 Cores, 256GB RAM
○ DSAN = $$$
● Scalability
○ This model relies on vertical scaling
14. Drivers for Change
(cont.)
● Performance
○ Cleansing, Loading, Analytics,
Reports only getting more complex
● Which increases time to
complete each
15. New Tools in
MacGyver’s Toolbox
“ A paperclip can be a
wonderous thing.
More times than I can
remember, one of
these has gotten me
out of a tight spot.”
15
16. Enter Spark
Why Spark?
● Performance
○ 100 TB unsorted data
○ Previous Record achieved
● 2100 Node Hadoop cluster at Yahoo!
● Completed in 73 min, 1.42 TB/min
○ Spark
● 206 Nodes
● 23 min, 4.27 TB/min
● 1 PB, 190 Nodes, 234 min, 4.27 TB/min - previously
unattainable
○ https://databricks.com/blog/2014/10/10/spark-petabyte-
sort.html
○ Fairly easy to tune (will show later)
17. Enter Spark
Why Spark (cont.)?
● Operating Cost
○ Open Source (Apache Licensed)
○ Gets more done with fewer nodes
○ Memory less expensive nowadays
○ Runs on commodity hardware
○ Predictable projection for growth
● Hardware costs grow with customer base
● Add memory to node
● When memory maxed out, add node to cluster
18. Enter Spark
Why Spark (cont.)?
● Multi-Faceted, Simplified API
○ Map/Reduce can often be completed as a
one liner
○ Functional, immutable API
● Easy to keep concepts in your head
● Tranformations - abstract
● Actions - concrete
○ ETL generally only needs Map and Filter
○ Multi-language APIs
● Scala, Java, Python
20. Enter Clojure
Why Clojure?
● We started out with Python
○ Good cultural fit
● Dynamic language
● Cross paradigm - OO, Functional
○ But…
● Lags behind Scala and Java Spark releases
● Only worked in YARN client mode
21. Enter Clojure
Why Clojure?
● Clojure is:
○ Dynamic Language
○ Built on JVM
● Can use just about any Java API you want
● Can optionally compile a Clojure app into a
Java Archive
○ (Only) Functional
● Comes with map,reduce,filter baked in
● Fns are first class objects
● Immutable data structures
● No generics
○ Great concurrency support
23. Developing in Spark
and Clojure
● Clojure comes with a shell called
the REPL (Read Eval Print Loop)
24. Developing in Spark
and Clojure
● We currently use the Sparkling
API
(https://gorillalabs.github.io/sparkling/)
○ Idiomatic wrapper around Spark
Java API
● May use Flambo in the future:
○ https://github.com/yieldbot/flambo
25. Developing in Spark
and Clojure
● Best editor for Clojure is Emacs
○ Cider plugin
● Integrated REPL
● Code completion
● In Clojure, Java, and REPL
● Easier to read errors
● Great article on how to set up Cider
http://www.braveclojure.com/using-
emacs-with-clojure/
26. Developing in Spark
and Clojure
● Build with Leiningen (or lein)
○ Project file written in Clojure
○ Provides integration with Maven and
Clojars repos
○ Runs unit tests
○ Generates uberjar
● Looking into potential use of Gradle
○ Better fit for continuous
integration/deployment
28. Spark ETL Architecture
● Advantages
○ Lower risk
● Fits into existing process
○ Single Responsibility:
● Do cleansing, and do it well
○ Huge potential to improve performance
● Next Steps
○ Build out loading capability in Spark
29. Initial Results
● 70 m record Point of Sale data set
○ Prod Cleansing Time:
1h 29m
○ Spark Cleansing Time:
1m 36s
○ How?
● Keys
○ Yarn Cluster mode
○ Num Executors
Scenario:
Let’s say that the the records coming from walmart are too many to fit in one file so they have to be pulled down in chunks, maybe 100+ files for the whole set.
JnJ has an expectation that loads and analytics will be completed at night so that the system is available for ad hoc Business Intelligence reports.
On current system, that