SlideShare a Scribd company logo
1 of 30
How MacGyver Learned to Leave Duct Tape
Behind and Use Spark Instead
April 22, 2015
DC Spark Interactive Meetup | 1
Agenda
• MacGyver Who?
• Complex Data Problem
• Current Architecture
• New Tools in MacGyver’s box
● Spark Architecture
● Initial Results
• Q&A
DC Spark Meetup | 2
MacGyver Who?
MacGyver Trivia
● Answer these questions 3:
○ What was the name of the actor who
played the role of MacGyver?
○ What other series is this actor best known
for?
○ Was there another actor who was in both
MacGyver and the other series?
○ OR
○ What’s MacGyver’s first name?
If MacGyver Were a Coder ….
● Suppose he retired in 1996
from the Phoenix Foundation
and became a Software
Engineer
● He’s given the assignment in
2004 to build a new ETL
platform.
● What would the architecture
look like?
ETL circa 2000 - Present
= SQL
=
Oracle DB
SQL Server
MySQL
PostGreSQL
ETL Architecture
MacGyver at Orchestro
● If he worked at Orchestro ….
○ He might find a lot of :
○ But he might find cases where there are
problems need additional tools.
Maybe Orchestro doesn’t
have a BIG Data Problem
● Currently, our team deals with 15-20TB of
data total
● In Matt Asay’s talk at QCon in 2014,
○ 64% of Big Data projects have < 100TB
● So maybe we don’t have a “Big Data”
problem, but there’s a good chance we have
a complex data problem
Complex Data Problem
Example
4,177 Stores
~600 products sold at 4,177
Stores
~ 2.4 million new sales recs /
day
But supplier is a Category Captain,
> 70 million new sales recs to analyze
(including competitor data)
And that’s just for
Smiley Mart!
Current ETL Architecture
Standardized Text
Delimited Format
Landing (Raw) Staging (Cleansed)
Stored in Snowflake
Schema in EDW
(around 1200 tables)
Drivers for Change
(in no particular order)
● Cost
○ SQL Server License = $$$
● ~$6k core
○ DB Servers = $$$
● 64 Cores, 256GB RAM
○ DSAN = $$$
● Scalability
○ This model relies on vertical scaling
Drivers for Change
(cont.)
● Performance
○ Cleansing, Loading, Analytics,
Reports only getting more complex
● Which increases time to
complete each
New Tools in
MacGyver’s Toolbox
“ A paperclip can be a
wonderous thing.
More times than I can
remember, one of
these has gotten me
out of a tight spot.”
15
Enter Spark
Why Spark?
● Performance
○ 100 TB unsorted data
○ Previous Record achieved
● 2100 Node Hadoop cluster at Yahoo!
● Completed in 73 min, 1.42 TB/min
○ Spark
● 206 Nodes
● 23 min, 4.27 TB/min
● 1 PB, 190 Nodes, 234 min, 4.27 TB/min - previously
unattainable
○ https://databricks.com/blog/2014/10/10/spark-petabyte-
sort.html
○ Fairly easy to tune (will show later)
Enter Spark
Why Spark (cont.)?
● Operating Cost
○ Open Source (Apache Licensed)
○ Gets more done with fewer nodes
○ Memory less expensive nowadays
○ Runs on commodity hardware
○ Predictable projection for growth
● Hardware costs grow with customer base
● Add memory to node
● When memory maxed out, add node to cluster
Enter Spark
Why Spark (cont.)?
● Multi-Faceted, Simplified API
○ Map/Reduce can often be completed as a
one liner
○ Functional, immutable API
● Easy to keep concepts in your head
● Tranformations - abstract
● Actions - concrete
○ ETL generally only needs Map and Filter
○ Multi-language APIs
● Scala, Java, Python
Another Tool
=
+ =
Enter Clojure
Why Clojure?
● We started out with Python
○ Good cultural fit
● Dynamic language
● Cross paradigm - OO, Functional
○ But…
● Lags behind Scala and Java Spark releases
● Only worked in YARN client mode
Enter Clojure
Why Clojure?
● Clojure is:
○ Dynamic Language
○ Built on JVM
● Can use just about any Java API you want
● Can optionally compile a Clojure app into a
Java Archive
○ (Only) Functional
● Comes with map,reduce,filter baked in
● Fns are first class objects
● Immutable data structures
● No generics
○ Great concurrency support
Clojure Syntax
● Maps: {}
○ {:weapon “chewing gum”
:outcome “boom”}
● Sequences/Lists: ()
○ (1 “Mullet ” “ please”)
● Vectors: []
○ [“Got ” “duct tape?”]
● Functions: (fname arg0 arg1)
○ (catch-bad-guy-with “Paper Clip”)
Developing in Spark
and Clojure
● Clojure comes with a shell called
the REPL (Read Eval Print Loop)
Developing in Spark
and Clojure
● We currently use the Sparkling
API
(https://gorillalabs.github.io/sparkling/)
○ Idiomatic wrapper around Spark
Java API
● May use Flambo in the future:
○ https://github.com/yieldbot/flambo
Developing in Spark
and Clojure
● Best editor for Clojure is Emacs
○ Cider plugin
● Integrated REPL
● Code completion
● In Clojure, Java, and REPL
● Easier to read errors
● Great article on how to set up Cider
http://www.braveclojure.com/using-
emacs-with-clojure/
Developing in Spark
and Clojure
● Build with Leiningen (or lein)
○ Project file written in Clojure
○ Provides integration with Maven and
Clojars repos
○ Runs unit tests
○ Generates uberjar
● Looking into potential use of Gradle
○ Better fit for continuous
integration/deployment
Spark ETL Architecture
1 - Cleansing 2 - Loading
Spark ETL Architecture
● Advantages
○ Lower risk
● Fits into existing process
○ Single Responsibility:
● Do cleansing, and do it well
○ Huge potential to improve performance
● Next Steps
○ Build out loading capability in Spark
Initial Results
● 70 m record Point of Sale data set
○ Prod Cleansing Time:
1h 29m
○ Spark Cleansing Time:
1m 36s
○ How?
● Keys
○ Yarn Cluster mode
○ Num Executors
Questions?
● Contact Info:
○ Jared Holmberg
● jared.holmberg@orchestro.com
● http://www.orchestro.com
Thank You!

More Related Content

What's hot

Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Alexey Zinoviev
 
Cassandra Lunch #59 Functions in Cassandra
Cassandra Lunch #59  Functions in CassandraCassandra Lunch #59  Functions in Cassandra
Cassandra Lunch #59 Functions in CassandraAnant Corporation
 
Masterless Distributed Computing with Riak Core - EUC 2010
Masterless Distributed Computing with Riak Core - EUC 2010Masterless Distributed Computing with Riak Core - EUC 2010
Masterless Distributed Computing with Riak Core - EUC 2010Rusty Klophaus
 
.NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov).NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov)ITCamp
 
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
Type safe, versioned, and rewindable stream processing  with  Apache {Avro, K...Type safe, versioned, and rewindable stream processing  with  Apache {Avro, K...
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...Hisham Mardam-Bey
 
Plasmaquick Workshop - FISL 13
Plasmaquick Workshop - FISL 13Plasmaquick Workshop - FISL 13
Plasmaquick Workshop - FISL 13Daker Fernandes
 

What's hot (6)

Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
 
Cassandra Lunch #59 Functions in Cassandra
Cassandra Lunch #59  Functions in CassandraCassandra Lunch #59  Functions in Cassandra
Cassandra Lunch #59 Functions in Cassandra
 
Masterless Distributed Computing with Riak Core - EUC 2010
Masterless Distributed Computing with Riak Core - EUC 2010Masterless Distributed Computing with Riak Core - EUC 2010
Masterless Distributed Computing with Riak Core - EUC 2010
 
.NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov).NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov)
 
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
Type safe, versioned, and rewindable stream processing  with  Apache {Avro, K...Type safe, versioned, and rewindable stream processing  with  Apache {Avro, K...
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
 
Plasmaquick Workshop - FISL 13
Plasmaquick Workshop - FISL 13Plasmaquick Workshop - FISL 13
Plasmaquick Workshop - FISL 13
 

Viewers also liked

Putting Together the Pieces - The S&OP Technology Landscape - 20 AUG 2015
Putting Together the Pieces - The S&OP Technology Landscape - 20 AUG 2015Putting Together the Pieces - The S&OP Technology Landscape - 20 AUG 2015
Putting Together the Pieces - The S&OP Technology Landscape - 20 AUG 2015Lora Cecere
 
Sourcing & Procurement Analytics for the modern enterprise
Sourcing & Procurement Analytics for the modern enterpriseSourcing & Procurement Analytics for the modern enterprise
Sourcing & Procurement Analytics for the modern enterpriseBRIDGEi2i Analytics Solutions
 
Demand driven forecasting
Demand driven forecastingDemand driven forecasting
Demand driven forecastingCharles Novak
 

Viewers also liked (6)

Macgyver How-to Handbook
Macgyver How-to HandbookMacgyver How-to Handbook
Macgyver How-to Handbook
 
Nv terra Technology
Nv terra TechnologyNv terra Technology
Nv terra Technology
 
How to Incorporate Market Intelligence into Statistical Forecasting
How to Incorporate Market Intelligence into Statistical ForecastingHow to Incorporate Market Intelligence into Statistical Forecasting
How to Incorporate Market Intelligence into Statistical Forecasting
 
Putting Together the Pieces - The S&OP Technology Landscape - 20 AUG 2015
Putting Together the Pieces - The S&OP Technology Landscape - 20 AUG 2015Putting Together the Pieces - The S&OP Technology Landscape - 20 AUG 2015
Putting Together the Pieces - The S&OP Technology Landscape - 20 AUG 2015
 
Sourcing & Procurement Analytics for the modern enterprise
Sourcing & Procurement Analytics for the modern enterpriseSourcing & Procurement Analytics for the modern enterprise
Sourcing & Procurement Analytics for the modern enterprise
 
Demand driven forecasting
Demand driven forecastingDemand driven forecasting
Demand driven forecasting
 

Similar to MacGyver Learns Spark

Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYCHolden Karau
 
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Holden Karau
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMHolden Karau
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Holden Karau
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...Holden Karau
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Holden Karau
 
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...Holden Karau
 
Apache Spark and Python: unified Big Data analytics
Apache Spark and Python: unified Big Data analyticsApache Spark and Python: unified Big Data analytics
Apache Spark and Python: unified Big Data analyticsJulien Anguenot
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018Holden Karau
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Holden Karau
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupHolden Karau
 
Wattpad - Spark Stories
Wattpad - Spark StoriesWattpad - Spark Stories
Wattpad - Spark StoriesRylan Halteman
 
NE Scala 2016 roundup
NE Scala 2016 roundupNE Scala 2016 roundup
NE Scala 2016 roundupHung Lin
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityJen Aman
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityJen Aman
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Holden Karau
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...Holden Karau
 

Similar to MacGyver Learns Spark (20)

Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
 
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
 
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
 
Apache Spark and Python: unified Big Data analytics
Apache Spark and Python: unified Big Data analyticsApache Spark and Python: unified Big Data analytics
Apache Spark and Python: unified Big Data analytics
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016
 
24 uses for perl6
24 uses for perl624 uses for perl6
24 uses for perl6
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
 
Wattpad - Spark Stories
Wattpad - Spark StoriesWattpad - Spark Stories
Wattpad - Spark Stories
 
NE Scala 2016 roundup
NE Scala 2016 roundupNE Scala 2016 roundup
NE Scala 2016 roundup
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...
 

Recently uploaded

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxnada99848
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 

Recently uploaded (20)

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptx
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 

MacGyver Learns Spark

  • 1. How MacGyver Learned to Leave Duct Tape Behind and Use Spark Instead April 22, 2015 DC Spark Interactive Meetup | 1
  • 2. Agenda • MacGyver Who? • Complex Data Problem • Current Architecture • New Tools in MacGyver’s box ● Spark Architecture ● Initial Results • Q&A DC Spark Meetup | 2
  • 4. MacGyver Trivia ● Answer these questions 3: ○ What was the name of the actor who played the role of MacGyver? ○ What other series is this actor best known for? ○ Was there another actor who was in both MacGyver and the other series? ○ OR ○ What’s MacGyver’s first name?
  • 5. If MacGyver Were a Coder …. ● Suppose he retired in 1996 from the Phoenix Foundation and became a Software Engineer ● He’s given the assignment in 2004 to build a new ETL platform. ● What would the architecture look like?
  • 6. ETL circa 2000 - Present = SQL = Oracle DB SQL Server MySQL PostGreSQL
  • 8. MacGyver at Orchestro ● If he worked at Orchestro …. ○ He might find a lot of : ○ But he might find cases where there are problems need additional tools.
  • 9. Maybe Orchestro doesn’t have a BIG Data Problem ● Currently, our team deals with 15-20TB of data total ● In Matt Asay’s talk at QCon in 2014, ○ 64% of Big Data projects have < 100TB ● So maybe we don’t have a “Big Data” problem, but there’s a good chance we have a complex data problem
  • 11. Example 4,177 Stores ~600 products sold at 4,177 Stores ~ 2.4 million new sales recs / day But supplier is a Category Captain, > 70 million new sales recs to analyze (including competitor data) And that’s just for Smiley Mart!
  • 12. Current ETL Architecture Standardized Text Delimited Format Landing (Raw) Staging (Cleansed) Stored in Snowflake Schema in EDW (around 1200 tables)
  • 13. Drivers for Change (in no particular order) ● Cost ○ SQL Server License = $$$ ● ~$6k core ○ DB Servers = $$$ ● 64 Cores, 256GB RAM ○ DSAN = $$$ ● Scalability ○ This model relies on vertical scaling
  • 14. Drivers for Change (cont.) ● Performance ○ Cleansing, Loading, Analytics, Reports only getting more complex ● Which increases time to complete each
  • 15. New Tools in MacGyver’s Toolbox “ A paperclip can be a wonderous thing. More times than I can remember, one of these has gotten me out of a tight spot.” 15
  • 16. Enter Spark Why Spark? ● Performance ○ 100 TB unsorted data ○ Previous Record achieved ● 2100 Node Hadoop cluster at Yahoo! ● Completed in 73 min, 1.42 TB/min ○ Spark ● 206 Nodes ● 23 min, 4.27 TB/min ● 1 PB, 190 Nodes, 234 min, 4.27 TB/min - previously unattainable ○ https://databricks.com/blog/2014/10/10/spark-petabyte- sort.html ○ Fairly easy to tune (will show later)
  • 17. Enter Spark Why Spark (cont.)? ● Operating Cost ○ Open Source (Apache Licensed) ○ Gets more done with fewer nodes ○ Memory less expensive nowadays ○ Runs on commodity hardware ○ Predictable projection for growth ● Hardware costs grow with customer base ● Add memory to node ● When memory maxed out, add node to cluster
  • 18. Enter Spark Why Spark (cont.)? ● Multi-Faceted, Simplified API ○ Map/Reduce can often be completed as a one liner ○ Functional, immutable API ● Easy to keep concepts in your head ● Tranformations - abstract ● Actions - concrete ○ ETL generally only needs Map and Filter ○ Multi-language APIs ● Scala, Java, Python
  • 20. Enter Clojure Why Clojure? ● We started out with Python ○ Good cultural fit ● Dynamic language ● Cross paradigm - OO, Functional ○ But… ● Lags behind Scala and Java Spark releases ● Only worked in YARN client mode
  • 21. Enter Clojure Why Clojure? ● Clojure is: ○ Dynamic Language ○ Built on JVM ● Can use just about any Java API you want ● Can optionally compile a Clojure app into a Java Archive ○ (Only) Functional ● Comes with map,reduce,filter baked in ● Fns are first class objects ● Immutable data structures ● No generics ○ Great concurrency support
  • 22. Clojure Syntax ● Maps: {} ○ {:weapon “chewing gum” :outcome “boom”} ● Sequences/Lists: () ○ (1 “Mullet ” “ please”) ● Vectors: [] ○ [“Got ” “duct tape?”] ● Functions: (fname arg0 arg1) ○ (catch-bad-guy-with “Paper Clip”)
  • 23. Developing in Spark and Clojure ● Clojure comes with a shell called the REPL (Read Eval Print Loop)
  • 24. Developing in Spark and Clojure ● We currently use the Sparkling API (https://gorillalabs.github.io/sparkling/) ○ Idiomatic wrapper around Spark Java API ● May use Flambo in the future: ○ https://github.com/yieldbot/flambo
  • 25. Developing in Spark and Clojure ● Best editor for Clojure is Emacs ○ Cider plugin ● Integrated REPL ● Code completion ● In Clojure, Java, and REPL ● Easier to read errors ● Great article on how to set up Cider http://www.braveclojure.com/using- emacs-with-clojure/
  • 26. Developing in Spark and Clojure ● Build with Leiningen (or lein) ○ Project file written in Clojure ○ Provides integration with Maven and Clojars repos ○ Runs unit tests ○ Generates uberjar ● Looking into potential use of Gradle ○ Better fit for continuous integration/deployment
  • 27. Spark ETL Architecture 1 - Cleansing 2 - Loading
  • 28. Spark ETL Architecture ● Advantages ○ Lower risk ● Fits into existing process ○ Single Responsibility: ● Do cleansing, and do it well ○ Huge potential to improve performance ● Next Steps ○ Build out loading capability in Spark
  • 29. Initial Results ● 70 m record Point of Sale data set ○ Prod Cleansing Time: 1h 29m ○ Spark Cleansing Time: 1m 36s ○ How? ● Keys ○ Yarn Cluster mode ○ Num Executors
  • 30. Questions? ● Contact Info: ○ Jared Holmberg ● jared.holmberg@orchestro.com ● http://www.orchestro.com Thank You!

Editor's Notes

  1. Scenario: Let’s say that the the records coming from walmart are too many to fit in one file so they have to be pulled down in chunks, maybe 100+ files for the whole set. JnJ has an expectation that loads and analytics will be completed at night so that the system is available for ad hoc Business Intelligence reports. On current system, that