SlideShare a Scribd company logo
1 of 30
How MacGyver Learned to Leave Duct Tape
Behind and Use Spark Instead
April 22, 2015
DC Spark Interactive Meetup | 1
Agenda
• MacGyver Who?
• Complex Data Problem
• Current Architecture
• New Tools in MacGyver’s box
â—Ź Spark Architecture
â—Ź Initial Results
• Q&A
DC Spark Meetup | 2
MacGyver Who?
MacGyver Trivia
â—Ź Answer these questions 3:
â—‹ What was the name of the actor who
played the role of MacGyver?
â—‹ What other series is this actor best known
for?
â—‹ Was there another actor who was in both
MacGyver and the other series?
â—‹ OR
○ What’s MacGyver’s first name?
If MacGyver Were a Coder ….
â—Ź Suppose he retired in 1996
from the Phoenix Foundation
and became a Software
Engineer
● He’s given the assignment in
2004 to build a new ETL
platform.
â—Ź What would the architecture
look like?
ETL circa 2000 - Present
= SQL
=
Oracle DB
SQL Server
MySQL
PostGreSQL
ETL Architecture
MacGyver at Orchestro
● If he worked at Orchestro ….
â—‹ He might find a lot of :
â—‹ But he might find cases where there are
problems need additional tools.
Maybe Orchestro doesn’t
have a BIG Data Problem
â—Ź Currently, our team deals with 15-20TB of
data total
● In Matt Asay’s talk at QCon in 2014,
â—‹ 64% of Big Data projects have < 100TB
● So maybe we don’t have a “Big Data”
problem, but there’s a good chance we have
a complex data problem
Complex Data Problem
Example
4,177 Stores
~600 products sold at 4,177
Stores
~ 2.4 million new sales recs /
day
But supplier is a Category Captain,
> 70 million new sales recs to analyze
(including competitor data)
And that’s just for
Smiley Mart!
Current ETL Architecture
Standardized Text
Delimited Format
Landing (Raw) Staging (Cleansed)
Stored in Snowflake
Schema in EDW
(around 1200 tables)
Drivers for Change
(in no particular order)
â—Ź Cost
â—‹ SQL Server License = $$$
â—Ź ~$6k core
â—‹ DB Servers = $$$
â—Ź 64 Cores, 256GB RAM
â—‹ DSAN = $$$
â—Ź Scalability
â—‹ This model relies on vertical scaling
Drivers for Change
(cont.)
â—Ź Performance
â—‹ Cleansing, Loading, Analytics,
Reports only getting more complex
â—Ź Which increases time to
complete each
New Tools in
MacGyver’s Toolbox
“ A paperclip can be a
wonderous thing.
More times than I can
remember, one of
these has gotten me
out of a tight spot.”
15
Enter Spark
Why Spark?
â—Ź Performance
â—‹ 100 TB unsorted data
â—‹ Previous Record achieved
â—Ź 2100 Node Hadoop cluster at Yahoo!
â—Ź Completed in 73 min, 1.42 TB/min
â—‹ Spark
â—Ź 206 Nodes
â—Ź 23 min, 4.27 TB/min
â—Ź 1 PB, 190 Nodes, 234 min, 4.27 TB/min - previously
unattainable
â—‹ https://databricks.com/blog/2014/10/10/spark-petabyte-
sort.html
â—‹ Fairly easy to tune (will show later)
Enter Spark
Why Spark (cont.)?
â—Ź Operating Cost
â—‹ Open Source (Apache Licensed)
â—‹ Gets more done with fewer nodes
â—‹ Memory less expensive nowadays
â—‹ Runs on commodity hardware
â—‹ Predictable projection for growth
â—Ź Hardware costs grow with customer base
â—Ź Add memory to node
â—Ź When memory maxed out, add node to cluster
Enter Spark
Why Spark (cont.)?
â—Ź Multi-Faceted, Simplified API
â—‹ Map/Reduce can often be completed as a
one liner
â—‹ Functional, immutable API
â—Ź Easy to keep concepts in your head
â—Ź Tranformations - abstract
â—Ź Actions - concrete
â—‹ ETL generally only needs Map and Filter
â—‹ Multi-language APIs
â—Ź Scala, Java, Python
Another Tool
=
+ =
Enter Clojure
Why Clojure?
â—Ź We started out with Python
â—‹ Good cultural fit
â—Ź Dynamic language
â—Ź Cross paradigm - OO, Functional
○ But…
â—Ź Lags behind Scala and Java Spark releases
â—Ź Only worked in YARN client mode
Enter Clojure
Why Clojure?
â—Ź Clojure is:
â—‹ Dynamic Language
â—‹ Built on JVM
â—Ź Can use just about any Java API you want
â—Ź Can optionally compile a Clojure app into a
Java Archive
â—‹ (Only) Functional
â—Ź Comes with map,reduce,filter baked in
â—Ź Fns are first class objects
â—Ź Immutable data structures
â—Ź No generics
â—‹ Great concurrency support
Clojure Syntax
â—Ź Maps: {}
○ {:weapon “chewing gum”
:outcome “boom”}
â—Ź Sequences/Lists: ()
○ (1 “Mullet ” “ please”)
â—Ź Vectors: []
○ [“Got ” “duct tape?”]
â—Ź Functions: (fname arg0 arg1)
○ (catch-bad-guy-with “Paper Clip”)
Developing in Spark
and Clojure
â—Ź Clojure comes with a shell called
the REPL (Read Eval Print Loop)
Developing in Spark
and Clojure
â—Ź We currently use the Sparkling
API
(https://gorillalabs.github.io/sparkling/)
â—‹ Idiomatic wrapper around Spark
Java API
â—Ź May use Flambo in the future:
â—‹ https://github.com/yieldbot/flambo
Developing in Spark
and Clojure
â—Ź Best editor for Clojure is Emacs
â—‹ Cider plugin
â—Ź Integrated REPL
â—Ź Code completion
â—Ź In Clojure, Java, and REPL
â—Ź Easier to read errors
â—Ź Great article on how to set up Cider
http://www.braveclojure.com/using-
emacs-with-clojure/
Developing in Spark
and Clojure
â—Ź Build with Leiningen (or lein)
â—‹ Project file written in Clojure
â—‹ Provides integration with Maven and
Clojars repos
â—‹ Runs unit tests
â—‹ Generates uberjar
â—Ź Looking into potential use of Gradle
â—‹ Better fit for continuous
integration/deployment
Spark ETL Architecture
1 - Cleansing 2 - Loading
Spark ETL Architecture
â—Ź Advantages
â—‹ Lower risk
â—Ź Fits into existing process
â—‹ Single Responsibility:
â—Ź Do cleansing, and do it well
â—‹ Huge potential to improve performance
â—Ź Next Steps
â—‹ Build out loading capability in Spark
Initial Results
â—Ź 70 m record Point of Sale data set
â—‹ Prod Cleansing Time:
1h 29m
â—‹ Spark Cleansing Time:
1m 36s
â—‹ How?
â—Ź Keys
â—‹ Yarn Cluster mode
â—‹ Num Executors
Questions?
â—Ź Contact Info:
â—‹ Jared Holmberg
â—Ź jared.holmberg@orchestro.com
â—Ź http://www.orchestro.com
Thank You!

More Related Content

What's hot

Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Alexey Zinoviev
 
Cassandra Lunch #59 Functions in Cassandra
Cassandra Lunch #59  Functions in CassandraCassandra Lunch #59  Functions in Cassandra
Cassandra Lunch #59 Functions in CassandraAnant Corporation
 
Masterless Distributed Computing with Riak Core - EUC 2010
Masterless Distributed Computing with Riak Core - EUC 2010Masterless Distributed Computing with Riak Core - EUC 2010
Masterless Distributed Computing with Riak Core - EUC 2010Rusty Klophaus
 
.NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov).NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov)ITCamp
 
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
Type safe, versioned, and rewindable stream processing  with  Apache {Avro, K...Type safe, versioned, and rewindable stream processing  with  Apache {Avro, K...
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...Hisham Mardam-Bey
 
Plasmaquick Workshop - FISL 13
Plasmaquick Workshop - FISL 13Plasmaquick Workshop - FISL 13
Plasmaquick Workshop - FISL 13Daker Fernandes
 

What's hot (6)

Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
 
Cassandra Lunch #59 Functions in Cassandra
Cassandra Lunch #59  Functions in CassandraCassandra Lunch #59  Functions in Cassandra
Cassandra Lunch #59 Functions in Cassandra
 
Masterless Distributed Computing with Riak Core - EUC 2010
Masterless Distributed Computing with Riak Core - EUC 2010Masterless Distributed Computing with Riak Core - EUC 2010
Masterless Distributed Computing with Riak Core - EUC 2010
 
.NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov).NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov)
 
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
Type safe, versioned, and rewindable stream processing  with  Apache {Avro, K...Type safe, versioned, and rewindable stream processing  with  Apache {Avro, K...
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
 
Plasmaquick Workshop - FISL 13
Plasmaquick Workshop - FISL 13Plasmaquick Workshop - FISL 13
Plasmaquick Workshop - FISL 13
 

Viewers also liked

Macgyver How-to Handbook
Macgyver How-to HandbookMacgyver How-to Handbook
Macgyver How-to Handbookmacgyvermanifesto
 
Nv terra Technology
Nv terra TechnologyNv terra Technology
Nv terra TechnologyiLikeGreen Ru
 
How to Incorporate Market Intelligence into Statistical Forecasting
How to Incorporate Market Intelligence into Statistical ForecastingHow to Incorporate Market Intelligence into Statistical Forecasting
How to Incorporate Market Intelligence into Statistical ForecastingPresident Stephen Crane Consulting, LLC
 
Putting Together the Pieces - The S&OP Technology Landscape - 20 AUG 2015
Putting Together the Pieces - The S&OP Technology Landscape - 20 AUG 2015Putting Together the Pieces - The S&OP Technology Landscape - 20 AUG 2015
Putting Together the Pieces - The S&OP Technology Landscape - 20 AUG 2015Lora Cecere
 
Sourcing & Procurement Analytics for the modern enterprise
Sourcing & Procurement Analytics for the modern enterpriseSourcing & Procurement Analytics for the modern enterprise
Sourcing & Procurement Analytics for the modern enterpriseBRIDGEi2i Analytics Solutions
 
Demand driven forecasting
Demand driven forecastingDemand driven forecasting
Demand driven forecastingCharles Novak
 

Viewers also liked (6)

Macgyver How-to Handbook
Macgyver How-to HandbookMacgyver How-to Handbook
Macgyver How-to Handbook
 
Nv terra Technology
Nv terra TechnologyNv terra Technology
Nv terra Technology
 
How to Incorporate Market Intelligence into Statistical Forecasting
How to Incorporate Market Intelligence into Statistical ForecastingHow to Incorporate Market Intelligence into Statistical Forecasting
How to Incorporate Market Intelligence into Statistical Forecasting
 
Putting Together the Pieces - The S&OP Technology Landscape - 20 AUG 2015
Putting Together the Pieces - The S&OP Technology Landscape - 20 AUG 2015Putting Together the Pieces - The S&OP Technology Landscape - 20 AUG 2015
Putting Together the Pieces - The S&OP Technology Landscape - 20 AUG 2015
 
Sourcing & Procurement Analytics for the modern enterprise
Sourcing & Procurement Analytics for the modern enterpriseSourcing & Procurement Analytics for the modern enterprise
Sourcing & Procurement Analytics for the modern enterprise
 
Demand driven forecasting
Demand driven forecastingDemand driven forecasting
Demand driven forecasting
 

Similar to MacGyver Learns Spark

Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYCHolden Karau
 
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Holden Karau
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMHolden Karau
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...Holden Karau
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Holden Karau
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Holden Karau
 
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...Holden Karau
 
Apache Spark and Python: unified Big Data analytics
Apache Spark and Python: unified Big Data analyticsApache Spark and Python: unified Big Data analytics
Apache Spark and Python: unified Big Data analyticsJulien Anguenot
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018Holden Karau
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Holden Karau
 
24 uses for perl6
24 uses for perl624 uses for perl6
24 uses for perl6Simon Proctor
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupHolden Karau
 
Wattpad - Spark Stories
Wattpad - Spark StoriesWattpad - Spark Stories
Wattpad - Spark StoriesRylan Halteman
 
NE Scala 2016 roundup
NE Scala 2016 roundupNE Scala 2016 roundup
NE Scala 2016 roundupHung Lin
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityJen Aman
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityJen Aman
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Holden Karau
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...Holden Karau
 

Similar to MacGyver Learns Spark (20)

Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional   w/ Apache Spark @ Scala Days NYCKeeping the fun in functional   w/ Apache Spark @ Scala Days NYC
Keeping the fun in functional w/ Apache Spark @ Scala Days NYC
 
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
Sharing (or stealing) the jewels of python with big data &amp; the jvm (1)
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
 
Making the big data ecosystem work together with python apache arrow, spark,...
Making the big data ecosystem work together with python  apache arrow, spark,...Making the big data ecosystem work together with python  apache arrow, spark,...
Making the big data ecosystem work together with python apache arrow, spark,...
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
 
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...Powering tensorflow with big data (apache spark, flink, and beam)   dataworks...
Powering tensorflow with big data (apache spark, flink, and beam) dataworks...
 
Apache Spark and Python: unified Big Data analytics
Apache Spark and Python: unified Big Data analyticsApache Spark and Python: unified Big Data analytics
Apache Spark and Python: unified Big Data analytics
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016
 
24 uses for perl6
24 uses for perl624 uses for perl6
24 uses for perl6
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
 
Wattpad - Spark Stories
Wattpad - Spark StoriesWattpad - Spark Stories
Wattpad - Spark Stories
 
NE Scala 2016 roundup
NE Scala 2016 roundupNE Scala 2016 roundup
NE Scala 2016 roundup
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?Are general purpose big data systems eating the world?
Are general purpose big data systems eating the world?
 
The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...The magic of (data parallel) distributed systems and where it all breaks - Re...
The magic of (data parallel) distributed systems and where it all breaks - Re...
 

Recently uploaded

VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...software pro Development
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfVishalKumarJha10
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfryanfarris8
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfproinshot.com
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 

Recently uploaded (20)

VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 

MacGyver Learns Spark

  • 1. How MacGyver Learned to Leave Duct Tape Behind and Use Spark Instead April 22, 2015 DC Spark Interactive Meetup | 1
  • 2. Agenda • MacGyver Who? • Complex Data Problem • Current Architecture • New Tools in MacGyver’s box â—Ź Spark Architecture â—Ź Initial Results • Q&A DC Spark Meetup | 2
  • 4. MacGyver Trivia â—Ź Answer these questions 3: â—‹ What was the name of the actor who played the role of MacGyver? â—‹ What other series is this actor best known for? â—‹ Was there another actor who was in both MacGyver and the other series? â—‹ OR â—‹ What’s MacGyver’s first name?
  • 5. If MacGyver Were a Coder …. â—Ź Suppose he retired in 1996 from the Phoenix Foundation and became a Software Engineer â—Ź He’s given the assignment in 2004 to build a new ETL platform. â—Ź What would the architecture look like?
  • 6. ETL circa 2000 - Present = SQL = Oracle DB SQL Server MySQL PostGreSQL
  • 8. MacGyver at Orchestro â—Ź If he worked at Orchestro …. â—‹ He might find a lot of : â—‹ But he might find cases where there are problems need additional tools.
  • 9. Maybe Orchestro doesn’t have a BIG Data Problem â—Ź Currently, our team deals with 15-20TB of data total â—Ź In Matt Asay’s talk at QCon in 2014, â—‹ 64% of Big Data projects have < 100TB â—Ź So maybe we don’t have a “Big Data” problem, but there’s a good chance we have a complex data problem
  • 11. Example 4,177 Stores ~600 products sold at 4,177 Stores ~ 2.4 million new sales recs / day But supplier is a Category Captain, > 70 million new sales recs to analyze (including competitor data) And that’s just for Smiley Mart!
  • 12. Current ETL Architecture Standardized Text Delimited Format Landing (Raw) Staging (Cleansed) Stored in Snowflake Schema in EDW (around 1200 tables)
  • 13. Drivers for Change (in no particular order) â—Ź Cost â—‹ SQL Server License = $$$ â—Ź ~$6k core â—‹ DB Servers = $$$ â—Ź 64 Cores, 256GB RAM â—‹ DSAN = $$$ â—Ź Scalability â—‹ This model relies on vertical scaling
  • 14. Drivers for Change (cont.) â—Ź Performance â—‹ Cleansing, Loading, Analytics, Reports only getting more complex â—Ź Which increases time to complete each
  • 15. New Tools in MacGyver’s Toolbox “ A paperclip can be a wonderous thing. More times than I can remember, one of these has gotten me out of a tight spot.” 15
  • 16. Enter Spark Why Spark? â—Ź Performance â—‹ 100 TB unsorted data â—‹ Previous Record achieved â—Ź 2100 Node Hadoop cluster at Yahoo! â—Ź Completed in 73 min, 1.42 TB/min â—‹ Spark â—Ź 206 Nodes â—Ź 23 min, 4.27 TB/min â—Ź 1 PB, 190 Nodes, 234 min, 4.27 TB/min - previously unattainable â—‹ https://databricks.com/blog/2014/10/10/spark-petabyte- sort.html â—‹ Fairly easy to tune (will show later)
  • 17. Enter Spark Why Spark (cont.)? â—Ź Operating Cost â—‹ Open Source (Apache Licensed) â—‹ Gets more done with fewer nodes â—‹ Memory less expensive nowadays â—‹ Runs on commodity hardware â—‹ Predictable projection for growth â—Ź Hardware costs grow with customer base â—Ź Add memory to node â—Ź When memory maxed out, add node to cluster
  • 18. Enter Spark Why Spark (cont.)? â—Ź Multi-Faceted, Simplified API â—‹ Map/Reduce can often be completed as a one liner â—‹ Functional, immutable API â—Ź Easy to keep concepts in your head â—Ź Tranformations - abstract â—Ź Actions - concrete â—‹ ETL generally only needs Map and Filter â—‹ Multi-language APIs â—Ź Scala, Java, Python
  • 20. Enter Clojure Why Clojure? â—Ź We started out with Python â—‹ Good cultural fit â—Ź Dynamic language â—Ź Cross paradigm - OO, Functional â—‹ But… â—Ź Lags behind Scala and Java Spark releases â—Ź Only worked in YARN client mode
  • 21. Enter Clojure Why Clojure? â—Ź Clojure is: â—‹ Dynamic Language â—‹ Built on JVM â—Ź Can use just about any Java API you want â—Ź Can optionally compile a Clojure app into a Java Archive â—‹ (Only) Functional â—Ź Comes with map,reduce,filter baked in â—Ź Fns are first class objects â—Ź Immutable data structures â—Ź No generics â—‹ Great concurrency support
  • 22. Clojure Syntax â—Ź Maps: {} â—‹ {:weapon “chewing gum” :outcome “boom”} â—Ź Sequences/Lists: () â—‹ (1 “Mullet ” “ please”) â—Ź Vectors: [] â—‹ [“Got ” “duct tape?”] â—Ź Functions: (fname arg0 arg1) â—‹ (catch-bad-guy-with “Paper Clip”)
  • 23. Developing in Spark and Clojure â—Ź Clojure comes with a shell called the REPL (Read Eval Print Loop)
  • 24. Developing in Spark and Clojure â—Ź We currently use the Sparkling API (https://gorillalabs.github.io/sparkling/) â—‹ Idiomatic wrapper around Spark Java API â—Ź May use Flambo in the future: â—‹ https://github.com/yieldbot/flambo
  • 25. Developing in Spark and Clojure â—Ź Best editor for Clojure is Emacs â—‹ Cider plugin â—Ź Integrated REPL â—Ź Code completion â—Ź In Clojure, Java, and REPL â—Ź Easier to read errors â—Ź Great article on how to set up Cider http://www.braveclojure.com/using- emacs-with-clojure/
  • 26. Developing in Spark and Clojure â—Ź Build with Leiningen (or lein) â—‹ Project file written in Clojure â—‹ Provides integration with Maven and Clojars repos â—‹ Runs unit tests â—‹ Generates uberjar â—Ź Looking into potential use of Gradle â—‹ Better fit for continuous integration/deployment
  • 27. Spark ETL Architecture 1 - Cleansing 2 - Loading
  • 28. Spark ETL Architecture â—Ź Advantages â—‹ Lower risk â—Ź Fits into existing process â—‹ Single Responsibility: â—Ź Do cleansing, and do it well â—‹ Huge potential to improve performance â—Ź Next Steps â—‹ Build out loading capability in Spark
  • 29. Initial Results â—Ź 70 m record Point of Sale data set â—‹ Prod Cleansing Time: 1h 29m â—‹ Spark Cleansing Time: 1m 36s â—‹ How? â—Ź Keys â—‹ Yarn Cluster mode â—‹ Num Executors
  • 30. Questions? â—Ź Contact Info: â—‹ Jared Holmberg â—Ź jared.holmberg@orchestro.com â—Ź http://www.orchestro.com Thank You!

Editor's Notes

  1. Scenario: Let’s say that the the records coming from walmart are too many to fit in one file so they have to be pulled down in chunks, maybe 100+ files for the whole set. JnJ has an expectation that loads and analytics will be completed at night so that the system is available for ad hoc Business Intelligence reports. On current system, that