SlideShare a Scribd company logo
Spark Application Development
Made Fast and Easy
Lance Co Ting Keh
Machine Learning @ Box
Distributed ML Infrastructure
Go Blue Devils!
Shivnath Babu
Associate Professor @ Duke
Chief Scientist at Unravel Data
Systems
R&D in Management of Data Systems
Genomics Telecom Geospatial
NLP
Image
Processing
IoT
Operational
Intelligence
Recommender
Systems
Fraud
Detection
Applications Powered by Spark @ Box
What’s so great about Spark?
From http://evinrevello.com/big-data-pocket-knife/
What’s so great about Spark?
From:
https://weminoredinfilm.files.wordpress.com/2014/04/59911951.jpg
Complexity
Spark Execution
sc.textFile(hdfsPath)
.map(parseInput)
.filter(subThreshold)
.reduceByKey(tallyCount)
.map(formatOutput)
.saveAsTextFile(outPath)
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
map filter reduceBykey map saveAsTextFile
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
RDD0 RDD1 RDD2 RDD3 RDD4 HDFS
Spark Execution
sc.textFile(hdfsPath)
.map(parseInput)
.filter(subThreshold)
.reduceByKey(tallyCount)
.map(formatOutput)
.saveAsTextFile(outPath)
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
map filter reduceBykey map saveAsTextFile
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
HDFS
Stage 0 Stage 1
RDD0 RDD1 RDD2 RDD3 RDD4
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
map filter reduceBykey map saveAsTextFile
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
RDD0 RDD1 RDD2 RDD3 RDD4 HDFS
Stage 0 Stage 1
sc.textFile(hdfsPath)
.map(parseInput)
.filter(subThreshold)
.reduceByKey(tallyCount)
.map(formatOutput)
.saveAsTextFile(outPath)
Spark Execution
Spark Execution
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
map filter reduceBykey map saveAsTextFile
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
Exec0Exec1Exec2
RDD0 RDD1 RDD2 RDD3 RDD4 HDFS
sc.textFile(hdfsPath)
.map(parseInput)
.filter(subThreshold)
.reduceByKey(tallyCount)
.map(formatOutput)
.saveAsTextFile(outPath)
Stage 0 Stage 1
Spark Execution
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
map filter reduceBykey map saveAsTextFile
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
Exec0Exec1Exec2
RDD0 RDD1 RDD2 RDD3 RDD4 HDFS
sc.textFile(hdfsPath)
.map(parseInput)
.filter(subThreshold)
.reduceByKey(tallyCount)
.map(formatOutput)
.saveAsTextFile(outPath)
Stage 0 Stage 1
Spark Execution
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
map filter reduceBykey map saveAsTextFile
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
Exec0Exec1Exec2
RDD0 RDD1 RDD2 RDD3 RDD4 HDFS
sc.textFile(hdfsPath)
.map(parseInput)
.filter(subThreshold)
.reduceByKey(tallyCount)
.map(formatOutput)
.saveAsTextFile(outPath)
Stage 0 Stage 1
Anything that can go wrong,
will go wrong (at some point)
What can go wrong?
• Failures
• My query failed after 6 hours!
• What does this exception mean?
• Wrong results
• Result of my job looks wrong
• Bad performance
• My app is very slow
• Pipeline is not meeting the 4hr SLA
• Poor scalability
• Oh, but it worked on the dev cluster!
• Bad App(le)s
• Tom’s query brought the cluster down
• Application Problems
• Poor choice of transformations
• Ineffective caching
• Bloated data structures
• Data/Storage Problems
• Skewed data, load imbalance
• Small files, poor data partitioning
• Spark Problems
• Shuffle
• Lazy evaluation causes confusion
• Resources Problems
• Resource contention
• Performance degradation
And Why?
How do application developers
detect & fix these problems today?
Look at Logs?
Logs in distributed systems are spread out, incomplete,
& usually very difficult to understand
There has to be a better way for application
developers to
detect & fix problems
Visualize:
Show me all relevant data in one place
Optimize:
Analyze the data for me and give me diagnoses
and fixes
Strategize:
Help me prevent the problems from happening
and meet my goals
For Hadoop
Visualize Optimize Strategize
Ambrose Yes
Lipstick Yes
ATS /Ambari Yes
Inviso Yes
Vaidya Yes
Dr. Elephant Yes
Starfish Yes Yes Yes
Unravel Yes Yes Yes
For Spark
Visualize Optimize Strategize
Spark UI Yes
Spark Debugger Yes
Sematext SPM Yes
Sparkling Yes
Unravel Yes Yes Yes
Demo

More Related Content

Viewers also liked

Nguyên nhân nào gây đau thắt lưng
Nguyên nhân nào gây đau thắt lưngNguyên nhân nào gây đau thắt lưng
Nguyên nhân nào gây đau thắt lưngfrederic716
 
Spark 101
Spark 101Spark 101
Custom scouts_making_the_ask_small_screen
Custom  scouts_making_the_ask_small_screenCustom  scouts_making_the_ask_small_screen
Custom scouts_making_the_ask_small_screen
mikalsky
 
SOAL INDUKSI ELEKTROMAGNET DAN ELEKTROMAGNET
SOAL INDUKSI ELEKTROMAGNET DAN ELEKTROMAGNETSOAL INDUKSI ELEKTROMAGNET DAN ELEKTROMAGNET
SOAL INDUKSI ELEKTROMAGNET DAN ELEKTROMAGNET
Ganda Leonaldi
 
The JMT Network Program - Slide Show
The JMT Network Program - Slide ShowThe JMT Network Program - Slide Show
The JMT Network Program - Slide ShowJohn M. Turner, Ph.D.
 

Viewers also liked (6)

Nguyên nhân nào gây đau thắt lưng
Nguyên nhân nào gây đau thắt lưngNguyên nhân nào gây đau thắt lưng
Nguyên nhân nào gây đau thắt lưng
 
Spark 101
Spark 101Spark 101
Spark 101
 
Custom scouts_making_the_ask_small_screen
Custom  scouts_making_the_ask_small_screenCustom  scouts_making_the_ask_small_screen
Custom scouts_making_the_ask_small_screen
 
SOAL INDUKSI ELEKTROMAGNET DAN ELEKTROMAGNET
SOAL INDUKSI ELEKTROMAGNET DAN ELEKTROMAGNETSOAL INDUKSI ELEKTROMAGNET DAN ELEKTROMAGNET
SOAL INDUKSI ELEKTROMAGNET DAN ELEKTROMAGNET
 
shaun benjamin - Copy
shaun benjamin - Copyshaun benjamin - Copy
shaun benjamin - Copy
 
The JMT Network Program - Slide Show
The JMT Network Program - Slide ShowThe JMT Network Program - Slide Show
The JMT Network Program - Slide Show
 

Similar to LanceShivnathHadoopSummit2015

Spark Application Development Made Easy
Spark Application Development Made EasySpark Application Development Made Easy
Spark Application Development Made Easy
DataWorks Summit
 
Better Visibility into Spark Execution for Faster Application Development-(S...
 Better Visibility into Spark Execution for Faster Application Development-(S... Better Visibility into Spark Execution for Faster Application Development-(S...
Better Visibility into Spark Execution for Faster Application Development-(S...
Spark Summit
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Yahoo Developer Network
 
SemTech 2010: Pelorus Platform
SemTech 2010: Pelorus PlatformSemTech 2010: Pelorus Platform
SemTech 2010: Pelorus Platform
Clark & Parsia LLC
 
Dev and Ops Collaboration and Awareness at Etsy and Flickr
Dev and Ops Collaboration and Awareness at Etsy and FlickrDev and Ops Collaboration and Awareness at Etsy and Flickr
Dev and Ops Collaboration and Awareness at Etsy and Flickr
John Allspaw
 
2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review
Hang Li
 
Building Powerful and Intelligent Applications with Azure Machine Learning
Building Powerful and Intelligent Applications with Azure Machine LearningBuilding Powerful and Intelligent Applications with Azure Machine Learning
Building Powerful and Intelligent Applications with Azure Machine Learning
David Walker, CSM,CSD,MCP,MCAD,MCSD,MVP
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
Durga Gadiraju
 
Spatial ETL For Web Services-Based Data Sharing
Spatial ETL For Web Services-Based Data SharingSpatial ETL For Web Services-Based Data Sharing
Spatial ETL For Web Services-Based Data Sharing
Safe Software
 
Datascience and Azure(v1.0)
Datascience and Azure(v1.0)Datascience and Azure(v1.0)
Datascience and Azure(v1.0)
Zenodia Charpy
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
Giivee The
 
Super Sizing Youtube with Python
Super Sizing Youtube with PythonSuper Sizing Youtube with Python
Super Sizing Youtube with Python
didip
 
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
confluent
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
dallemang
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
Big Data Spain
 
Data Science on Azure
Data Science on Azure Data Science on Azure
Data Science on Azure
Zenodia Charpy
 
Introduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software StackIntroduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software Stack
Jérôme Kehrli
 
Sql interview question part 9
Sql interview question part 9Sql interview question part 9
Sql interview question part 9
kaashiv1
 

Similar to LanceShivnathHadoopSummit2015 (20)

Spark Application Development Made Easy
Spark Application Development Made EasySpark Application Development Made Easy
Spark Application Development Made Easy
 
Better Visibility into Spark Execution for Faster Application Development-(S...
 Better Visibility into Spark Execution for Faster Application Development-(S... Better Visibility into Spark Execution for Faster Application Development-(S...
Better Visibility into Spark Execution for Faster Application Development-(S...
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
SemTech 2010: Pelorus Platform
SemTech 2010: Pelorus PlatformSemTech 2010: Pelorus Platform
SemTech 2010: Pelorus Platform
 
Dev and Ops Collaboration and Awareness at Etsy and Flickr
Dev and Ops Collaboration and Awareness at Etsy and FlickrDev and Ops Collaboration and Awareness at Etsy and Flickr
Dev and Ops Collaboration and Awareness at Etsy and Flickr
 
2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review
 
Building Powerful and Intelligent Applications with Azure Machine Learning
Building Powerful and Intelligent Applications with Azure Machine LearningBuilding Powerful and Intelligent Applications with Azure Machine Learning
Building Powerful and Intelligent Applications with Azure Machine Learning
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
 
Spatial ETL For Web Services-Based Data Sharing
Spatial ETL For Web Services-Based Data SharingSpatial ETL For Web Services-Based Data Sharing
Spatial ETL For Web Services-Based Data Sharing
 
Datascience and Azure(v1.0)
Datascience and Azure(v1.0)Datascience and Azure(v1.0)
Datascience and Azure(v1.0)
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
Super Sizing Youtube with Python
Super Sizing Youtube with PythonSuper Sizing Youtube with Python
Super Sizing Youtube with Python
 
Os Solomon
Os SolomonOs Solomon
Os Solomon
 
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
Data Science on Azure
Data Science on Azure Data Science on Azure
Data Science on Azure
 
Introduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software StackIntroduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software Stack
 
Ebook9
Ebook9Ebook9
Ebook9
 
Sql interview question part 9
Sql interview question part 9Sql interview question part 9
Sql interview question part 9
 

LanceShivnathHadoopSummit2015

Editor's Notes

  1. pull stages apart slide for Applications Powered by Spark @Box screenshots of Spark WebUI
  2. How does it all relate to my app
  3. Piece together the story and connect it back to app