LanceShivnathHadoopSummit2015

Spark Application Development
Made Fast and Easy

Lance Co Ting Keh
Machine Learning @ Box
Distributed ML Infrastructure
Go Blue Devils!
Shivnath Babu
Associate Professor @ Duke
Chief Scientist at Unravel Data
Systems
R&D in Management of Data Systems

Genomics Telecom Geospatial
NLP
Image
Processing
IoT
Operational
Intelligence
Recommender
Systems
Fraud
Detection

Applications Powered by Spark @ Box

What’s so great about Spark?
From http://evinrevello.com/big-data-pocket-knife/

What’s so great about Spark?
From:

https://weminoredinfilm.files.wordpress.com/2014/04/59911951.jpg
Complexity

Spark Execution
sc.textFile(hdfsPath)
.map(parseInput)
.filter(subThreshold)
.reduceByKey(tallyCount)
.map(formatOutput)
.saveAsTextFile(outPath)
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
map filter reduceBykey map saveAsTextFile
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
RDD0 RDD1 RDD2 RDD3 RDD4 HDFS

Spark Execution
.map(parseInput)
.map(formatOutput)
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
HDFS
Stage 0 Stage 1
RDD0 RDD1 RDD2 RDD3 RDD4

part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
Stage 0 Stage 1
.map(parseInput)
.map(formatOutput)
Spark Execution

Spark Execution
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
part-0
part-1
part-2
part-3
Exec0Exec1Exec2
.map(parseInput)
.map(formatOutput)
Stage 0 Stage 1

Anything that can go wrong,
will go wrong (at some point)

What can go wrong?
• Failures
• My query failed after 6 hours!
• What does this exception mean?
• Wrong results
• Result of my job looks wrong
• Bad performance
• My app is very slow
• Pipeline is not meeting the 4hr SLA
• Poor scalability
• Oh, but it worked on the dev cluster!
• Bad App(le)s
• Tom’s query brought the cluster down
• Application Problems
• Poor choice of transformations
• Ineffective caching
• Bloated data structures
• Data/Storage Problems
• Skewed data, load imbalance
• Small files, poor data partitioning
• Spark Problems
• Shuffle
• Lazy evaluation causes confusion
• Resources Problems
• Resource contention
• Performance degradation
And Why?

How do application developers
detect & fix these problems today?

Look at Logs?
Logs in distributed systems are spread out, incomplete,
& usually very difficult to understand

There has to be a better way for application
developers to
detect & fix problems

Visualize:
Show me all relevant data in one place

Optimize:
Analyze the data for me and give me diagnoses
and fixes

Strategize:
Help me prevent the problems from happening
and meet my goals

For Hadoop
Visualize Optimize Strategize
Ambrose Yes
Lipstick Yes
ATS /Ambari Yes
Inviso Yes
Vaidya Yes
Dr. Elephant Yes
Starfish Yes Yes Yes
Unravel Yes Yes Yes

For Spark
Visualize Optimize Strategize
Spark UI Yes
Spark Debugger Yes
Sematext SPM Yes
Sparkling Yes
Unravel Yes Yes Yes

LanceShivnathHadoopSummit2015

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (6)

Similar to LanceShivnathHadoopSummit2015

Similar to LanceShivnathHadoopSummit2015 (20)

LanceShivnathHadoopSummit2015

Editor's Notes