Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark Application Development Made Easy

1,095 views

Published on

Hadoop Summit 2015

Published in: Technology
  • Be the first to comment

Spark Application Development Made Easy

  1. 1. Lance Co Ting Keh Machine Learning @ Box Distributed ML Infrastructure Go Blue Devils! Shivnath Babu Associate Professor @ Duke Chief Scientist at Unravel Data Systems R&D Management of Data Systems
  2. 2. Genomics Telecom Geospatial NLP Image Processing IoT Operational Intelligence Recommender Systems Fraud Detection
  3. 3. Applications Powered by Spark @ Box
  4. 4. What’s so great about Spark? From http://evinrevello.com/big-data-pocket-knife/
  5. 5. What’s so great about Spark? From:
  6. 6. https://weminoredinfilm.files.wordpress.com/2014/04/59911951.jpg Complexity
  7. 7. Spark Execution sc.textFile(hdfsPath) .map(parseInput) .filter(subThreshold) .reduceByKey(tallyCount) .map(formatOutput) .saveAsTextFile(outPath) part-0 part-1 part-2 part-3 part-0 part-1 part-2 part-3 map filter reduceBykey map saveAsTextFile part-0 part-1 part-2 part-3 part-0 part-1 part-2 part-3 part-0 part-1 part-2 part-3 part-0 part-1 part-2 part-3 RDD0 RDD1 RDD2 RDD3 RDD4 HDFS
  8. 8. Spark Execution sc.textFile(hdfsPath) .map(parseInput) .filter(subThreshold) .reduceByKey(tallyCount) .map(formatOutput) .saveAsTextFile(outPath) part-0 part-1 part-2 part-3 part-0 part-1 part-2 part-3 map filter reduceBykey map saveAsTextFile part-0 part-1 part-2 part-3 part-0 part-1 part-2 part-3 part-0 part-1 part-2 part-3 part-0 part-1 part-2 part-3 HDFS Stage 0 Stage 1 RDD0 RDD1 RDD2 RDD3 RDD4
  9. 9. part-0 part-1 part-2 part-3 part-0 part-1 part-2 part-3 map filter reduceBykey map saveAsTextFile part-0 part-1 part-2 part-3 part-0 part-1 part-2 part-3 part-0 part-1 part-2 part-3 part-0 part-1 part-2 part-3 RDD0 RDD1 RDD2 RDD3 RDD4 HDFS Stage 0 Stage 1 sc.textFile(hdfsPath) .map(parseInput) .filter(subThreshold) .reduceByKey(tallyCount) .map(formatOutput) .saveAsTextFile(outPath) Spark Execution
  10. 10. Spark Execution part-0 part-1 part-2 part-3 part-0 part-1 part-2 part-3 map filter reduceBykey map saveAsTextFile part-0 part-1 part-2 part-3 part-0 part-1 part-2 part-3 part-0 part-1 part-2 part-3 part-0 part-1 part-2 part-3 Exec0Exec1Exec2 RDD0 RDD1 RDD2 RDD3 RDD4 HDFS sc.textFile(hdfsPath) .map(parseInput) .filter(subThreshold) .reduceByKey(tallyCount) .map(formatOutput) .saveAsTextFile(outPath) Stage 0 Stage 1
  11. 11. Spark Execution part-0 part-1 part-2 part-3 part-0 part-1 part-2 part-3 map filter reduceBykey map saveAsTextFile part-0 part-1 part-2 part-3 part-0 part-1 part-2 part-3 part-0 part-1 part-2 part-3 part-0 part-1 part-2 part-3 Exec0Exec1Exec2 RDD0 RDD1 RDD2 RDD3 RDD4 HDFS sc.textFile(hdfsPath) .map(parseInput) .filter(subThreshold) .reduceByKey(tallyCount) .map(formatOutput) .saveAsTextFile(outPath) Stage 0 Stage 1
  12. 12. Spark Execution part-0 part-1 part-2 part-3 part-0 part-1 part-2 part-3 map filter reduceBykey map saveAsTextFile part-0 part-1 part-2 part-3 part-0 part-1 part-2 part-3 part-0 part-1 part-2 part-3 part-0 part-1 part-2 part-3 Exec0Exec1Exec2 RDD0 RDD1 RDD2 RDD3 RDD4 HDFS sc.textFile(hdfsPath) .map(parseInput) .filter(subThreshold) .reduceByKey(tallyCount) .map(formatOutput) .saveAsTextFile(outPath) Stage 0 Stage 1
  13. 13. Anything that can go wrong, will go wrong (at some point)
  14. 14. What can go wrong? • Failures • My query failed after 6 hours! • What does this exception mean? • Wrong results • Result of my job looks wrong • Bad performance • My app is very slow • Pipeline is not meeting the 4hr SLA • Poor scalability • Oh, but it worked on the dev cluster! • Bad App(le)s • Tom’s query brought the cluster down • Application Problems • Poor choice of transformations • Ineffective caching • Bloated data structures • Data/Storage Problems • Skewed data, load imbalance • Small files, poor data partitioning • Spark Problems • Shuffle • Lazy evaluation causes confusion • Resources Problems • Resource contention • Performance degradation And Why?
  15. 15. How do application developers detect & fix these problems today?
  16. 16. A Look at Logs Logs in distributed systems are spread out, incomplete, & usually very difficult to understand
  17. 17. There has to be a better way for application developers to detect & fix problems
  18. 18. Visualize: Show me all relevant data in one place
  19. 19. Optimize: Analyze the data for me and give me diagnoses and fixes
  20. 20. Strategize: Help me prevent the problems from happening and meet my goals
  21. 21. For Hadoop Visualize Optimize Strategize Ambrose Yes Lipstick Yes ATS /Ambari Yes Inviso Yes Vaidya Yes Dr. Elephant Yes Starfish Yes Yes Yes Unravel Yes Yes Yes
  22. 22. For Spark Visualize Optimize Strategize Spark UI Yes Spark Debugger Yes Sematext SPM Yes Sparkling Yes Unravel Yes Yes Yes
  23. 23. Demo

×