Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

December 2013 HUG: Spark at Yahoo!

5,062 views

Published on

December 2013 HUG: Spark at Yahoo!

Published in: Technology, Education

December 2013 HUG: Spark at Yahoo!

  1. 1. HADOOP AND SPARK JOIN FORCES AT YAHOO Andy Feng (afeng@yahoo-inc.com) Distinguished Architect, Platforms, Yahoo 1 Monday, December 2, 13
  2. 2. YAHOO 2012: TODAY MODULE http://visualize.yahoo.com/core/# Monday, December 2, 13 2
  3. 3. YAHOO 2013: PERSONALIZED HOMEPAGE http://www.yahoo.com Mobile 3 Monday, December 2, 13
  4. 4. YAHOO 2013: PERSONALIZED PROPERTIES http://finance.yahoo.com Mobile 4 Monday, December 2, 13
  5. 5. YAHOO 2013: IMPROVED WEB SEARCH W/ VERTICAL CONTENT http://search.yahoo.com Mobile 5 Monday, December 2, 13
  6. 6. DATA SCIENCE AT SCALE Data Scientist 6 Monday, December 2, 13
  7. 7. 1. CHALLENGE: SCIENCE ✦ Single model for all items in homepage stream ✴ Millions of items ✴ 1000’s of item/user features • Yahoo content categories • Wikipedia entity names ✴ Over 800 million users ✦ Objective function ✴ Relevance & user engagement ✴ Freshness & popularity ✴ Diversity ★ Algorithm exploration ✴ Logistic regression? ✴ Collaborative filtering? ✴ Decision trees? ✴ Hybrid? 7 Monday, December 2, 13
  8. 8. II. CHALLENGE: SPEED ✦Ex. Item CTR in Yahoo homepage Today Module * Short Lifetimes * Temporal effect ✦Models should be constructed hourly or faster 8 Monday, December 2, 13 * Breaking news
  9. 9. III. CHALLENGE: SCALE ✦150 PB of data on Yahoo Hadoop clusters ✴Yahoo data scientists need the data for ‣ Model building ‣ BI analytics ✴Such datasets should be accessed efficiently ‣ avoid latency caused by data movement ✦35,000 servers in Hadoop cluster ✴Science projects need to leverage 9 Monday, December 2, 13 all these servers for computation
  10. 10. SOLUTION: HADOOP + SPARK I. science ... Spark API & MLlib ease development of ML algorithms II.speed ... Spark reduces latency of model training via in-memory RDD etc III.scale ... YARN brings Hadoop datasets & servers at scientists’ fingertips 10 Monday, December 2, 13
  11. 11. PILOT PROJECT: E-COMMERCE Yahoo Taiwan Shopping & Auction ✦ Collaborative filtering algorithms for ✴ Viewed-also-viewed ✴ Bought-also-bought ✴ Bought-after-viewed ✦ 30 LOC in Spark/Scala ✴ 14 min. on 10 servers - Hadoop-based algorithm: 106 min. 11 Monday, December 2, 13
  12. 12. PILOT PROJECT: STREAM ADS ✦A logistic regression algorithm ✴120 LOC in Spark/Scala ‣ Alternative: Vowpal Wabbit ... Difficult to extend ✴30 min. on model creation for 100M samples and 13K features with 30 iterations ✦“I used Spark-on-Yarn package today, works great.” Amit (July 26, 2013 2:51 PM) ✴Initial algorithm was launched within 2 hours after Spark-YARN package announcement ✴Compare: Several weeks on system setup and data movement 12 Monday, December 2, 13
  13. 13. SUMMARY ✦Spark plays an important role in machine learning at Yahoo ✴Hadoop continues to be the core of our big-data platform ✦Yahoo is excited about your continued contribution to Apache ✴4 committers ✴ex. Spark-on-Yarn, Shark, security, scalability, operability etc. 13 Monday, December 2, 13 Spark
  14. 14. KEY TECHNOLOGY: SPARK ON YARN Thomas Graves (tgraves@yahoo-inc.com) Spark Committer & Hadoop PMC Yahoo 14 Monday, December 2, 13
  15. 15. SPARK-ON-YARN: ROADMAP ✦ Spark-0.6 & 0.7 - Experimental support of YARN ✴ Hadoop 0.23.X and early Hadoop 2.X releases ✦ Spark-0.8.0 - Spark-on-Yarn merged into master Spark branch ✦ Spark-0.8.X - Future integration with Hadoop ✦ Spark-0.9.X - Client-mode introduced ✴ Secure HDFS access ✴ Use YARN approved directories ✴ Link Spark UI to YARN UI ✴ Add authentication to Spark ✴ Support running spark on YARN from HDFS ✴ Support files/archives in Hadoop distributed cache ✴ Support Spark Shell on YARN ✴ Spark on YARN running on Hadoop 2.2.X 15 Monday, December 2, 13
  16. 16. ARCHITECTURE: STANDALONE MODE 16 Monday, December 2, 13
  17. 17. ARCHITECTURE: CLIENT MODE 17 Monday, December 2, 13
  18. 18. FUTURE DIRECTIONS ✦Support long running ✴Shark ✴Spark Streaming ✦Dynamic jobs resource allocation ✦Integrate with Hadoop enhancements ✴Generic History Server ✴Preemption, etc. 18 Monday, December 2, 13
  19. 19. MORE INFO: ✦http://spark.incubator.apache.org/docs/latest/running-on-yarn.html 19 Monday, December 2, 13

×