December 2013 HUG: Spark at Yahoo!
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
3,028
On Slideshare
2,973
From Embeds
55
Number of Embeds
2

Actions

Shares
Downloads
32
Comments
2
Likes
5

Embeds 55

https://twitter.com 54
https://mail.google.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. HADOOP AND SPARK JOIN FORCES AT YAHOO Andy Feng (afeng@yahoo-inc.com) Distinguished Architect, Platforms, Yahoo 1 Monday, December 2, 13
  • 2. YAHOO 2012: TODAY MODULE http://visualize.yahoo.com/core/# Monday, December 2, 13 2
  • 3. YAHOO 2013: PERSONALIZED HOMEPAGE http://www.yahoo.com Mobile 3 Monday, December 2, 13
  • 4. YAHOO 2013: PERSONALIZED PROPERTIES http://finance.yahoo.com Mobile 4 Monday, December 2, 13
  • 5. YAHOO 2013: IMPROVED WEB SEARCH W/ VERTICAL CONTENT http://search.yahoo.com Mobile 5 Monday, December 2, 13
  • 6. DATA SCIENCE AT SCALE Data Scientist 6 Monday, December 2, 13
  • 7. 1. CHALLENGE: SCIENCE ✦ Single model for all items in homepage stream ✴ Millions of items ✴ 1000’s of item/user features • Yahoo content categories • Wikipedia entity names ✴ Over 800 million users ✦ Objective function ✴ Relevance & user engagement ✴ Freshness & popularity ✴ Diversity ★ Algorithm exploration ✴ Logistic regression? ✴ Collaborative filtering? ✴ Decision trees? ✴ Hybrid? 7 Monday, December 2, 13
  • 8. II. CHALLENGE: SPEED ✦Ex. Item CTR in Yahoo homepage Today Module * Short Lifetimes * Temporal effect ✦Models should be constructed hourly or faster 8 Monday, December 2, 13 * Breaking news
  • 9. III. CHALLENGE: SCALE ✦150 PB of data on Yahoo Hadoop clusters ✴Yahoo data scientists need the data for ‣ Model building ‣ BI analytics ✴Such datasets should be accessed efficiently ‣ avoid latency caused by data movement ✦35,000 servers in Hadoop cluster ✴Science projects need to leverage 9 Monday, December 2, 13 all these servers for computation
  • 10. SOLUTION: HADOOP + SPARK I. science ... Spark API & MLlib ease development of ML algorithms II.speed ... Spark reduces latency of model training via in-memory RDD etc III.scale ... YARN brings Hadoop datasets & servers at scientists’ fingertips 10 Monday, December 2, 13
  • 11. PILOT PROJECT: E-COMMERCE Yahoo Taiwan Shopping & Auction ✦ Collaborative filtering algorithms for ✴ Viewed-also-viewed ✴ Bought-also-bought ✴ Bought-after-viewed ✦ 30 LOC in Spark/Scala ✴ 14 min. on 10 servers - Hadoop-based algorithm: 106 min. 11 Monday, December 2, 13
  • 12. PILOT PROJECT: STREAM ADS ✦A logistic regression algorithm ✴120 LOC in Spark/Scala ‣ Alternative: Vowpal Wabbit ... Difficult to extend ✴30 min. on model creation for 100M samples and 13K features with 30 iterations ✦“I used Spark-on-Yarn package today, works great.” Amit (July 26, 2013 2:51 PM) ✴Initial algorithm was launched within 2 hours after Spark-YARN package announcement ✴Compare: Several weeks on system setup and data movement 12 Monday, December 2, 13
  • 13. SUMMARY ✦Spark plays an important role in machine learning at Yahoo ✴Hadoop continues to be the core of our big-data platform ✦Yahoo is excited about your continued contribution to Apache ✴4 committers ✴ex. Spark-on-Yarn, Shark, security, scalability, operability etc. 13 Monday, December 2, 13 Spark
  • 14. KEY TECHNOLOGY: SPARK ON YARN Thomas Graves (tgraves@yahoo-inc.com) Spark Committer & Hadoop PMC Yahoo 14 Monday, December 2, 13
  • 15. SPARK-ON-YARN: ROADMAP ✦ Spark-0.6 & 0.7 - Experimental support of YARN ✴ Hadoop 0.23.X and early Hadoop 2.X releases ✦ Spark-0.8.0 - Spark-on-Yarn merged into master Spark branch ✦ Spark-0.8.X - Future integration with Hadoop ✦ Spark-0.9.X - Client-mode introduced ✴ Secure HDFS access ✴ Use YARN approved directories ✴ Link Spark UI to YARN UI ✴ Add authentication to Spark ✴ Support running spark on YARN from HDFS ✴ Support files/archives in Hadoop distributed cache ✴ Support Spark Shell on YARN ✴ Spark on YARN running on Hadoop 2.2.X 15 Monday, December 2, 13
  • 16. ARCHITECTURE: STANDALONE MODE 16 Monday, December 2, 13
  • 17. ARCHITECTURE: CLIENT MODE 17 Monday, December 2, 13
  • 18. FUTURE DIRECTIONS ✦Support long running ✴Shark ✴Spark Streaming ✦Dynamic jobs resource allocation ✦Integrate with Hadoop enhancements ✴Generic History Server ✴Preemption, etc. 18 Monday, December 2, 13
  • 19. MORE INFO: ✦http://spark.incubator.apache.org/docs/latest/running-on-yarn.html 19 Monday, December 2, 13