December 2013 HUG: Spark at Yahoo!

  • 2,713 views
Uploaded on

December 2013 HUG: Spark at Yahoo!

December 2013 HUG: Spark at Yahoo!

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
2,713
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
32
Comments
2
Likes
5

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. HADOOP AND SPARK JOIN FORCES AT YAHOO Andy Feng (afeng@yahoo-inc.com) Distinguished Architect, Platforms, Yahoo 1 Monday, December 2, 13
  • 2. YAHOO 2012: TODAY MODULE http://visualize.yahoo.com/core/# Monday, December 2, 13 2
  • 3. YAHOO 2013: PERSONALIZED HOMEPAGE http://www.yahoo.com Mobile 3 Monday, December 2, 13
  • 4. YAHOO 2013: PERSONALIZED PROPERTIES http://finance.yahoo.com Mobile 4 Monday, December 2, 13
  • 5. YAHOO 2013: IMPROVED WEB SEARCH W/ VERTICAL CONTENT http://search.yahoo.com Mobile 5 Monday, December 2, 13
  • 6. DATA SCIENCE AT SCALE Data Scientist 6 Monday, December 2, 13
  • 7. 1. CHALLENGE: SCIENCE ✦ Single model for all items in homepage stream ✴ Millions of items ✴ 1000’s of item/user features • Yahoo content categories • Wikipedia entity names ✴ Over 800 million users ✦ Objective function ✴ Relevance & user engagement ✴ Freshness & popularity ✴ Diversity ★ Algorithm exploration ✴ Logistic regression? ✴ Collaborative filtering? ✴ Decision trees? ✴ Hybrid? 7 Monday, December 2, 13
  • 8. II. CHALLENGE: SPEED ✦Ex. Item CTR in Yahoo homepage Today Module * Short Lifetimes * Temporal effect ✦Models should be constructed hourly or faster 8 Monday, December 2, 13 * Breaking news
  • 9. III. CHALLENGE: SCALE ✦150 PB of data on Yahoo Hadoop clusters ✴Yahoo data scientists need the data for ‣ Model building ‣ BI analytics ✴Such datasets should be accessed efficiently ‣ avoid latency caused by data movement ✦35,000 servers in Hadoop cluster ✴Science projects need to leverage 9 Monday, December 2, 13 all these servers for computation
  • 10. SOLUTION: HADOOP + SPARK I. science ... Spark API & MLlib ease development of ML algorithms II.speed ... Spark reduces latency of model training via in-memory RDD etc III.scale ... YARN brings Hadoop datasets & servers at scientists’ fingertips 10 Monday, December 2, 13
  • 11. PILOT PROJECT: E-COMMERCE Yahoo Taiwan Shopping & Auction ✦ Collaborative filtering algorithms for ✴ Viewed-also-viewed ✴ Bought-also-bought ✴ Bought-after-viewed ✦ 30 LOC in Spark/Scala ✴ 14 min. on 10 servers - Hadoop-based algorithm: 106 min. 11 Monday, December 2, 13
  • 12. PILOT PROJECT: STREAM ADS ✦A logistic regression algorithm ✴120 LOC in Spark/Scala ‣ Alternative: Vowpal Wabbit ... Difficult to extend ✴30 min. on model creation for 100M samples and 13K features with 30 iterations ✦“I used Spark-on-Yarn package today, works great.” Amit (July 26, 2013 2:51 PM) ✴Initial algorithm was launched within 2 hours after Spark-YARN package announcement ✴Compare: Several weeks on system setup and data movement 12 Monday, December 2, 13
  • 13. SUMMARY ✦Spark plays an important role in machine learning at Yahoo ✴Hadoop continues to be the core of our big-data platform ✦Yahoo is excited about your continued contribution to Apache ✴4 committers ✴ex. Spark-on-Yarn, Shark, security, scalability, operability etc. 13 Monday, December 2, 13 Spark
  • 14. KEY TECHNOLOGY: SPARK ON YARN Thomas Graves (tgraves@yahoo-inc.com) Spark Committer & Hadoop PMC Yahoo 14 Monday, December 2, 13
  • 15. SPARK-ON-YARN: ROADMAP ✦ Spark-0.6 & 0.7 - Experimental support of YARN ✴ Hadoop 0.23.X and early Hadoop 2.X releases ✦ Spark-0.8.0 - Spark-on-Yarn merged into master Spark branch ✦ Spark-0.8.X - Future integration with Hadoop ✦ Spark-0.9.X - Client-mode introduced ✴ Secure HDFS access ✴ Use YARN approved directories ✴ Link Spark UI to YARN UI ✴ Add authentication to Spark ✴ Support running spark on YARN from HDFS ✴ Support files/archives in Hadoop distributed cache ✴ Support Spark Shell on YARN ✴ Spark on YARN running on Hadoop 2.2.X 15 Monday, December 2, 13
  • 16. ARCHITECTURE: STANDALONE MODE 16 Monday, December 2, 13
  • 17. ARCHITECTURE: CLIENT MODE 17 Monday, December 2, 13
  • 18. FUTURE DIRECTIONS ✦Support long running ✴Shark ✴Spark Streaming ✦Dynamic jobs resource allocation ✦Integrate with Hadoop enhancements ✴Generic History Server ✴Preemption, etc. 18 Monday, December 2, 13
  • 19. MORE INFO: ✦http://spark.incubator.apache.org/docs/latest/running-on-yarn.html 19 Monday, December 2, 13