0
Spark and Hadoop at Yahoo:
Brought to you by YARN

Andy Feng
Yahoo! Hadoop
(afeng@yahoo-inc.com)
Personalized Web
Big-Data in Yahoo!

3

9/10/13
Hadoop + Spark:
Empowered by YARN

30k+ Yahoo! production nodes on YARN since Q1 2013
Shark Pilot: Advertising Data Analytics
§  Business questions
› 

Are two sets of audience cohorts similar to each other?...
Shark Perf: TCP-H Benchmark
Average
Seconds
600
500
400
300
200
100
0
Spark Pilot: Model Training Pipeline
§  A DAG of M/R jobs in Hadoop Streaming
› 

Feature extraction

› 

Train models

›...
Use Case: Ad Targeting

Spark

M/R and Storm

8

9/10/13
Use Case: Content Recommendation
w/ Collaborative Filtering
Input

CF Learning

Ranking

Spark

Spark

9

9/10/13

Output
Spark-YARN: Deployment Simplified
run spark.deploy.yarn.Client --jar … --class … --args …
--queue …--num-workers … --worke...
Acknowledgement
§  AMPLab team
› 

Outstanding collaboration: Ion, Matei, Reynold, Patrick, Matt, …

§  Yahoo! Hadoop te...
We Are Hiring!
http://careers.yahoo.com/
Upcoming SlideShare
Loading in...5
×

Yahoo spark

598

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
598
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
5
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Yahoo spark"

  1. 1. Spark and Hadoop at Yahoo: Brought to you by YARN Andy Feng Yahoo! Hadoop (afeng@yahoo-inc.com)
  2. 2. Personalized Web
  3. 3. Big-Data in Yahoo! 3 9/10/13
  4. 4. Hadoop + Spark: Empowered by YARN 30k+ Yahoo! production nodes on YARN since Q1 2013
  5. 5. Shark Pilot: Advertising Data Analytics §  Business questions ›  Are two sets of audience cohorts similar to each other? ›  What audience segment is most likely to be interested in this ad campaign? ›  In what way was the new front page rollout different than the previous front page as far as audience engagement goes? ›  What are the right metrics to define user engagement? §  Shark pilot ›  50 nodes, each w/ 96GB RAM •  Currently loaded w/ 3.2 TB sample data in memory ›  Homegrown BI tools for ad-hoc queries •  Using Shark Server (contributed to community by Yahoo!)
  6. 6. Shark Perf: TCP-H Benchmark Average Seconds 600 500 400 300 200 100 0
  7. 7. Spark Pilot: Model Training Pipeline §  A DAG of M/R jobs in Hadoop Streaming ›  Feature extraction ›  Train models ›  Score and analyze models §  Initial Spark prototype ›  3x speedup on feature extraction §  Production launch ›  Apply Spark against complete pipeline ›  Spark on 80 node cluster •  Thanks to the enhanced UI and metrics in Spark 0.8 7 9/10/13
  8. 8. Use Case: Ad Targeting Spark M/R and Storm 8 9/10/13
  9. 9. Use Case: Content Recommendation w/ Collaborative Filtering Input CF Learning Ranking Spark Spark 9 9/10/13 Output
  10. 10. Spark-YARN: Deployment Simplified run spark.deploy.yarn.Client --jar … --class … --args … --queue …--num-workers … --worker-memory … Spark-YARN (contributed by Yahoo!) is being adopted by community (ex. Taobao) for production use. You should try it on your Hadoop cluster. 10 9/10/13
  11. 11. Acknowledgement §  AMPLab team ›  Outstanding collaboration: Ion, Matei, Reynold, Patrick, Matt, … §  Yahoo! Hadoop team ›  Thomas, Bobby, Paul, Rajiv, Mithun, … §  Yahoo! Lab. ›  Mridul, Nathan, … §  Yahoo! data analytics ›  Supreeth, Ram, Tim, … §  Yahoo! spark users ›  Gavin, Jay, Hirakendu, … 11 9/10/13
  12. 12. We Are Hiring! http://careers.yahoo.com/
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×