Load Balancing : Brooklyn (DNS) directs users to their local datacenter RSS Feeds : Feed-norm leverages Yahoo Traffic Server to normalize, cache, and proxy site feeds for Auto Apps Image and Video Delivery : All images and thumbnails displayed on the page Substantial part the 20-25 billion objects YCS serves a day Stats Coming Site thumbnails (Auto-apps) These are the Metro applications generated from web sites that are added to the left column Metro is currently storing about 220K thumbnails replicated on both US coasts Usage is currently about 55K/second (heavily cached by YCS) growing 100% month over month Attachment Store Mail uses YMDB (MObStor pre-cursor) to store 10TB of attachments Search Index : Data mining to obtain the top-n user search queries Ads Optimization: On-going refreshes to the Ad ranking model for revenue optimization Content Optimization: Computation of Content centric user profiles to get user segmentation Models generation refresh for content categorization User centric recommendation module Machine Learning: Model creation for various purposes at Yahoo Spam Filters: Utilizing Co-occurrence and other data intensive techniques for mail spam detection
Load Balancing : Brooklyn (DNS) directs users to their local datacenter RSS Feeds : Feed-norm leverages Yahoo Traffic Server to normalize, cache, and proxy site feeds for Auto Apps Image and Video Delivery : All images and thumbnails displayed on the page Substantial part the 20-25 billion objects YCS serves a day Stats Coming Site thumbnails (Auto-apps) These are the Metro applications generated from web sites that are added to the left column Metro is currently storing about 220K thumbnails replicated on both US coasts Usage is currently about 55K/second (heavily cached by YCS) growing 100% month over month Attachment Store Mail uses YMDB (MObStor pre-cursor) to store 10TB of attachments Search Index : Data mining to obtain the top-n user search queries Ads Optimization: On-going refreshes to the Ad ranking model for revenue optimization Content Optimization: Computation of Content centric user profiles to get user segmentation Models generation refresh for content categorization User centric recommendation module Machine Learning: Model creation for various purposes at Yahoo Spam Filters: Utilizing Co-occurrence and other data intensive techniques for mail spam detection
Load Balancing : Brooklyn (DNS) directs users to their local datacenter RSS Feeds : Feed-norm leverages Yahoo Traffic Server to normalize, cache, and proxy site feeds for Auto Apps Image and Video Delivery : All images and thumbnails displayed on the page Substantial part the 20-25 billion objects YCS serves a day Stats Coming Site thumbnails (Auto-apps) These are the Metro applications generated from web sites that are added to the left column Metro is currently storing about 220K thumbnails replicated on both US coasts Usage is currently about 55K/second (heavily cached by YCS) growing 100% month over month Attachment Store Mail uses YMDB (MObStor pre-cursor) to store 10TB of attachments Search Index : Data mining to obtain the top-n user search queries Ads Optimization: On-going refreshes to the Ad ranking model for revenue optimization Content Optimization: Computation of Content centric user profiles to get user segmentation Models generation refresh for content categorization User centric recommendation module Machine Learning: Model creation for various purposes at Yahoo Spam Filters: Utilizing Co-occurrence and other data intensive techniques for mail spam detection
5 Favorites
Hatem Ben Yacoub, Senior Systems Architect & IT Consultant at Jeddah Government, favorited this 1 week ago
Lower cost ways of defending Hadoop resources from jobs?
More pluggable APIs
Scheduler, Block placement, logging
Various implementation choices for Map-Reduce
Push vs Pull, full sorted output?, reduce locality cases?
Map-Reduce-Reduce - Can MR be made more efficient for multiple MR jobs?
Load balancing tricks in HDFS (eliminate hot spots, slow nodes)
Block placement strategies
More failure domains (power? User Zones?)
Replication on every rack
Various RAID / parity approach
Collocation of data (this is hard)
Some Areas to Explore: Applications
Implementing standard algorithms (in Pig?)
Joins, aggregations, vector operations? ML primitives…
Programming model for Iterative computations
Machine learning does a lot of this, how can we enhance the framework to support ML? (beyond MPI)
Debugging & performance tools, UI etc
One of the easiest ways to have impact!
Log collection and management
Hadoop should be better at monitoring itself and user jobs
Some Areas to Explore: Pig
Memory Usage
Java provides poor models for managing RAM. This is key to Pig, Hive, HBase, the HDFS NN…
Automated Hadoop Tuning
Can Pig or Oozie or MR itself figure out how to configure Hadoop to best run a particular script / job?
RDBM tricks
Cost based optimization – how does current RDBMS technology carry over to MR world?
Indices, materialized views, etc. – How do these traditional RDBMS tools fit into the MR world?
Build an optimizing compiler for Pig Latin, perhaps incorporating some database query optimization techniques
Use data layout information for query optimization in Pig
Questions? Eric Baldeschwieler VP Hadoop Software Development [email_address] For more information: http://hadoop.apache.org/ http://hadoop.yahoo.com/ (including job openings)
0 comments
Post a comment