1<br />                          Avatar at eBay<br />Srinivasan Rengarajan (srengarajan@ebay.com)<br />Mohit Soni (mosoni@...
2<br />2007 Research Team Builds a 4 node Cluster<br />Subset of Click Stream and EDW data<br />Innovation with Mobius Que...
Infrastructure<br />3<br /><ul><li>Enterprise Nodes </li></ul>Sun 64bit , Red Hat Linux<br />2 Quad Core Nehalem, 72GB RAM...
Job Tracker
Zookeeper
HBaseMaster
Ganglia Server
eBay (Cloudera) HUE
Data Nodes</li></ul>SGI-Rackables, Cent OS, 1U , 5.3PB<br />2 Quad Core Nehalem, 36GB RAM, 10TB<br />Hbase on 20 nodes<br ...
Ecosystem<br />4<br /><ul><li> Monitoring & Alerting</li></ul>Ganglia, Nagios<br /><ul><li> Tools</li></ul>HUE/Mobius – li...
Upcoming SlideShare
Loading in...5
×

Apache Hadoop India Summit 2011 talk "Hadoop Avatar at eBay" by Srinivasan Rengarajan and Mohit Soni

2,305

Published on

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,305
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
62
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Apache Hadoop India Summit 2011 talk "Hadoop Avatar at eBay" by Srinivasan Rengarajan and Mohit Soni"

  1. 1. 1<br /> Avatar at eBay<br />Srinivasan Rengarajan (srengarajan@ebay.com)<br />Mohit Soni (mosoni@ebay.com)<br />Courtesy<br />Anil Madan (amadan@ebay.com)<br />
  2. 2. 2<br />2007 Research Team Builds a 4 node Cluster<br />Subset of Click Stream and EDW data<br />Innovation with Mobius Query Language<br />Visualization and Click Path analysis<br />2009 Sept Search Clusters <br />Machine Learning Ranking cluster of 28 nodes<br />Search relevance cluster of 10 nodes<br />Subset of Click Stream and EDW Data<br />2010 May – Athena* Exploratory Cluster of 532 nodes<br />Platform Teams join hands with Search/Research to build a larger cluster .<br />Build it as a core competency for advanced insights for complex data<br />Rapid build-out with timelines pulled in by couple of months<br />* Athena, is the goddess of civilization, wisdom, strength, strategy, craft, justice and skill in Greek mythology<br /> MIT's Athena ushered the world in a new era of distributed systems when it started in the mid 80s.<br />2<br />
  3. 3. Infrastructure<br />3<br /><ul><li>Enterprise Nodes </li></ul>Sun 64bit , Red Hat Linux<br />2 Quad Core Nehalem, 72GB RAM, 4TB<br />Servers<br /><ul><li>NameNode(s)
  4. 4. Job Tracker
  5. 5. Zookeeper
  6. 6. HBaseMaster
  7. 7. Ganglia Server
  8. 8. eBay (Cloudera) HUE
  9. 9. Data Nodes</li></ul>SGI-Rackables, Cent OS, 1U , 5.3PB<br />2 Quad Core Nehalem, 36GB RAM, 10TB<br />Hbase on 20 nodes<br /><ul><li>Network</li></ul> TOR 1Gbps<br /> Core Switches uplink 40Gbps<br />3<br />
  10. 10. Ecosystem<br />4<br /><ul><li> Monitoring & Alerting</li></ul>Ganglia, Nagios<br /><ul><li> Tools</li></ul>HUE/Mobius – lifecycle of user jobs UC4 - scheduling Oozie – user workflow and data pipelines<br /> Mahout – data mining <br />Monitoring & Alerting <br />(Ganglia, Nagios)<br />Tools & Libraries<br />(HUE,UC4,Oozie.Mobius,Mahout)<br /><ul><li> Data Access Frameworks</li></ul>Hbase - for EDWdata<br />Pig – data piplelines<br />Hive – Adhoc queries MQL – Mobius Query Language<br />Data Access <br />(Hbase, Pig, Hive)<br />MapReduce <br />(Java, Streaming, Pipes,Scala)<br />Hadoop Core <br />(HDFS,Common)<br /><ul><li> MapReduce</li></ul>Sourcing data primarily Java Applications using Perl, Scala, Python…<br />4<br />
  11. 11. Administration<br />Groups<br />Built to support multiple groups<br />Job invocation uses the group name<br />Fair Scheduler <br />Allocations based on investment<br />Weights <br />Minimum share of mappers and reducers<br />poolMaxJobsDefault<br />userMaxJobsDefault<br />defaultMinSharePreemptionTimeout<br />fairSharePreemptionTimeout<br />Auth & Auth<br />HUE – custom module to use corp. credentials<br />CLI*– PAM custom module<br />Security* - Implement token interface to replace Kerberos with SAML.<br />* Work in Progress<br />5<br />
  12. 12. Data Sourcing Patterns<br />6<br />Click Stream<br />Search Indices<br />EDW<br />Analytics Reporting<br />Description<br />Acquisition<br />Algorithmic Models<br />Images<br />
  13. 13. Search Use Case – Machine Learned Ranking<br />7<br />ClickStream<br />Items<br />Users<br />Feedback<br />Classifiers<br />Ranking Function<br />Great Search Results<br /><ul><li>Goal
  14. 14. Enhance search relevance for eBay’s items.
  15. 15. Hadoop Usage
  16. 16. Build a ranking function that takes multiple factors into account like price, listing format, seller track record, relevance.
  17. 17. Ability to add new factors to validate hypothesis
  18. 18. .</li></li></ul><li>Research Use Case – Description Data Mining <br />8<br />BARBIE<br />1999 "PREMIERE NIGHT" <br />Home Shopping Special Edition<br />Gorgeous Doll With Beautiful Blond Hair /  In A Gown Of Purple And Silver<br />New / Never Removed From Box / Doll Is In Mint Condition / Remember This Beauty Is 11 Years Old<br />Free Shipping To US Only / Will Ship International / Please E-mail For Cost<br />Feel Free To Ask Me Any Questions Or Concerns<br />Smoke - Free Environment<br />Free Shipping<br />Year: 1999<br />Model: premiere night<br />Edition: home shopping special<br />Hair: blond<br />Gown: purple and silver<br />Condition: new / never removed from box / mint<br />Goal<br />Extend catalog coverage<br />Hadoop Usage<br />Leverage data mining/machine learning techniques to create inventory into name value pairs <br /> in an completely unsupervised way<br />
  19. 19. 9<br />
  20. 20. 10<br />Acknowledgments<br /><ul><li> Athena Team
  21. 21. Cloudera Inc.
  22. 22. Community</li>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×