2 hadoop@e bay-hug-2010-07-21


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • 07/22/10
  • 2 hadoop@e bay-hug-2010-07-21

    1. 1. @eBay Anil Madan amadan@ebay.com Analytics Platform Development
    2. 2. <ul><li>2007 Research Team Builds a 4 node Cluster </li></ul><ul><ul><li>Subset of Click Stream and EDW data </li></ul></ul><ul><ul><li>Innovation with Mobius Query Language </li></ul></ul><ul><ul><li>Visualization and Click Path analysis </li></ul></ul><ul><li>2009 Sept Search Clusters </li></ul><ul><ul><li>Machine Learning Ranking cluster of 28 nodes </li></ul></ul><ul><ul><li>Search relevance cluster of 10 nodes </li></ul></ul><ul><ul><li>Subset of Click Stream and EDW Data </li></ul></ul><ul><li>2010 May – Athena* Exploratory Cluster of 532 nodes </li></ul><ul><ul><li>Platform Teams join hands with Search/Research to build a larger cluster . </li></ul></ul><ul><ul><li>Build it as a core competency for advanced insights for complex data </li></ul></ul><ul><ul><li>Rapid build-out with timelines pulled in by couple of months </li></ul></ul><ul><ul><li>* Athena, is the goddess of civilization, wisdom, strength, strategy, craft, justice and skill in Greek mythology </li></ul></ul><ul><ul><li>MIT's Athena ushered the world in a new era of distributed systems when it started in the mid 80s. </li></ul></ul>
    3. 3. Infrastructure <ul><li>Enterprise Nodes </li></ul><ul><ul><li>Sun 64bit , Red Hat Linux </li></ul></ul><ul><ul><li>2 Quad Core Nehalem, 72GB RAM, 4TB </li></ul></ul><ul><ul><li>Servers </li></ul></ul><ul><ul><ul><li>NameNode(s) </li></ul></ul></ul><ul><ul><ul><li>Job Tracker </li></ul></ul></ul><ul><ul><ul><li>Zookeeper </li></ul></ul></ul><ul><ul><ul><li>HBaseMaster </li></ul></ul></ul><ul><ul><ul><li>Ganglia Server </li></ul></ul></ul><ul><ul><ul><li>eBay (Cloudera) HUE </li></ul></ul></ul><ul><li>Data Nodes </li></ul><ul><ul><li>SGI-Rackables, Cent OS, 1U , 5.3PB </li></ul></ul><ul><ul><li>2 Quad Core Nehalem, 36GB RAM, 10TB </li></ul></ul><ul><ul><li>Hbase on 20 nodes </li></ul></ul><ul><li>Network </li></ul><ul><ul><li>TOR 1Gbps </li></ul></ul><ul><ul><li>Core Switches uplink 40Gbps </li></ul></ul>
    4. 4. Ecosystem Hadoop Core (HDFS,Common) MapReduce (Java, Streaming, Pipes,Scala) Data Access (Hbase, Pig, Hive) Tools & Libraries (HUE,UC4,Oozie.Mobius,Mahout) Monitoring & Alerting (Ganglia, Nagios) <ul><li>MapReduce </li></ul><ul><ul><li>Sourcing data primarily Java Applications using Perl, Scala, Python… </li></ul></ul><ul><li>Data Access Frameworks </li></ul><ul><ul><li>Hbase - for EDWdata </li></ul></ul><ul><ul><li>Pig – data piplelines </li></ul></ul><ul><ul><li>Hive – Adhoc queries MQL – Mobius Query Language </li></ul></ul><ul><li>Monitoring & Alerting </li></ul><ul><ul><li>Ganglia, Nagios </li></ul></ul><ul><li>Tools </li></ul><ul><ul><li>HUE/Mobius – lifecycle of user jobs UC4 - scheduling Oozie – user workflow and data pipelines </li></ul></ul><ul><ul><li>Mahout – data mining </li></ul></ul>
    5. 5. Administration <ul><li>Groups </li></ul><ul><ul><li>Built to support multiple groups </li></ul></ul><ul><ul><li>Job invocation uses the group name </li></ul></ul><ul><ul><li>Fair Scheduler </li></ul></ul><ul><ul><ul><li>Allocations based on investment </li></ul></ul></ul><ul><ul><ul><li>Weights </li></ul></ul></ul><ul><ul><ul><li>Minimum share of mappers and reducers </li></ul></ul></ul><ul><ul><ul><li>poolMaxJobsDefault </li></ul></ul></ul><ul><ul><ul><li>userMaxJobsDefault </li></ul></ul></ul><ul><ul><ul><li>defaultMinSharePreemptionTimeout </li></ul></ul></ul><ul><ul><ul><li>fairSharePreemptionTimeout </li></ul></ul></ul><ul><li>Auth & Auth </li></ul><ul><ul><li>HUE – custom module to use corp. credentials </li></ul></ul><ul><ul><li>CLI*– PAM custom module </li></ul></ul><ul><ul><li>Security* - Implement token interface to replace Kerberos with SAML. </li></ul></ul><ul><ul><li>* Work in Progress </li></ul></ul>
    6. 6. Data Sourcing Patterns Click Stream EDW Images Search Indices Analytics Reporting Algorithmic Models Acquisition Description Source Preparation Format Pattern Click Stream Session Event Session Container Session/Event Streamed as LZO/Text SessionContainer generate Sequence Files Session/Event Data Build an index and use LzoTextInputFormat for splits based on the work done by Johan Oskarsson/Twitter Session Container ‘Value to Type Conversion’ Pattern Secondary sort with reduce side join EDW Item Transaction User Feedback Bids Streamed as GZIP/Text Generate SequenceFile/ Hbase snapshot with previous day snapshot and current day data. Hive StorageHandlers to point to SequenceFile/Hbase snapshot <ul><li>TotalOrderPartitoner with RandomSamplers to identify partition ranges for reducers. </li></ul><ul><ul><ul><li>Create Hbase regions using Hfile </li></ul></ul></ul><ul><ul><ul><li>Update RegionServers using ruby script loadtable.rb </li></ul></ul></ul><ul><ul><ul><li>Concerns - Hbase append performance, Hfile flush HBASE-1923 </li></ul></ul></ul>
    7. 7. Search Use Case – Machine Learned Ranking ClickStream Items Users Feedback Classifiers Ranking Function Great Search Results <ul><li>Goal </li></ul><ul><ul><li>Enhance search relevance for eBay’s items. </li></ul></ul><ul><li>Hadoop Usage </li></ul><ul><ul><li>Build a ranking function that takes multiple factors into account like price, listing format, seller track record, relevance. </li></ul></ul><ul><ul><li>Ability to add new factors to validate hypothesis </li></ul></ul><ul><ul><li>. </li></ul></ul>
    8. 8. Research Use Case – Description Data Mining <ul><li>Goal </li></ul><ul><ul><li>Extend catalog coverage </li></ul></ul><ul><li>Hadoop Usage </li></ul><ul><ul><li>Leverage data mining/machine learning techniques to create inventory into name value pairs </li></ul></ul><ul><ul><li>in an completely unsupervised way </li></ul></ul>BARBIE 1999 &quot;PREMIERE NIGHT&quot; Home Shopping Special Edition Gorgeous Doll With Beautiful Blond Hair /  In A Gown Of Purple And Silver New / Never Removed From Box / Doll Is In Mint Condition / Remember This Beauty Is 11 Years Old Free Shipping To US Only / Will Ship International / Please E-mail For Cost Feel Free To Ask Me Any Questions Or Concerns Smoke - Free Environment Free Shipping Year: 1999 Model: premiere night Edition: home shopping special Hair: blond Gown: purple and silver Condition: new / never removed from box / mint
    9. 9. Platform Details Metrics Job Statistics, System/Disk Consumption, Utilization Infrastructure Publish/Subscribe ETL tools, low latency data movement Development Tools, Environment, IDE, Architecture Schemas, Metadata, Governance, Policies Operations Administration, Configuration, Monitoring Reporting Visualization, BI Generation, Information delivery Security User & Group Management, Auth & Auth Clusters Details Exploratory Strategic investment 1000-5000 nodes Production Site facing, low latency, high availability Use Case Specific Advertising, Trust & Safety , Merchandizing
    10. 10. Acknowledgments <ul><li>Athena Team </li></ul><ul><li>Cloudera Inc. </li></ul><ul><li>Community </li></ul>