Your SlideShare is downloading. ×
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

2 hadoop@e bay-hug-2010-07-21


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • 07/22/10
  • Transcript

    1. @eBay Anil Madan Analytics Platform Development
    2. <ul><li>2007 Research Team Builds a 4 node Cluster </li></ul><ul><ul><li>Subset of Click Stream and EDW data </li></ul></ul><ul><ul><li>Innovation with Mobius Query Language </li></ul></ul><ul><ul><li>Visualization and Click Path analysis </li></ul></ul><ul><li>2009 Sept Search Clusters </li></ul><ul><ul><li>Machine Learning Ranking cluster of 28 nodes </li></ul></ul><ul><ul><li>Search relevance cluster of 10 nodes </li></ul></ul><ul><ul><li>Subset of Click Stream and EDW Data </li></ul></ul><ul><li>2010 May – Athena* Exploratory Cluster of 532 nodes </li></ul><ul><ul><li>Platform Teams join hands with Search/Research to build a larger cluster . </li></ul></ul><ul><ul><li>Build it as a core competency for advanced insights for complex data </li></ul></ul><ul><ul><li>Rapid build-out with timelines pulled in by couple of months </li></ul></ul><ul><ul><li>* Athena, is the goddess of civilization, wisdom, strength, strategy, craft, justice and skill in Greek mythology </li></ul></ul><ul><ul><li>MIT's Athena ushered the world in a new era of distributed systems when it started in the mid 80s. </li></ul></ul>
    3. Infrastructure <ul><li>Enterprise Nodes </li></ul><ul><ul><li>Sun 64bit , Red Hat Linux </li></ul></ul><ul><ul><li>2 Quad Core Nehalem, 72GB RAM, 4TB </li></ul></ul><ul><ul><li>Servers </li></ul></ul><ul><ul><ul><li>NameNode(s) </li></ul></ul></ul><ul><ul><ul><li>Job Tracker </li></ul></ul></ul><ul><ul><ul><li>Zookeeper </li></ul></ul></ul><ul><ul><ul><li>HBaseMaster </li></ul></ul></ul><ul><ul><ul><li>Ganglia Server </li></ul></ul></ul><ul><ul><ul><li>eBay (Cloudera) HUE </li></ul></ul></ul><ul><li>Data Nodes </li></ul><ul><ul><li>SGI-Rackables, Cent OS, 1U , 5.3PB </li></ul></ul><ul><ul><li>2 Quad Core Nehalem, 36GB RAM, 10TB </li></ul></ul><ul><ul><li>Hbase on 20 nodes </li></ul></ul><ul><li>Network </li></ul><ul><ul><li>TOR 1Gbps </li></ul></ul><ul><ul><li>Core Switches uplink 40Gbps </li></ul></ul>
    4. Ecosystem Hadoop Core (HDFS,Common) MapReduce (Java, Streaming, Pipes,Scala) Data Access (Hbase, Pig, Hive) Tools & Libraries (HUE,UC4,Oozie.Mobius,Mahout) Monitoring & Alerting (Ganglia, Nagios) <ul><li>MapReduce </li></ul><ul><ul><li>Sourcing data primarily Java Applications using Perl, Scala, Python… </li></ul></ul><ul><li>Data Access Frameworks </li></ul><ul><ul><li>Hbase - for EDWdata </li></ul></ul><ul><ul><li>Pig – data piplelines </li></ul></ul><ul><ul><li>Hive – Adhoc queries MQL – Mobius Query Language </li></ul></ul><ul><li>Monitoring & Alerting </li></ul><ul><ul><li>Ganglia, Nagios </li></ul></ul><ul><li>Tools </li></ul><ul><ul><li>HUE/Mobius – lifecycle of user jobs UC4 - scheduling Oozie – user workflow and data pipelines </li></ul></ul><ul><ul><li>Mahout – data mining </li></ul></ul>
    5. Administration <ul><li>Groups </li></ul><ul><ul><li>Built to support multiple groups </li></ul></ul><ul><ul><li>Job invocation uses the group name </li></ul></ul><ul><ul><li>Fair Scheduler </li></ul></ul><ul><ul><ul><li>Allocations based on investment </li></ul></ul></ul><ul><ul><ul><li>Weights </li></ul></ul></ul><ul><ul><ul><li>Minimum share of mappers and reducers </li></ul></ul></ul><ul><ul><ul><li>poolMaxJobsDefault </li></ul></ul></ul><ul><ul><ul><li>userMaxJobsDefault </li></ul></ul></ul><ul><ul><ul><li>defaultMinSharePreemptionTimeout </li></ul></ul></ul><ul><ul><ul><li>fairSharePreemptionTimeout </li></ul></ul></ul><ul><li>Auth & Auth </li></ul><ul><ul><li>HUE – custom module to use corp. credentials </li></ul></ul><ul><ul><li>CLI*– PAM custom module </li></ul></ul><ul><ul><li>Security* - Implement token interface to replace Kerberos with SAML. </li></ul></ul><ul><ul><li>* Work in Progress </li></ul></ul>
    6. Data Sourcing Patterns Click Stream EDW Images Search Indices Analytics Reporting Algorithmic Models Acquisition Description Source Preparation Format Pattern Click Stream Session Event Session Container Session/Event Streamed as LZO/Text SessionContainer generate Sequence Files Session/Event Data Build an index and use LzoTextInputFormat for splits based on the work done by Johan Oskarsson/Twitter Session Container ‘Value to Type Conversion’ Pattern Secondary sort with reduce side join EDW Item Transaction User Feedback Bids Streamed as GZIP/Text Generate SequenceFile/ Hbase snapshot with previous day snapshot and current day data. Hive StorageHandlers to point to SequenceFile/Hbase snapshot <ul><li>TotalOrderPartitoner with RandomSamplers to identify partition ranges for reducers. </li></ul><ul><ul><ul><li>Create Hbase regions using Hfile </li></ul></ul></ul><ul><ul><ul><li>Update RegionServers using ruby script loadtable.rb </li></ul></ul></ul><ul><ul><ul><li>Concerns - Hbase append performance, Hfile flush HBASE-1923 </li></ul></ul></ul>
    7. Search Use Case – Machine Learned Ranking ClickStream Items Users Feedback Classifiers Ranking Function Great Search Results <ul><li>Goal </li></ul><ul><ul><li>Enhance search relevance for eBay’s items. </li></ul></ul><ul><li>Hadoop Usage </li></ul><ul><ul><li>Build a ranking function that takes multiple factors into account like price, listing format, seller track record, relevance. </li></ul></ul><ul><ul><li>Ability to add new factors to validate hypothesis </li></ul></ul><ul><ul><li>. </li></ul></ul>
    8. Research Use Case – Description Data Mining <ul><li>Goal </li></ul><ul><ul><li>Extend catalog coverage </li></ul></ul><ul><li>Hadoop Usage </li></ul><ul><ul><li>Leverage data mining/machine learning techniques to create inventory into name value pairs </li></ul></ul><ul><ul><li>in an completely unsupervised way </li></ul></ul>BARBIE 1999 &quot;PREMIERE NIGHT&quot; Home Shopping Special Edition Gorgeous Doll With Beautiful Blond Hair /  In A Gown Of Purple And Silver New / Never Removed From Box / Doll Is In Mint Condition / Remember This Beauty Is 11 Years Old Free Shipping To US Only / Will Ship International / Please E-mail For Cost Feel Free To Ask Me Any Questions Or Concerns Smoke - Free Environment Free Shipping Year: 1999 Model: premiere night Edition: home shopping special Hair: blond Gown: purple and silver Condition: new / never removed from box / mint
    9. Platform Details Metrics Job Statistics, System/Disk Consumption, Utilization Infrastructure Publish/Subscribe ETL tools, low latency data movement Development Tools, Environment, IDE, Architecture Schemas, Metadata, Governance, Policies Operations Administration, Configuration, Monitoring Reporting Visualization, BI Generation, Information delivery Security User & Group Management, Auth & Auth Clusters Details Exploratory Strategic investment 1000-5000 nodes Production Site facing, low latency, high availability Use Case Specific Advertising, Trust & Safety , Merchandizing
    10. Acknowledgments <ul><li>Athena Team </li></ul><ul><li>Cloudera Inc. </li></ul><ul><li>Community </li></ul>