Your SlideShare is downloading. ×
2 hadoop@e bay-hug-2010-07-21
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

2 hadoop@e bay-hug-2010-07-21

2,644
views

Published on

Published in: Technology

0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,644
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
82
Comments
0
Likes
5
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • 07/22/10
  • Transcript

    • 1. @eBay Anil Madan amadan@ebay.com Analytics Platform Development
    • 2.
      • 2007 Research Team Builds a 4 node Cluster
        • Subset of Click Stream and EDW data
        • Innovation with Mobius Query Language
        • Visualization and Click Path analysis
      • 2009 Sept Search Clusters
        • Machine Learning Ranking cluster of 28 nodes
        • Search relevance cluster of 10 nodes
        • Subset of Click Stream and EDW Data
      • 2010 May – Athena* Exploratory Cluster of 532 nodes
        • Platform Teams join hands with Search/Research to build a larger cluster .
        • Build it as a core competency for advanced insights for complex data
        • Rapid build-out with timelines pulled in by couple of months
        • * Athena, is the goddess of civilization, wisdom, strength, strategy, craft, justice and skill in Greek mythology
        • MIT's Athena ushered the world in a new era of distributed systems when it started in the mid 80s.
    • 3. Infrastructure
      • Enterprise Nodes
        • Sun 64bit , Red Hat Linux
        • 2 Quad Core Nehalem, 72GB RAM, 4TB
        • Servers
          • NameNode(s)
          • Job Tracker
          • Zookeeper
          • HBaseMaster
          • Ganglia Server
          • eBay (Cloudera) HUE
      • Data Nodes
        • SGI-Rackables, Cent OS, 1U , 5.3PB
        • 2 Quad Core Nehalem, 36GB RAM, 10TB
        • Hbase on 20 nodes
      • Network
        • TOR 1Gbps
        • Core Switches uplink 40Gbps
    • 4. Ecosystem Hadoop Core (HDFS,Common) MapReduce (Java, Streaming, Pipes,Scala) Data Access (Hbase, Pig, Hive) Tools & Libraries (HUE,UC4,Oozie.Mobius,Mahout) Monitoring & Alerting (Ganglia, Nagios)
      • MapReduce
        • Sourcing data primarily Java Applications using Perl, Scala, Python…
      • Data Access Frameworks
        • Hbase - for EDWdata
        • Pig – data piplelines
        • Hive – Adhoc queries MQL – Mobius Query Language
      • Monitoring & Alerting
        • Ganglia, Nagios
      • Tools
        • HUE/Mobius – lifecycle of user jobs UC4 - scheduling Oozie – user workflow and data pipelines
        • Mahout – data mining
    • 5. Administration
      • Groups
        • Built to support multiple groups
        • Job invocation uses the group name
        • Fair Scheduler
          • Allocations based on investment
          • Weights
          • Minimum share of mappers and reducers
          • poolMaxJobsDefault
          • userMaxJobsDefault
          • defaultMinSharePreemptionTimeout
          • fairSharePreemptionTimeout
      • Auth & Auth
        • HUE – custom module to use corp. credentials
        • CLI*– PAM custom module
        • Security* - Implement token interface to replace Kerberos with SAML.
        • * Work in Progress
    • 6. Data Sourcing Patterns Click Stream EDW Images Search Indices Analytics Reporting Algorithmic Models Acquisition Description Source Preparation Format Pattern Click Stream Session Event Session Container Session/Event Streamed as LZO/Text SessionContainer generate Sequence Files Session/Event Data Build an index and use LzoTextInputFormat for splits based on the work done by Johan Oskarsson/Twitter Session Container ‘Value to Type Conversion’ Pattern Secondary sort with reduce side join EDW Item Transaction User Feedback Bids Streamed as GZIP/Text Generate SequenceFile/ Hbase snapshot with previous day snapshot and current day data. Hive StorageHandlers to point to SequenceFile/Hbase snapshot
      • TotalOrderPartitoner with RandomSamplers to identify partition ranges for reducers.
          • Create Hbase regions using Hfile
          • Update RegionServers using ruby script loadtable.rb
          • Concerns - Hbase append performance, Hfile flush HBASE-1923
    • 7. Search Use Case – Machine Learned Ranking ClickStream Items Users Feedback Classifiers Ranking Function Great Search Results
      • Goal
        • Enhance search relevance for eBay’s items.
      • Hadoop Usage
        • Build a ranking function that takes multiple factors into account like price, listing format, seller track record, relevance.
        • Ability to add new factors to validate hypothesis
        • .
    • 8. Research Use Case – Description Data Mining
      • Goal
        • Extend catalog coverage
      • Hadoop Usage
        • Leverage data mining/machine learning techniques to create inventory into name value pairs
        • in an completely unsupervised way
      BARBIE 1999 "PREMIERE NIGHT" Home Shopping Special Edition Gorgeous Doll With Beautiful Blond Hair /  In A Gown Of Purple And Silver New / Never Removed From Box / Doll Is In Mint Condition / Remember This Beauty Is 11 Years Old Free Shipping To US Only / Will Ship International / Please E-mail For Cost Feel Free To Ask Me Any Questions Or Concerns Smoke - Free Environment Free Shipping Year: 1999 Model: premiere night Edition: home shopping special Hair: blond Gown: purple and silver Condition: new / never removed from box / mint
    • 9. Platform Details Metrics Job Statistics, System/Disk Consumption, Utilization Infrastructure Publish/Subscribe ETL tools, low latency data movement Development Tools, Environment, IDE, Architecture Schemas, Metadata, Governance, Policies Operations Administration, Configuration, Monitoring Reporting Visualization, BI Generation, Information delivery Security User & Group Management, Auth & Auth Clusters Details Exploratory Strategic investment 1000-5000 nodes Production Site facing, low latency, high availability Use Case Specific Advertising, Trust & Safety , Merchandizing
    • 10. Acknowledgments
      • Athena Team
      • Cloudera Inc.
      • Community

    ×