2 hadoop@e bay-hug-2010-07-21
Upcoming SlideShare
Loading in...5
×
 

2 hadoop@e bay-hug-2010-07-21

on

  • 3,158 views

 

Statistics

Views

Total Views
3,158
Views on SlideShare
2,605
Embed Views
553

Actions

Likes
5
Downloads
80
Comments
0

12 Embeds 553

http://practicals-datascience.blogspot.com 493
http://developer.yahoo.net 42
http://practicals-datascience.blogspot.in 7
http://practicals-datascience.blogspot.jp 3
http://www.slideshare.net 1
http://practicals-datascience.blogspot.ca 1
http://practicals-datascience.blogspot.co.uk 1
http://practicals-datascience.blogspot.com.es 1
http://practicals-datascience.blogspot.mx 1
http://practicals-datascience.blogspot.com.au 1
http://practicals-datascience.blogspot.it 1
http://practicals-datascience.blogspot.be 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • 07/22/10

2 hadoop@e bay-hug-2010-07-21 2 hadoop@e bay-hug-2010-07-21 Presentation Transcript

  • @eBay Anil Madan amadan@ebay.com Analytics Platform Development
    • 2007 Research Team Builds a 4 node Cluster
      • Subset of Click Stream and EDW data
      • Innovation with Mobius Query Language
      • Visualization and Click Path analysis
    • 2009 Sept Search Clusters
      • Machine Learning Ranking cluster of 28 nodes
      • Search relevance cluster of 10 nodes
      • Subset of Click Stream and EDW Data
    • 2010 May – Athena* Exploratory Cluster of 532 nodes
      • Platform Teams join hands with Search/Research to build a larger cluster .
      • Build it as a core competency for advanced insights for complex data
      • Rapid build-out with timelines pulled in by couple of months
      • * Athena, is the goddess of civilization, wisdom, strength, strategy, craft, justice and skill in Greek mythology
      • MIT's Athena ushered the world in a new era of distributed systems when it started in the mid 80s.
  • Infrastructure
    • Enterprise Nodes
      • Sun 64bit , Red Hat Linux
      • 2 Quad Core Nehalem, 72GB RAM, 4TB
      • Servers
        • NameNode(s)
        • Job Tracker
        • Zookeeper
        • HBaseMaster
        • Ganglia Server
        • eBay (Cloudera) HUE
    • Data Nodes
      • SGI-Rackables, Cent OS, 1U , 5.3PB
      • 2 Quad Core Nehalem, 36GB RAM, 10TB
      • Hbase on 20 nodes
    • Network
      • TOR 1Gbps
      • Core Switches uplink 40Gbps
  • Ecosystem Hadoop Core (HDFS,Common) MapReduce (Java, Streaming, Pipes,Scala) Data Access (Hbase, Pig, Hive) Tools & Libraries (HUE,UC4,Oozie.Mobius,Mahout) Monitoring & Alerting (Ganglia, Nagios)
    • MapReduce
      • Sourcing data primarily Java Applications using Perl, Scala, Python…
    • Data Access Frameworks
      • Hbase - for EDWdata
      • Pig – data piplelines
      • Hive – Adhoc queries MQL – Mobius Query Language
    • Monitoring & Alerting
      • Ganglia, Nagios
    • Tools
      • HUE/Mobius – lifecycle of user jobs UC4 - scheduling Oozie – user workflow and data pipelines
      • Mahout – data mining
  • Administration
    • Groups
      • Built to support multiple groups
      • Job invocation uses the group name
      • Fair Scheduler
        • Allocations based on investment
        • Weights
        • Minimum share of mappers and reducers
        • poolMaxJobsDefault
        • userMaxJobsDefault
        • defaultMinSharePreemptionTimeout
        • fairSharePreemptionTimeout
    • Auth & Auth
      • HUE – custom module to use corp. credentials
      • CLI*– PAM custom module
      • Security* - Implement token interface to replace Kerberos with SAML.
      • * Work in Progress
  • Data Sourcing Patterns Click Stream EDW Images Search Indices Analytics Reporting Algorithmic Models Acquisition Description Source Preparation Format Pattern Click Stream Session Event Session Container Session/Event Streamed as LZO/Text SessionContainer generate Sequence Files Session/Event Data Build an index and use LzoTextInputFormat for splits based on the work done by Johan Oskarsson/Twitter Session Container ‘Value to Type Conversion’ Pattern Secondary sort with reduce side join EDW Item Transaction User Feedback Bids Streamed as GZIP/Text Generate SequenceFile/ Hbase snapshot with previous day snapshot and current day data. Hive StorageHandlers to point to SequenceFile/Hbase snapshot
    • TotalOrderPartitoner with RandomSamplers to identify partition ranges for reducers.
        • Create Hbase regions using Hfile
        • Update RegionServers using ruby script loadtable.rb
        • Concerns - Hbase append performance, Hfile flush HBASE-1923
  • Search Use Case – Machine Learned Ranking ClickStream Items Users Feedback Classifiers Ranking Function Great Search Results
    • Goal
      • Enhance search relevance for eBay’s items.
    • Hadoop Usage
      • Build a ranking function that takes multiple factors into account like price, listing format, seller track record, relevance.
      • Ability to add new factors to validate hypothesis
      • .
  • Research Use Case – Description Data Mining
    • Goal
      • Extend catalog coverage
    • Hadoop Usage
      • Leverage data mining/machine learning techniques to create inventory into name value pairs
      • in an completely unsupervised way
    BARBIE 1999 "PREMIERE NIGHT" Home Shopping Special Edition Gorgeous Doll With Beautiful Blond Hair /  In A Gown Of Purple And Silver New / Never Removed From Box / Doll Is In Mint Condition / Remember This Beauty Is 11 Years Old Free Shipping To US Only / Will Ship International / Please E-mail For Cost Feel Free To Ask Me Any Questions Or Concerns Smoke - Free Environment Free Shipping Year: 1999 Model: premiere night Edition: home shopping special Hair: blond Gown: purple and silver Condition: new / never removed from box / mint
  • Platform Details Metrics Job Statistics, System/Disk Consumption, Utilization Infrastructure Publish/Subscribe ETL tools, low latency data movement Development Tools, Environment, IDE, Architecture Schemas, Metadata, Governance, Policies Operations Administration, Configuration, Monitoring Reporting Visualization, BI Generation, Information delivery Security User & Group Management, Auth & Auth Clusters Details Exploratory Strategic investment 1000-5000 nodes Production Site facing, low latency, high availability Use Case Specific Advertising, Trust & Safety , Merchandizing
  • Acknowledgments
    • Athena Team
    • Cloudera Inc.
    • Community