0
Hadoop @ eBay:
Past, Present and
Future
Ryan Hennig
Hadoop Platform Team
ABOUT ME
RYAN HENNIG
Born and raised in Seattle, WA
Studied Computer Science at University of Washington in Seattle
Worked on Micro...
AGENDA

Past: Growth of Hadoop at eBay
Present: Hadoop Use Cases, Operations Tools
Future: Hadoop 2.0
HADOOP AT EBAY:
PAST
Growth of Hadoop at eBay
Adventures in Forking
Partnership with Hortonworks
HADOOP EVOLUTION @ eBay

2013
• Shared
clusters

2012
2011
2010

2009
Search
2007

• 10snodes

Single digit
nodes

Shared
...
ADVENTURES IN FORKING
• 2007-2010: eBay runs shared clusters on Cloudera Distribution of Hadoop
• 2010-2012: eBay runs sha...
HADOOP AT EBAY: PAST

8
EBAY AND HORTONWORKS
• 2012: eBay enters partnership with HortonWorks
– Goals
• Focus on eBay-specific development interna...
HADOOP AT EBAY:
PRESENT
Shared and Dedicated Clusters
Job Distribution
Use Case Examples
eBay Data Platform Overview
SHARED AND DEDICATED CLUSTERS
Shared clusters
–
–
–
–
–

10s of PB and 10s of thousands of slots per cluster
Used primaril...
JOB DISTRIBUTION BY TYPE

HADOOP AT EBAY: PRESENT

12
USE CASE EXAMPLES
•Cassini, eBay’s new search engine:
– Use MR to build full and incremental near-real-time indexes
– Raw ...
HADOOP OPERATIONS
LDAP Integration
- All users stored in Active Directory, accessed via LDAP
- Access to MapReduce Queues ...
HADOOP OPERATIONS
Team has Development and Operations Responsibilities
- 2 Huge shared clusters
- 1800+ users, exponential...
HADOOP MANAGEMENT CONSOLE
• Custom Web application built on Ruby on Rails
• Self-service tools are continually added to re...
HADOOP AT EBAY: PRESENT

17
HADOOP AT EBAY: PRESENT

18
HADOOP AT EBAY: PRESENT

19
HADOOP AT EBAY: PRESENT

20
HADOOP AT EBAY: PRESENT

21
HADOOP AT EBAY: PRESENT

22
HADOOP AT EBAY: PRESENT

23
ldap-admin
•Command-line tool written in Ruby
•Swiss-army knife tool, features added on demand for support issues
•Often u...
HADOOP AT EBAY:
FUTURE
HDFS Federation
YARN
New Scenarios
Storage and Operational Efficiency
HDFS HA and Federation
• HDFS High-Availability for Reliability
– NameNode in Hadoop 1.0 is a Single Point of Failure
– Au...
HDFS HA

HADOOP AT EBAY: FUTURE

27
HDFS HA

HADOOP AT EBAY: FUTURE

28
HDFS HA

HADOOP AT EBAY: FUTURE

29
HDFS Federation
Horizontal Scalability of HDFS Namespace
Multiple independent NameNodes serving a subtree of the NameSpace...
YARN
Hadoop 1.0: MapReduce
– JobTracker and TaskTracker services
– Handles Resource Management, Job Execution

Hadoop 2.0:...
YARN

HADOOP AT EBAY: FUTURE

32
YARN

HADOOP AT EBAY: FUTURE

33
YARN

HADOOP AT EBAY: FUTURE

34
YARN

HADOOP AT EBAY: FUTURE

35
New Scenarios
• Iterative Query
– Stinger (Hive), Impala, etc
– Rapid Data exploration and analysis
• Graph Databases
– Ti...
Efficiency and Reliability
• Storage Efficiency
– HDFS introduces a 3x storage cost for its replicas
– HDFS-RAID: more rel...
Open Source
• HMC Metadata
– Long term goal: standardize on open source technologies (HCatalog)
– Short term: explore what...
THANK YOU
Questions?
Upcoming SlideShare
Loading in...5
×

Hadoop @ eBay: Past, Present, and Future

2,197

Published on

An overview of eBay's experience with Hadoop in the Past and Present, as well as directions for the Future. Given by Ryan Hennig at the Big Data Meetup at eBay in Netanya, Israel on Dec 2, 2013

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,197
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
109
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "Hadoop @ eBay: Past, Present, and Future"

  1. 1. Hadoop @ eBay: Past, Present and Future Ryan Hennig Hadoop Platform Team
  2. 2. ABOUT ME
  3. 3. RYAN HENNIG Born and raised in Seattle, WA Studied Computer Science at University of Washington in Seattle Worked on Microsoft SQL Server 2006 – 2012 - Shipped SQL Server 2008, 2008 R2, 2012 Joined eBay Hadoop team in early 2012 - Based in Bellevue, suburb of Seattle COMPUTE AND DATA INFRASTRUCTURE 3
  4. 4. AGENDA Past: Growth of Hadoop at eBay Present: Hadoop Use Cases, Operations Tools Future: Hadoop 2.0
  5. 5. HADOOP AT EBAY: PAST Growth of Hadoop at eBay Adventures in Forking Partnership with Hortonworks
  6. 6. HADOOP EVOLUTION @ eBay 2013 • Shared clusters 2012 2011 2010 2009 Search 2007 • 10snodes Single digit nodes Shared cluster • 100s nodes • 1000s + core • PB • CDH2 • Shared clusters • 1000s node • 10,000+ core • 10s PB • Wilma (0.20) • Shared clusters • 1000s node • 10,000+ core • 10s PB • Argon (0.22) • 4k+ node • 40,000+ core • 50s PB • HDP 1.x HADOOP AT EBAY: PAST 6
  7. 7. ADVENTURES IN FORKING • 2007-2010: eBay runs shared clusters on Cloudera Distribution of Hadoop • 2010-2012: eBay runs shared clusters on custom Hadoop versions – 2010: Wilma (based on 0.20) – 2011: Argon (based on 0.22) – 2012: Custom branch abandoned • Lessons Learned – Forking a fast-changing open source project is difficult and risky • Balancing Development and operations needs • Development team size – Facebook had 100 – eBay had 15 • Coordination with open source community = lots of overhead • Divergence from open source: Push changes early and often HADOOP AT EBAY: PAST 7
  8. 8. HADOOP AT EBAY: PAST 8
  9. 9. EBAY AND HORTONWORKS • 2012: eBay enters partnership with HortonWorks – Goals • Focus on eBay-specific development internally • Leverage HortonWorks expertise for general Hadoop Development • Avoid source code divergence by making open source contribution a priority – Benefits to HortonWorks • Credibility enhanced by having a well-known customer • Ability to test at large scale HADOOP AT EBAY: PAST 9
  10. 10. HADOOP AT EBAY: PRESENT Shared and Dedicated Clusters Job Distribution Use Case Examples eBay Data Platform Overview
  11. 11. SHARED AND DEDICATED CLUSTERS Shared clusters – – – – – 10s of PB and 10s of thousands of slots per cluster Used primarily for analytics of user behavior and inventory Mix of production and ad-hoc jobs Mix of MR, Hive, PIG, Cascading etc. Hadoop and HBase security enabled Dedicated clusters – – – – Very specific use cases like Index Building Tight SLAs for jobs (in order of minutes) Immediate revenue impact Usually smaller than our shared clusters, but still big (100s of nodes…) HADOOP AT EBAY: PRESENT 11
  12. 12. JOB DISTRIBUTION BY TYPE HADOOP AT EBAY: PRESENT 12
  13. 13. USE CASE EXAMPLES •Cassini, eBay’s new search engine: – Use MR to build full and incremental near-real-time indexes – Raw Data is stored in HBase for efficient updates and random read – Strong SLAs: < 10 minutes – Run on dedicated clusters •Related and similar Items recommendations: – Use transactional data, click stream data, search index, etc. – Production MR jobs on a shared cluster •Analytics dashboard: – Run Mobius MR jobs to join click stream data and transactional data – Store summary data in HBase – Web application to query HBase HADOOP AT EBAY: PRESENT 13
  14. 14. HADOOP OPERATIONS LDAP Integration - All users stored in Active Directory, accessed via LDAP - Access to MapReduce Queues granted via MapReduce queues - Batch users: shared by a group of users Security - Kerberos as implemented by Microsoft Active Directory - One domain for users, another for service/server principals - Batch users authenticated via keytabs, not passwords Misc - 10’s of slave nodes are broken at any given time - Often need to add several racks of machines at a time HADOOP AT EBAY: PRESENT 14
  15. 15. HADOOP OPERATIONS Team has Development and Operations Responsibilities - 2 Huge shared clusters - 1800+ users, exponential growth - About 10 Hadoop developers - Recently: operations work moved to dedicated team Developed several tools to manage operations - Hadoop Management Console: user-facing web app - ldap-admin: swiss-army knife style tool for hadoop admins - Puppet: for adding machines to the clusters, many racks at a time - Decom/Recom scripts: automatic detection, repair, decommission, and recommission of slave nodes HADOOP AT EBAY: PRESENT 15
  16. 16. HADOOP MANAGEMENT CONSOLE • Custom Web application built on Ruby on Rails • Self-service tools are continually added to reduce support load – User Management • Access Requests • Group Membership – Batch User Management • New Requests • Sudoer management – Dataset Management • Explore Datasets • Request New dataset transfer between Teradata and Hadoop – Metadata tools • Each dataset is stored in custom XML format • Code Generation: Hive Tables, Java POJOs HADOOP AT EBAY: PRESENT 16
  17. 17. HADOOP AT EBAY: PRESENT 17
  18. 18. HADOOP AT EBAY: PRESENT 18
  19. 19. HADOOP AT EBAY: PRESENT 19
  20. 20. HADOOP AT EBAY: PRESENT 20
  21. 21. HADOOP AT EBAY: PRESENT 21
  22. 22. HADOOP AT EBAY: PRESENT 22
  23. 23. HADOOP AT EBAY: PRESENT 23
  24. 24. ldap-admin •Command-line tool written in Ruby •Swiss-army knife tool, features added on demand for support issues •Often used features: – Add a user to a group – View key details for LDAP users and groups – List all users, batch users, hadoop groups – Reset batch user passwords and keytabs – Show/add/remove sudoers for a batch account – Run user diagnostics: check permissions, keytabs, etc HADOOP AT EBAY: PRESENT 24
  25. 25. HADOOP AT EBAY: FUTURE HDFS Federation YARN New Scenarios Storage and Operational Efficiency
  26. 26. HDFS HA and Federation • HDFS High-Availability for Reliability – NameNode in Hadoop 1.0 is a Single Point of Failure – Automated failover to hot standby – Depends on ZooKeeper • HDFS Federation for Scalability and Isolation – Hadoop 1.0: Single NameNode service • “Secondary NameNode” is not for failover • Storage scales horizontally, but Namespace scales vertically • No isolation for different tenants or applications – Hadoop 2.0: HDFS Federation • Partition the HDFS Namespace • Many independent NameNodes • Allows direct access to Block Storage w/o going through HDFS interface HADOOP AT EBAY: FUTURE 26
  27. 27. HDFS HA HADOOP AT EBAY: FUTURE 27
  28. 28. HDFS HA HADOOP AT EBAY: FUTURE 28
  29. 29. HDFS HA HADOOP AT EBAY: FUTURE 29
  30. 30. HDFS Federation Horizontal Scalability of HDFS Namespace Multiple independent NameNodes serving a subtree of the NameSpace Example: NN1 provides /users, NN2 provides /reports HADOOP AT EBAY: FUTURE 30
  31. 31. YARN Hadoop 1.0: MapReduce – JobTracker and TaskTracker services – Handles Resource Management, Job Execution Hadoop 2.0: YARN - Refactoring Responsiblities of JobTracker and TaskTracker into more general platform - Global ResourceManager - Cluster-wide resource managements - Per-application ApplicationMaster - Application-specific job control HADOOP AT EBAY: FUTURE 31
  32. 32. YARN HADOOP AT EBAY: FUTURE 32
  33. 33. YARN HADOOP AT EBAY: FUTURE 33
  34. 34. YARN HADOOP AT EBAY: FUTURE 34
  35. 35. YARN HADOOP AT EBAY: FUTURE 35
  36. 36. New Scenarios • Iterative Query – Stinger (Hive), Impala, etc – Rapid Data exploration and analysis • Graph Databases – TitanDB, Giraph – Billions of vertices and edges – Complex Graph Traversals – Applications: PayPal fraud detection, Social Graph Analysis • Real-Time Processing – Storm (Twitter), Apache S4 – Reinforcement Learning, Monitoring HADOOP AT EBAY: FUTURE 36
  37. 37. Efficiency and Reliability • Storage Efficiency – HDFS introduces a 3x storage cost for its replicas – HDFS-RAID: more reliability for 1.5x storage cost • Reed-Solomon • Locally Repairable Codes (Project Xorbas) – Tradeoff: the cost of repairing lost data is much higher • Operational Efficiency – More automation – More self-service tools – Better Monitoring HADOOP AT EBAY: FUTURE 37
  38. 38. Open Source • HMC Metadata – Long term goal: standardize on open source technologies (HCatalog) – Short term: explore what should be open sourced • Hadoop Management Console – Hadoop Access Request Automation – Batch user creation and management – Metadata management – Code generation of dataset to Hive tables and Java POJOs • ldap_admin tools – Very useful but tightly coupled to eBay’s LDAP configuration – Willing to open source if there is interest HADOOP AT EBAY: FUTURE 38
  39. 39. THANK YOU Questions?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×