HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website
 

HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website

on

  • 3,000 views

Gap Inc Direct, the online division for Gap Inc., uses HBase to serve, in real-time, apparel catalog for all its brands’ and markets’ web sites. This case study will review the business case as ...

Gap Inc Direct, the online division for Gap Inc., uses HBase to serve, in real-time, apparel catalog for all its brands’ and markets’ web sites. This case study will review the business case as well as key decisions regarding schema selection and cluster configurations. We will also discuss implementation challenges and insights that were learned.

Statistics

Views

Total Views
3,000
Views on SlideShare
2,781
Embed Views
219

Actions

Likes
4
Downloads
109
Comments
0

2 Embeds 219

http://www.cloudera.com 218
http://blog.cloudera.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website Presentation Transcript

  • HBaseCon 2012Applications Track – Case Study 1
  •  Suraj Varma  Director of Technology Implementation  Gap Inc Direct (GID), San Francisco, CA  IRC: svarma Gupta Gogula  Director-IT & Domain Architect of Catalog Management & Distribution  Gap Inc Direct (GID), San Francisco, CA 2
  •  Problem Domain HBase Schema Specifics HBase Cluster Specifics Learning & Challenges 3
  • 2009 2008 2005 2007 2010 APPLICATION SERVERS DATABASESNEWATHLETACA & SITE LAUNCH UNIVERSALITY PIPERLIME EU MARKETS US CA EU US CA EUINCOMING TRAFFIC US CA EU US US 4
  •  Evolution of the GID Apparel Catalog  2005 - Three independent brands in US  2010 – 5 integrated brands in US, CA, EU Rapid Expansion of Apparel Catalog However, each brand / market combination necessitated separate logical catalog databases 5
  •  Single Catalog store for all brands/markets  Horizontally scalable over time  Cross brand business features Access data store directly  To avail of inventory awareness of items Minimal Caching – only for optimization  Keeping caches in sync is a problem. Highly Available 6
  •  Sharded RDMBS, MemCached, etc  Significant effort was required  Still had scalability limits Non-relational alternatives considered HBase POC (early-2010)  Promising results -decided to move ahead 7
  •  Strong Consistency Model Server Side Filters Automatic Sharding, Distribution, Failover Hadoop Integration out of the box General Purpose Other use cases outside of Catalog Strong Community! 8
  • NEAR REAL TIME INVENTORY UPDATES MUTATIONS INCOMING REQUESTS BACKEND HBASE FOR SERVICES REQUESTS CLUSTERCATALOG DATA MUTATIONS MUTATIONS PRICING UPDATES ITEM UPDATES 9
  •  Read Mostly  Website Traffic  Sync MR Jobs Write / Delete Bursts  Catalog Publish  Phase out to near real- time updates from originating systems  MR jobs on Live Cluster Continuous Writes  Inventory Updates 10
  •  Hierarchical Data (Primarily)  SKU -> Style Lookups (child -> parent)  Cross Brand Sell (sibling <-> sibling) Rows: 100KB avg size 1000-5000 cols Data Access Patterns Sparse rows  Full Product Graph in one read  Single path of graph from root to leaf node  Search - Secondary Indices  Large Feed files 11
  • READ FULL GRAPH READ SINGLE PATH / EDGE 12
  •  Built custom “bean to schema mapper”  POJO graph < -> HBase qualifiers  Flexibility to shorten column qualifiers  Flexibility to change schema qualifiers (per environment / developer) <…> <association>one-to-many</association> <prefix>SC</prefix> <uniqueId>colorCd</uniqueId> <beanName>styleColorBean</beanName> <…> 13
  •  <PP>_<id1>_QQ_<id2>_RR_<id3>_name  Where PP is parent, QQ is child, RR is grandchildPattern: ANCESTOR IDS EMBEDDED IN QUALIFIER NAMEcf1:VAR_1_SC_0012_colorCdcf2:VAR_1_SC_0012_SCIMG_10_path 14
  •  Secondary Index  <id3> => RR ; QQ ; PP  FilterList with (RR, QQ, PP) ids to get thin slice pathPattern: SECONDARY INDEX TO HIERARCHICAL ANCESTORS KEY_5555 4444 333 22 1 15
  •  “Publish at Midnight”  Future Dated PUTs  Get/Scan with time range Large Feed Files  Sharded into smaller chunks < 2MB per cellPattern: SHARDED CHUNKS KEY_nnnn S_1 S_2 S_3 S_4 16
  •  16 Slave (RS + TT + DN) Nodes  8 & 16 GB RAM 3 Master (HM,ZK,JT, NN) Nodes  8 GB RAM NN Failover via NFS 17
  •  Block Cache  Maximize Block Cache  hfile.block.cache.size: 0.6 Garbage Collection  MSLAB enabled  CMSInitiatingOccupancyFactor 18
  •  Quick Recovery on node failure  Default timeouts too large  zookeeper.session.timeout Region Server  hbase.rpc.timeout Data Node  dfs.heartbeat.recheck.interval  heartbeat.recheck.interval 19
  •  Block Cache Size Tuning  Block Cache Churn Hot Row scenarios  Perf Tests & Doing Phased Rollouts Hot Region issues  Perf Tests & Pre-split Regions. Filters  CPU Intensive – profiling needed. 20
  •  Monitoring is crucial  Layer by layer -> what’s the bottleneck  Metrics to target optimization & tuning  Troubleshooting Non Uniform Hardware  Sub-optimal region distribution  Hefty boxes lightly loaded. 21
  •  M/R Jobs running on live cluster  Has an impact – so cannot run full throttle  Go easy … Feature Enablement – Phase in  Don’t turn on several features together  Easier identification of potential hot regions / rows, overloaded RS, etc 22
  • INVENTORY UPDATESFEATURE “A” ENABLED:ADDITIONAL “N” REQ / SEC INCOMING BACKEND LOT HBASE REQUESTS SERVICES MORE CLUSTER REQUESTSFEATURE “B” ENABLED:ADDITIONAL “K” REQ / SEC PRICING UPDATES ITEM UPDATES Enable Features individually to measure impact and tune cluster accordingly 23
  •  Search  No out-of-the-box secondary indexes.  Custom solution with Solr Transactions  Only row level atomicity  But … can’t pack all in a single row  Atomic Cross-Row Put/Delete and HBASE-5229 seem potential partial solves (0.94+) 24
  •  Orthogonal access patterns  Optimize for most frequently used pattern. Filters  May suffice, with early out configurations  Impacts CPU usage Duplicate data for every access pattern  Too drastic  Effort to keep all copies in sync 25
  •  Rebuild from source data  Takes time … but no data loss Export / import based backups  Faster … but stale  Another MR on live cluster Better options in future releases … 26
  • We’re hiring! http://www.gapinc.com 27