HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website

Cloudera, Inc.
Cloudera, Inc.Cloudera, Inc.
HBaseCon 2012

Applications Track – Case Study




                                  1
   Suraj Varma
     Director of Technology Implementation
     Gap Inc Direct (GID), San Francisco, CA
     IRC: svarma


   Gupta Gogula
     Director-IT & Domain Architect of Catalog
      Management & Distribution
     Gap Inc Direct (GID), San Francisco, CA
                                                  2
   Problem Domain

   HBase Schema Specifics

   HBase Cluster Specifics

   Learning & Challenges


                              3
2009
    2008
    2005
    2007
    2010           APPLICATION SERVERS   DATABASES

NEWATHLETA
CA & SITE LAUNCH
 UNIVERSALITY
   PIPERLIME
     EU MARKETS
                                              US
                                         CA        EU


                                              US
                                         CA        EU
INCOMING TRAFFIC
                                              US
                                         CA        EU


                                              US



                                              US

                                                        4
   Evolution of the GID Apparel Catalog
     2005 - Three independent brands in US
     2010 – 5 integrated brands in US, CA, EU


   Rapid Expansion of Apparel Catalog

   However, each brand / market combination
    necessitated separate logical catalog
    databases
                                                 5
   Single Catalog store for all brands/markets
     Horizontally scalable over time
     Cross brand business features

   Access data store directly
     To avail of inventory awareness of items

   Minimal Caching – only for optimization
     Keeping caches in sync is a problem.

   Highly Available
                                                  6
   Sharded RDMBS, MemCached, etc
     Significant effort was required
     Still had scalability limits


   Non-relational alternatives considered

   HBase POC (early-2010)
     Promising results -decided to move ahead


                                                 7
   Strong Consistency Model
   Server Side Filters
   Automatic Sharding, Distribution, Failover
   Hadoop Integration out of the box

   General Purpose
   Other use cases outside of Catalog

   Strong Community!
                                                 8
NEAR REAL TIME INVENTORY UPDATES


                                        MUTATIONS



  INCOMING
  REQUESTS     BACKEND                      HBASE
     FOR       SERVICES   REQUESTS         CLUSTER
CATALOG DATA


                                      MUTATIONS       MUTATIONS



                              PRICING UPDATES       ITEM UPDATES
                                                                  9
   Read Mostly                Website Traffic
                               Sync MR Jobs


   Write / Delete Bursts      Catalog Publish
                                 Phase out to near real-
                                  time updates from
                                  originating systems
                               MR jobs on Live Cluster

   Continuous Writes          Inventory Updates

                                                            10
   Hierarchical Data (Primarily)
     SKU -> Style Lookups (child -> parent)
     Cross Brand Sell (sibling <-> sibling)     Rows:
                                                 100KB avg size
                                                 1000-5000 cols
   Data Access Patterns                         Sparse rows
       Full Product Graph in one read
       Single path of graph from root to leaf node
       Search - Secondary Indices
       Large Feed files

                                                             11
READ FULL GRAPH




                  READ SINGLE PATH / EDGE



                                       12
   Built custom “bean to schema mapper”
     POJO graph < -> HBase qualifiers
     Flexibility to shorten column qualifiers
     Flexibility to change schema qualifiers (per
     environment / developer)
                             <…>
           <association>one-to-many</association>
                     <prefix>SC</prefix>
               <uniqueId>colorCd</uniqueId>
          <beanName>styleColorBean</beanName>
                             <…>
                                                     13
   <PP>_<id1>_QQ_<id2>_RR_<id3>_name
     Where PP is parent, QQ is child, RR is grandchild


Pattern: ANCESTOR IDS EMBEDDED IN QUALIFIER NAME

cf1:VAR_1_SC_0012_colorCd
cf2:VAR_1_SC_0012_SCIMG_10_path



                                                          14
   Secondary Index
     <id3> => RR ; QQ ; PP
     FilterList with (RR, QQ, PP) ids to get thin slice
      path


Pattern: SECONDARY INDEX TO HIERARCHICAL ANCESTORS
      KEY_5555              4444     333       22          1




                                                               15
   “Publish at Midnight”
     Future Dated PUTs
     Get/Scan with time range


   Large Feed Files
     Sharded into smaller chunks < 2MB per cell

Pattern: SHARDED CHUNKS
     KEY_nnnn             S_1      S_2     S_3     S_4

                                                         16
   16 Slave (RS + TT + DN) Nodes
     8 & 16 GB RAM


   3 Master (HM,ZK,JT, NN) Nodes
     8 GB RAM


   NN Failover via NFS


                                    17
   Block Cache
     Maximize Block Cache
     hfile.block.cache.size: 0.6


   Garbage Collection
     MSLAB enabled
     CMSInitiatingOccupancyFactor



                                     18
   Quick Recovery on node failure
     Default timeouts too large
     zookeeper.session.timeout

   Region Server
     hbase.rpc.timeout

   Data Node
     dfs.heartbeat.recheck.interval
     heartbeat.recheck.interval

                                       19
   Block Cache Size Tuning
     Block Cache Churn

   Hot Row scenarios
     Perf Tests & Doing Phased Rollouts

   Hot Region issues
     Perf Tests & Pre-split Regions.

   Filters
     CPU Intensive – profiling needed.

                                           20
   Monitoring is crucial
     Layer by layer -> what’s the bottleneck
     Metrics to target optimization & tuning
     Troubleshooting


   Non Uniform Hardware
     Sub-optimal region distribution
     Hefty boxes lightly loaded.

                                                21
   M/R Jobs running on live cluster
     Has an impact – so cannot run full throttle
     Go easy …


   Feature Enablement – Phase in
     Don’t turn on several features together
     Easier identification of potential hot regions /
      rows, overloaded RS, etc

                                                         22
INVENTORY UPDATES
FEATURE “A” ENABLED:
ADDITIONAL “N” REQ / SEC


 INCOMING           BACKEND           LOT
                                                         HBASE
 REQUESTS           SERVICES         MORE               CLUSTER
                                   REQUESTS
FEATURE “B” ENABLED:
ADDITIONAL “K” REQ / SEC
                                          PRICING UPDATES       ITEM UPDATES


 Enable Features individually to measure impact and tune cluster accordingly
                                                                               23
   Search
     No out-of-the-box secondary indexes.
     Custom solution with Solr


   Transactions
     Only row level atomicity
     But … can’t pack all in a single row
     Atomic Cross-Row Put/Delete and HBASE-5229
     seem potential partial solves (0.94+)
                                                   24
   Orthogonal access patterns
     Optimize for most frequently used pattern.

   Filters
     May suffice, with early out configurations
     Impacts CPU usage

   Duplicate data for every access pattern
     Too drastic
     Effort to keep all copies in sync

                                                   25
   Rebuild from source data
     Takes time … but no data loss


   Export / import based backups
     Faster … but stale
     Another MR on live cluster


   Better options in future releases …

                                          26
We’re hiring!
    http://www.gapinc.com




                            27
1 of 27

Recommended

HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera by
HBaseCon 2012 | Base Metrics: What They Mean to You - ClouderaHBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon 2012 | Base Metrics: What They Mean to You - ClouderaCloudera, Inc.
8.7K views21 slides
HBaseCon 2012 | HBase, the Use Case in eBay Cassini by
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini Cloudera, Inc.
6.1K views13 slides
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment by
HBaseCon 2015: HBase at Scale in an Online and  High-Demand EnvironmentHBaseCon 2015: HBase at Scale in an Online and  High-Demand Environment
HBaseCon 2015: HBase at Scale in an Online and High-Demand EnvironmentHBaseCon
4K views31 slides
Real-time HBase: Lessons from the Cloud by
Real-time HBase: Lessons from the CloudReal-time HBase: Lessons from the Cloud
Real-time HBase: Lessons from the CloudHBaseCon
4.5K views35 slides
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera by
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaCloudera, Inc.
5.5K views30 slides
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase by
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseCloudera, Inc.
3.2K views21 slides

More Related Content

What's hot

HBaseCon 2013: Compaction Improvements in Apache HBase by
HBaseCon 2013: Compaction Improvements in Apache HBaseHBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBaseCloudera, Inc.
19.1K views33 slides
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ... by
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...Cloudera, Inc.
7.3K views50 slides
HBase Accelerated: In-Memory Flush and Compaction by
HBase Accelerated: In-Memory Flush and CompactionHBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and CompactionDataWorks Summit/Hadoop Summit
2.7K views24 slides
HBase 0.20.0 Performance Evaluation by
HBase 0.20.0 Performance EvaluationHBase 0.20.0 Performance Evaluation
HBase 0.20.0 Performance EvaluationSchubert Zhang
3.8K views7 slides
HBaseCon 2015: HBase Operations in a Flurry by
HBaseCon 2015: HBase Operations in a FlurryHBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon
4.1K views22 slides
HBaseCon 2015: HBase 2.0 and Beyond Panel by
HBaseCon 2015: HBase 2.0 and Beyond PanelHBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond PanelHBaseCon
5.3K views44 slides

What's hot(20)

HBaseCon 2013: Compaction Improvements in Apache HBase by Cloudera, Inc.
HBaseCon 2013: Compaction Improvements in Apache HBaseHBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBase
Cloudera, Inc.19.1K views
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ... by Cloudera, Inc.
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
Cloudera, Inc.7.3K views
HBase 0.20.0 Performance Evaluation by Schubert Zhang
HBase 0.20.0 Performance EvaluationHBase 0.20.0 Performance Evaluation
HBase 0.20.0 Performance Evaluation
Schubert Zhang3.8K views
HBaseCon 2015: HBase Operations in a Flurry by HBaseCon
HBaseCon 2015: HBase Operations in a FlurryHBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon4.1K views
HBaseCon 2015: HBase 2.0 and Beyond Panel by HBaseCon
HBaseCon 2015: HBase 2.0 and Beyond PanelHBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon5.3K views
HBase @ Twitter by ctrezzo
HBase @ TwitterHBase @ Twitter
HBase @ Twitter
ctrezzo7.6K views
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,... by Cloudera, Inc.
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
Cloudera, Inc.3.8K views
hbaseconasia2017: HBase在Hulu的使用和实践 by HBaseCon
hbaseconasia2017: HBase在Hulu的使用和实践hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践
HBaseCon878 views
Facebook - Jonthan Gray - Hadoop World 2010 by Cloudera, Inc.
Facebook - Jonthan Gray - Hadoop World 2010Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
Cloudera, Inc.8.7K views
HBase at Bloomberg: High Availability Needs for the Financial Industry by HBaseCon
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial Industry
HBaseCon6.7K views
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight by HBaseCon
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsightOptimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
HBaseCon2.8K views
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data by Cloudera, Inc.
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataHBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
Cloudera, Inc.3.5K views
HBase Application Performance Improvement by Biju Nair
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance Improvement
Biju Nair23.5K views
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster by Cloudera, Inc.
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
Cloudera, Inc.7.5K views
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc... by Cloudera, Inc.
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
Cloudera, Inc.9.3K views
HBase Data Modeling and Access Patterns with Kite SDK by HBaseCon
HBase Data Modeling and Access Patterns with Kite SDKHBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDK
HBaseCon4.7K views
Time-Series Apache HBase by HBaseCon
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
HBaseCon5.6K views
HBase Read High Availability Using Timeline Consistent Region Replicas by enissoz
HBase  Read High Availability Using Timeline Consistent Region ReplicasHBase  Read High Availability Using Timeline Consistent Region Replicas
HBase Read High Availability Using Timeline Consistent Region Replicas
enissoz8.7K views

Viewers also liked

HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase by
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase Cloudera, Inc.
4.6K views23 slides
BIG Data & Hadoop Applications in E-Commerce by
BIG Data & Hadoop Applications in E-CommerceBIG Data & Hadoop Applications in E-Commerce
BIG Data & Hadoop Applications in E-CommerceSkillspeed
3.5K views17 slides
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage by
hive HBase Metastore - Improving Hive with a Big Data Metadata Storagehive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata StorageDataWorks Summit/Hadoop Summit
4.8K views45 slides
Impala: A Modern, Open-Source SQL Engine for Hadoop by
Impala: A Modern, Open-Source SQL Engine for HadoopImpala: A Modern, Open-Source SQL Engine for Hadoop
Impala: A Modern, Open-Source SQL Engine for HadoopAll Things Open
1.8K views51 slides
HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th... by
HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...
HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...Cloudera, Inc.
3.4K views8 slides
Real-World NoSQL Schema Design by
Real-World NoSQL Schema DesignReal-World NoSQL Schema Design
Real-World NoSQL Schema DesignDataWorks Summit/Hadoop Summit
22.4K views57 slides

Viewers also liked(12)

HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase by Cloudera, Inc.
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase
Cloudera, Inc.4.6K views
BIG Data & Hadoop Applications in E-Commerce by Skillspeed
BIG Data & Hadoop Applications in E-CommerceBIG Data & Hadoop Applications in E-Commerce
BIG Data & Hadoop Applications in E-Commerce
Skillspeed3.5K views
Impala: A Modern, Open-Source SQL Engine for Hadoop by All Things Open
Impala: A Modern, Open-Source SQL Engine for HadoopImpala: A Modern, Open-Source SQL Engine for Hadoop
Impala: A Modern, Open-Source SQL Engine for Hadoop
All Things Open1.8K views
HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th... by Cloudera, Inc.
HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...
HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...
Cloudera, Inc.3.4K views
How we solved Real-time User Segmentation using HBase by DataWorks Summit
How we solved Real-time User Segmentation using HBaseHow we solved Real-time User Segmentation using HBase
How we solved Real-time User Segmentation using HBase
DataWorks Summit11.4K views
MongoDB Schema Design: Four Real-World Examples by Mike Friedman
MongoDB Schema Design: Four Real-World ExamplesMongoDB Schema Design: Four Real-World Examples
MongoDB Schema Design: Four Real-World Examples
Mike Friedman98.2K views
Magento scalability from the trenches (Meet Magento Sweden 2016) by Divante
Magento scalability from the trenches (Meet Magento Sweden 2016)Magento scalability from the trenches (Meet Magento Sweden 2016)
Magento scalability from the trenches (Meet Magento Sweden 2016)
Divante166.5K views
Surprising failure factors when implementing eCommerce and Omnichannel eBusiness by Divante
Surprising failure factors when implementing eCommerce and Omnichannel eBusinessSurprising failure factors when implementing eCommerce and Omnichannel eBusiness
Surprising failure factors when implementing eCommerce and Omnichannel eBusiness
Divante161.4K views
Omnichannel Customer Experience by Divante
Omnichannel Customer ExperienceOmnichannel Customer Experience
Omnichannel Customer Experience
Divante166.6K views
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce by Cloudera, Inc.
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
Cloudera, Inc.41.7K views

Similar to HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website

All Aboard the Databus by
All Aboard the DatabusAll Aboard the Databus
All Aboard the DatabusAmy W. Tang
2.4K views24 slides
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012 by
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012Shirshanka Das
14.3K views20 slides
The Very Very Latest in Database Development - Oracle Open World 2012 by
The Very Very Latest in Database Development - Oracle Open World 2012The Very Very Latest in Database Development - Oracle Open World 2012
The Very Very Latest in Database Development - Oracle Open World 2012Lucas Jellema
2.5K views57 slides
The Very Very Latest In Database Development - Lucas Jellema - Oracle OpenWor... by
The Very Very Latest In Database Development - Lucas Jellema - Oracle OpenWor...The Very Very Latest In Database Development - Lucas Jellema - Oracle OpenWor...
The Very Very Latest In Database Development - Lucas Jellema - Oracle OpenWor...Getting value from IoT, Integration and Data Analytics
887 views57 slides
Cloudcon East Presentation by
Cloudcon East PresentationCloudcon East Presentation
Cloudcon East Presentationbr7tt
429 views61 slides
Cloudcon East Presentation by
Cloudcon East PresentationCloudcon East Presentation
Cloudcon East Presentationbr7tt
310 views61 slides

Similar to HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website(20)

All Aboard the Databus by Amy W. Tang
All Aboard the DatabusAll Aboard the Databus
All Aboard the Databus
Amy W. Tang2.4K views
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012 by Shirshanka Das
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Shirshanka Das14.3K views
The Very Very Latest in Database Development - Oracle Open World 2012 by Lucas Jellema
The Very Very Latest in Database Development - Oracle Open World 2012The Very Very Latest in Database Development - Oracle Open World 2012
The Very Very Latest in Database Development - Oracle Open World 2012
Lucas Jellema2.5K views
Cloudcon East Presentation by br7tt
Cloudcon East PresentationCloudcon East Presentation
Cloudcon East Presentation
br7tt429 views
Cloudcon East Presentation by br7tt
Cloudcon East PresentationCloudcon East Presentation
Cloudcon East Presentation
br7tt310 views
MySQL Cluster Scaling to a Billion Queries by Bernd Ocklin
MySQL Cluster Scaling to a Billion QueriesMySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion Queries
Bernd Ocklin10.6K views
Performance Management in ‘Big Data’ Applications by Michael Kopp
Performance Management in ‘Big Data’ ApplicationsPerformance Management in ‘Big Data’ Applications
Performance Management in ‘Big Data’ Applications
Michael Kopp1.6K views
SQL and NoSQL in SQL Server by Michael Rys
SQL and NoSQL in SQL ServerSQL and NoSQL in SQL Server
SQL and NoSQL in SQL Server
Michael Rys7.9K views
Red Hat Storage Day New York - New Reference Architectures by Red_Hat_Storage
Red Hat Storage Day New York - New Reference ArchitecturesRed Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference Architectures
Red_Hat_Storage578 views
Replicate from Oracle to data warehouses and analytics by Continuent
Replicate from Oracle to data warehouses and analyticsReplicate from Oracle to data warehouses and analytics
Replicate from Oracle to data warehouses and analytics
Continuent781 views
Oracle rac 10g best practices by Haseeb Alam
Oracle rac 10g best practicesOracle rac 10g best practices
Oracle rac 10g best practices
Haseeb Alam1.3K views
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic... by Chicago Hadoop Users Group
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam... by HostedbyConfluent
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
HostedbyConfluent425 views
Cowboy dating with big data by b0ris_1
Cowboy dating with big data Cowboy dating with big data
Cowboy dating with big data
b0ris_1357 views

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx by
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
107 views55 slides
Cloudera Data Impact Awards 2021 - Finalists by
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
6.4K views34 slides
2020 Cloudera Data Impact Awards Finalists by
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
6.3K views43 slides
Edc event vienna presentation 1 oct 2019 by
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
4.5K views67 slides
Machine Learning with Limited Labeled Data 4/3/19 by
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
3.6K views36 slides
Data Driven With the Cloudera Modern Data Warehouse 3.19.19 by
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
2.5K views21 slides

More from Cloudera, Inc.(20)

Partner Briefing_January 25 (FINAL).pptx by Cloudera, Inc.
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.107 views
Cloudera Data Impact Awards 2021 - Finalists by Cloudera, Inc.
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.6.4K views
2020 Cloudera Data Impact Awards Finalists by Cloudera, Inc.
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.6.3K views
Edc event vienna presentation 1 oct 2019 by Cloudera, Inc.
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.4.5K views
Machine Learning with Limited Labeled Data 4/3/19 by Cloudera, Inc.
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.3.6K views
Data Driven With the Cloudera Modern Data Warehouse 3.19.19 by Cloudera, Inc.
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.2.5K views
Introducing Cloudera DataFlow (CDF) 2.13.19 by Cloudera, Inc.
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.4.9K views
Introducing Cloudera Data Science Workbench for HDP 2.12.19 by Cloudera, Inc.
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.2.7K views
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19 by Cloudera, Inc.
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.1.6K views
Leveraging the cloud for analytics and machine learning 1.29.19 by Cloudera, Inc.
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.1.6K views
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19 by Cloudera, Inc.
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.2.5K views
Leveraging the Cloud for Big Data Analytics 12.11.18 by Cloudera, Inc.
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.1.7K views
Modern Data Warehouse Fundamentals Part 3 by Cloudera, Inc.
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.1.3K views
Modern Data Warehouse Fundamentals Part 2 by Cloudera, Inc.
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.2.3K views
Modern Data Warehouse Fundamentals Part 1 by Cloudera, Inc.
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.1.5K views
Extending Cloudera SDX beyond the Platform by Cloudera, Inc.
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.966 views
Federated Learning: ML with Privacy on the Edge 11.15.18 by Cloudera, Inc.
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.2.2K views
Analyst Webinar: Doing a 180 on Customer 360 by Cloudera, Inc.
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.1.4K views
Build a modern platform for anti-money laundering 9.19.18 by Cloudera, Inc.
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.1K views
Introducing the data science sandbox as a service 8.30.18 by Cloudera, Inc.
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.1.2K views

Recently uploaded

How to reduce cold starts for Java Serverless applications in AWS at JCON Wor... by
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...Vadym Kazulkin
70 views64 slides
Combining Orchestration and Choreography for a Clean Architecture by
Combining Orchestration and Choreography for a Clean ArchitectureCombining Orchestration and Choreography for a Clean Architecture
Combining Orchestration and Choreography for a Clean ArchitectureThomasHeinrichs1
68 views24 slides
ChatGPT and AI for Web Developers by
ChatGPT and AI for Web DevelopersChatGPT and AI for Web Developers
ChatGPT and AI for Web DevelopersMaximiliano Firtman
174 views82 slides
Micron CXL product and architecture update by
Micron CXL product and architecture updateMicron CXL product and architecture update
Micron CXL product and architecture updateCXL Forum
27 views7 slides
Empathic Computing: Delivering the Potential of the Metaverse by
Empathic Computing: Delivering  the Potential of the MetaverseEmpathic Computing: Delivering  the Potential of the Metaverse
Empathic Computing: Delivering the Potential of the MetaverseMark Billinghurst
449 views80 slides
Spesifikasi Lengkap ASUS Vivobook Go 14 by
Spesifikasi Lengkap ASUS Vivobook Go 14Spesifikasi Lengkap ASUS Vivobook Go 14
Spesifikasi Lengkap ASUS Vivobook Go 14Dot Semarang
35 views1 slide

Recently uploaded(20)

How to reduce cold starts for Java Serverless applications in AWS at JCON Wor... by Vadym Kazulkin
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...
Vadym Kazulkin70 views
Combining Orchestration and Choreography for a Clean Architecture by ThomasHeinrichs1
Combining Orchestration and Choreography for a Clean ArchitectureCombining Orchestration and Choreography for a Clean Architecture
Combining Orchestration and Choreography for a Clean Architecture
ThomasHeinrichs168 views
Micron CXL product and architecture update by CXL Forum
Micron CXL product and architecture updateMicron CXL product and architecture update
Micron CXL product and architecture update
CXL Forum27 views
Empathic Computing: Delivering the Potential of the Metaverse by Mark Billinghurst
Empathic Computing: Delivering  the Potential of the MetaverseEmpathic Computing: Delivering  the Potential of the Metaverse
Empathic Computing: Delivering the Potential of the Metaverse
Mark Billinghurst449 views
Spesifikasi Lengkap ASUS Vivobook Go 14 by Dot Semarang
Spesifikasi Lengkap ASUS Vivobook Go 14Spesifikasi Lengkap ASUS Vivobook Go 14
Spesifikasi Lengkap ASUS Vivobook Go 14
Dot Semarang35 views
PharoJS - Zürich Smalltalk Group Meetup November 2023 by Noury Bouraqadi
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023
Noury Bouraqadi113 views
Future of Learning - Yap Aye Wee.pdf by NUS-ISS
Future of Learning - Yap Aye Wee.pdfFuture of Learning - Yap Aye Wee.pdf
Future of Learning - Yap Aye Wee.pdf
NUS-ISS38 views
[2023] Putting the R! in R&D.pdf by Eleanor McHugh
[2023] Putting the R! in R&D.pdf[2023] Putting the R! in R&D.pdf
[2023] Putting the R! in R&D.pdf
Eleanor McHugh38 views
Astera Labs: Intelligent Connectivity for Cloud and AI Infrastructure by CXL Forum
Astera Labs:  Intelligent Connectivity for Cloud and AI InfrastructureAstera Labs:  Intelligent Connectivity for Cloud and AI Infrastructure
Astera Labs: Intelligent Connectivity for Cloud and AI Infrastructure
CXL Forum125 views
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10165 views
TE Connectivity: Card Edge Interconnects by CXL Forum
TE Connectivity: Card Edge InterconnectsTE Connectivity: Card Edge Interconnects
TE Connectivity: Card Edge Interconnects
CXL Forum96 views
The Importance of Cybersecurity for Digital Transformation by NUS-ISS
The Importance of Cybersecurity for Digital TransformationThe Importance of Cybersecurity for Digital Transformation
The Importance of Cybersecurity for Digital Transformation
NUS-ISS25 views
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum... by NUS-ISS
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
NUS-ISS28 views
JCon Live 2023 - Lice coding some integration problems by Bernd Ruecker
JCon Live 2023 - Lice coding some integration problemsJCon Live 2023 - Lice coding some integration problems
JCon Live 2023 - Lice coding some integration problems
Bernd Ruecker67 views
"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy by Fwdays
"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy
"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy
Fwdays40 views
MemVerge: Gismo (Global IO-free Shared Memory Objects) by CXL Forum
MemVerge: Gismo (Global IO-free Shared Memory Objects)MemVerge: Gismo (Global IO-free Shared Memory Objects)
MemVerge: Gismo (Global IO-free Shared Memory Objects)
CXL Forum112 views
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu... by NUS-ISS
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
NUS-ISS32 views
Microchip: CXL Use Cases and Enabling Ecosystem by CXL Forum
Microchip: CXL Use Cases and Enabling EcosystemMicrochip: CXL Use Cases and Enabling Ecosystem
Microchip: CXL Use Cases and Enabling Ecosystem
CXL Forum129 views

HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website

  • 2. Suraj Varma  Director of Technology Implementation  Gap Inc Direct (GID), San Francisco, CA  IRC: svarma  Gupta Gogula  Director-IT & Domain Architect of Catalog Management & Distribution  Gap Inc Direct (GID), San Francisco, CA 2
  • 3. Problem Domain  HBase Schema Specifics  HBase Cluster Specifics  Learning & Challenges 3
  • 4. 2009 2008 2005 2007 2010 APPLICATION SERVERS DATABASES NEWATHLETA CA & SITE LAUNCH UNIVERSALITY PIPERLIME EU MARKETS US CA EU US CA EU INCOMING TRAFFIC US CA EU US US 4
  • 5. Evolution of the GID Apparel Catalog  2005 - Three independent brands in US  2010 – 5 integrated brands in US, CA, EU  Rapid Expansion of Apparel Catalog  However, each brand / market combination necessitated separate logical catalog databases 5
  • 6. Single Catalog store for all brands/markets  Horizontally scalable over time  Cross brand business features  Access data store directly  To avail of inventory awareness of items  Minimal Caching – only for optimization  Keeping caches in sync is a problem.  Highly Available 6
  • 7. Sharded RDMBS, MemCached, etc  Significant effort was required  Still had scalability limits  Non-relational alternatives considered  HBase POC (early-2010)  Promising results -decided to move ahead 7
  • 8. Strong Consistency Model  Server Side Filters  Automatic Sharding, Distribution, Failover  Hadoop Integration out of the box  General Purpose  Other use cases outside of Catalog  Strong Community! 8
  • 9. NEAR REAL TIME INVENTORY UPDATES MUTATIONS INCOMING REQUESTS BACKEND HBASE FOR SERVICES REQUESTS CLUSTER CATALOG DATA MUTATIONS MUTATIONS PRICING UPDATES ITEM UPDATES 9
  • 10. Read Mostly  Website Traffic  Sync MR Jobs  Write / Delete Bursts  Catalog Publish  Phase out to near real- time updates from originating systems  MR jobs on Live Cluster  Continuous Writes  Inventory Updates 10
  • 11. Hierarchical Data (Primarily)  SKU -> Style Lookups (child -> parent)  Cross Brand Sell (sibling <-> sibling) Rows: 100KB avg size 1000-5000 cols  Data Access Patterns Sparse rows  Full Product Graph in one read  Single path of graph from root to leaf node  Search - Secondary Indices  Large Feed files 11
  • 12. READ FULL GRAPH READ SINGLE PATH / EDGE 12
  • 13. Built custom “bean to schema mapper”  POJO graph < -> HBase qualifiers  Flexibility to shorten column qualifiers  Flexibility to change schema qualifiers (per environment / developer) <…> <association>one-to-many</association> <prefix>SC</prefix> <uniqueId>colorCd</uniqueId> <beanName>styleColorBean</beanName> <…> 13
  • 14. <PP>_<id1>_QQ_<id2>_RR_<id3>_name  Where PP is parent, QQ is child, RR is grandchild Pattern: ANCESTOR IDS EMBEDDED IN QUALIFIER NAME cf1:VAR_1_SC_0012_colorCd cf2:VAR_1_SC_0012_SCIMG_10_path 14
  • 15. Secondary Index  <id3> => RR ; QQ ; PP  FilterList with (RR, QQ, PP) ids to get thin slice path Pattern: SECONDARY INDEX TO HIERARCHICAL ANCESTORS KEY_5555 4444 333 22 1 15
  • 16. “Publish at Midnight”  Future Dated PUTs  Get/Scan with time range  Large Feed Files  Sharded into smaller chunks < 2MB per cell Pattern: SHARDED CHUNKS KEY_nnnn S_1 S_2 S_3 S_4 16
  • 17. 16 Slave (RS + TT + DN) Nodes  8 & 16 GB RAM  3 Master (HM,ZK,JT, NN) Nodes  8 GB RAM  NN Failover via NFS 17
  • 18. Block Cache  Maximize Block Cache  hfile.block.cache.size: 0.6  Garbage Collection  MSLAB enabled  CMSInitiatingOccupancyFactor 18
  • 19. Quick Recovery on node failure  Default timeouts too large  zookeeper.session.timeout  Region Server  hbase.rpc.timeout  Data Node  dfs.heartbeat.recheck.interval  heartbeat.recheck.interval 19
  • 20. Block Cache Size Tuning  Block Cache Churn  Hot Row scenarios  Perf Tests & Doing Phased Rollouts  Hot Region issues  Perf Tests & Pre-split Regions.  Filters  CPU Intensive – profiling needed. 20
  • 21. Monitoring is crucial  Layer by layer -> what’s the bottleneck  Metrics to target optimization & tuning  Troubleshooting  Non Uniform Hardware  Sub-optimal region distribution  Hefty boxes lightly loaded. 21
  • 22. M/R Jobs running on live cluster  Has an impact – so cannot run full throttle  Go easy …  Feature Enablement – Phase in  Don’t turn on several features together  Easier identification of potential hot regions / rows, overloaded RS, etc 22
  • 23. INVENTORY UPDATES FEATURE “A” ENABLED: ADDITIONAL “N” REQ / SEC INCOMING BACKEND LOT HBASE REQUESTS SERVICES MORE CLUSTER REQUESTS FEATURE “B” ENABLED: ADDITIONAL “K” REQ / SEC PRICING UPDATES ITEM UPDATES Enable Features individually to measure impact and tune cluster accordingly 23
  • 24. Search  No out-of-the-box secondary indexes.  Custom solution with Solr  Transactions  Only row level atomicity  But … can’t pack all in a single row  Atomic Cross-Row Put/Delete and HBASE-5229 seem potential partial solves (0.94+) 24
  • 25. Orthogonal access patterns  Optimize for most frequently used pattern.  Filters  May suffice, with early out configurations  Impacts CPU usage  Duplicate data for every access pattern  Too drastic  Effort to keep all copies in sync 25
  • 26. Rebuild from source data  Takes time … but no data loss  Export / import based backups  Faster … but stale  Another MR on live cluster  Better options in future releases … 26
  • 27. We’re hiring! http://www.gapinc.com 27