SlideShare a Scribd company logo
1 of 13
HBase powered Merchant Lookup
Service at Intuit
Vrushali Channapattan, Intuit
Lightning Talk @ HBaseCon2012 (May 22nd, 2012)
About Intuit




              Intuit is a leader in this trend
              because we are entrusted with the
              collective data of our 50 million
              customers.
2
Problem: Duplicate Merchants

                   Company ABC                                  Company PQR

name: The Windsor Press, Inc.              name: The Windsor Press
address: PO Box 465 6 North Third Street   address: P.O. Box 465 6 North 3rd St.
city: Hamburg                              city: Hamburg
state: PA                                  state: PA
zip: 19526                                 zip: 19526-0465
phone: (610) 562-2267                      phone: (610) 562-2267




                Both of the above vendor records map to the D&B business:

                ID: 002114902
                Name: The Windsor-Press Inc
                Street: 6 N 3rd St
                City: Hamburg
   Dun &        State: PA
   Bradstreet   Zip: 19526-1502
                Phone: (610)-562-2267
Applications of Merchant Lookup
Applications of Merchant Lookup
Backend Architecture
      Input                                                   Applications
                         Loader
      Data

                                                   Internal
                                                   Research
                                                   Projects



                                              Update
                                                         Merchant
                                                          Splicer

         Full table                    Final
         Scan                          Match Score




     Name        Phone      Address
                                      Individual               Score
                                      Matcher
                                      Scores                  Combiner

    Various
    Matchers


6
Data Model -Tables in HBase
    Merchants
       Master dataset of merchants
    Sangria_id
      Unique id generation coordination across mapper processes
    Duplicates
      Noting duplicate merchants after deduplication
    SnapshotMerchants
      Merging into master dataset
    NewMerchants
      The new merchant set that is to be added to the master data set of
       merchants



7
Schema
    Merchants
      Row key    Info (column family)                Mapping (column
                                                     family)
      25204939   name:Crepevine                      sourcename:10000048,
                 street:367 University Avenue        10000075
                 city:Palo Alto
                 state:CA
                 zip:94031
                 county:Santa Clara County
                 country: United States of America
                 website:www.crepevine.com
                 phoneNumber:16503233900
                 latitude:37.430211
                 longitude:-122.098221
                 source:internet
                 mint_category:Food & Dining
                 qbo_category:Restaurants
                 NAICS:722110
                 SIC:5182



8
Schema
     Sangria_id
       Row key     Info (column family)
       default     seed:30000
                   comment:initial seed by vc of 1000

       qbo         seed:20550000
                   comment:initial seed by kf of 20000000


    Duplicates
       Row key               Info (column family)

       10000043              25204921:0.998

       10000048              25204939:0.78

       10000075              25204939:0.95

9
Optimizations (job level)
     • For Hadoop jobs interfacing with HBase, used TableMapReduceUtil
     • Emitted a ‘put’ from Mapper or Reducer instead of a regular htable put
       – Use context.write(rowKey,put)
     • To make the full table scan faster (hbase read only hadoop jobs – deduping
       matchers , Solr index generator)
        scan.setCaching(500);
        scan.setCacheBlocks(false);
     • Used Customized TableInputFormat while scanning (custom number of
       splits for map tasks)
       job.setInputFormatClass(CustomizedTableInputFormat.class);
       extends TableInputFormat class and overriding getSplits
        method



10
Optimizations (code level)
     • Storing frequently used column family and column names as byte arrays in a
       public interface
       public static final byte[] COLUMN_NAME =
        Bytes.toBytes("name");
       public static final byte[] COLUMN_FAMILY_INFO =
        Bytes.toBytes("info");
     • Utility class for getting values from hbase.client.Result
       HBaseUtils.getColumnValue(result, COLUMN_FAMILY_INFO,
        COLUMN_NAME));
       public static String getColumnValue(Result result, byte[]
         type, byte[] columnName) {
             return Bytes.toString(result.getValue(type,
     columnName));
             }
     • Writing a sample set of 31 million records into the HBase cluster changed
       from 4 hours 37 mins 47 secs to 32 mins, 18 seconds
11
Thank You!
     Vrushali Channapattan, Intuit Data Group (BIO)
     vrushali_channapattan@intuit.com




12
Schema
      SnapshotMerchants

        Row key    Info (column family)
        merge      first:1336813613
                   start:1337029113
                   end:1337120100
                   comments:merging qbo against dandb
                   merchants initiated on May 14th 2012
                   outcome:started (or) merge run successful



     NewMerchants- same as Merchants




13

More Related Content

What's hot

BIG DATA: From mammoth to elephant
BIG DATA: From mammoth to elephantBIG DATA: From mammoth to elephant
BIG DATA: From mammoth to elephantRoman Nikitchenko
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 
Bringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopBringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopDataWorks Summit
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010Jonathan Seidman
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCCloudera, Inc.
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1Abbas Maazallahi
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon
 
Benchmarking Apache Druid
Benchmarking Apache Druid Benchmarking Apache Druid
Benchmarking Apache Druid Matt Sarrel
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keownCisco Canada
 
Building Data Applications with Apache Druid
Building Data Applications with Apache DruidBuilding Data Applications with Apache Druid
Building Data Applications with Apache DruidImply
 
Net flowhadoop flocon2013_yhlee_final
Net flowhadoop flocon2013_yhlee_finalNet flowhadoop flocon2013_yhlee_final
Net flowhadoop flocon2013_yhlee_finalYeounhee Lee
 
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...HBaseCon
 
Apache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesApache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesData Con LA
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 

What's hot (20)

Hadoop and HBase @eBay
Hadoop and HBase @eBayHadoop and HBase @eBay
Hadoop and HBase @eBay
 
BIG DATA: From mammoth to elephant
BIG DATA: From mammoth to elephantBIG DATA: From mammoth to elephant
BIG DATA: From mammoth to elephant
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Bringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopBringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on Hadoop
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
 
Understanding hdfs
Understanding hdfsUnderstanding hdfs
Understanding hdfs
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
 
Benchmarking Apache Druid
Benchmarking Apache Druid Benchmarking Apache Druid
Benchmarking Apache Druid
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
Building Data Applications with Apache Druid
Building Data Applications with Apache DruidBuilding Data Applications with Apache Druid
Building Data Applications with Apache Druid
 
Net flowhadoop flocon2013_yhlee_final
Net flowhadoop flocon2013_yhlee_finalNet flowhadoop flocon2013_yhlee_final
Net flowhadoop flocon2013_yhlee_final
 
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
 
Apache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesApache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use Cases
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 

Viewers also liked

HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaCloudera, Inc.
 
Case studies session 2
Case studies   session 2Case studies   session 2
Case studies session 2HBaseCon
 
Bulk Loading in the Wild: Ingesting the World's Energy Data
Bulk Loading in the Wild: Ingesting the World's Energy DataBulk Loading in the Wild: Ingesting the World's Energy Data
Bulk Loading in the Wild: Ingesting the World's Energy DataHBaseCon
 
HBaseCon 2012 | Storing and Manipulating Graphs in HBase
HBaseCon 2012 | Storing and Manipulating Graphs in HBaseHBaseCon 2012 | Storing and Manipulating Graphs in HBase
HBaseCon 2012 | Storing and Manipulating Graphs in HBaseCloudera, Inc.
 
Design Patterns for Building 360-degree Views with HBase and Kiji
Design Patterns for Building 360-degree Views with HBase and KijiDesign Patterns for Building 360-degree Views with HBase and Kiji
Design Patterns for Building 360-degree Views with HBase and KijiHBaseCon
 
HBaseCon 2012 | Orchestrating Clusters with Ironfan and Chef - Runa
HBaseCon 2012 | Orchestrating Clusters with Ironfan and Chef - RunaHBaseCon 2012 | Orchestrating Clusters with Ironfan and Chef - Runa
HBaseCon 2012 | Orchestrating Clusters with Ironfan and Chef - RunaCloudera, Inc.
 
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance TuningLars Hofhansl
 
Beautiful Seams: The Intuit Design System
Beautiful Seams: The Intuit Design SystemBeautiful Seams: The Intuit Design System
Beautiful Seams: The Intuit Design Systemuxpin
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!Cloudera, Inc.
 
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...Cloudera, Inc.
 
HBaseCon 2012 | Building Mobile Infrastructure with HBase
HBaseCon 2012 | Building Mobile Infrastructure with HBaseHBaseCon 2012 | Building Mobile Infrastructure with HBase
HBaseCon 2012 | Building Mobile Infrastructure with HBaseCloudera, Inc.
 
HBaseCon 2013: Being Smarter Than the Smart Meter
HBaseCon 2013: Being Smarter Than the Smart MeterHBaseCon 2013: Being Smarter Than the Smart Meter
HBaseCon 2013: Being Smarter Than the Smart MeterCloudera, Inc.
 
HBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 MinutesHBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 MinutesCloudera, Inc.
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBaseCon
 
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...Cloudera, Inc.
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera FieldHBaseCon
 
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.Cloudera, Inc.
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...Cloudera, Inc.
 

Viewers also liked (20)

HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
 
Case studies session 2
Case studies   session 2Case studies   session 2
Case studies session 2
 
Bulk Loading in the Wild: Ingesting the World's Energy Data
Bulk Loading in the Wild: Ingesting the World's Energy DataBulk Loading in the Wild: Ingesting the World's Energy Data
Bulk Loading in the Wild: Ingesting the World's Energy Data
 
HBaseCon 2012 | Storing and Manipulating Graphs in HBase
HBaseCon 2012 | Storing and Manipulating Graphs in HBaseHBaseCon 2012 | Storing and Manipulating Graphs in HBase
HBaseCon 2012 | Storing and Manipulating Graphs in HBase
 
Design Patterns for Building 360-degree Views with HBase and Kiji
Design Patterns for Building 360-degree Views with HBase and KijiDesign Patterns for Building 360-degree Views with HBase and Kiji
Design Patterns for Building 360-degree Views with HBase and Kiji
 
HBaseCon 2012 | Orchestrating Clusters with Ironfan and Chef - Runa
HBaseCon 2012 | Orchestrating Clusters with Ironfan and Chef - RunaHBaseCon 2012 | Orchestrating Clusters with Ironfan and Chef - Runa
HBaseCon 2012 | Orchestrating Clusters with Ironfan and Chef - Runa
 
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance Tuning
 
Beautiful Seams: The Intuit Design System
Beautiful Seams: The Intuit Design SystemBeautiful Seams: The Intuit Design System
Beautiful Seams: The Intuit Design System
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
 
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
 
HBaseCon 2012 | Building Mobile Infrastructure with HBase
HBaseCon 2012 | Building Mobile Infrastructure with HBaseHBaseCon 2012 | Building Mobile Infrastructure with HBase
HBaseCon 2012 | Building Mobile Infrastructure with HBase
 
HBaseCon 2013: Being Smarter Than the Smart Meter
HBaseCon 2013: Being Smarter Than the Smart MeterHBaseCon 2013: Being Smarter Than the Smart Meter
HBaseCon 2013: Being Smarter Than the Smart Meter
 
HBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 MinutesHBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 Minutes
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region Replicas
 
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera Field
 
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
 

Similar to HBaseCon 2012 | HBase powered Merchant Lookup Service at Intuit

Merchant Lookup Service Intuit
Merchant Lookup Service IntuitMerchant Lookup Service Intuit
Merchant Lookup Service IntuitDataWorks Summit
 
Neo4j Aura on AWS: The Customer Choice for Graph Databases
Neo4j Aura on AWS: The Customer Choice for Graph DatabasesNeo4j Aura on AWS: The Customer Choice for Graph Databases
Neo4j Aura on AWS: The Customer Choice for Graph DatabasesNeo4j
 
Cio summit 20170223_v20
Cio summit 20170223_v20Cio summit 20170223_v20
Cio summit 20170223_v20Joshua Bae
 
Customer Intelligence_ Harnessing Elephants at Transamerica Presentation (1)
Customer Intelligence_ Harnessing Elephants at Transamerica    Presentation (1)Customer Intelligence_ Harnessing Elephants at Transamerica    Presentation (1)
Customer Intelligence_ Harnessing Elephants at Transamerica Presentation (1)Vishal Bamba
 
VSSML17 L5. Basic Data Transformations and Feature Engineering
VSSML17 L5. Basic Data Transformations and Feature EngineeringVSSML17 L5. Basic Data Transformations and Feature Engineering
VSSML17 L5. Basic Data Transformations and Feature EngineeringBigML, Inc
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?confluent
 
Achieving Business Value by Fusing Hadoop and Corporate Data
Achieving Business Value by Fusing Hadoop and Corporate DataAchieving Business Value by Fusing Hadoop and Corporate Data
Achieving Business Value by Fusing Hadoop and Corporate DataInside Analysis
 
Analyst View of Data Virtualization: Conversations with Boulder Business Inte...
Analyst View of Data Virtualization: Conversations with Boulder Business Inte...Analyst View of Data Virtualization: Conversations with Boulder Business Inte...
Analyst View of Data Virtualization: Conversations with Boulder Business Inte...Denodo
 
The Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big DataThe Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big DataInside Analysis
 
Introduction: Relational to Graphs
Introduction: Relational to GraphsIntroduction: Relational to Graphs
Introduction: Relational to GraphsNeo4j
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkCaserta
 
Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015StampedeCon
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use CasesMax De Marzi
 
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...Databricks
 
Hadoop Demo eConvergence
Hadoop Demo eConvergenceHadoop Demo eConvergence
Hadoop Demo eConvergencekvnnrao
 
Sqlforetltesting 130712042826-phpapp01
Sqlforetltesting 130712042826-phpapp01Sqlforetltesting 130712042826-phpapp01
Sqlforetltesting 130712042826-phpapp01Gyanendra Kumar
 
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...WebExpo
 
L’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazioneL’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazioneMongoDB
 

Similar to HBaseCon 2012 | HBase powered Merchant Lookup Service at Intuit (20)

Merchant Lookup Service Intuit
Merchant Lookup Service IntuitMerchant Lookup Service Intuit
Merchant Lookup Service Intuit
 
Neo4j Aura on AWS: The Customer Choice for Graph Databases
Neo4j Aura on AWS: The Customer Choice for Graph DatabasesNeo4j Aura on AWS: The Customer Choice for Graph Databases
Neo4j Aura on AWS: The Customer Choice for Graph Databases
 
Cio summit 20170223_v20
Cio summit 20170223_v20Cio summit 20170223_v20
Cio summit 20170223_v20
 
Customer Intelligence_ Harnessing Elephants at Transamerica Presentation (1)
Customer Intelligence_ Harnessing Elephants at Transamerica    Presentation (1)Customer Intelligence_ Harnessing Elephants at Transamerica    Presentation (1)
Customer Intelligence_ Harnessing Elephants at Transamerica Presentation (1)
 
VSSML17 L5. Basic Data Transformations and Feature Engineering
VSSML17 L5. Basic Data Transformations and Feature EngineeringVSSML17 L5. Basic Data Transformations and Feature Engineering
VSSML17 L5. Basic Data Transformations and Feature Engineering
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?
 
Achieving Business Value by Fusing Hadoop and Corporate Data
Achieving Business Value by Fusing Hadoop and Corporate DataAchieving Business Value by Fusing Hadoop and Corporate Data
Achieving Business Value by Fusing Hadoop and Corporate Data
 
Analyst View of Data Virtualization: Conversations with Boulder Business Inte...
Analyst View of Data Virtualization: Conversations with Boulder Business Inte...Analyst View of Data Virtualization: Conversations with Boulder Business Inte...
Analyst View of Data Virtualization: Conversations with Boulder Business Inte...
 
The Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big DataThe Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big Data
 
Montali - DB-Nets: On The Marriage of Colored Petri Nets 
and Relational Data...
Montali - DB-Nets: On The Marriage of Colored Petri Nets 
and Relational Data...Montali - DB-Nets: On The Marriage of Colored Petri Nets 
and Relational Data...
Montali - DB-Nets: On The Marriage of Colored Petri Nets 
and Relational Data...
 
Introduction: Relational to Graphs
Introduction: Relational to GraphsIntroduction: Relational to Graphs
Introduction: Relational to Graphs
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 
Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
Large Scale Fuzzy Name Matching with a Custom ML Pipeline in Batch and Stream...
 
Hadoop Demo eConvergence
Hadoop Demo eConvergenceHadoop Demo eConvergence
Hadoop Demo eConvergence
 
Sqlforetltesting 130712042826-phpapp01
Sqlforetltesting 130712042826-phpapp01Sqlforetltesting 130712042826-phpapp01
Sqlforetltesting 130712042826-phpapp01
 
SQL for ETL Testing
SQL for ETL TestingSQL for ETL Testing
SQL for ETL Testing
 
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...
 
L’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazioneL’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazione
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 

Recently uploaded (20)

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 

HBaseCon 2012 | HBase powered Merchant Lookup Service at Intuit

  • 1. HBase powered Merchant Lookup Service at Intuit Vrushali Channapattan, Intuit Lightning Talk @ HBaseCon2012 (May 22nd, 2012)
  • 2. About Intuit Intuit is a leader in this trend because we are entrusted with the collective data of our 50 million customers. 2
  • 3. Problem: Duplicate Merchants Company ABC Company PQR name: The Windsor Press, Inc. name: The Windsor Press address: PO Box 465 6 North Third Street address: P.O. Box 465 6 North 3rd St. city: Hamburg city: Hamburg state: PA state: PA zip: 19526 zip: 19526-0465 phone: (610) 562-2267 phone: (610) 562-2267 Both of the above vendor records map to the D&B business: ID: 002114902 Name: The Windsor-Press Inc Street: 6 N 3rd St City: Hamburg Dun & State: PA Bradstreet Zip: 19526-1502 Phone: (610)-562-2267
  • 6. Backend Architecture Input Applications Loader Data Internal Research Projects Update Merchant Splicer Full table Final Scan Match Score Name Phone Address Individual Score Matcher Scores Combiner Various Matchers 6
  • 7. Data Model -Tables in HBase Merchants  Master dataset of merchants Sangria_id Unique id generation coordination across mapper processes Duplicates Noting duplicate merchants after deduplication SnapshotMerchants Merging into master dataset NewMerchants The new merchant set that is to be added to the master data set of merchants 7
  • 8. Schema Merchants Row key Info (column family) Mapping (column family) 25204939 name:Crepevine sourcename:10000048, street:367 University Avenue 10000075 city:Palo Alto state:CA zip:94031 county:Santa Clara County country: United States of America website:www.crepevine.com phoneNumber:16503233900 latitude:37.430211 longitude:-122.098221 source:internet mint_category:Food & Dining qbo_category:Restaurants NAICS:722110 SIC:5182 8
  • 9. Schema  Sangria_id Row key Info (column family) default seed:30000 comment:initial seed by vc of 1000 qbo seed:20550000 comment:initial seed by kf of 20000000 Duplicates Row key Info (column family) 10000043 25204921:0.998 10000048 25204939:0.78 10000075 25204939:0.95 9
  • 10. Optimizations (job level) • For Hadoop jobs interfacing with HBase, used TableMapReduceUtil • Emitted a ‘put’ from Mapper or Reducer instead of a regular htable put – Use context.write(rowKey,put) • To make the full table scan faster (hbase read only hadoop jobs – deduping matchers , Solr index generator)  scan.setCaching(500);  scan.setCacheBlocks(false); • Used Customized TableInputFormat while scanning (custom number of splits for map tasks) job.setInputFormatClass(CustomizedTableInputFormat.class); extends TableInputFormat class and overriding getSplits method 10
  • 11. Optimizations (code level) • Storing frequently used column family and column names as byte arrays in a public interface public static final byte[] COLUMN_NAME = Bytes.toBytes("name"); public static final byte[] COLUMN_FAMILY_INFO = Bytes.toBytes("info"); • Utility class for getting values from hbase.client.Result HBaseUtils.getColumnValue(result, COLUMN_FAMILY_INFO, COLUMN_NAME)); public static String getColumnValue(Result result, byte[] type, byte[] columnName) { return Bytes.toString(result.getValue(type, columnName)); } • Writing a sample set of 31 million records into the HBase cluster changed from 4 hours 37 mins 47 secs to 32 mins, 18 seconds 11
  • 12. Thank You! Vrushali Channapattan, Intuit Data Group (BIO) vrushali_channapattan@intuit.com 12
  • 13. Schema  SnapshotMerchants Row key Info (column family) merge first:1336813613 start:1337029113 end:1337120100 comments:merging qbo against dandb merchants initiated on May 14th 2012 outcome:started (or) merge run successful NewMerchants- same as Merchants 13

Editor's Notes

  1. 9,223,372,036,854,775,80720550000