SlideShare a Scribd company logo
Building a Large Search
Platform on a             find the talk
Shoestring Budget

Jacques Nadeau, CTO
jacques@yapmap.com
@intjesus


May 22, 2012
Agenda
What is YapMap?
• Interfacing with Data
• Using HBase as a data processing pipeline
• NoSQL Schemas: Adjusting and Migrating
• Index Construction
• HBase Operations
What is YapMap?
• A visual search technology
• Focused on threaded
  conversations
• Built to provide better
  context and ranking
• Built on Hadoop & HBase for
  massive scale
• Two self-funded guys
• Motoyap.com is largest
  implementation at 650mm       www.motoyap.com
  automotive docs
Why do this?
               • Discussion forums and
                 mailings list primary
                 home for many hobbies
               • Threaded search sucks
                 – No context in the middle
                   of the conversation
How does it work?
                    Post 1
                    Post 2
                        Post 3
                             Post 4
                    Post 5
                        Post 6
A YapMap Search Result Page
Conceptual data model
                                        Entire Thread is MainDocGroup
             Post 1
             Post 2
                  Post 3                  For long threads, a single
                       Post 4             group may have multiple
                                          MainDocs
              Post 5
                  Post 6

                                  Each individual post is a
                                  DetailDoc

 • Threads are broken up among many web pages and don’t
   necessarily arrive in order
 • Longer threads are broken up
    – For short threads, MainDocGroup == MainDoc
General architecture

           RabbitMQ            MapReduce

    Targeted     Processing     Indexing      Results
    Crawlers      Pipeline       Engine    Presentation


                HBase                      Riak
                        HDFS/MapRfs
                         Zookeeper

   MySQL                                          MySQL
We match the tool to the use case
                     MySQL             HBase                      Riak
Primary Use          Business          Storage of crawl data,     Storage of
                     management        processing pipeline        components
                     information                                  directly related to
                                                                  presentation
Key features that   Transactions, SQL, Consistency, redundancy, Predictable low
drove selection     JPA                memory to persitence       latency, full
                                       ratio                      uptime, max one
                                                                  IOP per object
Average Object Size             Small                         20k                    2k
Object Count                <1 million                500 million             1 billion
System Count                         2                         10                     8
Memory Footprint                  <1gb                     120gb                 240gb
Dataset Size                    10mb                        10tb                    2tb


    We also evaluated Voldemort and Cassandra
Agenda
• What is YapMap?
Interfacing with Data
• Using HBase as a data processing pipeline
• NoSQL Schemas: Adjusting and Migrating
• Index Construction
• HBase Operations
HBase client is a power user interface
 • HBase client interface is low-level
   – Similar to JDBC/SQL
 • Most people start by using
   Bytes.to(String|Short|Long)
   – Spaghetti data layer
 • New developers have to learn a bunch of new
   concepts
 • Mistakes are easy to make
We built a DTO layer to simplify dev
 • Data Transfer Objects (DTO) & data access layer provide single point
   for code changes and data migration
 • First-class row key objects
 • Centralized type serialization
     – Standard data types
     – Complex object serialization layer via protobuf
 • Provide optimistic locking
 • Enable asynchronous operation
 • Minimize mistakes:
     – QuerySet abstraction (columns & column families)
     – Field state management (not queried versus null)
 • Newer tools have arrived to ease this burden
     – Kundera and Gora
Examples from our DTO abstraction
             <table name="crawlJob" row-id-class=“example.CrawlJobId" >
               <column-family name="main" compression="LZO" blockCacheEnabled="false" versions="1">
Definition
 Model




                  <column name="firstCheckpoint" type=“example.proto.JobProtos$CrawlCheckpoint" />
                  <column name="firstCheckpointTime" type="Long" />
                  <column name="entryCheckpointCount" type="Long" />
                  ...
             public class CrawlJobModel extends SparseModel<CrawlJobId>{
               public CrawlJobId getId(){…}
Generated
 Model




               public boolean hasFirstCheckpoint(){…}
               public CrawlCheckpoint getFirstCheckpoint(){…}
               public void setFirstCrawlCheckpoint(CrawlCheckpoint checkpoint){…}
               …
             public interface HBaseReadWriteService{
               public void putUnsafe(T model);
               public void putVersioned(T model);
Interface
  HBase




               public T get(RowId<T> rowId, QuerySet<T> querySet);
               public void increment(RowId<T> rowId, IncrementPair<T>... pairs);
               public SutructuredScanner<T> scanByPrefix(byte[] bytePrefix, QuerySet<T> querySet);
               ….
Example Primary Keys
UrlId                           Path + Query String
  org.apache.hbase:80:x:/book/architecture.html

 Reverse domain       Client Protocol (e.g. user name + http)
                    Optional Port

MainDocId
   GroupId (row)    2 byte bucket number (part)
 xxxx x xxxxxxx xx
           Additional identifier (4, 8 or 32 bytes depending on type)
      1 byte type of identifier enum (int, long or sha2, generic 32)
   4 byte source id
Agenda
• What is YapMap?
• Interfacing with Data
Using HBase as a data processing pipeline
• NoSQL Schemas: Adjusting and Migrating
• Index Construction
• HBase Operations
Processing pipeline is built on HBase
 •    Multiple steps with checkpoints to manage failures
 •    Out of order input assumed
 •    Idempotent operations at each stage of process
 •    Utilize optimistic locking to do coordinated merges
 •    Use regular cleanup scans to pick up lost tasks
 •    Control batch size of messages to control throughput versus latency

                Message                Message             Message                 Batch
                          Build Main       Merge + Split
                                                                Pre-index Main   Indexing
     Crawlers                               Main Doc
                             Docs            Groups                  Docs           RT
                                                                                 Indexing

     Cache       DFS          t1:cf1                 t2:cf1     t2:cf2
Migrating from messaging to coprocessors

 • Big challenges
    – Mixing system code and application code
    – Memory impact: we have a GC stable state
 • Exploring HBASE-4047 to solve

             Message                Message             Message                    Batch
                       Build Main       Merge + Split
                                                                Pre-index Main   Indexing
  Crawlers                               Main Doc
                          Docs            Groups                     Docs           RT
                                                                                 Indexing
                                        CP                 CP
  Cache       DFS          t1:cf1                 t2:cf1        t2:cf1
Agenda
• What is YapMap?
• Interfacing with Data
• Using HBase as a data processing pipeline
NoSQL Schemas: Adjusting and Migrating
• Index Construction
• HBase Operations
Learn to leverage NoSQL strengths
• Original Structure was similar    • New structure utilizes a cell for
  to traditional RDBMS,               each DetailDoc
    – static column names           • Split metadata maps MainDoc >
    – fully realized MainDoc          DetailDocId
• One new DetailDoc could cause     • HBase handles cascading changes
  a cascading regeneration of all   • MainDoc realized on app read
  MainDocs
                                    • Use column prefixing
            0        1         2        metadata detailid1   detailid2

        MainDoc MainDoc MainDoc           Splits   Detail     Detail
Schema migration steps
 1. Disable application writes on OldTable
 2. Extract OldSplits from OldTable
 3. Create NewTable with appropriate column families and
    properties
 4. Split NewTable based on OldSplits
 5. Run MapReduce job that converts old objects into new
    objects
    –   Use HTableInputFormat as input on OldTable
    –   Use HFileOutputFormat as output format pointing at NewTable
 6. Bulk load output into NewTable
 7. Redeploy application to read on NewTable
 8. Enable writes in application layer
Agenda
• What is YapMap?
• Interfacing with Data
• Using HBase as a data processing pipeline
• NoSQL Schemas: Adjusting and Migrating
Index Construction
• HBase Operations
Index Shards loosely based on HBase regions
 • Indexing is split      Tokenized Main Docs
   between major
   indices (batch) and
   minor (real time)             R1             Shard 1
 • Primary key order is
   same as index order
 • Shards are based on           R2             Shard 2
   snapshots of splits
 • IndexedTableSplit
   allows cross-region           R3             Shard 3
   shard splits to be
   integrated at Index
   load time
Batch indices are memory based, stored on DFS

 • Total of all shards about 1tb
    – With ECC memory <$7/gb, systems easily achieving 128-256gb
      each=> no problem
 • Each shard ~5gb in size to improve parallelism on search
    – Variable depending on needs and use case
 • Each shard is composed of multiple map and reduce parts
   along with MapReduce statistics from HBase
    – Integration of components are done in memory
    – Partitioner utilizes observed term distributions
    – New MR committer: FileAndPutOutputCommitter
        • Allows low volume secondary outputs from Map phase to be used
          during reduce phase
Agenda
• What is YapMap?
• Interfacing with Data
• Using HBase as a data processing pipeline
• NoSQL Schemas: Adjusting and Migrating
• Index Construction
HBase Operations
HBase Operations
• Getting GC right – 6 months
   – Machines have 32gb, 12gb for HBase, more was a problem
• Pick the right region size: With HFile v2, just start bigger
• Be cautious about using multiple CFs
• Consider Asynchbase Client
   – Benoit did some nice work at SU
   – Ultimately we just leveraged EJB3.1 @Async capabilities to make our HBase
     service async
• Upgrade: typically on the first or second point release
   – Testing/research cluster first
• Hardware: 8 core low power chips, low power ddr3, 6x WD
  Black 2TB drives per machine, Infiniband
• MapR’s M3 distribution of Hadoop
Questions
• Why not Lucene/Solr/ElasticSearch/etc?
    – Data locality between main and detail documents to do document-at-once scoring
    – Not built to work well with Hadoop and HBase (Blur.io is first to tackle this head on)
• Why not store indices directly in HBase?
    – Single cell storage would be the only way to do it efficiently
    – No such thing as a single cell no-read append (HBASE-5993)
    – No single cell partial read
• Why use Riak for presentation side?
    – Hadoop SPOF
    – Even with newer Hadoop versions, HBase does not do sub-second row-level HA on node
      failure (HBASE-2357)
    – Riak has more predictable latency
• Why did you switch to MapR?
    – Index load performance was substantially faster
    – Less impact on HBase performance
    – Snapshots in trial copy were nice for those 30 days

More Related Content

What's hot

HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon
 
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, PhotobucketHBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
Cloudera, Inc.
 
HBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial Industry
HBaseCon
 
Keynote: The Future of Apache HBase
Keynote: The Future of Apache HBaseKeynote: The Future of Apache HBase
Keynote: The Future of Apache HBase
HBaseCon
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
HBaseCon
 
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsightOptimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
HBaseCon
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBaseCon
 
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataHBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
Cloudera, Inc.
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
DataWorks Summit/Hadoop Summit
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
Cloudera, Inc.
 
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWSHBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon
 
HBaseCon 2013: Apache HBase Operations at Pinterest
HBaseCon 2013: Apache HBase Operations at PinterestHBaseCon 2013: Apache HBase Operations at Pinterest
HBaseCon 2013: Apache HBase Operations at Pinterest
Cloudera, Inc.
 
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
Cloudera, Inc.
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
Yahoo Developer Network
 
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBase
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBaseHBaseCon 2015 General Session: Zen - A Graph Data Model on HBase
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBase
HBaseCon
 
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
HBaseCon 2015: HBase at Scale in an Online and  High-Demand EnvironmentHBaseCon 2015: HBase at Scale in an Online and  High-Demand Environment
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
HBaseCon
 
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
Michael Stack
 
HBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBaseHBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBase
Cloudera, Inc.
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
Cloudera, Inc.
 
HBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and Spark
HBaseCon
 

What's hot (20)

HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
 
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, PhotobucketHBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
 
HBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial Industry
 
Keynote: The Future of Apache HBase
Keynote: The Future of Apache HBaseKeynote: The Future of Apache HBase
Keynote: The Future of Apache HBase
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
 
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsightOptimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region Replicas
 
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataHBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
 
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWSHBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
 
HBaseCon 2013: Apache HBase Operations at Pinterest
HBaseCon 2013: Apache HBase Operations at PinterestHBaseCon 2013: Apache HBase Operations at Pinterest
HBaseCon 2013: Apache HBase Operations at Pinterest
 
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
 
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBase
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBaseHBaseCon 2015 General Session: Zen - A Graph Data Model on HBase
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBase
 
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
HBaseCon 2015: HBase at Scale in an Online and  High-Demand EnvironmentHBaseCon 2015: HBase at Scale in an Online and  High-Demand Environment
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
 
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
 
HBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBaseHBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBase
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
 
HBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and Spark
 

Viewers also liked

HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...
HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...
HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...
Cloudera, Inc.
 
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase
Cloudera, Inc.
 
HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...
HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...
HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...
Cloudera, Inc.
 
HBaseCon 2013: Real-Time Model Scoring in Recommender Systems
HBaseCon 2013: Real-Time Model Scoring in Recommender Systems HBaseCon 2013: Real-Time Model Scoring in Recommender Systems
HBaseCon 2013: Real-Time Model Scoring in Recommender Systems
Cloudera, Inc.
 
HBaseCon 2012 | Real-time Analytics with HBase - Sematext
HBaseCon 2012 | Real-time Analytics with HBase - SematextHBaseCon 2012 | Real-time Analytics with HBase - Sematext
HBaseCon 2012 | Real-time Analytics with HBase - Sematext
Cloudera, Inc.
 
HBaseCon 2013: Scalable Network Designs for Apache HBase
HBaseCon 2013: Scalable Network Designs for Apache HBaseHBaseCon 2013: Scalable Network Designs for Apache HBase
HBaseCon 2013: Scalable Network Designs for Apache HBase
Cloudera, Inc.
 
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...
Cloudera, Inc.
 
HBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBaseHBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBase
Cloudera, Inc.
 
HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...
HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...
HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...
Cloudera, Inc.
 
HBaseCon 2013: Near Real Time Indexing for eBay Search
HBaseCon 2013: Near Real Time Indexing for eBay SearchHBaseCon 2013: Near Real Time Indexing for eBay Search
HBaseCon 2013: Near Real Time Indexing for eBay Search
Cloudera, Inc.
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
Cloudera, Inc.
 
Docker Monitoring Webinar
Docker Monitoring  WebinarDocker Monitoring  Webinar
Docker Monitoring Webinar
Sematext Group, Inc.
 
Solr Anti Patterns
Solr Anti PatternsSolr Anti Patterns
Solr Anti Patterns
Sematext Group, Inc.
 
Tuning Solr for Logs
Tuning Solr for LogsTuning Solr for Logs
Tuning Solr for Logs
Sematext Group, Inc.
 
From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...
From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...
From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...
Sematext Group, Inc.
 
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Sematext Group, Inc.
 
Metrics, Logs, Transaction Traces, Anomaly Detection at Scale
Metrics, Logs, Transaction Traces, Anomaly Detection at ScaleMetrics, Logs, Transaction Traces, Anomaly Detection at Scale
Metrics, Logs, Transaction Traces, Anomaly Detection at Scale
Sematext Group, Inc.
 
Tuning Elasticsearch Indexing Pipeline for Logs
Tuning Elasticsearch Indexing Pipeline for LogsTuning Elasticsearch Indexing Pipeline for Logs
Tuning Elasticsearch Indexing Pipeline for Logs
Sematext Group, Inc.
 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2
Sematext Group, Inc.
 
Hue: Big Data Web applications for Interactive Hadoop at Big Data Spain 2014
Hue: Big Data Web applications for Interactive Hadoop at Big Data Spain 2014Hue: Big Data Web applications for Interactive Hadoop at Big Data Spain 2014
Hue: Big Data Web applications for Interactive Hadoop at Big Data Spain 2014
gethue
 

Viewers also liked (20)

HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...
HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...
HBaseCon 2012 | Getting Real about Interactive Big Data Management with Lily ...
 
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase
 
HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...
HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...
HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...
 
HBaseCon 2013: Real-Time Model Scoring in Recommender Systems
HBaseCon 2013: Real-Time Model Scoring in Recommender Systems HBaseCon 2013: Real-Time Model Scoring in Recommender Systems
HBaseCon 2013: Real-Time Model Scoring in Recommender Systems
 
HBaseCon 2012 | Real-time Analytics with HBase - Sematext
HBaseCon 2012 | Real-time Analytics with HBase - SematextHBaseCon 2012 | Real-time Analytics with HBase - Sematext
HBaseCon 2012 | Real-time Analytics with HBase - Sematext
 
HBaseCon 2013: Scalable Network Designs for Apache HBase
HBaseCon 2013: Scalable Network Designs for Apache HBaseHBaseCon 2013: Scalable Network Designs for Apache HBase
HBaseCon 2013: Scalable Network Designs for Apache HBase
 
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...
 
HBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBaseHBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBase
 
HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...
HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...
HBaseCon 2013: Realtime User Segmentation using Apache HBase -- Architectural...
 
HBaseCon 2013: Near Real Time Indexing for eBay Search
HBaseCon 2013: Near Real Time Indexing for eBay SearchHBaseCon 2013: Near Real Time Indexing for eBay Search
HBaseCon 2013: Near Real Time Indexing for eBay Search
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
 
Docker Monitoring Webinar
Docker Monitoring  WebinarDocker Monitoring  Webinar
Docker Monitoring Webinar
 
Solr Anti Patterns
Solr Anti PatternsSolr Anti Patterns
Solr Anti Patterns
 
Tuning Solr for Logs
Tuning Solr for LogsTuning Solr for Logs
Tuning Solr for Logs
 
From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...
From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...
From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...
 
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
 
Metrics, Logs, Transaction Traces, Anomaly Detection at Scale
Metrics, Logs, Transaction Traces, Anomaly Detection at ScaleMetrics, Logs, Transaction Traces, Anomaly Detection at Scale
Metrics, Logs, Transaction Traces, Anomaly Detection at Scale
 
Tuning Elasticsearch Indexing Pipeline for Logs
Tuning Elasticsearch Indexing Pipeline for LogsTuning Elasticsearch Indexing Pipeline for Logs
Tuning Elasticsearch Indexing Pipeline for Logs
 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2
 
Hue: Big Data Web applications for Interactive Hadoop at Big Data Spain 2014
Hue: Big Data Web applications for Interactive Hadoop at Big Data Spain 2014Hue: Big Data Web applications for Interactive Hadoop at Big Data Spain 2014
Hue: Big Data Web applications for Interactive Hadoop at Big Data Spain 2014
 

Similar to HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget

Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
sonukumar379092
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
arslanhaneef
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
chariorienit
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoop
DataWorks Summit
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
MaharajothiP
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using Hadoop
George Ang
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects
 
Apache Drill
Apache DrillApache Drill
Apache Drill
Ted Dunning
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
Venneladonthireddy1
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
Sandeep Singh
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
York University
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
hansen3032
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
Ayyappan Paramesh
 
מיכאל
מיכאלמיכאל
מיכאל
sqlserver.co.il
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Ovidiu Dimulescu
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Fwdays
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 

Similar to HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget (20)

Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoop
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using Hadoop
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
מיכאל
מיכאלמיכאל
מיכאל
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
SAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloudSAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloud
maazsz111
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 

Recently uploaded (20)

Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
SAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloudSAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloud
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 

HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget

  • 1. Building a Large Search Platform on a find the talk Shoestring Budget Jacques Nadeau, CTO jacques@yapmap.com @intjesus May 22, 2012
  • 2. Agenda What is YapMap? • Interfacing with Data • Using HBase as a data processing pipeline • NoSQL Schemas: Adjusting and Migrating • Index Construction • HBase Operations
  • 3. What is YapMap? • A visual search technology • Focused on threaded conversations • Built to provide better context and ranking • Built on Hadoop & HBase for massive scale • Two self-funded guys • Motoyap.com is largest implementation at 650mm www.motoyap.com automotive docs
  • 4. Why do this? • Discussion forums and mailings list primary home for many hobbies • Threaded search sucks – No context in the middle of the conversation
  • 5. How does it work? Post 1 Post 2 Post 3 Post 4 Post 5 Post 6
  • 6. A YapMap Search Result Page
  • 7. Conceptual data model Entire Thread is MainDocGroup Post 1 Post 2 Post 3 For long threads, a single Post 4 group may have multiple MainDocs Post 5 Post 6 Each individual post is a DetailDoc • Threads are broken up among many web pages and don’t necessarily arrive in order • Longer threads are broken up – For short threads, MainDocGroup == MainDoc
  • 8. General architecture RabbitMQ MapReduce Targeted Processing Indexing Results Crawlers Pipeline Engine Presentation HBase Riak HDFS/MapRfs Zookeeper MySQL MySQL
  • 9. We match the tool to the use case MySQL HBase Riak Primary Use Business Storage of crawl data, Storage of management processing pipeline components information directly related to presentation Key features that Transactions, SQL, Consistency, redundancy, Predictable low drove selection JPA memory to persitence latency, full ratio uptime, max one IOP per object Average Object Size Small 20k 2k Object Count <1 million 500 million 1 billion System Count 2 10 8 Memory Footprint <1gb 120gb 240gb Dataset Size 10mb 10tb 2tb We also evaluated Voldemort and Cassandra
  • 10. Agenda • What is YapMap? Interfacing with Data • Using HBase as a data processing pipeline • NoSQL Schemas: Adjusting and Migrating • Index Construction • HBase Operations
  • 11. HBase client is a power user interface • HBase client interface is low-level – Similar to JDBC/SQL • Most people start by using Bytes.to(String|Short|Long) – Spaghetti data layer • New developers have to learn a bunch of new concepts • Mistakes are easy to make
  • 12. We built a DTO layer to simplify dev • Data Transfer Objects (DTO) & data access layer provide single point for code changes and data migration • First-class row key objects • Centralized type serialization – Standard data types – Complex object serialization layer via protobuf • Provide optimistic locking • Enable asynchronous operation • Minimize mistakes: – QuerySet abstraction (columns & column families) – Field state management (not queried versus null) • Newer tools have arrived to ease this burden – Kundera and Gora
  • 13. Examples from our DTO abstraction <table name="crawlJob" row-id-class=“example.CrawlJobId" > <column-family name="main" compression="LZO" blockCacheEnabled="false" versions="1"> Definition Model <column name="firstCheckpoint" type=“example.proto.JobProtos$CrawlCheckpoint" /> <column name="firstCheckpointTime" type="Long" /> <column name="entryCheckpointCount" type="Long" /> ... public class CrawlJobModel extends SparseModel<CrawlJobId>{ public CrawlJobId getId(){…} Generated Model public boolean hasFirstCheckpoint(){…} public CrawlCheckpoint getFirstCheckpoint(){…} public void setFirstCrawlCheckpoint(CrawlCheckpoint checkpoint){…} … public interface HBaseReadWriteService{ public void putUnsafe(T model); public void putVersioned(T model); Interface HBase public T get(RowId<T> rowId, QuerySet<T> querySet); public void increment(RowId<T> rowId, IncrementPair<T>... pairs); public SutructuredScanner<T> scanByPrefix(byte[] bytePrefix, QuerySet<T> querySet); ….
  • 14. Example Primary Keys UrlId Path + Query String org.apache.hbase:80:x:/book/architecture.html Reverse domain Client Protocol (e.g. user name + http) Optional Port MainDocId GroupId (row) 2 byte bucket number (part) xxxx x xxxxxxx xx Additional identifier (4, 8 or 32 bytes depending on type) 1 byte type of identifier enum (int, long or sha2, generic 32) 4 byte source id
  • 15. Agenda • What is YapMap? • Interfacing with Data Using HBase as a data processing pipeline • NoSQL Schemas: Adjusting and Migrating • Index Construction • HBase Operations
  • 16. Processing pipeline is built on HBase • Multiple steps with checkpoints to manage failures • Out of order input assumed • Idempotent operations at each stage of process • Utilize optimistic locking to do coordinated merges • Use regular cleanup scans to pick up lost tasks • Control batch size of messages to control throughput versus latency Message Message Message Batch Build Main Merge + Split Pre-index Main Indexing Crawlers Main Doc Docs Groups Docs RT Indexing Cache DFS t1:cf1 t2:cf1 t2:cf2
  • 17. Migrating from messaging to coprocessors • Big challenges – Mixing system code and application code – Memory impact: we have a GC stable state • Exploring HBASE-4047 to solve Message Message Message Batch Build Main Merge + Split Pre-index Main Indexing Crawlers Main Doc Docs Groups Docs RT Indexing CP CP Cache DFS t1:cf1 t2:cf1 t2:cf1
  • 18. Agenda • What is YapMap? • Interfacing with Data • Using HBase as a data processing pipeline NoSQL Schemas: Adjusting and Migrating • Index Construction • HBase Operations
  • 19. Learn to leverage NoSQL strengths • Original Structure was similar • New structure utilizes a cell for to traditional RDBMS, each DetailDoc – static column names • Split metadata maps MainDoc > – fully realized MainDoc DetailDocId • One new DetailDoc could cause • HBase handles cascading changes a cascading regeneration of all • MainDoc realized on app read MainDocs • Use column prefixing 0 1 2 metadata detailid1 detailid2 MainDoc MainDoc MainDoc Splits Detail Detail
  • 20. Schema migration steps 1. Disable application writes on OldTable 2. Extract OldSplits from OldTable 3. Create NewTable with appropriate column families and properties 4. Split NewTable based on OldSplits 5. Run MapReduce job that converts old objects into new objects – Use HTableInputFormat as input on OldTable – Use HFileOutputFormat as output format pointing at NewTable 6. Bulk load output into NewTable 7. Redeploy application to read on NewTable 8. Enable writes in application layer
  • 21. Agenda • What is YapMap? • Interfacing with Data • Using HBase as a data processing pipeline • NoSQL Schemas: Adjusting and Migrating Index Construction • HBase Operations
  • 22. Index Shards loosely based on HBase regions • Indexing is split Tokenized Main Docs between major indices (batch) and minor (real time) R1 Shard 1 • Primary key order is same as index order • Shards are based on R2 Shard 2 snapshots of splits • IndexedTableSplit allows cross-region R3 Shard 3 shard splits to be integrated at Index load time
  • 23. Batch indices are memory based, stored on DFS • Total of all shards about 1tb – With ECC memory <$7/gb, systems easily achieving 128-256gb each=> no problem • Each shard ~5gb in size to improve parallelism on search – Variable depending on needs and use case • Each shard is composed of multiple map and reduce parts along with MapReduce statistics from HBase – Integration of components are done in memory – Partitioner utilizes observed term distributions – New MR committer: FileAndPutOutputCommitter • Allows low volume secondary outputs from Map phase to be used during reduce phase
  • 24. Agenda • What is YapMap? • Interfacing with Data • Using HBase as a data processing pipeline • NoSQL Schemas: Adjusting and Migrating • Index Construction HBase Operations
  • 25. HBase Operations • Getting GC right – 6 months – Machines have 32gb, 12gb for HBase, more was a problem • Pick the right region size: With HFile v2, just start bigger • Be cautious about using multiple CFs • Consider Asynchbase Client – Benoit did some nice work at SU – Ultimately we just leveraged EJB3.1 @Async capabilities to make our HBase service async • Upgrade: typically on the first or second point release – Testing/research cluster first • Hardware: 8 core low power chips, low power ddr3, 6x WD Black 2TB drives per machine, Infiniband • MapR’s M3 distribution of Hadoop
  • 26. Questions • Why not Lucene/Solr/ElasticSearch/etc? – Data locality between main and detail documents to do document-at-once scoring – Not built to work well with Hadoop and HBase (Blur.io is first to tackle this head on) • Why not store indices directly in HBase? – Single cell storage would be the only way to do it efficiently – No such thing as a single cell no-read append (HBASE-5993) – No single cell partial read • Why use Riak for presentation side? – Hadoop SPOF – Even with newer Hadoop versions, HBase does not do sub-second row-level HA on node failure (HBASE-2357) – Riak has more predictable latency • Why did you switch to MapR? – Index load performance was substantially faster – Less impact on HBase performance – Snapshots in trial copy were nice for those 30 days