This document discusses YapMap, a visual search platform built on Hadoop and HBase. It summarizes how YapMap interfaces with HBase data, uses HBase as a data processing pipeline with checkpoints, and had to adjust schemas and migrate data as the system evolved. It also covers how YapMap constructs search indexes in shards based on HBase regions and stored indexes on HDFS. The document concludes with some lessons learned around optimizing HBase operations.
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
1. Building a Large Search
Platform on a find the talk
Shoestring Budget
Jacques Nadeau, CTO
jacques@yapmap.com
@intjesus
May 22, 2012
2. Agenda
What is YapMap?
• Interfacing with Data
• Using HBase as a data processing pipeline
• NoSQL Schemas: Adjusting and Migrating
• Index Construction
• HBase Operations
3. What is YapMap?
• A visual search technology
• Focused on threaded
conversations
• Built to provide better
context and ranking
• Built on Hadoop & HBase for
massive scale
• Two self-funded guys
• Motoyap.com is largest
implementation at 650mm www.motoyap.com
automotive docs
4. Why do this?
• Discussion forums and
mailings list primary
home for many hobbies
• Threaded search sucks
– No context in the middle
of the conversation
5. How does it work?
Post 1
Post 2
Post 3
Post 4
Post 5
Post 6
7. Conceptual data model
Entire Thread is MainDocGroup
Post 1
Post 2
Post 3 For long threads, a single
Post 4 group may have multiple
MainDocs
Post 5
Post 6
Each individual post is a
DetailDoc
• Threads are broken up among many web pages and don’t
necessarily arrive in order
• Longer threads are broken up
– For short threads, MainDocGroup == MainDoc
8. General architecture
RabbitMQ MapReduce
Targeted Processing Indexing Results
Crawlers Pipeline Engine Presentation
HBase Riak
HDFS/MapRfs
Zookeeper
MySQL MySQL
9. We match the tool to the use case
MySQL HBase Riak
Primary Use Business Storage of crawl data, Storage of
management processing pipeline components
information directly related to
presentation
Key features that Transactions, SQL, Consistency, redundancy, Predictable low
drove selection JPA memory to persitence latency, full
ratio uptime, max one
IOP per object
Average Object Size Small 20k 2k
Object Count <1 million 500 million 1 billion
System Count 2 10 8
Memory Footprint <1gb 120gb 240gb
Dataset Size 10mb 10tb 2tb
We also evaluated Voldemort and Cassandra
10. Agenda
• What is YapMap?
Interfacing with Data
• Using HBase as a data processing pipeline
• NoSQL Schemas: Adjusting and Migrating
• Index Construction
• HBase Operations
11. HBase client is a power user interface
• HBase client interface is low-level
– Similar to JDBC/SQL
• Most people start by using
Bytes.to(String|Short|Long)
– Spaghetti data layer
• New developers have to learn a bunch of new
concepts
• Mistakes are easy to make
12. We built a DTO layer to simplify dev
• Data Transfer Objects (DTO) & data access layer provide single point
for code changes and data migration
• First-class row key objects
• Centralized type serialization
– Standard data types
– Complex object serialization layer via protobuf
• Provide optimistic locking
• Enable asynchronous operation
• Minimize mistakes:
– QuerySet abstraction (columns & column families)
– Field state management (not queried versus null)
• Newer tools have arrived to ease this burden
– Kundera and Gora
13. Examples from our DTO abstraction
<table name="crawlJob" row-id-class=“example.CrawlJobId" >
<column-family name="main" compression="LZO" blockCacheEnabled="false" versions="1">
Definition
Model
<column name="firstCheckpoint" type=“example.proto.JobProtos$CrawlCheckpoint" />
<column name="firstCheckpointTime" type="Long" />
<column name="entryCheckpointCount" type="Long" />
...
public class CrawlJobModel extends SparseModel<CrawlJobId>{
public CrawlJobId getId(){…}
Generated
Model
public boolean hasFirstCheckpoint(){…}
public CrawlCheckpoint getFirstCheckpoint(){…}
public void setFirstCrawlCheckpoint(CrawlCheckpoint checkpoint){…}
…
public interface HBaseReadWriteService{
public void putUnsafe(T model);
public void putVersioned(T model);
Interface
HBase
public T get(RowId<T> rowId, QuerySet<T> querySet);
public void increment(RowId<T> rowId, IncrementPair<T>... pairs);
public SutructuredScanner<T> scanByPrefix(byte[] bytePrefix, QuerySet<T> querySet);
….
14. Example Primary Keys
UrlId Path + Query String
org.apache.hbase:80:x:/book/architecture.html
Reverse domain Client Protocol (e.g. user name + http)
Optional Port
MainDocId
GroupId (row) 2 byte bucket number (part)
xxxx x xxxxxxx xx
Additional identifier (4, 8 or 32 bytes depending on type)
1 byte type of identifier enum (int, long or sha2, generic 32)
4 byte source id
15. Agenda
• What is YapMap?
• Interfacing with Data
Using HBase as a data processing pipeline
• NoSQL Schemas: Adjusting and Migrating
• Index Construction
• HBase Operations
16. Processing pipeline is built on HBase
• Multiple steps with checkpoints to manage failures
• Out of order input assumed
• Idempotent operations at each stage of process
• Utilize optimistic locking to do coordinated merges
• Use regular cleanup scans to pick up lost tasks
• Control batch size of messages to control throughput versus latency
Message Message Message Batch
Build Main Merge + Split
Pre-index Main Indexing
Crawlers Main Doc
Docs Groups Docs RT
Indexing
Cache DFS t1:cf1 t2:cf1 t2:cf2
17. Migrating from messaging to coprocessors
• Big challenges
– Mixing system code and application code
– Memory impact: we have a GC stable state
• Exploring HBASE-4047 to solve
Message Message Message Batch
Build Main Merge + Split
Pre-index Main Indexing
Crawlers Main Doc
Docs Groups Docs RT
Indexing
CP CP
Cache DFS t1:cf1 t2:cf1 t2:cf1
18. Agenda
• What is YapMap?
• Interfacing with Data
• Using HBase as a data processing pipeline
NoSQL Schemas: Adjusting and Migrating
• Index Construction
• HBase Operations
19. Learn to leverage NoSQL strengths
• Original Structure was similar • New structure utilizes a cell for
to traditional RDBMS, each DetailDoc
– static column names • Split metadata maps MainDoc >
– fully realized MainDoc DetailDocId
• One new DetailDoc could cause • HBase handles cascading changes
a cascading regeneration of all • MainDoc realized on app read
MainDocs
• Use column prefixing
0 1 2 metadata detailid1 detailid2
MainDoc MainDoc MainDoc Splits Detail Detail
20. Schema migration steps
1. Disable application writes on OldTable
2. Extract OldSplits from OldTable
3. Create NewTable with appropriate column families and
properties
4. Split NewTable based on OldSplits
5. Run MapReduce job that converts old objects into new
objects
– Use HTableInputFormat as input on OldTable
– Use HFileOutputFormat as output format pointing at NewTable
6. Bulk load output into NewTable
7. Redeploy application to read on NewTable
8. Enable writes in application layer
21. Agenda
• What is YapMap?
• Interfacing with Data
• Using HBase as a data processing pipeline
• NoSQL Schemas: Adjusting and Migrating
Index Construction
• HBase Operations
22. Index Shards loosely based on HBase regions
• Indexing is split Tokenized Main Docs
between major
indices (batch) and
minor (real time) R1 Shard 1
• Primary key order is
same as index order
• Shards are based on R2 Shard 2
snapshots of splits
• IndexedTableSplit
allows cross-region R3 Shard 3
shard splits to be
integrated at Index
load time
23. Batch indices are memory based, stored on DFS
• Total of all shards about 1tb
– With ECC memory <$7/gb, systems easily achieving 128-256gb
each=> no problem
• Each shard ~5gb in size to improve parallelism on search
– Variable depending on needs and use case
• Each shard is composed of multiple map and reduce parts
along with MapReduce statistics from HBase
– Integration of components are done in memory
– Partitioner utilizes observed term distributions
– New MR committer: FileAndPutOutputCommitter
• Allows low volume secondary outputs from Map phase to be used
during reduce phase
24. Agenda
• What is YapMap?
• Interfacing with Data
• Using HBase as a data processing pipeline
• NoSQL Schemas: Adjusting and Migrating
• Index Construction
HBase Operations
25. HBase Operations
• Getting GC right – 6 months
– Machines have 32gb, 12gb for HBase, more was a problem
• Pick the right region size: With HFile v2, just start bigger
• Be cautious about using multiple CFs
• Consider Asynchbase Client
– Benoit did some nice work at SU
– Ultimately we just leveraged EJB3.1 @Async capabilities to make our HBase
service async
• Upgrade: typically on the first or second point release
– Testing/research cluster first
• Hardware: 8 core low power chips, low power ddr3, 6x WD
Black 2TB drives per machine, Infiniband
• MapR’s M3 distribution of Hadoop
26. Questions
• Why not Lucene/Solr/ElasticSearch/etc?
– Data locality between main and detail documents to do document-at-once scoring
– Not built to work well with Hadoop and HBase (Blur.io is first to tackle this head on)
• Why not store indices directly in HBase?
– Single cell storage would be the only way to do it efficiently
– No such thing as a single cell no-read append (HBASE-5993)
– No single cell partial read
• Why use Riak for presentation side?
– Hadoop SPOF
– Even with newer Hadoop versions, HBase does not do sub-second row-level HA on node
failure (HBASE-2357)
– Riak has more predictable latency
• Why did you switch to MapR?
– Index load performance was substantially faster
– Less impact on HBase performance
– Snapshots in trial copy were nice for those 30 days