Searching conversations with hadoop

Searching
Conversa/ons

using
Hadoop:
More
than
find the talk
Just
Analy/cs

Jacques
Nadeau,
CTO

jacques@yapmap.com

@intjesus

June
13,
2012

Agenda

ü What
is
YapMap?

•  FiLng
Hadoop
into
your
architecture

•  YapMap
Approach

–  Crawling

–  Processing

–  Index
Genera/on

–  Results

•  Opera/ons,
GeLng
Started
&
Ques/ons

What
is
YapMap?

•  A
visual
search
technology

•  Focused
on
threaded

conversa/ons

•  Built
to
provide
beWer

context
and
ranking

•  Built
on
Hadoop
ecosystem

for
massive
scale

•  Two
self-‐funded
guys

•  Motoyap.com
largest

implementa/on
at
650mm
www.motoyap.com

automo/ve
docs

Why
do
this?

•  Discussion
forums
and

mailings
list
primary

home
for
many
hobbies

•  Threaded
search
sucks

–  No
context
in
the
middle

of
the
conversa/on

How
does
it
work?

Post
1

Post
2

Post
3

Post
4

Post
5

Post
6

Conceptual
data
model

Thread

Post
1

Post
2

Post
3
Sub-‐thread

Post
4

Post
5

Post
6

Individual
post

•  Single
thread
scaWered
across
many
web
pages

•  Posts
don’t
necessarily
arrive
in
order

A
YapMap
search
result
page

Agenda

•  What
is
YapMap?

ü FiLng
Hadoop
into
your
architecture

•  YapMap
Approach

–  Crawling

–  Processing

–  Index
Genera/on

–  Results

•  Opera/ons,
GeLng
Started
&
Ques/ons

Evolu/on
of
Hadoop

Hadoop
Today
Hadoop
Tomorrow

•  Batch
analysis
system
•  Real-‐/me
enterprise

applica/on
pladorm

•  Lacks
enterprise
features
•  Strong
Enterprise
Features

(e.g.
HA,
Stability,
compat)

•  Limited
applica/ons
•  BI,
Email/Collabora/on,

primarily
BI
&
analy/cs
Marke/ng
DW,
etc.

•  Clusters
focused
on
point
•  Shared
resource
suppor/ng

use
cases
a
large
number
of
use
cases

Complementary
to
exis/ng
technologies

Tradi-onal
Tools
Hadoop
Addi-ons

•  Glassﬁsh
3.1.2
(EJB&CDI)
•  Zookeeper

•  MySQL
•  HBase

•  RabbitMQ
•  MapReduce

•  Protobuf
•  MapRfs/HDFS

•  Varnish
•  Mahout

•  Riak

General
architecture

RabbitMQ
MapReduce

Processing
Indexing
Results

Crawler

Pipeline
Engine
Presenta/on

HBase
Riak

HDFS/MapRfs

Zookeeper

MySQL
MySQL

Hadoop
doesn’t
solve
all
problems

MySQL HBase
Riak
Primary
Use Business
Storage
of
crawl
data,
Storage
of

management
processing
pipeline components

information
directly
related
to

presentation
Key
features
that
Transactions,
SQL,
Consistency,
redundancy,
Predictable
l ow

drove
selection JPA memory
to
persitence
latency,
full

ratio
uptime,
max
one

IOP
per
object
Average
Object
Size Small 20k 2k
Object
Count <1
million 500
million 1
billion
System
Count 2 10 8
Memory
Footprint <1gb 120gb 240gb
Dataset
Size valuated
Voldemort
and
Cassandra

We
also
e 10mb 10tb 2tb

How
we
use
Hadoop

•  Zookeeper
•  Corosync,
Accord,
JGroups

–  Distributed
Locks

–  Cluster
membership

coordina/on

–  Index
distribu/on
coordina/on

•  Teradata,
Exadata,
sharded

•  HBase
MySQL,
Cassandra

–  Primary
Data
store

–  Crawl
Caching

–  Data
merging

–  Processing
Pipeline
•  MPI,
JPPF,
Clustered
EJB

•  MapReduce

–  Index
genera/on
•  Gluster,
SAN/NAS,
Lustre

•  MapRfs/HDFS

–  Index
storage
•  Carrot2,
Lingpipe,
Lexaly/cs

•  Mahout

–  Cluster
iden/ﬁca/on

Processing
Indexing
Results

Crawler

Pipeline
Engine
Presenta/on

YapMap
Approach:
Crawling

YapMap
crawling
challenges

•  Depth
versus
breadth

•  Crawls
must
be
throWled
to
avoid
overloading

•  Avoid
duplicate
crawling

•  Save
progress
of
long
running
crawls

•  Need
an
elas/c
and
full
distributed
approach

to
crawling

•  Crawler
death
managed

Crawler
overview

RabbitMQ

5.
Crawler
Outputs

1.
New

4.
Crawler
retrieves
Posts
(using

Crawl
job

external
assets
append
as

arrives

necessary)

DFS

Crawler

2.
Crawler
checks

document
cache

Aier
achieving

/me

HBase
and/or
quan/ty

6.
Crawler
generates
thresholds,
crawl

more
crawl
tasks
pauses,
checkpoints
in

3.
Crawler

HBase
and
resubmits

Acquires

to
RabbitMQ
queue

Domain
Lock

Zookeeper

Processing
Indexing
Results

Crawler

Pipeline
Engine
Presenta/on

YapMap
Approach:
Processing
Pipeline

Processing
pipeline
challenges

•  Independent
posts
=>
complete
threads

•  Split
long
threads
into
mul/ple
sub-‐threads

•  Fully
parallel
processing
pipeline

•  Accommodate
out
of
order
data

Processing
pipeline
using
HBase

•  Mul/ple
steps
with
checkpoints
to
manage
failures

•  Idempotent
opera/ons
at
each
stage
of
process

•  U/lize
op/mis/c
locking
to
do
coordinated
merges

•  Use
regular
cleanup
scans
to
pick
up
lost
tasks

•  Control
batch
size
of
messages
to
control
throughput
versus
latency

•  Out
of
order
input
assumed

Posts
from

Message
Message
Batch

Crawler
Process
&
pre-‐ Indexing

Build
thread
Merge
+
split

index
sub-‐
parts
threads
RT

threads

Indexing

HBase
Riak

Processing
Indexing
Results

Crawler

Pipeline
Engine
Presenta/on

YapMap
Approach:
Index
Genera/on

Index
genera/on
challenges

•  Shard
size
control

•  Index
ordering

•  Maintain
inverted
and
un-‐inverted
data
in

parallel

•  Minimize
merging
costs

•  Support
mul/-‐grain
indexing
and
scoring

Index
Shards
loosely
based
on
HBase
regions

•  HBase
primary
key
Pre-‐index
Docs

order
is
same
as

index
order

•  Shards
sized
based
R1
Shard
1

on
paralleliza/on

requirements

–  Typically
~5gb
R2

Shard
2

each

•  Shards
are
based

on
snapshots
of
R3

Shard
3

splits
for
data

locality

MapReduce
for
Index
Genera/on

IndexedTableInputFormat
Term:
Pos/ng
Lists

Map
Reduce

Map
Reduce

Barrier
Map
Split

Term
Distribu/on
Par//oner
Sta/s/cs
FileAndPutOutputCommiWer

Inverted
data

Un-‐inverted
characteris/cs
Inverted

Un-‐inverted
data

Data

Indices
&
dic/onaries

characteris/cs

DFS

HBase

Processing
Indexing
Results

Crawler

Pipeline
Engine
Presenta/on

YapMap
Approach:
Results
Presenta/on

Presenta/on
Layer
Challenges

•  Distributed
search
tree

•  High
performance
index
loading
and
serving

•  No
SPOF

•  Eﬀec/ve
memory
management

&
alloca/on

•  Automa/c
cluster
management

•  Smart
index
distribu/on

Results
Presenta/on
Layer

1.
Request
5.
Response

2.
Query
Zookeeper
for

4.
Retrieve
assets

Results
SServer
ac/ve
servers

Riak
Results
erver
Zookeeper

3.
Fan-‐out
request,

consolidate
responses
3.
Register

new
shard

Shard
Shard
Shard
Shard
availability

Daemon
Daemon
Daemon
Daemon

Index
Server
Index
Server

1.
Load
shard
proﬁle
&
2.
Parallel
load
and

conﬁgure
memory
integrate
shard

HBase
DFS

Agenda

•  What
is
YapMap?

•  FiLng
Hadoop
into
your
architecture

•  YapMap
Approach

–  Crawling

–  Processing

–  Index
Genera/on

–  Results

ü Opera/ons,
GeLng
Started
&
Ques/ons

Opera/ons

•  Hardware

–  Supermicro
with
8
core
low
power
chips,
low
power
ddr3

–  WD
Black
2TB
drives

–  DDR
Inﬁniband
using
IPoIB
for
index
loading
performance

•  Soiware

–  Started
on
Cloudera,
switched
to
MapR’s
M3
distribu/on

of
Hadoop

•  GC
was
painful,
now
manageable

–  HBase
now
supports
MSLAB
for
writes
and
oﬀ-‐heap
block

cache
to
support
larger
memory
usage

–  Shard
servers
u/lize
large
pages
to
minimize

fragmenta/on

–  Shard
servers
do
immediate
large
alloca/ons
to
minimize

GC
problems

GeLng
Started

•  Amazon
Elas/c
Map
Reduce

–  Common
Crawl
dataset
is
a
great
data
set
to
start

with

•  Cheap
old-‐gen
cluster
if
you
want
to
run
things

like
HBase

–  We
built
a
eﬀec/ve
6
node
Hadoop/HBase
cluster
for

$1500
(Craigslist,
eBay)

–  Mailing
lists
are
liWered
with
performance
and

interconnec/vity
challenges
when
using
cloud

compu/ng
resources
to
do
Hadoop
stuﬀ

Ques/ons

•  Why
not
Lucene/Solr/Elas/cSearch/KaWa/etc?

–  Not
built
to
work
well
with
Hadoop
and
HBase
(Blur.io
is
ﬁrst
to
tackle
this
head
on)

–  Data
locality
between
threads
and
posts
to
do
document-‐at-‐once
scoring

•  Why
not
store
indices
directly
in
HBase?

–  Single
cell
storage
would
be
the
only
way
to
do
it
eﬃciently

–  No
such
thing
as
a
single
cell
no-‐read
append
(HBASE-‐5993)

–  No
single
cell
par/al
read

•  Why
use
Riak
for
presenta/on
side?

–  Hadoop
SPOF

–  Even
with
newer
Hadoop
versions,
HBase
does
not
do
sub-‐second
row-‐level
HA
on
node

failure
(HBASE-‐2357)

–  Riak
has
more
predictable
latency

•  Why
did
you
switch
to
MapR?

–  Index
load
performance
was
substan/ally
faster

–  Snapshots
in
trial
copy
were
nice
for
those
30
days

–  Less
impact
on
HBase
performance

Searching conversations with hadoop

More Related Content

What's hot

Similar to Searching conversations with hadoop

More from DataWorks Summit

Recently uploaded

Searching conversations with hadoop