Realtime Analytics with Hadoop and HBase

Realtime Analytics using
Hadoop & HBase
Lars George,
Solutions Architect @ Cloudera
lars@cloudera.com

Monday, July 25, 11

About Me

• Solutions Architect @ Cloudera
• Apache HBase & Whirr Committer
• Working with Hadoop & HBase since 2007
• Author of O’Reilly’s “HBase - The Deﬁnitive
Guide”

Monday, July 25, 11

The Application Stack
• Solve Business Goals
• Rely on Proven Building Blocks
• Rapid Prototyping
‣ Templates, MVC, Reference
Implementations
• Evolutionary Innovation Cycles
“Let there be light!”
Monday, July 25, 11

L Linux

A Apache

M MySQL

P PHP/Perl

Monday, July 25, 11

L Linux

A Apache

M MySQL

M Memcache

P PHP/Perl

Monday, July 25, 11

The Dawn of Big Data
• Industry verticals produce a staggering amount of data
• Not only web properties, but also “brick and mortar”
businesses
‣ Smart Grid, Bio Informatics, Financial, Telco
• Scalable computation frameworks allow analysis of all the data
‣ No sampling anymore
• Suitable algorithms derive even more data
‣ Machine learning
• “The Unreasonable Effectiveness of Data”
‣ More data is better than smart algorithms

Monday, July 25, 11

Hadoop

• HDFS + MapReduce
• Based on Google Papers
• Distributed Storage and Computation
Framework
• Affordable Hardware, Free Software
• Signiﬁcant Adoption
Monday, July 25, 11

HDFS
• Reliably store petabytes of replicated data across
thousands of nodes
‣ Data divided into 64MB blocks, each block replicated
three times
• Master/Slave Architecture
‣ Master NameNode contains meta data
‣ Slave DataNode manages block on local ﬁle system
• Built on “commodity” hardware
‣ No 15k RPM disks or RAID required (nor wanted!)
‣ Commodity Server Hardware

Monday, July 25, 11

MapReduce
• Distributed programming model to reliably
process petabytes of data
• Locality of data to processing is vital
‣ Run code where data resides
• Inspired by map and reduce functions in
functional programming

Input ➜ Map() ➜ Copy/Sort ➜ Reduce() ➜ Output
Monday, July 25, 11

From Short to Long Term
Internet

LAM(M)P
• Serves the Client
• Stores Intermediate Data

Hadoop
• Background Batch Processing
• Stores Long-Term Data

Monday, July 25, 11

Batch Processing
• Scale is Unlimited
‣ Bound only by Hardware
• Harness the Power of the Cluster
‣ CPUs, Disks, Memory

• Disks extend Memory
‣ Spills represent Swapping

• Trade Size Limitations with Time
‣ Jobs run for a few minutes to hours, days
Monday, July 25, 11

From Batch to Realtime
• “Time is Money”
• Bridging the gap between batch and “now”
• Realtime often means “faster than batch”
• 80/20 Rule
‣ Hadoop solves the 80% easily
‣ The remaining 20% is taking 80% of the
effort
• Go as close as possible, don’t overdo it!

Monday, July 25, 11

Stop Gap Solutions
• In Memory
‣ Memcached
‣ MemBase
‣ GigaSpaces
• Relational Databases
‣ MySQL
‣ PostgreSQL
• NoSQL
‣ Cassandra
‣ HBase

Monday, July 25, 11

HBase Architecture

Monday, July 25, 11

Client Access

Monday, July 25, 11

Auto Sharding

Monday, July 25, 11

Distribution

Monday, July 25, 11

HBase Key Design

Monday, July 25, 11

Key Cardinality

Monday, July 25, 11

Fold, Store, and Shift

Monday, July 25, 11

Complemental Design #1
Internet
• Keep Backup in HDFS
• MapReduce over HDFS
• Synchronize HBase
LAM(M)P ‣Batch Puts
‣Bulk Import

Hadoop HBase

Monday, July 25, 11

Internet
• Add Log Support
• Synchronize HBase
LAM(M)P ‣Batch Puts
Flume
‣Bulk Import

Hadoop HBase

Monday, July 25, 11

Mitigation Planning
• Reliable storage has top priority
• Disaster Recovery
• HBase Backups
‣ Export - but what if HBase is “down”
‣ CopyTable - same issue
‣ Snapshots - not available

Monday, July 25, 11

Internet
• Add Log Processing
• Remove Direct Connection
LAM(M)P • Synchronize HBase
‣Batch Puts
Flume ‣Bulk Import

Log
Hadoop HBase
Proc

Monday, July 25, 11

Facebook Insights

• > 20B Events per Day
• 1M Counter Updates per Second
‣ 100 Nodes Cluster
‣ 10K OPS per Node

Web ➜ Scribe ➜ Ptail ➜ Puma ➜ HBase

Monday, July 25, 11

Collection Layer

• “Like” button triggers AJAX request
• Event written to log file using Scribe
‣ Handles aggregation, delivery, file roll
over, etc.
‣ Uses HDFS to store files
✓ Use Flume or Scribe

Monday, July 25, 11

Filter Layer
• Ptail “follows” logs written by Scribe
• Aggregates from multiple logs
• Separates into event types
‣ Sharding for future growth
• Facebook internal tool
✓ Use Flume

Monday, July 25, 11

Batching Layer
• Puma batches updates
‣ 1 sec, staggered
• Flush batch, when last is done
• Duration limited by key distribution
• Facebook internal tool
✓ Use Coprocessors (0.92.0)

Monday, July 25, 11

Counters
• Store counters per Domain and per URL
‣ Leverage HBase increment (atomic read-modify-
write) feature
• Each row is one speciﬁc Domain or URL
• The columns are the counters for speciﬁc metrics
• Column families are used to group counters by time
range
‣ Set time-to-live on CF level to auto-expire counters
by age to save space, e.g., 2 weeks on “Daily
Counters” family

Monday, July 25, 11

Key Design
• Reversed Domains, eg. “com.cloudera.www”, “com.cloudera.blog”
‣ Helps keeping pages per site close, as HBase efﬁciently scans blocks
of sorted keys
• Domain Row Key =
MD5(Reversed Domain) + Reversed Domain
‣ Leading MD5 hash spreads keys randomly across all regions for
load balancing reasons
‣ Only hashing the domain groups per site (and per subdomain if
needed)
• URL Row Key =
MD5(Reversed Domain) + Reversed Domain + URL ID
‣ Unique ID per URL already available, make use of it

Monday, July 25, 11

Insights Schema
Row Key: Domain Row Key
Columns:
Hourly Counters CF Daily Counters CF Lifetime Counters CF
6pm 6pm 6pm 7pm 1/1 1/1 2/1
... 1/1 Total ... Total Male Female US ...
Total Male US ... Male US ...
100 50 92 45 1000 320 670 990 10000 6780 3220 9900

Row Key: URL Row Key
Columns:
Hourly Counters CF Daily Counters CF Lifetime Counters CF
6pm 6pm 6pm 7pm 1/1 1/1 2/1
... 1/1 Total ... Total Male Female US ...
Total Male US ... Male US ...
10 5 9 4 100 20 70 99 100 8 92 100

Monday, July 25, 11

Summary
• Design for Use-Case
‣ Read, Write, or Both?
• Avoid Hotspotting
‣ Region and Table
• Manage Automatism at Scale
‣ For now!

Monday, July 25, 11

Questions?

lars@cloudera.com
http://cloudera.com

Monday, July 25, 11

Realtime Analytics with Hadoop and HBase

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (8)

Similar to Realtime Analytics with Hadoop and HBase

Similar to Realtime Analytics with Hadoop and HBase (20)

More from larsgeorge

More from larsgeorge (14)

Recently uploaded

Recently uploaded (20)

Realtime Analytics with Hadoop and HBase