From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012

From Batch to Realtime
with Hadoop
Berlin Buzzwords, June 2012
Lars George
lars@cloudera.com

About Me

• Solutions Architect @ Cloudera
• Apache HBase & Whirr Committer
• Working with Hadoop & HBase since
2007
• Author of O’Reilly’s “HBase - The
Deﬁnitive Guide”

The Application Stack
• Solve Business Goals
• Rely on Proven Building Blocks
• Rapid Prototyping
‣ Templates, MVC, Reference
Implementations
• Evolutionary Innovation Cycles
“Let there be light!”

L Linux

A Apache

M MySQL

P PHP/Perl

L Linux

A Apache

M MySQL

M Memcache

P PHP/Perl

The Dawn of Big Data
• Industry verticals produce a staggering amount of data
• Not only web properties, but also “brick and mortar”
businesses
‣ Smart Grid, Bio Informatics, Financial, Telco
• Scalable computation frameworks allow analysis of all the data
‣ No sampling anymore
• Suitable algorithms derive even more data
‣ Machine learning
• “The Unreasonable Effectiveness of Data”
‣ More data is better than smart algorithms

Hadoop

• HDFS + MapReduce
• Based on Google Papers
• Distributed Storage and Computation
Framework
• Affordable Hardware, Free Software
• Signiﬁcant Adoption

HDFS

• Reliably store petabytes of replicated data
across thousands of nodes
• Master/Slave Architecture
• Built on “commodity” hardware

MapReduce
• Distributed programming model to reliably
process petabytes of data
• Locality of data to processing is vital
‣ Run code where data resides
• Inspired by map and reduce functions in
functional programming

Input ➜ Map() ➜ Copy/Sort ➜ Reduce() ➜ Output

From Short to Long Term
Internet

LAM(M)P
• Serves the Client
• Stores Intermediate Data

Hadoop
• Background Batch Processing
• Stores Long-Term Data

Batch Processing
• Scale is Unlimited
‣ Bound only by Hardware
• Harness the Power of the Cluster
‣ CPUs, Disks, Memory

• Disks extend Memory
‣ Spills represent Swapping

• Trade Size Limitations with Time
‣ Jobs run for a few minutes to hours, days

From Batch to Realtime
• “Time is Money”
• Bridging the gap between batch and “now”
• Realtime often means “faster than batch”
• 80/20 Rule
‣ Hadoop solves the 80% easily
‣ The remaining 20% is taking 80% of the
effort
• Go as close as possible, don’t overdo it!

Stop Gap Solutions
• In Memory
‣ Memcached
‣ MemBase
‣ GigaSpaces
• Relational Databases
‣ MySQL
‣ PostgreSQL
• NoSQL
‣ Cassandra
‣ HBase

Complemental Design #1
Internet
• Keep Backup in HDFS
• MapReduce over HDFS
• Synchronize HBase
LAM(M)P ‣Batch Puts
‣Bulk Import

Hadoop HBase

Internet
• Add Log Support
• Synchronize HBase
LAM(M)P ‣Batch Puts
Flume
‣Bulk Import

Hadoop HBase

Mitigation Planning
• Reliable storage has top priority
• Disaster Recovery
• HBase Backups
‣ Export - but what if HBase is “down”
‣ CopyTable - same issue
‣ Snapshots - not available (yet)

Internet
• Add Log Processing
• Remove Direct Connection
LAM(M)P • Synchronize HBase
‣Batch Puts
Flume ‣Bulk Import

Log
Hadoop HBase
Proc

Facebook Insights

• > 20B Events per Day
• 1M Counter Updates per Second
‣ 100 Nodes Cluster
‣ 10K OPS per Node

Web ➜ Scribe ➜ Ptail ➜ Puma ➜ HBase

Collection Layer

• “Like” button triggers AJAX request
• Event written to log file using Scribe
‣ Handles aggregation, delivery, file roll
over, etc.
‣ Uses HDFS to store files
✓ Use Flume or Scribe

Filter Layer
• Ptail “follows” logs written by Scribe
• Aggregates from multiple logs
• Separates into event types
‣ Sharding for future growth
• Facebook internal tool
✓ Use Flume

Batching Layer
• Puma batches updates
‣ 1 sec, staggered
• Flush batch, when last is done
• Duration limited by key distribution
• Facebook internal tool
✓ Use Coprocessors (0.92.0)

Counters
• Store counters per Domain and per URL
‣ Leverage HBase increment (atomic read-modify-
write) feature
• Each row is one speciﬁc Domain or URL
• The columns are the counters for speciﬁc metrics
• Column families are used to group counters by time
range
‣ Set time-to-live on CF level to auto-expire counters
by age to save space, e.g., 2 weeks on “Daily
Counters” family

Key Design
• Reversed Domains, eg. “com.cloudera.www”, “com.cloudera.blog”
‣ Helps keeping pages per site close, as HBase efﬁciently scans blocks
of sorted keys
• Domain Row Key =
MD5(Reversed Domain) + Reversed Domain
‣ Leading MD5 hash spreads keys randomly across all regions for
load balancing reasons
‣ Only hashing the domain groups per site (and per subdomain if
needed)
• URL Row Key =
MD5(Reversed Domain) + Reversed Domain + URL ID
‣ Unique ID per URL already available, make use of it

Insights Schema
Row Key: Domain Row Key
Columns:
Hourly Counters CF Daily Counters CF Lifetime Counters CF
6pm 6pm 6pm 7pm 1/1 1/1 2/1
... 1/1 Total ... Total Male Female US ...
Total Male US ... Male US ...
100 50 92 45 1000 320 670 990 10000 6780 3220 9900

Row Key: URL Row Key
Columns:
Hourly Counters CF Daily Counters CF Lifetime Counters CF
6pm 6pm 6pm 7pm 1/1 1/1 2/1
... 1/1 Total ... Total Male Female US ...
Total Male US ... Male US ...
10 5 9 4 100 20 70 99 100 8 92 100

Internet
• Add Stream Processing
‣In-Memory
LAM(M)P Storm ‣Fault Tolerant
‣Aggregations
Flume • Bridges minutes/hours
vs. months/years

Hadoop HBase

Batch + Stream
• Currently moves complexity into app layer
‣ Reads need to merge batch and stream results
• Stream results can be dropped once data is
persisted in batch layer
• Stream might not be 100% correct, but good
enough in most cases
‣ Eventual Accuracy
• Latency vs. Throughput - best of both worlds

Questions?

lars@cloudera.com
http://cloudera.com

From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012

Similar to From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012 (20)

Recently uploaded

Recently uploaded (20)

From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012

Editor's Notes