This document summarizes Lars George's presentation on moving from batch to real-time processing with Hadoop. It discusses using Hadoop (HDFS and MapReduce) for batch processing of large amounts of data and integrating real-time databases and stream processing tools like HBase and Storm to enable faster querying and analytics. Example architectures shown combine batch and real-time systems by using real-time tools to process streaming data and periodically syncing results to Hadoop and HBase for long-term storage and analysis.
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
1. From Batch to Realtime
with Hadoop
Berlin Buzzwords, June 2012
Lars George
lars@cloudera.com
2. About Me
• Solutions Architect @ Cloudera
• Apache HBase & Whirr Committer
• Working with Hadoop & HBase since
2007
• Author of O’Reilly’s “HBase - The
Definitive Guide”
3. The Application Stack
• Solve Business Goals
• Rely on Proven Building Blocks
• Rapid Prototyping
‣ Templates, MVC, Reference
Implementations
• Evolutionary Innovation Cycles
“Let there be light!”
7. The Dawn of Big Data
• Industry verticals produce a staggering amount of data
• Not only web properties, but also “brick and mortar”
businesses
‣ Smart Grid, Bio Informatics, Financial, Telco
• Scalable computation frameworks allow analysis of all the data
‣ No sampling anymore
• Suitable algorithms derive even more data
‣ Machine learning
• “The Unreasonable Effectiveness of Data”
‣ More data is better than smart algorithms
8. Hadoop
• HDFS + MapReduce
• Based on Google Papers
• Distributed Storage and Computation
Framework
• Affordable Hardware, Free Software
• Significant Adoption
9. HDFS
• Reliably store petabytes of replicated data
across thousands of nodes
• Master/Slave Architecture
• Built on “commodity” hardware
10. MapReduce
• Distributed programming model to reliably
process petabytes of data
• Locality of data to processing is vital
‣ Run code where data resides
• Inspired by map and reduce functions in
functional programming
Input ➜ Map() ➜ Copy/Sort ➜ Reduce() ➜ Output
11. From Short to Long Term
Internet
LAM(M)P
• Serves the Client
• Stores Intermediate Data
Hadoop
• Background Batch Processing
• Stores Long-Term Data
12. Batch Processing
• Scale is Unlimited
‣ Bound only by Hardware
• Harness the Power of the Cluster
‣ CPUs, Disks, Memory
• Disks extend Memory
‣ Spills represent Swapping
• Trade Size Limitations with Time
‣ Jobs run for a few minutes to hours, days
13. From Batch to Realtime
• “Time is Money”
• Bridging the gap between batch and “now”
• Realtime often means “faster than batch”
• 80/20 Rule
‣ Hadoop solves the 80% easily
‣ The remaining 20% is taking 80% of the
effort
• Go as close as possible, don’t overdo it!
14. Stop Gap Solutions
• In Memory
‣ Memcached
‣ MemBase
‣ GigaSpaces
• Relational Databases
‣ MySQL
‣ PostgreSQL
• NoSQL
‣ Cassandra
‣ HBase
15. Complemental Design #1
Internet
• Keep Backup in HDFS
• MapReduce over HDFS
• Synchronize HBase
LAM(M)P ‣Batch Puts
‣Bulk Import
Hadoop HBase
16. Complemental Design #2
Internet
• Add Log Support
• Synchronize HBase
LAM(M)P ‣Batch Puts
Flume
‣Bulk Import
Hadoop HBase
17. Mitigation Planning
• Reliable storage has top priority
• Disaster Recovery
• HBase Backups
‣ Export - but what if HBase is “down”
‣ CopyTable - same issue
‣ Snapshots - not available (yet)
19. Facebook Insights
• > 20B Events per Day
• 1M Counter Updates per Second
‣ 100 Nodes Cluster
‣ 10K OPS per Node
Web ➜ Scribe ➜ Ptail ➜ Puma ➜ HBase
20. Collection Layer
• “Like” button triggers AJAX request
• Event written to log file using Scribe
‣ Handles aggregation, delivery, file roll
over, etc.
‣ Uses HDFS to store files
✓ Use Flume or Scribe
21. Filter Layer
• Ptail “follows” logs written by Scribe
• Aggregates from multiple logs
• Separates into event types
‣ Sharding for future growth
• Facebook internal tool
✓ Use Flume
22. Batching Layer
• Puma batches updates
‣ 1 sec, staggered
• Flush batch, when last is done
• Duration limited by key distribution
• Facebook internal tool
✓ Use Coprocessors (0.92.0)
23. Counters
• Store counters per Domain and per URL
‣ Leverage HBase increment (atomic read-modify-
write) feature
• Each row is one specific Domain or URL
• The columns are the counters for specific metrics
• Column families are used to group counters by time
range
‣ Set time-to-live on CF level to auto-expire counters
by age to save space, e.g., 2 weeks on “Daily
Counters” family
24. Key Design
• Reversed Domains, eg. “com.cloudera.www”, “com.cloudera.blog”
‣ Helps keeping pages per site close, as HBase efficiently scans blocks
of sorted keys
• Domain Row Key =
MD5(Reversed Domain) + Reversed Domain
‣ Leading MD5 hash spreads keys randomly across all regions for
load balancing reasons
‣ Only hashing the domain groups per site (and per subdomain if
needed)
• URL Row Key =
MD5(Reversed Domain) + Reversed Domain + URL ID
‣ Unique ID per URL already available, make use of it
25. Insights Schema
Row Key: Domain Row Key
Columns:
Hourly Counters CF Daily Counters CF Lifetime Counters CF
6pm 6pm 6pm 7pm 1/1 1/1 2/1
... 1/1 Total ... Total Male Female US ...
Total Male US ... Male US ...
100 50 92 45 1000 320 670 990 10000 6780 3220 9900
Row Key: URL Row Key
Columns:
Hourly Counters CF Daily Counters CF Lifetime Counters CF
6pm 6pm 6pm 7pm 1/1 1/1 2/1
... 1/1 Total ... Total Male Female US ...
Total Male US ... Male US ...
10 5 9 4 100 20 70 99 100 8 92 100
27. Batch + Stream
• Currently moves complexity into app layer
‣ Reads need to merge batch and stream results
• Stream results can be dropped once data is
persisted in batch layer
• Stream might not be 100% correct, but good
enough in most cases
‣ Eventual Accuracy
• Latency vs. Throughput - best of both worlds