Successfully reported this slideshow.

From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012


Published on

In the early days of web applications, sites were designed to serve users and gather information along the way. With the proliferation of data sources and growing user bases, the amount of data generated required new ways for storage and processing. Hadoop's HDFS and its batch oriented MapReduce opened new possibilities, yet it falls short of instant delivery of aggregate data to end users. Adding HBase and other layers, such as stream processing using Twitter's Storm, can overcome this delay and bridge the gap to realtime aggregation and reporting. This presentation takes the audience from the beginning of web application design to the current architecture, which combines multiple technologies to be able to process vast amounts of data, while still being able to react timely and report near realtime statistics.

Published in: Technology

From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012

  1. From Batch to Realtime with Hadoop Berlin Buzzwords, June 2012 Lars George
  2. About Me• Solutions Architect @ Cloudera• Apache HBase & Whirr Committer• Working with Hadoop & HBase since 2007• Author of O’Reilly’s “HBase - The Definitive Guide”
  3. The Application Stack• Solve Business Goals• Rely on Proven Building Blocks• Rapid Prototyping ‣ Templates, MVC, Reference Implementations• Evolutionary Innovation Cycles “Let there be light!”
  4. LAMP
  5. L LinuxA ApacheM MySQLP PHP/Perl
  6. L LinuxA ApacheM MySQLM MemcacheP PHP/Perl
  7. The Dawn of Big Data• Industry verticals produce a staggering amount of data• Not only web properties, but also “brick and mortar” businesses ‣ Smart Grid, Bio Informatics, Financial, Telco• Scalable computation frameworks allow analysis of all the data ‣ No sampling anymore• Suitable algorithms derive even more data ‣ Machine learning• “The Unreasonable Effectiveness of Data” ‣ More data is better than smart algorithms
  8. Hadoop• HDFS + MapReduce• Based on Google Papers• Distributed Storage and Computation Framework• Affordable Hardware, Free Software• Significant Adoption
  9. HDFS• Reliably store petabytes of replicated data across thousands of nodes• Master/Slave Architecture• Built on “commodity” hardware
  10. MapReduce • Distributed programming model to reliably process petabytes of data • Locality of data to processing is vital ‣ Run code where data resides • Inspired by map and reduce functions in functional programmingInput ➜ Map() ➜ Copy/Sort ➜ Reduce() ➜ Output
  11. From Short to Long Term Internet LAM(M)P • Serves the Client • Stores Intermediate Data Hadoop • Background Batch Processing • Stores Long-Term Data
  12. Batch Processing• Scale is Unlimited ‣ Bound only by Hardware• Harness the Power of the Cluster ‣ CPUs, Disks, Memory• Disks extend Memory ‣ Spills represent Swapping• Trade Size Limitations with Time ‣ Jobs run for a few minutes to hours, days
  13. From Batch to Realtime• “Time is Money”• Bridging the gap between batch and “now”• Realtime often means “faster than batch”• 80/20 Rule ‣ Hadoop solves the 80% easily ‣ The remaining 20% is taking 80% of the effort• Go as close as possible, don’t overdo it!
  14. Stop Gap Solutions• In Memory ‣ Memcached ‣ MemBase ‣ GigaSpaces• Relational Databases ‣ MySQL ‣ PostgreSQL• NoSQL ‣ Cassandra ‣ HBase
  15. Complemental Design #1 Internet • Keep Backup in HDFS • MapReduce over HDFS • Synchronize HBase LAM(M)P ‣Batch Puts ‣Bulk Import Hadoop HBase
  16. Complemental Design #2 Internet • Add Log Support • Synchronize HBase LAM(M)P ‣Batch Puts Flume ‣Bulk Import Hadoop HBase
  17. Mitigation Planning• Reliable storage has top priority• Disaster Recovery• HBase Backups ‣ Export - but what if HBase is “down” ‣ CopyTable - same issue ‣ Snapshots - not available (yet)
  18. Complemental Design #3 Internet • Add Log Processing • Remove Direct Connection LAM(M)P • Synchronize HBase ‣Batch Puts Flume ‣Bulk Import Log Hadoop HBase Proc
  19. Facebook Insights• > 20B Events per Day• 1M Counter Updates per Second ‣ 100 Nodes Cluster ‣ 10K OPS per NodeWeb ➜ Scribe ➜ Ptail ➜ Puma ➜ HBase
  20. Collection Layer• “Like” button triggers AJAX request• Event written to log file using Scribe ‣ Handles aggregation, delivery, file roll over, etc. ‣ Uses HDFS to store files✓ Use Flume or Scribe
  21. Filter Layer• Ptail “follows” logs written by Scribe• Aggregates from multiple logs• Separates into event types ‣ Sharding for future growth• Facebook internal tool✓ Use Flume
  22. Batching Layer• Puma batches updates ‣ 1 sec, staggered• Flush batch, when last is done• Duration limited by key distribution• Facebook internal tool✓ Use Coprocessors (0.92.0)
  23. Counters• Store counters per Domain and per URL ‣ Leverage HBase increment (atomic read-modify- write) feature• Each row is one specific Domain or URL• The columns are the counters for specific metrics• Column families are used to group counters by time range ‣ Set time-to-live on CF level to auto-expire counters by age to save space, e.g., 2 weeks on “Daily Counters” family
  24. Key Design• Reversed Domains, eg. “com.cloudera.www”, “” ‣ Helps keeping pages per site close, as HBase efficiently scans blocks of sorted keys• Domain Row Key = MD5(Reversed Domain) + Reversed Domain ‣ Leading MD5 hash spreads keys randomly across all regions for load balancing reasons ‣ Only hashing the domain groups per site (and per subdomain if needed)• URL Row Key = MD5(Reversed Domain) + Reversed Domain + URL ID ‣ Unique ID per URL already available, make use of it
  25. Insights SchemaRow Key: Domain Row KeyColumns: Hourly Counters CF Daily Counters CF Lifetime Counters CF6pm 6pm 6pm 7pm 1/1 1/1 2/1 ... 1/1 Total ... Total Male Female US ...Total Male US ... Male US ... 100 50 92 45 1000 320 670 990 10000 6780 3220 9900Row Key: URL Row KeyColumns: Hourly Counters CF Daily Counters CF Lifetime Counters CF6pm 6pm 6pm 7pm 1/1 1/1 2/1 ... 1/1 Total ... Total Male Female US ...Total Male US ... Male US ... 10 5 9 4 100 20 70 99 100 8 92 100
  26. Complemental Design #4Internet • Add Stream Processing ‣In-MemoryLAM(M)P Storm ‣Fault Tolerant ‣Aggregations Flume • Bridges minutes/hours vs. months/yearsHadoop HBase
  27. Batch + Stream• Currently moves complexity into app layer ‣ Reads need to merge batch and stream results• Stream results can be dropped once data is persisted in batch layer• Stream might not be 100% correct, but good enough in most cases ‣ Eventual Accuracy• Latency vs. Throughput - best of both worlds
  28. Questions?lars@cloudera.com