• Save
Realtime Analytics with Hadoop and HBase
Upcoming SlideShare
Loading in...5
×
 

Realtime Analytics with Hadoop and HBase

on

  • 26,802 views

 

Statistics

Views

Total Views
26,802
Views on SlideShare
26,403
Embed Views
399

Actions

Likes
105
Downloads
0
Comments
1

18 Embeds 399

http://tedwon.com 144
http://www.scoop.it 96
http://paper.li 92
http://www.techgig.com 23
https://twitter.com 11
http://blog.daum.net 8
http://twitter.com 6
http://www.linkedin.com 3
http://editor.daum.net 3
http://tweetedtimes.com 3
http://snipick.com 2
https://www.facebook.com 2
http://a0.twimg.com 1
http://b.hatena.ne.jp 1
http://us-w1.rockmelt.com 1
https://www.linkedin.com 1
http://115.68.2.182 1
http://pmomale-ld1 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Realtime Analytics with Hadoop and HBase Realtime Analytics with Hadoop and HBase Presentation Transcript

  • Realtime Analytics using Hadoop & HBase Lars George, Solutions Architect @ Cloudera lars@cloudera.comMonday, July 25, 11
  • About Me • Solutions Architect @ Cloudera • Apache HBase & Whirr Committer • Working with Hadoop & HBase since 2007 • Author of O’Reilly’s “HBase - The Definitive Guide”Monday, July 25, 11
  • The Application Stack • Solve Business Goals • Rely on Proven Building Blocks • Rapid Prototyping ‣ Templates, MVC, Reference Implementations • Evolutionary Innovation Cycles “Let there be light!”Monday, July 25, 11
  • LAMPMonday, July 25, 11
  • L Linux A Apache M MySQL P PHP/PerlMonday, July 25, 11
  • L Linux A Apache M MySQL M Memcache P PHP/PerlMonday, July 25, 11
  • The Dawn of Big Data • Industry verticals produce a staggering amount of data • Not only web properties, but also “brick and mortar” businesses ‣ Smart Grid, Bio Informatics, Financial, Telco • Scalable computation frameworks allow analysis of all the data ‣ No sampling anymore • Suitable algorithms derive even more data ‣ Machine learning • “The Unreasonable Effectiveness of Data” ‣ More data is better than smart algorithmsMonday, July 25, 11
  • Hadoop • HDFS + MapReduce • Based on Google Papers • Distributed Storage and Computation Framework • Affordable Hardware, Free Software • Significant AdoptionMonday, July 25, 11
  • HDFS • Reliably store petabytes of replicated data across thousands of nodes ‣ Data divided into 64MB blocks, each block replicated three times • Master/Slave Architecture ‣ Master NameNode contains meta data ‣ Slave DataNode manages block on local file system • Built on “commodity” hardware ‣ No 15k RPM disks or RAID required (nor wanted!) ‣ Commodity Server HardwareMonday, July 25, 11
  • MapReduce • Distributed programming model to reliably process petabytes of data • Locality of data to processing is vital ‣ Run code where data resides • Inspired by map and reduce functions in functional programming Input ➜ Map() ➜ Copy/Sort ➜ Reduce() ➜ OutputMonday, July 25, 11
  • From Short to Long Term Internet LAM(M)P • Serves the Client • Stores Intermediate Data Hadoop • Background Batch Processing • Stores Long-Term DataMonday, July 25, 11
  • Batch Processing • Scale is Unlimited ‣ Bound only by Hardware • Harness the Power of the Cluster ‣ CPUs, Disks, Memory • Disks extend Memory ‣ Spills represent Swapping • Trade Size Limitations with Time ‣ Jobs run for a few minutes to hours, daysMonday, July 25, 11
  • From Batch to Realtime • “Time is Money” • Bridging the gap between batch and “now” • Realtime often means “faster than batch” • 80/20 Rule ‣ Hadoop solves the 80% easily ‣ The remaining 20% is taking 80% of the effort • Go as close as possible, don’t overdo it!Monday, July 25, 11
  • Stop Gap Solutions • In Memory ‣ Memcached ‣ MemBase ‣ GigaSpaces • Relational Databases ‣ MySQL ‣ PostgreSQL • NoSQL ‣ Cassandra ‣ HBaseMonday, July 25, 11
  • HBase ArchitectureMonday, July 25, 11
  • Client AccessMonday, July 25, 11
  • Auto ShardingMonday, July 25, 11
  • DistributionMonday, July 25, 11
  • HBase Key DesignMonday, July 25, 11
  • Key CardinalityMonday, July 25, 11
  • Fold, Store, and ShiftMonday, July 25, 11
  • Complemental Design #1 Internet • Keep Backup in HDFS • MapReduce over HDFS • Synchronize HBase LAM(M)P ‣Batch Puts ‣Bulk Import Hadoop HBaseMonday, July 25, 11
  • Complemental Design #2 Internet • Add Log Support • Synchronize HBase LAM(M)P ‣Batch Puts Flume ‣Bulk Import Hadoop HBaseMonday, July 25, 11
  • Mitigation Planning • Reliable storage has top priority • Disaster Recovery • HBase Backups ‣ Export - but what if HBase is “down” ‣ CopyTable - same issue ‣ Snapshots - not availableMonday, July 25, 11
  • Complemental Design #3 Internet • Add Log Processing • Remove Direct Connection LAM(M)P • Synchronize HBase ‣Batch Puts Flume ‣Bulk Import Log Hadoop HBase ProcMonday, July 25, 11
  • Facebook Insights • > 20B Events per Day • 1M Counter Updates per Second ‣ 100 Nodes Cluster ‣ 10K OPS per Node Web ➜ Scribe ➜ Ptail ➜ Puma ➜ HBaseMonday, July 25, 11
  • Collection Layer • “Like” button triggers AJAX request • Event written to log file using Scribe ‣ Handles aggregation, delivery, file roll over, etc. ‣ Uses HDFS to store files ✓ Use Flume or ScribeMonday, July 25, 11
  • Filter Layer • Ptail “follows” logs written by Scribe • Aggregates from multiple logs • Separates into event types ‣ Sharding for future growth • Facebook internal tool ✓ Use FlumeMonday, July 25, 11
  • Batching Layer • Puma batches updates ‣ 1 sec, staggered • Flush batch, when last is done • Duration limited by key distribution • Facebook internal tool ✓ Use Coprocessors (0.92.0)Monday, July 25, 11
  • Counters • Store counters per Domain and per URL ‣ Leverage HBase increment (atomic read-modify- write) feature • Each row is one specific Domain or URL • The columns are the counters for specific metrics • Column families are used to group counters by time range ‣ Set time-to-live on CF level to auto-expire counters by age to save space, e.g., 2 weeks on “Daily Counters” familyMonday, July 25, 11
  • Key Design • Reversed Domains, eg. “com.cloudera.www”, “com.cloudera.blog” ‣ Helps keeping pages per site close, as HBase efficiently scans blocks of sorted keys • Domain Row Key = MD5(Reversed Domain) + Reversed Domain ‣ Leading MD5 hash spreads keys randomly across all regions for load balancing reasons ‣ Only hashing the domain groups per site (and per subdomain if needed) • URL Row Key = MD5(Reversed Domain) + Reversed Domain + URL ID ‣ Unique ID per URL already available, make use of itMonday, July 25, 11
  • Insights Schema Row Key: Domain Row Key Columns: Hourly Counters CF Daily Counters CF Lifetime Counters CF 6pm 6pm 6pm 7pm 1/1 1/1 2/1 ... 1/1 Total ... Total Male Female US ... Total Male US ... Male US ... 100 50 92 45 1000 320 670 990 10000 6780 3220 9900 Row Key: URL Row Key Columns: Hourly Counters CF Daily Counters CF Lifetime Counters CF 6pm 6pm 6pm 7pm 1/1 1/1 2/1 ... 1/1 Total ... Total Male Female US ... Total Male US ... Male US ... 10 5 9 4 100 20 70 99 100 8 92 100Monday, July 25, 11
  • Summary • Design for Use-Case ‣ Read, Write, or Both? • Avoid Hotspotting ‣ Region and Table • Manage Automatism at Scale ‣ For now!Monday, July 25, 11
  • Monday, July 25, 11
  • Monday, July 25, 11
  • Questions? lars@cloudera.com http://cloudera.comMonday, July 25, 11