Hadoop and HBase in the Real World

4,274 views
4,047 views

Published on

Cloudera Solutions Architect,
Joey Echeverria, explains Hadoop and HBases architecture and roles in the real world of data management and storage.

Published in: Technology
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,274
On SlideShare
0
From Embeds
0
Number of Embeds
285
Actions
Shares
0
Downloads
0
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide

Hadoop and HBase in the Real World

  1. 1. Apache Hadoop and HBase inthe Real World Joey Echeverria @fwiffo #novahug
  2. 2. The PlugWere Training! Developer Training July 25 to 27 Admin Training July 28 to 29 http://www.cloudera.com/trainingWe’re Hiring! Solution Architects, Trainers, Distributed Systems Engineers http://www.cloudera.com/careers Copyright 2011 Cloudera Inc. All rights reserved 2
  3. 3. 1 Minute Hadoop RecapHDFS Distributed file system Optimized for streaming reads and writes Block-level replicationMapReduce Distributed processing framework Reads/writes data in HDFS (typically) Operates over (key, value) view of data Copyright 2011 Cloudera Inc. All rights reserved 3
  4. 4. Where does HBase come in?Google Google invented GFS and MapReduce GFS optimized for streaming reads and writesBigTable Googles answer to random read/write workloads Copyright 2011 Cloudera Inc. All rights reserved 4
  5. 5. HBase: BigTable-like storage (for Hadoop) Copyright 2011 Cloudera Inc. All rights reserved 5
  6. 6. What is HBase?Key/value column family storeData stored in HDFSZooKeeper for coordinationAccess model is get/put/delPlus range scans and versions Copyright 2011 Cloudera Inc. All rights reserved 6
  7. 7. ArchitectureImage courtsey Lars George, Licensed uner Creative CommonsAttribution-Noncommercial-Share Alike 3.0 Germany License. Copyright 2011 Cloudera Inc. All rights reserved 7
  8. 8. Tables and Column FamiliesStatic part of the schemaColumn families also form locality groups One Store per family Multiple HFiles per StoreTables split into regions Continuous range of row keys Unit of distribution Automatically split Pre-split for performance Copyright 2011 Cloudera Inc. All rights reserved 8
  9. 9. Why use HBase?Variable schema in each recordCollections of data for each keyAtomic control of per-key dataRow access to each column family Copyright 2011 Cloudera Inc. All rights reserved 9
  10. 10. HBase Applications “Smart Data, at Scale, made Easy” http://www.lilyproject.org “Distributed, scalable Time Series Database (TSDB)” http://opentsdb.net Copyright 2011 Cloudera Inc. All rights reserved 10
  11. 11. Real-time ad optimizationsCapturing impressions and serving adsHBase front-end – to serve models (via memcached)HBase back-end – to serve pixels and capture cookiesMapReduce to compute models between the two Copyright 2011 Cloudera Inc. All rights reserved 11
  12. 12. Click stream sessionizationKey on userid and timeSeperate table for significant events (e.g. purchase)Load data using HBase importtsv toolSessionization performed by simple scans Copyright 2011 Cloudera Inc. All rights reserved 12
  13. 13. Mozilla - SoccorroWhen Firefox crashes, where do reports go?The Mozilla team gathers those crashes in HBaseCrashes varry widely and change format oftenProcessors take each individually and parse it out http://crash-stats.mozilla.com http://code.google.com/p/socorro Copyright 2011 Cloudera Inc. All rights reserved 13
  14. 14. NavteqLocation based content servingAll served out of HBase, location makes a great keyContent is variable – Maps, POI, User DataPreprocessing is all done via MR jobs Copyright 2011 Cloudera Inc. All rights reserved 14
  15. 15. ClouderaGathers data about customer clustersEach customer node is a key with Avro valuesEasy to browse, quick to find issues on NodesDump to HDFS and process with Pig Copyright 2011 Cloudera Inc. All rights reserved 15
  16. 16. Copyright 2011 Cloudera Inc. All rights reserved 16

×