Your SlideShare is downloading. ×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop and HBase in the Real World

3,708

Published on

Cloudera Solutions Architect, …

Cloudera Solutions Architect,
Joey Echeverria, explains Hadoop and HBases architecture and roles in the real world of data management and storage.

Published in: Technology
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,708
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
7
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Apache Hadoop and HBase inthe Real World Joey Echeverria @fwiffo #novahug
  • 2. The PlugWere Training! Developer Training July 25 to 27 Admin Training July 28 to 29 http://www.cloudera.com/trainingWe’re Hiring! Solution Architects, Trainers, Distributed Systems Engineers http://www.cloudera.com/careers Copyright 2011 Cloudera Inc. All rights reserved 2
  • 3. 1 Minute Hadoop RecapHDFS Distributed file system Optimized for streaming reads and writes Block-level replicationMapReduce Distributed processing framework Reads/writes data in HDFS (typically) Operates over (key, value) view of data Copyright 2011 Cloudera Inc. All rights reserved 3
  • 4. Where does HBase come in?Google Google invented GFS and MapReduce GFS optimized for streaming reads and writesBigTable Googles answer to random read/write workloads Copyright 2011 Cloudera Inc. All rights reserved 4
  • 5. HBase: BigTable-like storage (for Hadoop) Copyright 2011 Cloudera Inc. All rights reserved 5
  • 6. What is HBase?Key/value column family storeData stored in HDFSZooKeeper for coordinationAccess model is get/put/delPlus range scans and versions Copyright 2011 Cloudera Inc. All rights reserved 6
  • 7. ArchitectureImage courtsey Lars George, Licensed uner Creative CommonsAttribution-Noncommercial-Share Alike 3.0 Germany License. Copyright 2011 Cloudera Inc. All rights reserved 7
  • 8. Tables and Column FamiliesStatic part of the schemaColumn families also form locality groups One Store per family Multiple HFiles per StoreTables split into regions Continuous range of row keys Unit of distribution Automatically split Pre-split for performance Copyright 2011 Cloudera Inc. All rights reserved 8
  • 9. Why use HBase?Variable schema in each recordCollections of data for each keyAtomic control of per-key dataRow access to each column family Copyright 2011 Cloudera Inc. All rights reserved 9
  • 10. HBase Applications “Smart Data, at Scale, made Easy” http://www.lilyproject.org “Distributed, scalable Time Series Database (TSDB)” http://opentsdb.net Copyright 2011 Cloudera Inc. All rights reserved 10
  • 11. Real-time ad optimizationsCapturing impressions and serving adsHBase front-end – to serve models (via memcached)HBase back-end – to serve pixels and capture cookiesMapReduce to compute models between the two Copyright 2011 Cloudera Inc. All rights reserved 11
  • 12. Click stream sessionizationKey on userid and timeSeperate table for significant events (e.g. purchase)Load data using HBase importtsv toolSessionization performed by simple scans Copyright 2011 Cloudera Inc. All rights reserved 12
  • 13. Mozilla - SoccorroWhen Firefox crashes, where do reports go?The Mozilla team gathers those crashes in HBaseCrashes varry widely and change format oftenProcessors take each individually and parse it out http://crash-stats.mozilla.com http://code.google.com/p/socorro Copyright 2011 Cloudera Inc. All rights reserved 13
  • 14. NavteqLocation based content servingAll served out of HBase, location makes a great keyContent is variable – Maps, POI, User DataPreprocessing is all done via MR jobs Copyright 2011 Cloudera Inc. All rights reserved 14
  • 15. ClouderaGathers data about customer clustersEach customer node is a key with Avro valuesEasy to browse, quick to find issues on NodesDump to HDFS and process with Pig Copyright 2011 Cloudera Inc. All rights reserved 15
  • 16. Copyright 2011 Cloudera Inc. All rights reserved 16

×