Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Apache HBase 
Application 
Archetypes 
Strata + Hadoop World Barcelona. 
November 20h , 2014 
Lars George | @larsgeorge | ...
2 
About Lars and Jon 
Lars George 
• EMEA Chief Architect 
@Cloudera 
– Apache HBase PMC 
– O’Reilly Author of HBase – 
T...
3 
About Supporting HBase at Cloudera 
• Supporting Customers using HBase since 2011 
– HBase Training 
– Professional Ser...
4 
An Apache HBase Timeline 
2008 2009 2010 2011 2012 2013 2014 
Apr’11: CDH3 
GA with HBase 
0.90.1 
May ‘12: 
HBaseCon 
...
5 
Apache HBase “Nascar” Slide 
11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
6 
Outline 
• Definitions 
• Archetypes 
–The Good 
–The Bad 
–The Maybe 
• Conclusion 
11/20/14 Strata+Hadoop World Barce...
7 
Definitions 
A vocabulary for HBase Archetypes
8 
Defining HBase Archetypes 
• There are a lot of HBase applications 
– Some successful, some less so 
– They have common...
9 
So you want to use HBase? 
• What data is being stored? 
– Entity data 
– Event data 
• Why is the data being stored? 
...
10 
What is being stored? 
There are primarly two kinds of big data workloads. They have 
different storage requirements. ...
11 
Entity Centric Data 
• Entity data is information about current state 
– Generally real time reads and writes 
• Examp...
12 
Event Centric Data 
• Event centric data are time-series data points recording successive 
points spaced over time int...
13 
Events about Entities 
• Majority Big Data use cases are dealing with event-based data 
– |Entities| * |Events| = Big ...
14 
Why are you storing the data? 
• So what kind of questions are you asking the data? 
• Entity-centric questions 
– Giv...
15 
How does data get in and out of HBase? 
Put, Incr, Append 
HBase Client 
Gets 
Short scan 
HBase Client 
Full Scan, 
M...
16 
How does data get in and out of HBase? 
Put, Incr, Append 
HBase Client 
Get, Scan 
HBase Client 
Bulk Import 
HBase C...
17 
What system is most efficient? 
• It is all physics 
• You have a limited I/O budget 
– Use all your I/O by paralleliz...
18 
The physics of Hadoop Storage Systems 
Workload HBase HDFS 
Low Latency ms, cached mins, MR 
+ seconds, Impala 
Random...
19 
The physics of Hadoop Storage Systems 
Workload HBase HDFS 
Low Latency ms, cached mins, MR 
+ seconds, Impala 
Random...
20 
The physics of Hadoop Storage Systems 
Workload HBase HDFS 
Low Latency ms, cached mins, MR 
+ seconds, Impala 
Random...
21 
The Archetypes 
HBase Applications
22 
HBase application use cases 
• The Good 
– Simple Entities 
– Messaging Store 
– Graph Store 
– Metrics Store 
• The B...
23 
Archetypes: The 
Good 
HBase, you are my soul mate.
24 
Archetype: Simple Entities 
• Purely entity data, no relation between entities 
– Batch or real-time, random writes 
–...
25 
Simple Entities access pattern 
Put, Incr, Append 
HBase Client 
Get, Scan 
HBase Client 
HBase 
Replication 
Bulk Imp...
26 
Archetype: Messaging Store 
• Messaging Data: 
– Realtime random writes: Emails, SMS, MMS, IM 
– Realtime random updat...
27 
Facebook Messages - Statistics 
Source: HBaseCon 2012 - Anshuman Singh 
11/20/14 Strata+Hadoop World Barcelona 2014. G...
28 
Messages Access Pattern 
Put, Incr, Append 
HBase Client 
Get, Scan 
HBase Client 
Bulk Import 
HBase Client 
HBase 
R...
29 
Archetype: Graph Data 
• Graph Data: All entities and relations 
– Batch or realtime, random writes 
– Batch or realti...
30 
Graph Data Access Pattern 
Put, Incr, Append 
HBase Client 
Get, Scan 
HBase Client 
Bulk Import 
HBase Client 
HBase ...
31 
Archetype: Metrics 
• Frequently updated metrics 
– Increments 
– Roll ups generated by MR and bulk loaded to HBase 
•...
32 
Metrics Access Pattern 
Put, Incr, Append 
HBase Client 
Get, Scan 
HBase Client 
Bulk Import 
HBase Client 
HBase 
Re...
33 
Archetypes: The 
Bad 
These are not the droids you are looking for
34 
Current HBase weak spots 
• HBase’s architecture can handle a lot 
– Engineering tradeoffs optimize for some usecases ...
35 
Bad Archetype: Large Blob Store 
• Saving large objects >3MB per cell 
• Schema: 
– Normal entity pattern, but with so...
36 
Bad Archetype: Naïve RDBMS port 
• A naïve port of an RDBMS into HBase, directly copying the schema 
• Schema 
– Many ...
37 
Large blob store, Naïve RDBMS port access 
patterns 
Put, Incr, Append 
HBase Client 
Get, Scan 
HBase Client 
Bulk Im...
38 
Bad Archetype: Analytic archive 
• Store purely chronological data, partitioned by time 
– Real time writes, chronolog...
39 
Bad Archetype: Analytic Archive Problems 
• HBase non-optimal as primary use case 
– Will get crushed by frequent full...
40 
Analytic Archive access patterns 
Put, Incr, Append 
HBase Client 
Get, Scan 
HBase Client 
Bulk Import 
HBase Client ...
41 
Archetypes: The 
Maybe 
And this is crazy | But here’s my data, | serve it, 
maybe!
42 
The Maybe’s 
• For some applications, doing it right gets complicated. 
• More sophisticated or nuanced cases 
• Requi...
43 
Time Series: in HBase or HDFS? 
• Timeseries IO Pattern Physics: 
– Reads: Collocate related data 
• Make reads cheap ...
44 
Time Series data flows 
• Ingest 
– Flume or similar direct tool via app 
• HDFS for historical 
– No real time servin...
45 
Archetype: Entity Time Series 
• Full fidelity historical record of metrics 
– Random write to event data, random read...
46 
Entity Time Series access pattern 
Put, Incr, Append 
HBase Client 
Get, Scan 
HBase Client 
Bulk Import 
HBase Client...
47 
Archetypes: Hybrid Entity Time Series 
• Essentially a combo of the Metric Archetype and Entity Time Series 
Archetype...
48 
Hybrid time series access pattern 
Put, Incr, Append 
HBase Client 
Get, Scan 
HBase Client 
Hive or MR: 
Bulk Import ...
49 
Meta Archetype: Combined workloads 
• In these cases, the use of HBase depends on workload 
• Cases where we have mult...
50 
Operational with Analytical access pattern 
Get, Scan 
HBase Client 
poor latency! 
full scans 
interfere with 
latenc...
51 
Operational with Analytical access pattern 
Get, Scan 
HBase Client 
HBase 
Replication 
low latency 
Isolated from fu...
52 
MR over Table Snapshots (0.98, CDH5.0) 
• Previously MapReduce jobs over 
HBase required online full table 
scan 
• Ta...
53 
Analytic Archive access pattern 
Put, Incr, Append 
HBase Client 
Get, Scan 
HBase Client 
Bulk Import 
HBase Client 
...
54 
Analytic Archive Snapshot access pattern 
HDFS 
Put, Incr, Append 
HBase Client 
HBase Client 
Snapshot Scan, 
MR 
HBa...
55 
Request Scheduling 
• We want to MR for analytics while 
serving low-latency requests in one 
cluster. 
• Performance ...
56 
Conclusions
57 
Big Data Workloads 
Low 
latency 
Batch 
HDFS 
+ Impala 
HDFS + MR 
(Hive/Pig) 
HBase 
HBase + MR 
HBase + Snapshots 
...
58 
Big Data Workloads 
Low 
latency 
Batch 
HDFS 
+ Impala 
Analytic archive 
Entity Time series 
Hybrid Entity Time 
ser...
59 
HBase is evolving to be an Operational 
Database 
• Excels at consistent row-centric operations 
– Dev efforts aimed a...
60 
Join the Discussion 
Get community 
help or provide 
feedback 
cloudera.com/communi 
ty 
11/20/14 Strata+Hadoop World ...
61 
Try Hadoop 
Now 
cloudera.com/live 
11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
Thank you! 
More questions? 
Join us at Office 
Hours 
4pm @ Table B
You’ve finished this document.
Upcoming SlideShare
Aerospike: Key Value Data Access
Next
Upcoming SlideShare
Aerospike: Key Value Data Access
Next

32

Share

Apache HBase Application Archetypes

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Apache HBase Application Archetypes

  1. 1. Apache HBase Application Archetypes Strata + Hadoop World Barcelona. November 20h , 2014 Lars George | @larsgeorge | Cloudera EMEA Chief Architect | HBase PMC Jonathan Hsieh | @jmhsieh | Cloudera HBase Tech lead | HBase PMC
  2. 2. 2 About Lars and Jon Lars George • EMEA Chief Architect @Cloudera – Apache HBase PMC – O’Reilly Author of HBase – The Definitive Guide • Contact – lars@cloudera.com – @larsgeorge Jon Hsieh • Tech Lead HBase Team @Cloudera – Apache HBase PMC – Apache Flume founder • Contact – jon@cloudera.com – @jmhsieh 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  3. 3. 3 About Supporting HBase at Cloudera • Supporting Customers using HBase since 2011 – HBase Training – Professional Services • Team has experience supporting and running HBase since 2009 – 9 committers on staff – 2 HBase book authors • As of Jan 2014, ~20,000 HBase nodes (in aggregate) under management • Information in this presentation is either aggregated customer data or from public sources. 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  4. 4. 4 An Apache HBase Timeline 2008 2009 2010 2011 2012 2013 2014 Apr’11: CDH3 GA with HBase 0.90.1 May ‘12: HBaseCon 2012 Jun ‘13: HBaseCon 2013 Summer‘11: Messages on HBase Summer ‘09 StumbleUpon goes production on HBase ~0.20 Nov ‘11: Cassini on HBase Jan ‘13 Phoenix on HBase Summer‘11: Web Crawl Cache Sept’11: HBase TDG published Nov’12: HBase in Action published May ‘14: HBaseCon 2014 Aug ‘13 Flurry 1k-1k node cluster replication 2015 Fall’14/Winter ‘15 HBase v1.0.0 released Jan’14: Cloudera has ~20k HBase nodes under management 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  5. 5. 5 Apache HBase “Nascar” Slide 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  6. 6. 6 Outline • Definitions • Archetypes –The Good –The Bad –The Maybe • Conclusion 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  7. 7. 7 Definitions A vocabulary for HBase Archetypes
  8. 8. 8 Defining HBase Archetypes • There are a lot of HBase applications – Some successful, some less so – They have common architecture patterns – They have common tradeoffs • Archetypes are common architecture patterns – Common across multiple use-cases – Extracted to be repeatable • Our Goal: Define patterns à la “Gang of Four” (Gamma, Helm, Johnson, Vlissides) 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  9. 9. 9 So you want to use HBase? • What data is being stored? – Entity data – Event data • Why is the data being stored? – Operational use cases – Analytical use cases • How does the data get in and out? – Real time vs. Batch – Random vs. Sequential 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  10. 10. 10 What is being stored? There are primarly two kinds of big data workloads. They have different storage requirements. Entities Events 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  11. 11. 11 Entity Centric Data • Entity data is information about current state – Generally real time reads and writes • Examples: – Accounts – Users – Geolocation points – Click Counts and Metrics – Current Sensors Reading • Scales up with # of Humans and # of Machines/Sensors – Billions of distinct entities 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  12. 12. 12 Event Centric Data • Event centric data are time-series data points recording successive points spaced over time intervals. – Generally real time write, some combination of real time read or batch read • Examples: – Sensor data over time – Historical Stock Ticker data – Historical Metrics – Click time-series • Scales up due to finer grained intervals, retention policies, and the passage of time 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  13. 13. 13 Events about Entities • Majority Big Data use cases are dealing with event-based data – |Entities| * |Events| = Big data • When you ask questions, do you hone in on entity first? • When you ask questions, do you hone in on time ranges first? • Your answer will help you determine where and how to store your data. 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  14. 14. 14 Why are you storing the data? • So what kind of questions are you asking the data? • Entity-centric questions – Give me everything about entity e – Give me the most recent event v about entity e – Give me the n most recent events V about entity e – Give me all events V about e between time [t1,t2] • Event and Time-centric questions – Give me an aggregate for each entity between time [t1,t2] – Give me an aggregate for each time interval for entity e – Find events V that match some other given criteria 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  15. 15. 15 How does data get in and out of HBase? Put, Incr, Append HBase Client Gets Short scan HBase Client Full Scan, MapReduce HBase Scanner Bulk Import HBase Client HBase Replication HBase Replication 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  16. 16. 16 How does data get in and out of HBase? Put, Incr, Append HBase Client Get, Scan HBase Client Bulk Import HBase Client HBase Replication HBase Replication low latency high throughput Gets Short scan Full Scan, MapReduce HBase Scanner 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  17. 17. 17 What system is most efficient? • It is all physics • You have a limited I/O budget – Use all your I/O by parallelizing access and read/write sequentially. – Choose the system and features that reduces I/O in general • Pick the system that is best for your workload IOPs/s/disk 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  18. 18. 18 The physics of Hadoop Storage Systems Workload HBase HDFS Low Latency ms, cached mins, MR + seconds, Impala Random Read primary index - index?, small files problem Short Scan sorted + partition Full Scan 0 live table + (MR on snapshots) MR, Hive, Impala Random Write log structured - Not supported Sequential Write hbase overhead bulk load minimal overhead Updates log structured - Not supported 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  19. 19. 19 The physics of Hadoop Storage Systems Workload HBase HDFS Low Latency ms, cached mins, MR + seconds, Impala Random Read primary index - index?, small files problem Short Scan sorted + partition Full Scan 0 live table + (MR on snapshots) MR, Hive, Impala Random Write log structured - Not supported Sequential Write hbase overhead bulk load minimal overhead Updates log structured - Not supported 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  20. 20. 20 The physics of Hadoop Storage Systems Workload HBase HDFS Low Latency ms, cached mins, MR + seconds, Impala Random Read primary index - index?, small files problem Short Scan sorted + partition Full Scan 0 live table + (MR on snapshots) MR, Hive, Impala Random Write log structured - not supported Sequential Write HBase overhead bulk load minimal overhead Updates log structured - not supported 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  21. 21. 21 The Archetypes HBase Applications
  22. 22. 22 HBase application use cases • The Good – Simple Entities – Messaging Store – Graph Store – Metrics Store • The Bad – Large Blobs – Naïve RDBMS port – Analytic Archive • The Maybe – Time series DB – Combined workloads 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  23. 23. 23 Archetypes: The Good HBase, you are my soul mate.
  24. 24. 24 Archetype: Simple Entities • Purely entity data, no relation between entities – Batch or real-time, random writes – Real-time, random reads – Could be a well-done denormalized RDBMS port – Often from many different sources, with poly-structured data • Schema: – Row per entity – Row key => entity ID, or hash of entity ID – Col qualifier => Property / field, possibly time stamp • Examples: – Geolocation data – Search index building – Use solr to make text data searchable 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  25. 25. 25 Simple Entities access pattern Put, Incr, Append HBase Client Get, Scan HBase Client HBase Replication Bulk Import HBase Client low latency high throughput Gets Short scan HBase Replication Full Scan, MapReduce HBase Scanner Solr 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  26. 26. 26 Archetype: Messaging Store • Messaging Data: – Realtime random writes: Emails, SMS, MMS, IM – Realtime random updates: Msg read, starred, moved, deleted – Reading of top-N entries, sorted by time – Records are of varying size – Some time series, but mostly random read/write • Schema: – Row = users/feed/inbox – Row key = UID or UID + time – Column Qualifier = time or conversation id + time. – Use CF’s for indexes. • Examples: – Facebook Messages, Xiaomi Messages – Telco SMS/MMS services – Feeds like tumblr, pinterest 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  27. 27. 27 Facebook Messages - Statistics Source: HBaseCon 2012 - Anshuman Singh 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  28. 28. 28 Messages Access Pattern Put, Incr, Append HBase Client Get, Scan HBase Client Bulk Import HBase Client HBase Replication HBase Replication low latency high throughput Gets Short scan Full Scan, MapReduce HBase Scanner 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  29. 29. 29 Archetype: Graph Data • Graph Data: All entities and relations – Batch or realtime, random writes – Batch or realtime, random reads – Its an entity with relation edges • Schema: – Row = Node. – Row key => Node ID. – Col qualifier => Edge ID, or properties:values • Examples: – Web Caches – Yahoo!, Trend Micro – Titan Graph DB with HBase storage backend – Sessionization (financial transactions, clicks streams, network traffic) – Government (connect the bad guy) 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  30. 30. 30 Graph Data Access Pattern Put, Incr, Append HBase Client Get, Scan HBase Client Bulk Import HBase Client HBase Replication HBase Replication low latency high throughput Gets Short scan Full Scan, MapReduce HBase Scanner 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  31. 31. 31 Archetype: Metrics • Frequently updated metrics – Increments – Roll ups generated by MR and bulk loaded to HBase • Schema – Row: Entity for a time period – Row key: entity-<yymmddhh> (granular time) – Col Qualifier: property -> count • Examples – Campaign Impression/Click counts (Ad tech) – Sensor data (Energy, Manufacturing, Auto) 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  32. 32. 32 Metrics Access Pattern Put, Incr, Append HBase Client Get, Scan HBase Client Bulk Import HBase Client HBase Replication HBase Replication low latency high throughput Gets Short scan Full Scan, MapReduce HBase Scanner 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  33. 33. 33 Archetypes: The Bad These are not the droids you are looking for
  34. 34. 34 Current HBase weak spots • HBase’s architecture can handle a lot – Engineering tradeoffs optimize for some usecases and against others – HBase can still do things it is not optimal for – However, other systems are fundamentally more efficient for some workloads • We’ve seen folks forcing apps into HBase – If there is only one workloads on the data, consider another system – If there is a mixed workload, some of cases become “maybes” • Just because it is not good today, doesn’t mean it can’t be better tomorrow! 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  35. 35. 35 Bad Archetype: Large Blob Store • Saving large objects >3MB per cell • Schema: – Normal entity pattern, but with some columns with large cells • Examples – Raw photo or video storage in HBase – Large frequently updated structs as a single cell • Problems: – Write amplification when reoptimizing data for read (compactions on large unchanging data) – Write amplification when large structs rewritten to update subfields. Cells are atomic, and HBase must rewrite an entire cell • Note: Medium Binary Object (MOB) support coming (lots of 100KB-10MB cells) – See HBASE-11339 for more details. 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  36. 36. 36 Bad Archetype: Naïve RDBMS port • A naïve port of an RDBMS into HBase, directly copying the schema • Schema – Many tables, just like an RDBMS schema – Row key: primary key or auto-incrementing key, like RDBMS schema – Column qualifiers: field names – Manually do joins, or secondary indexes (not consistent) • Solution: – HBase is not a SQL Database – No multi-region/multi-table in HBase transactions (yet) – No built in join support. Must to denormalize your schema to use HBase 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  37. 37. 37 Large blob store, Naïve RDBMS port access patterns Put, Incr, Append HBase Client Get, Scan HBase Client Bulk Import HBase Client HBase Replication HBase Replication low latency high throughput Gets Short scan Full Scan, MapReduce HBase Scanner 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  38. 38. 38 Bad Archetype: Analytic archive • Store purely chronological data, partitioned by time – Real time writes, chronological time as primary index – Column-centric aggregations over all rows – Bulk reads out, generally for generating periodic reports • Schema – Row key: date+xxx or salt+date+xxx – Column qualifiers: properties with data or counters • Example – Machine logs organized by date (causes write hotspotting) – Full fidelity clickstream organzied by date (as opposed to campaign) 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  39. 39. 39 Bad Archetype: Analytic Archive Problems • HBase non-optimal as primary use case – Will get crushed by frequent full table scans – Will get crushed by large compactions – Will get crushed by write-side region hot spotting • Solution: – Store in HDFS; Use Parquet columnar data storage + Impala/Hive – Build rollups in HDFS+MR; store and serve rollups in HBase 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  40. 40. 40 Analytic Archive access patterns Put, Incr, Append HBase Client Get, Scan HBase Client Bulk Import HBase Client HBase Replication HBase Replication low latency high throughput Gets Short scan Full Scan, MapReduce HBase Scanner 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  41. 41. 41 Archetypes: The Maybe And this is crazy | But here’s my data, | serve it, maybe!
  42. 42. 42 The Maybe’s • For some applications, doing it right gets complicated. • More sophisticated or nuanced cases • Require considering these questions: – When do you choose HBase vs HDFS storage for time series data? – Are there times where bad archetypes are ok? 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  43. 43. 43 Time Series: in HBase or HDFS? • Timeseries IO Pattern Physics: – Reads: Collocate related data • Make reads cheap and fast – Writes: Spread writes out as much as possible • Maximize write throughput • HBase: Tension between these goals – Spreading writes spreads data making reads inefficient – Colocating on write causes hotspots, underutilizes resources by limiting write throughput • HDFS: The sweet spot – Sequential writes and and sequential read – Just write more files in date-dirs; physically spreads writes but logically groups data – Reads for time centric queries: just read files in date-dir 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  44. 44. 44 Time Series data flows • Ingest – Flume or similar direct tool via app • HDFS for historical – No real time serving – Batch queries and generate rollups in Hive/MR – Faster queries in Impala • HBase for recent – Serve individual events – Serve pre-computed aggregates 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  45. 45. 45 Archetype: Entity Time Series • Full fidelity historical record of metrics – Random write to event data, random read specific event or aggregate data • Schema: – Rowkey: entity-timestamp or hash(entity)-timestamp, possibly with salt added after entity – Col qualifiers: granular time stamps -> value – Use custom aggregation to consolidate old data – Use TTL’s to bound and age off old data • Examples: – OpenTSDB is a system on HBase that handles this for numeric values • Lazily aggregates cells for better performance – Facebook Insights, ODS 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  46. 46. 46 Entity Time Series access pattern Put, Incr, Append HBase Client Get, Scan HBase Client Bulk Import HBase Client HBase Replication HBase Replication low latency high throughput Gets Short scan Full Scan, MapReduce HBase Scanner Flume Custom App 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  47. 47. 47 Archetypes: Hybrid Entity Time Series • Essentially a combo of the Metric Archetype and Entity Time Series Archetype, with bulk loads of rollups via HDFS – Land data in HDFS and HBase – Keep all data in HDFS for future use – Aggregate in HDFS and write to HBase – HBase can do some aggregates too (counters) – Keep serve-able data in HBase – Use TTL to discard old values from HBase 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  48. 48. 48 Hybrid time series access pattern Put, Incr, Append HBase Client Get, Scan HBase Client Hive or MR: Bulk Import HBase Client HBase Replication HBase Replication low latency high throughput Gets Short scan Full Scan, MapReduce HBase Scanner Flume HDFS 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  49. 49. 49 Meta Archetype: Combined workloads • In these cases, the use of HBase depends on workload • Cases where we have multiple workloads styles. – Many cases we want to do multiple things with the same data – Primary use case (real time, random access) – Secondary use case (analytical) – Pick for your primary, here’s some patterns on how to do your secondary. 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  50. 50. 50 Operational with Analytical access pattern Get, Scan HBase Client poor latency! full scans interfere with latency! high throughput MapReduce HBase Scanner Put, Incr, Append HBase Client HBase Replication Bulk Import HBase Client 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  51. 51. 51 Operational with Analytical access pattern Get, Scan HBase Client HBase Replication low latency Isolated from full scans high throughput MapReduce HBase Scanner Put, Incr, Append HBase Client HBase Replication Bulk Import HBase Client high throughput 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  52. 52. 52 MR over Table Snapshots (0.98, CDH5.0) • Previously MapReduce jobs over HBase required online full table scan • Take a snapshot and run MR job over snapshot files – Doesn’t use HBase client – Avoid affecting HBase caches – 3-5x perf boost. – Still requires more IOPs than HDFS raw files map map map map map map map map reduce reduce reduce map map map map map map map map reduce reduce reduce snapshot 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  53. 53. 53 Analytic Archive access pattern Put, Incr, Append HBase Client Get, Scan HBase Client Bulk Import HBase Client HBase Replication HBase Replication low latency high throughput Gets Short scan Full Scan, MapReduce HBase Scanner 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  54. 54. 54 Analytic Archive Snapshot access pattern HDFS Put, Incr, Append HBase Client HBase Client Snapshot Scan, MR HBase Scanner Bulk Import HBase Client HBase Replication HBase Replication low latency Table snapshot Higher throughput Gets Short scan 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  55. 55. 55 Request Scheduling • We want to MR for analytics while serving low-latency requests in one cluster. • Performance Isolation (proposed) – Limit performance impact load on one table has on others. (HBASE-6721) • Request prioritization and scheduling – Current default is FIFO, added Deadline – Prioritize short requests before long scans (HBASE-10994) • Throttling – Limit the request throughput of MR jobs. Mixed workload Delayed by long scan requests 1 1 2 1 1 3 1 1 1 1 1 1 2 3 Rescheduled so new request get priority Isolated workload 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  56. 56. 56 Conclusions
  57. 57. 57 Big Data Workloads Low latency Batch HDFS + Impala HDFS + MR (Hive/Pig) HBase HBase + MR HBase + Snapshots -> HDFS + MR Random Access Short Scan Full Scan 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  58. 58. 58 Big Data Workloads Low latency Batch HDFS + Impala Analytic archive Entity Time series Hybrid Entity Time series + Rollup generation HDFS + MR (Hive/Pig) Simple Entities Graph data HBase Current Metrics Messages HBase + MR Hybrid Entity Time series + Rollup serving HBase + Snapshots -> HDFS + MR Index building Random Access Short Scan Full Scan 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  59. 59. 59 HBase is evolving to be an Operational Database • Excels at consistent row-centric operations – Dev efforts aimed at using all machine resources efficiently, reducing MTTR, and improving latency predictability. – Projects built on HBase that enable secondary indexing and multi-row transactions – Apache Phoenix or Impala provide a SQL skin for simplified application development – Evolution towards OLTP workloads • Analytic workloads? – Can be done but will be beaten by direct HDFS + MR/Spark/Impala 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  60. 60. 60 Join the Discussion Get community help or provide feedback cloudera.com/communi ty 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  61. 61. 61 Try Hadoop Now cloudera.com/live 11/20/14 Strata+Hadoop World Barcelona 2014. George and Hsieh
  62. 62. Thank you! More questions? Join us at Office Hours 4pm @ Table B
  • ClaudioGuidi

    Oct. 1, 2019
  • dunithd

    Jun. 21, 2019
  • HenryRengifo1

    Oct. 9, 2017
  • hareee99

    Jul. 15, 2017
  • masahiroitod2

    Feb. 9, 2017
  • qili710

    Oct. 23, 2016
  • taewook

    Mar. 16, 2016
  • nguyencluat3

    Jan. 23, 2016
  • StephenSamiya

    Jan. 10, 2016
  • VincentFu6

    Dec. 3, 2015
  • cris_weber

    Oct. 2, 2015
  • alireza19330

    Aug. 14, 2015
  • VeletiKishore

    Jul. 16, 2015
  • bart004

    Jun. 17, 2015
  • JaeChangSong

    Jun. 17, 2015
  • zshao

    Jun. 8, 2015
  • rajanrajendran

    Feb. 26, 2015
  • obsani

    Feb. 24, 2015
  • ssuser4a734e

    Feb. 21, 2015
  • thirupathireddyguduru

    Feb. 19, 2015

Views

Total views

5,091

On Slideshare

0

From embeds

0

Number of embeds

445

Actions

Downloads

0

Shares

0

Comments

0

Likes

32

×