Time Machine for Business Application
Data by Capturing, Organizing &
Processing of Change Records
Sharad Varshney
Hortonw...
© Hortonworks Inc. 2012
Time Machine
By capturing Insert, Update & Delete
Do we still need dimensional modeling?
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
Data Lake
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
Enterprise Data Warehouse (EDW)SOURCES
TRANSACTIONAL
ERP CRM PLM EAM OTHERS
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
Enterprise Data Warehouse (EDW)SOURCES
TRANSACTIONAL
ERP CRM PLM EAM OTHERS...
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
Enterprise Data Warehouse (EDW)SOURCES
TRANSACTIONAL
ERP CRM PLM EAM OTHERS...
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
Project Life Cycle
Determine List
of Questions
Design EDW
Schema
Collect Da...
© Hortonworks Inc. 2012
+/- Dim Modeling
Small Data Storage
Better Query Performance
Simple Query
Cubes / OLAP / In Me...
© Hortonworks Inc. 2012
Economics of Data
Cost of Data Generation
© Hortonworks Inc. 2012
Economics of Data
Cost of Data Generation
Amount of Data
© Hortonworks Inc. 2012
Economics of Data
• Most of Organization
wealth is stored in
OLTP Systems.
• Why to throw away
any...
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
Introducing Time Machine ConceptSOURCES
TRANSACTIONAL
ERP CRM PLM EAM OTHER...
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
Data Capture
SOURCES
TRANSACTIONAL
INSERT
UPDATE
DELETE
Capture every
trans...
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
Data Storage
• Store every transaction with Date Time stamp
• Store every t...
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
Data Transformation
• Convert Back to Source Schema
• Use secondary Sort M/...
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
Ask Questions
INSERT
UPDATE
DELETE
0
2
4
6
Series 1
Series 2
Series 3
SQL
•...
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
Time Dimension
• Don’t know how records have been changed?
• Need mechanism...
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
SNAPSHOT
INSERT
UPDATE
DELETE
0
2
4
6
Series 1
Series 2
Series 3
SQL
Stagin...
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
TREND
SELECT <COL> FROM <TABLE>
TREND BY <YEAR,MONTH,WEEK,DAY>
SQL
Inventor...
© Hortonworks Inc. 2012© Hortonworks Inc. 2012
Advantages
• Ask any question
• No new Schema to learn
• KPI can be generat...
© Hortonworks Inc. 2012
Deployment
Ask
Questions
Develop
Queries
See
Results
Improve
Business
Process
Develop
Curiosity
In...
© Hortonworks Inc. 2012
Questions
© Hortonworks Inc. 2012
Thank You
Sharad Varshney
Phone: +1 678 438 3701
HBase 0.96 – A Report on the Current
Status
Lars George
Cloudera
26
HBase 0.96+
A Report on the Current Status
Lars George | EMEA Chief Architect
About Me
• EMEA Chief Architect @ Cloudera
• Consulting on Hadoop projects (everywhere)
• Apache Committer
• HBase and Whi...
The Content...
• Version History
• Overview of new Features
• Summary
CONFIDENTIAL - RESTRICTED
Version History
A Timeline Overview
HBase Releases
URL: http://s.apache.org/hbase-releases
HBase Releases – Issues Closed (JIRA)
URL: http://s.apache.org/hbase-releases
HBase Releases – Issues Closed (Distinct)
URL: http://s.apache.org/hbase-releases
HBase Book?
I targeted 0.92.0 but…
r1130336 | stack | 2011-06-02 00:52:45 +0200 ⤦
(Thu, 02 Jun 2011) | 1 line
Add link to ...
HBase Book?
I am trailing 0.92.0 by 800+ commits, including for
example
r1153634 | tedyu | 2011-08-03 21:59:48 +0200 ⤦
(We...
Coprocessors and more…
HBase 0.92
HBase 0.92 - Highlights
• 682 issues addressed
• 811 issues total in 0.92.x line
• New logo! (HBASE-4312)
• HFile v2 (HBAS...
HBase 0.92 - Highlights
• Coprocessors (HBASE-2000)
• Offheap cache (HBASE-4027)
• Online Table Schema Change (HBASE-1730)...
HFile v1 – HBase 0.90
• Previously the file layout was data blocks, meta blocks
and then file metadata like indexes.
• Eac...
HFile v2 – HBase 0.92+
The 2nd version of HFile splits the indexes and Bloom
filters up into a hierarchy and interleaves t...
Coprocessors: Observers
Coprocessors: RPC Calls
Slab Cache – Off-heap Block Cache
http://blog.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/
• The off-heap cache u...
Performance Tuning…
HBase 0.94
HBase 0.94 - Highlights
• 420 issues addressed
• 1394 issues total in 0.94.x line
• Read Caching Improvements (HBASE-5074)...
HBase 0.94 - Highlights
• Simplified Region Sizing (HBASE-4365)
• Smarter Transaction Semantics
• Atomic Put&Delete in One...
HBase 0.94 - Highlights
• Per Column Family Metrics (HBASE-4219)
• Multi-row local transactions (HBASE-5229)
• Pluggable S...
Block Encoding
• Allows to reduce data footprint in memory
• Only encodes the key portion of a key/value pair
• Encoded ke...
Block Encoding: None
Source: http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/
• With no encoding the Key...
Block Encoding: Prefix Encoding
Source: http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/
• The encoding ...
Block Encoding: Diff Encoding
Source: http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/
• Apart from the ...
Block Encoding
• Advantage of block encoding is faster
decompression/decoding
• 20-80% faster than LZO
• Also it allows to...
The Singularity
HBase 0.96
HBase 0.96 - Highlights
• 1219 issues addressed
• 2243 issues total in 0.96.x line
• Improved Stability (HBASE-6241/6201)
...
HBase 0.96 - Highlights
• Operability Improvements
• Hooks for Health Scripts (HBASE-7399/7406)
• Trace Lagging Calls with...
HBase 0.96 - Highlights
• No more ROOT table (HBASE-3171)
• Remove HFile v1 (HBASE-7660)
• Trie Data Block Encoding (HBASE...
HBase 0.96 - Highlights
• Online Region Merging (HBASE-7403/8219)
• Bucket Cache Support (HBASE-7404)
• Remove older ICV C...
— Michael Stack, HBase PMC Chair
Mean-Time-To-Recovery (MTTR)
• Lot‘s of effort put into improve how long data might
not be accessible during a region move...
Cell Level Security
HBase 0.98
HBase 0.98 - Highlights
• 1303 issues addressed
• 1458 issues total in 0.98.x line
• Cell Level Security (HBASE-6222/7663/...
Cell Level Security
• Added HFile v3 which can store arbitrary metadata in
a cell, called tags
• Also extended ACL checks ...
Visibility Labels
The API allows to set visibility by using expressions with
“&”, “|”, and “!”, as well as “(“ and “)”, e....
The Future…
HBase 0.??
HBase Future
• Not much is writing in stone yet
• Master gets rewritten and also META table handling
• Build in consensus ...
Questions?
@larsgeorge
Th 1620 1700 glazen slide deck new
Th 1620 1700 glazen slide deck new
Upcoming SlideShare
Loading in …5
×

Th 1620 1700 glazen slide deck new

963 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
963
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
19
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Th 1620 1700 glazen slide deck new

  1. 1. Time Machine for Business Application Data by Capturing, Organizing & Processing of Change Records Sharad Varshney Hortonworks
  2. 2. © Hortonworks Inc. 2012 Time Machine By capturing Insert, Update & Delete Do we still need dimensional modeling?
  3. 3. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 Data Lake
  4. 4. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 Enterprise Data Warehouse (EDW)SOURCES TRANSACTIONAL ERP CRM PLM EAM OTHERS
  5. 5. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 Enterprise Data Warehouse (EDW)SOURCES TRANSACTIONAL ERP CRM PLM EAM OTHERS STAGING AREA Processing Pre-Calculation Trash Undesired Data
  6. 6. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 Enterprise Data Warehouse (EDW)SOURCES TRANSACTIONAL ERP CRM PLM EAM OTHERS STAGING AREA DATA WAREHOUSE DATA MART Processing Pre-Calculation Trash Undesired Data Cube Star Schema Data Vault Snow Flaked Schema REPORING ADHOC OLAP VISUALIZE
  7. 7. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 Project Life Cycle Determine List of Questions Design EDW Schema Collect Data Develop ETL Process Execute ETL Process Ask Question From the List
  8. 8. © Hortonworks Inc. 2012 +/- Dim Modeling Small Data Storage Better Query Performance Simple Query Cubes / OLAP / In Memory KPIs Calculations for future ❌Question Limitations ❌Poor Turnaround Time ❌Data loss ❌Change Management ❌New Schema to learn ❌Difficult to Scale ❌KPI for Past
  9. 9. © Hortonworks Inc. 2012 Economics of Data Cost of Data Generation
  10. 10. © Hortonworks Inc. 2012 Economics of Data Cost of Data Generation Amount of Data
  11. 11. © Hortonworks Inc. 2012 Economics of Data • Most of Organization wealth is stored in OLTP Systems. • Why to throw away any information from critical data sources? Cost of Data Generation Amount of Data
  12. 12. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 Introducing Time Machine ConceptSOURCES TRANSACTIONAL ERP CRM PLM EAM OTHERS Store Every Insert, Update & Delete Same Schema as Source REPORING ADHOC VISUALIZE ERP CRM PLM EAM OTHERS DASHBOARD ERP CRM PLM EAM OTHERS Time Machine Algorithms
  13. 13. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 Data Capture SOURCES TRANSACTIONAL INSERT UPDATE DELETE Capture every transaction • Triggers • Periodic Select Query over rowstamp/timestamp • Oracle archived logs • SQL Server change data capture
  14. 14. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 Data Storage • Store every transaction with Date Time stamp • Store every transaction with I,U,D flag • Assumption: Every table have a primary Key INSERT UPDATE DELETE
  15. 15. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 Data Transformation • Convert Back to Source Schema • Use secondary Sort M/R algorithm INSERT UPDATE DELETE Staging Source Schema
  16. 16. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 Ask Questions INSERT UPDATE DELETE 0 2 4 6 Series 1 Series 2 Series 3 SQL • No new schema to learn Staging Source Schema What’s missing?
  17. 17. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 Time Dimension • Don’t know how records have been changed? • Need mechanism to see history records • Should be able to use SQL
  18. 18. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 SNAPSHOT INSERT UPDATE DELETE 0 2 4 6 Series 1 Series 2 Series 3 SQL Staging Source Schema SELECT <COL> FROM <TABLE> SNAPSHOPT <DATE TIME>
  19. 19. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 TREND SELECT <COL> FROM <TABLE> TREND BY <YEAR,MONTH,WEEK,DAY> SQL Inventory Shirts Pants Jeans PJs 0 5 10 15 20 2011 2012 2013 2014 PJ Jeans Pants Shirts
  20. 20. © Hortonworks Inc. 2012© Hortonworks Inc. 2012 Advantages • Ask any question • No new Schema to learn • KPI can be generated for Past • Quick Turnaround • No ETL Errors (Automations) • Peace of mind that data is not lost
  21. 21. © Hortonworks Inc. 2012 Deployment Ask Questions Develop Queries See Results Improve Business Process Develop Curiosity Infrastructure Collect Data Store Data
  22. 22. © Hortonworks Inc. 2012 Questions
  23. 23. © Hortonworks Inc. 2012 Thank You Sharad Varshney Phone: +1 678 438 3701
  24. 24. HBase 0.96 – A Report on the Current Status Lars George Cloudera
  25. 25. 26 HBase 0.96+ A Report on the Current Status Lars George | EMEA Chief Architect
  26. 26. About Me • EMEA Chief Architect @ Cloudera • Consulting on Hadoop projects (everywhere) • Apache Committer • HBase and Whirr • O’Reilly Author • HBase – The Definitive Guide • Now in Japanese! • Contact • lars@cloudera.com • @larsgeorge 日本語版も出ました!
  27. 27. The Content... • Version History • Overview of new Features • Summary
  28. 28. CONFIDENTIAL - RESTRICTED Version History A Timeline Overview
  29. 29. HBase Releases URL: http://s.apache.org/hbase-releases
  30. 30. HBase Releases – Issues Closed (JIRA) URL: http://s.apache.org/hbase-releases
  31. 31. HBase Releases – Issues Closed (Distinct) URL: http://s.apache.org/hbase-releases
  32. 32. HBase Book? I targeted 0.92.0 but… r1130336 | stack | 2011-06-02 00:52:45 +0200 ⤦ (Thu, 02 Jun 2011) | 1 line Add link to meet up ... r1234894 | stack | 2012-01-23 17:50:43 +0100 ⤦ (Mon, 23 Jan 2012) | 1 line Move version on past 0.92.0 to 0.92.1-SNAPSHOT $ svn log -r 1130336:1234894 | grep "^r" | wc -l 807
  33. 33. HBase Book? I am trailing 0.92.0 by 800+ commits, including for example r1153634 | tedyu | 2011-08-03 21:59:48 +0200 ⤦ (Wed, 03 Aug 2011) | 2 lines HBASE-3857 Change the HFile Format (Mikhail & Liyin) …which is not “unimportant”.  I am working on an update!
  34. 34. Coprocessors and more… HBase 0.92
  35. 35. HBase 0.92 - Highlights • 682 issues addressed • 811 issues total in 0.92.x line • New logo! (HBASE-4312) • HFile v2 (HBASE-3857) • Distributed Log Splitting (HBASE-1364) • Enhanced Master UI • Major compaction progress (HBASE-3900) • Regions in transition (HBASE-4291) • Tasks (HBASE-3839) • Slow Query Metrics (HBASE-4117)
  36. 36. HBase 0.92 - Highlights • Coprocessors (HBASE-2000) • Offheap cache (HBASE-4027) • Online Table Schema Change (HBASE-1730) • Regions Size from 256MB to 1GB (HBASE-4374) • Hadoop 1 Support (HBASE-5125) • Snappy Support (HBASE-3691) • Keep last version with TTL (HBASE-4071) • Multithreaded Compactions (HBASE-4572)
  37. 37. HFile v1 – HBase 0.90 • Previously the file layout was data blocks, meta blocks and then file metadata like indexes. • Each data block held a magic header and then the actual data sequentially.
  38. 38. HFile v2 – HBase 0.92+ The 2nd version of HFile splits the indexes and Bloom filters up into a hierarchy and interleaves those with data blocks. The data block header now holds additional info on the block itself. Source: http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/
  39. 39. Coprocessors: Observers
  40. 40. Coprocessors: RPC Calls
  41. 41. Slab Cache – Off-heap Block Cache http://blog.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ • The off-heap cache uses Java NIO’s Direct ByteBuffer structures • Uses its on slab allocation handling • Does copy-on-read during access of data • Uses L2 cache and replaces OS buffer cache
  42. 42. Performance Tuning… HBase 0.94
  43. 43. HBase 0.94 - Highlights • 420 issues addressed • 1394 issues total in 0.94.x line • Read Caching Improvements (HBASE-5074) • Seek Optimization • Bloom Filter for Delete Family (HBASE-4532) • Lazy Seeks (HBASE-4465) • Write to WAL Optimizations • WAL Compression (HBASE-4608) • Data Block Encoding of KeyValues (HBASE-4218) • Improved HBaseFsck (HBASE-5128)
  44. 44. HBase 0.94 - Highlights • Simplified Region Sizing (HBASE-4365) • Smarter Transaction Semantics • Atomic Put&Delete in One Call (HBASE-3584) • Snapshots (0.94.6) (HBASE-7360) • Atomic Appends (HBASE-4102) • Multi Increment and Append (HBASE-2947) • More Aggressive Off-Peak Compactions (HBASE-4463)
  45. 45. HBase 0.94 - Highlights • Per Column Family Metrics (HBASE-4219) • Multi-row local transactions (HBASE-5229) • Pluggable Split Key Policy (HBASE-5304) • Load balance regions by table (HBASE-3373) • Also backported to 0.92.1 • Make Compaction Code Pluggable (HBASE-6427) • Deprecate HTablePool (0.94.11) (HBASE-6580) • Canary Test Tool (HBASE-4393)
  46. 46. Block Encoding • Allows to reduce data footprint in memory • Only encodes the key portion of a key/value pair • Encoded keys stay encoded also during flushes • Compression on top of encoding takes care of the values and remainder of key data Example: • Key length: 90B • Value length: 8B Type Ratio Key Compression 92% Total Compression 85% LZO on same data 85% LZO after encoding 91% https://issues.apache.org/jira/browse/HBASE-4218
  47. 47. Block Encoding: None Source: http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/ • With no encoding the Key/Value structures are stored verbatim (with some overhead for lengths) • In the past you were advised to keep the “keys” short for that reason
  48. 48. Block Encoding: Prefix Encoding Source: http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/ • The encoding patch added a new Cell abstraction that allows for extra fields in a Key/Value • The fields are used to track necessary details for the encoding
  49. 49. Block Encoding: Diff Encoding Source: http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/ • Apart from the prefix encoding there are other ways of encoding the keys • The diff encoding is one of such approaches
  50. 50. Block Encoding • Advantage of block encoding is faster decompression/decoding • 20-80% faster than LZO • Also it allows to seek data still, which is not possible with compressed data • Penalty is a slightly slower read performance compared to non-encoded keys • Important is to watch the sizes and repetition of key data, encoding might not be useful for random data https://issues.apache.org/jira/browse/HBASE-4218
  51. 51. The Singularity HBase 0.96
  52. 52. HBase 0.96 - Highlights • 1219 issues addressed • 2243 issues total in 0.96.x line • Improved Stability (HBASE-6241/6201) • ZK based Read/Write locks for table operations (HBASE-7305) • Scalability Improvements (HBASE-8877) • Schema Storage (HBASE-8778) • Log Cleaner for Replication Speed Up (HBASE-9208) • Mean-Time-To-Recovery (MTTR) Improvements (HBASE- 5844/5926) • Distributed Log Replay (HBASE-7006) • Dedicated WAL for System Table (HBASE-7213/8631)
  53. 53. HBase 0.96 - Highlights • Operability Improvements • Hooks for Health Scripts (HBASE-7399/7406) • Trace Lagging Calls with HTrace (HBASE-9121) • Versioned RPCs and Metadata (Protobufs) (HBASE-3505) • Parallel Seeks in Stores (HBASE-7495) • Hadoop 1 and 2 Support • Secure Short Circuit Reads on H2 (HBASE-6783) • Namespaces Support (HBASE-8015) • New Metrics v2 (HBASE-4050) • Cell Interface vs KeyValue (HBASE-7162)
  54. 54. HBase 0.96 - Highlights • No more ROOT table (HBASE-3171) • Remove HFile v1 (HBASE-7660) • Trie Data Block Encoding (HBASE-4676) • Remove Client-side Row Locks (HBASE-7263/7315) • Compaction and Flush Improvements (HBASE- 7516/7763/6466/7678) (HBASE-7667/7110/7603/7519/7842) • Improved Default Configuration (HBASE-4657?) • Client-side Type Library (HBASE-8089)
  55. 55. HBase 0.96 - Highlights • Online Region Merging (HBASE-7403/8219) • Bucket Cache Support (HBASE-7404) • Remove older ICV Calls (HBASE-7032) • New “Bootstrap” based UIs! (HBASE-6135) • Remove Client-side Row Locks (HBASE-7263/7315) • Compaction and Flush Improvements (HBASE- 7516/7763/6466/7678) (HBASE-7667/7110/7603/7519/7842)
  56. 56. — Michael Stack, HBase PMC Chair
  57. 57. Mean-Time-To-Recovery (MTTR) • Lot‘s of effort put into improve how long data might not be accessible during a region move • The offline period is made up of phases: • a detection phase, • a repair phase, • reassignment, and finally, • clients noticing the data available in its new location • Improvements in many of those areas • Faster detection, efficient repair, parallel replay • Dedicated WAL for system tables https://blog.cloudera.com/blog/2013/10/hbase-0-96-0-released/
  58. 58. Cell Level Security HBase 0.98
  59. 59. HBase 0.98 - Highlights • 1303 issues addressed • 1458 issues total in 0.98.x line • Cell Level Security (HBASE-6222/7663/7662) • Server-side Encryption (HBASE-7544) • WAL Throughput Improvements (HBASE-8755) • Reverse Scanner (HBASE-4811) • MapReduce over Snapshot Files (HBASE-8369) • Striped Compactions (HBASE-7667) • Throttle Replication (HBASE-9501)
  60. 60. Cell Level Security • Added HFile v3 which can store arbitrary metadata in a cell, called tags • Also extended ACL checks to apply to cell levels • With this visibility labels can be stored in tags • An API and CLI tools are provided that are akin to Accumulo’s, after which it is modeled • Additional encryption of data at rest ensures further security of sensitive data https://blogs.apache.org/hbase/entry/hbase_cell_security
  61. 61. Visibility Labels The API allows to set visibility by using expressions with “&”, “|”, and “!”, as well as “(“ and “)”, e.g. label set of { confidential, secret, topsecret, probationary } could be combined as ( secret | topsecret ) & !probationary At runtime the expressions are evaluated against a user and then applied to each cell.
  62. 62. The Future… HBase 0.??
  63. 63. HBase Future • Not much is writing in stone yet • Master gets rewritten and also META table handling • Build in consensus (HBASE-10296) • Co-locate Master and META (HBASE10569) • MTTR is further extended into interesting areas • Read replicas (HBASE10070) It has to be seen when 1.0.0 is released and what it contains. Your opinion counts!
  64. 64. Questions? @larsgeorge

×