SlideShare a Scribd company logo
1 of 28
Gary Helmling, Software Engineer
@gario
HBaseCon 2015 - May 7
Reusable Data Access Patterns
Agenda
• A brief look at data storage challenges
• How these challenges have influenced our work at Cask
• Exploration of Datasets and how they help
• Review of some common data access patterns and how Datasets apply
• A look underneath at the tech that makes it work
Data Storage Challenges
• Many HBase apps tend to solve similar problems with common needs
• Developers rebuild these solutions on their own, sometimes repeated for each app
• Effective schema (row key) design is hard
• byte[] conversions, no real native types
• No composite keys
• Some efforts - Orderly library and HBase types (HBASE-8089) - but nothing complete
Data Storage Challenges
• Easy to wind up with many similar, but slightly different implementations
Cask Data Application Platform
• CDAP is a scale-out application platform for Hadoop and HBase
• Enables easy scaling of application components
• Combines real-time and batch data processing
• Abstracts data storage details with Datasets
• Built from experience by Hadoop and HBase users and contributors
• CDAP and the components it builds on are open source (Apache License v2.0):
• Tephra, a scalable transaction engine for HBase and Hadoop
• Apache Twill, makes writing distributed apps on YARN as simple as running threads
• Encapsulate a data access pattern in a reusable, domain-specific API
• Establishes best practices in schema definition
• Abstract away underlying storage platform
HBase
CDAP Datasets
Table 1
Table 2
Dataset
App 2
App 1
How CDAP Datasets Help
• Reusable as data storage templates
• Easy sharing of stored data:
• Between applications
• Batch and real-time processing
• Integrated testing
• Extensible to create your own solutions
• Leverage common services to ease development:
• Transactions
• Readless increments
Reusing Datasets
• CDAP integrates lifecycle management
• Centralized metadata provides key configuration
• Transparent integration with other systems
• Hive metastore
• Map Reduce Input/OutputFormats
• Spark RDDs
Reusing Datasets
/**
* Counter Flowlet.
*/
public class Counter extends AbstractFlowlet {
@UseDataSet("wordCounts")
private KeyValueTable wordCountsTable;
…
}
Testing Datasets
• CDAP provides three storage backends: in-memory, local (standalone), distributed
In-Memory
NavigableMap
Local Temp Files
Local
LevelDB
Local Files
Distributed
HBase
HDFS
Dataset APIs
Development Lifecycle
Extending Datasets
• Existing datasets can be used as building blocks for your own patterns
• @EmbeddedDataset annotation injects wrapped instance in custom code
• Operations seamlessly wrapped in same transaction, no need to re-implement
public class UniqueCountTable extends AbstractDataset {
public UniqueCountTable(DatasetSpecification spec,
@EmbeddedDataset("unique") Table uniqueCountTable,
@EmbeddedDataset("entry") Table entryCountTable) {
…
}
}
HBase Data Patterns & Datasets
• Secondary Indexes
• Object-mapping
• Timeseries
• Data cube
Secondary Indexing
• Example use case: Entity storage - store customer records indexed by location
• HBase sorts data by row key
• Retrieving by a secondary value means storing a reference in another table
• Two types: global and local
• Global: efficient reads, but updates can be inconsistent
• Local: updates can be made consistent, but reads require contacting all servers
• IndexedTable Dataset performs global indexing
• Uses two tables: data table, index table
• Uses global transactions to keep updates consistent
Object-Mapping
• Example use case: Entity storage - easily store User instances for user profiles
• Easy serialization / deserialization of Java objects (think Hibernate)
• Maps property fields to HBase columns
• No defined schemas in HBase
• Accessing data by other means requires knowledge of object structure
• ObjectMappedTable Dataset: automatically persists object properties as columns in HBase
• Metadata managed by CDAP
• Stores the object's schema
• Automatically registers a table definition in Hive metastore with the same schema
Timeseries Data
• Example use case: any data organized around a time dimension
• System metrics
• Stock ticker data
• Sensor data - smart meters
• Constructing keys to avoid hotspotting and support efficient retrieval can be tricky
• TimeseriesTable Dataset: for each data key, stores a set of (timestamp, value) records
• Each stored value may have a set of tags used to filter results
• Each row represents a time bucket, individual values in that bucket stored as columns
• When reading data, projects entries back into a simple Iterator for easy consumption
Data Cube
• Example use case: Retail product sales reports, web analytics
• Stores “fact” entries, with aggregated values along configured combinations of the “fact”
dimensions
• Pre-aggregation necessary for efficient retrieval
• HBase increments can be costly in write-heavy workload
• Querying requires knowledge of pre-aggregation structure
• Reconfiguration can be difficult
• Need metadata around configuration
• Cube Dataset: uses readless increments for efficient aggregation
• Transactions keep pre-aggregations consistent
• Dataset framework manages metadata
Transactions
• Provided by Tephra (http://tephra.io), an open-source, distributed, scalable transaction engine
designed for HBase and Hadoop
• Each transaction assigned a time-based, globally unique transaction ID
• Transaction =
• Write Pointer: Timestamp for HBase writes
• Read pointer: Upper bound timestamp for reads
• Excludes: List of timestamps to exclude from reads
• HBase cell versions provide MVCC for Snapshot Isolation
Tephra Architecture
Tx Manager
(active)
Tx Manager
(standby)
HBase
RS 1
start / commit
Client
Client
Client
read / write
RS 2
Tx CP Tx CP
Transactional Writes
• Client sets write pointer (transaction ID) as timestamp on all writes
• Maintains set of change coordinates (row-level or column-level granularity depending on needs)
• On commit, client sends change set to Transaction Manager
• If any overlap with change sets of commits since transaction start, returns failure
• On commit failure, attempts to rollback any persisted changes
• Deletes use special markers instead of HBase deletes
• HBase deletes cannot be rolled back
Transactional Reads
• TransactionAwareHTable client sets the transaction state as an attribute on all read operations
• Get, Scan
• Transaction Processor RegionObserver translates transaction state into request properties
• max versions
• time range
• TransactionVisibilityFilter - excludes cells from:
• Invalid transactions (failed but not cleaned up)
• In-progress transactions
• “Delete” markers
• TTL’d cells
Increment Performance
• HBase increments perform read-modify-write cycle
• Happens server-side, but read operation still incurs overhead
• Read cost is unnecessary if we don't care about return value
• Not a great fit for write-heavy workloads
HBase Increments
row:col timestamp value
hello:count 1001 1
1002 2
Example: Word count on “Hello, hello, world”
counting 2nd “hello”
1. read: value = 1
2. modify: value += 1
3. write: value = 2
Readless Increments
• Readless increments store individual increment values for each write
• Mark cell value as Increment instead of normal Put
• Increment values are summed up on read
Readless Increments
row:col timestamp value
hello:count 1001 +1
1002 +1
1. write: increment = 1
Example: Word count on “Hello, hello, world”
counting 2nd “hello”
Readless Increments
row:col timestamp value
hello:count 1001 +1
1002 +1
Example: Word count on “Hello, hello, world”
reading current count for “hello”
read: 1 + 1 = 2 (total value)
Readless Increments
• Increments become simple writes
• Good for write-heavy workloads (many uses of increments)
• Reads incur extra cost from reading all versions up to latest full sum
• HBase RegionObserver merges increments on flush and compaction
• Limits cost of coalesce-on-read
• Work well with transactions: increments do not conflict!
Want to Learn More?
Open-source (Apache License v2)
Website:
http://cdap.io
Mailing List:
cdap-user@googlegroups.com
cdap-dev@googlegroups.com
Open-source (Apache License v2)
Website:
http://tephra.io
Mailing List:
tephra-user@googlegroups.com
tephra-dev@googlegroups.com
QUESTIONS?
Want to work on these and other challenges?
http://cask.co/careers/

More Related Content

Viewers also liked

HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!Cloudera, Inc.
 
HBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 MinutesHBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 MinutesCloudera, Inc.
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...Cloudera, Inc.
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBaseCon
 
Cross-Site BigTable using HBase
Cross-Site BigTable using HBaseCross-Site BigTable using HBase
Cross-Site BigTable using HBaseHBaseCon
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCCloudera, Inc.
 
HBaseCon 2012 | Building Mobile Infrastructure with HBase
HBaseCon 2012 | Building Mobile Infrastructure with HBaseHBaseCon 2012 | Building Mobile Infrastructure with HBase
HBaseCon 2012 | Building Mobile Infrastructure with HBaseCloudera, Inc.
 
HBaseCon 2013: Rebuilding for Scale on Apache HBase
HBaseCon 2013: Rebuilding for Scale on Apache HBaseHBaseCon 2013: Rebuilding for Scale on Apache HBase
HBaseCon 2013: Rebuilding for Scale on Apache HBaseCloudera, Inc.
 
HBaseCon 2012 | Scaling GIS In Three Acts
HBaseCon 2012 | Scaling GIS In Three ActsHBaseCon 2012 | Scaling GIS In Three Acts
HBaseCon 2012 | Scaling GIS In Three ActsCloudera, Inc.
 
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...Cloudera, Inc.
 
HBaseCon 2015: DeathStar - Easy, Dynamic, Multi-tenant HBase via YARN
HBaseCon 2015: DeathStar - Easy, Dynamic,  Multi-tenant HBase via YARNHBaseCon 2015: DeathStar - Easy, Dynamic,  Multi-tenant HBase via YARN
HBaseCon 2015: DeathStar - Easy, Dynamic, Multi-tenant HBase via YARNHBaseCon
 
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBaseHBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBaseHBaseCon
 
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUponHBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUponCloudera, Inc.
 
HBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon 2015: State of HBase Docs and How to ContributeHBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon 2015: State of HBase Docs and How to ContributeHBaseCon
 
Bulk Loading in the Wild: Ingesting the World's Energy Data
Bulk Loading in the Wild: Ingesting the World's Energy DataBulk Loading in the Wild: Ingesting the World's Energy Data
Bulk Loading in the Wild: Ingesting the World's Energy DataHBaseCon
 
HBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBaseHBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBaseCloudera, Inc.
 
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataHBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataCloudera, Inc.
 
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...Cloudera, Inc.
 
HBaseCon 2013: Apache HBase Operations at Pinterest
HBaseCon 2013: Apache HBase Operations at PinterestHBaseCon 2013: Apache HBase Operations at Pinterest
HBaseCon 2013: Apache HBase Operations at PinterestCloudera, Inc.
 
HBase: Extreme Makeover
HBase: Extreme MakeoverHBase: Extreme Makeover
HBase: Extreme MakeoverHBaseCon
 

Viewers also liked (20)

HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
 
HBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 MinutesHBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 Minutes
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region Replicas
 
Cross-Site BigTable using HBase
Cross-Site BigTable using HBaseCross-Site BigTable using HBase
Cross-Site BigTable using HBase
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
 
HBaseCon 2012 | Building Mobile Infrastructure with HBase
HBaseCon 2012 | Building Mobile Infrastructure with HBaseHBaseCon 2012 | Building Mobile Infrastructure with HBase
HBaseCon 2012 | Building Mobile Infrastructure with HBase
 
HBaseCon 2013: Rebuilding for Scale on Apache HBase
HBaseCon 2013: Rebuilding for Scale on Apache HBaseHBaseCon 2013: Rebuilding for Scale on Apache HBase
HBaseCon 2013: Rebuilding for Scale on Apache HBase
 
HBaseCon 2012 | Scaling GIS In Three Acts
HBaseCon 2012 | Scaling GIS In Three ActsHBaseCon 2012 | Scaling GIS In Three Acts
HBaseCon 2012 | Scaling GIS In Three Acts
 
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
 
HBaseCon 2015: DeathStar - Easy, Dynamic, Multi-tenant HBase via YARN
HBaseCon 2015: DeathStar - Easy, Dynamic,  Multi-tenant HBase via YARNHBaseCon 2015: DeathStar - Easy, Dynamic,  Multi-tenant HBase via YARN
HBaseCon 2015: DeathStar - Easy, Dynamic, Multi-tenant HBase via YARN
 
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBaseHBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
 
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUponHBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
 
HBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon 2015: State of HBase Docs and How to ContributeHBaseCon 2015: State of HBase Docs and How to Contribute
HBaseCon 2015: State of HBase Docs and How to Contribute
 
Bulk Loading in the Wild: Ingesting the World's Energy Data
Bulk Loading in the Wild: Ingesting the World's Energy DataBulk Loading in the Wild: Ingesting the World's Energy Data
Bulk Loading in the Wild: Ingesting the World's Energy Data
 
HBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBaseHBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBase
 
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataHBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
 
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
 
HBaseCon 2013: Apache HBase Operations at Pinterest
HBaseCon 2013: Apache HBase Operations at PinterestHBaseCon 2013: Apache HBase Operations at Pinterest
HBaseCon 2013: Apache HBase Operations at Pinterest
 
HBase: Extreme Makeover
HBase: Extreme MakeoverHBase: Extreme Makeover
HBase: Extreme Makeover
 

More from HBaseCon

hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on KubernetesHBaseCon
 
hbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase on Beamhbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase on BeamHBaseCon
 
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at Huaweihbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at HuaweiHBaseCon
 
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinteresthbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: Removable singularity: a story of HBase upgrade in PinterestHBaseCon
 
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程HBaseCon
 
hbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at Neteasehbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at NeteaseHBaseCon
 
hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践HBaseCon
 
hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: 基于HBase的企业级大数据平台hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: 基于HBase的企业级大数据平台HBaseCon
 
hbaseconasia2017: HBase at JD.com
hbaseconasia2017: HBase at JD.comhbaseconasia2017: HBase at JD.com
hbaseconasia2017: HBase at JD.comHBaseCon
 
hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecturehbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architectureHBaseCon
 
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huaweihbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: Ecosystems with HBase and CloudTable service at HuaweiHBaseCon
 
hbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMihbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMiHBaseCon
 
hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0HBaseCon
 
HBaseCon2017 Democratizing HBase
HBaseCon2017 Democratizing HBaseHBaseCon2017 Democratizing HBase
HBaseCon2017 Democratizing HBaseHBaseCon
 
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in PinterestHBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in PinterestHBaseCon
 
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBaseHBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBaseHBaseCon
 
HBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBaseHBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBaseHBaseCon
 
HBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBaseHBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBaseHBaseCon
 
HBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at DidiHBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at DidiHBaseCon
 
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase ClientHBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase ClientHBaseCon
 

More from HBaseCon (20)

hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
 
hbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase on Beamhbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase on Beam
 
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at Huaweihbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
 
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinteresthbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
 
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
 
hbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at Neteasehbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at Netease
 
hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践
 
hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: 基于HBase的企业级大数据平台hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: 基于HBase的企业级大数据平台
 
hbaseconasia2017: HBase at JD.com
hbaseconasia2017: HBase at JD.comhbaseconasia2017: HBase at JD.com
hbaseconasia2017: HBase at JD.com
 
hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecturehbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecture
 
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huaweihbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
 
hbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMihbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMi
 
hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0
 
HBaseCon2017 Democratizing HBase
HBaseCon2017 Democratizing HBaseHBaseCon2017 Democratizing HBase
HBaseCon2017 Democratizing HBase
 
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in PinterestHBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
 
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBaseHBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
 
HBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBaseHBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBase
 
HBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBaseHBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBase
 
HBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at DidiHBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at Didi
 
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase ClientHBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase Client
 

Recently uploaded

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 

Recently uploaded (20)

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 

HBaseCon 2015: Reusable Data Access Patterns with CDAP Datasets

  • 1. Gary Helmling, Software Engineer @gario HBaseCon 2015 - May 7 Reusable Data Access Patterns
  • 2. Agenda • A brief look at data storage challenges • How these challenges have influenced our work at Cask • Exploration of Datasets and how they help • Review of some common data access patterns and how Datasets apply • A look underneath at the tech that makes it work
  • 3. Data Storage Challenges • Many HBase apps tend to solve similar problems with common needs • Developers rebuild these solutions on their own, sometimes repeated for each app • Effective schema (row key) design is hard • byte[] conversions, no real native types • No composite keys • Some efforts - Orderly library and HBase types (HBASE-8089) - but nothing complete
  • 4. Data Storage Challenges • Easy to wind up with many similar, but slightly different implementations
  • 5. Cask Data Application Platform • CDAP is a scale-out application platform for Hadoop and HBase • Enables easy scaling of application components • Combines real-time and batch data processing • Abstracts data storage details with Datasets • Built from experience by Hadoop and HBase users and contributors • CDAP and the components it builds on are open source (Apache License v2.0): • Tephra, a scalable transaction engine for HBase and Hadoop • Apache Twill, makes writing distributed apps on YARN as simple as running threads
  • 6. • Encapsulate a data access pattern in a reusable, domain-specific API • Establishes best practices in schema definition • Abstract away underlying storage platform HBase CDAP Datasets Table 1 Table 2 Dataset App 2 App 1
  • 7. How CDAP Datasets Help • Reusable as data storage templates • Easy sharing of stored data: • Between applications • Batch and real-time processing • Integrated testing • Extensible to create your own solutions • Leverage common services to ease development: • Transactions • Readless increments
  • 8. Reusing Datasets • CDAP integrates lifecycle management • Centralized metadata provides key configuration • Transparent integration with other systems • Hive metastore • Map Reduce Input/OutputFormats • Spark RDDs
  • 9. Reusing Datasets /** * Counter Flowlet. */ public class Counter extends AbstractFlowlet { @UseDataSet("wordCounts") private KeyValueTable wordCountsTable; … }
  • 10. Testing Datasets • CDAP provides three storage backends: in-memory, local (standalone), distributed In-Memory NavigableMap Local Temp Files Local LevelDB Local Files Distributed HBase HDFS Dataset APIs Development Lifecycle
  • 11. Extending Datasets • Existing datasets can be used as building blocks for your own patterns • @EmbeddedDataset annotation injects wrapped instance in custom code • Operations seamlessly wrapped in same transaction, no need to re-implement public class UniqueCountTable extends AbstractDataset { public UniqueCountTable(DatasetSpecification spec, @EmbeddedDataset("unique") Table uniqueCountTable, @EmbeddedDataset("entry") Table entryCountTable) { … } }
  • 12. HBase Data Patterns & Datasets • Secondary Indexes • Object-mapping • Timeseries • Data cube
  • 13. Secondary Indexing • Example use case: Entity storage - store customer records indexed by location • HBase sorts data by row key • Retrieving by a secondary value means storing a reference in another table • Two types: global and local • Global: efficient reads, but updates can be inconsistent • Local: updates can be made consistent, but reads require contacting all servers • IndexedTable Dataset performs global indexing • Uses two tables: data table, index table • Uses global transactions to keep updates consistent
  • 14. Object-Mapping • Example use case: Entity storage - easily store User instances for user profiles • Easy serialization / deserialization of Java objects (think Hibernate) • Maps property fields to HBase columns • No defined schemas in HBase • Accessing data by other means requires knowledge of object structure • ObjectMappedTable Dataset: automatically persists object properties as columns in HBase • Metadata managed by CDAP • Stores the object's schema • Automatically registers a table definition in Hive metastore with the same schema
  • 15. Timeseries Data • Example use case: any data organized around a time dimension • System metrics • Stock ticker data • Sensor data - smart meters • Constructing keys to avoid hotspotting and support efficient retrieval can be tricky • TimeseriesTable Dataset: for each data key, stores a set of (timestamp, value) records • Each stored value may have a set of tags used to filter results • Each row represents a time bucket, individual values in that bucket stored as columns • When reading data, projects entries back into a simple Iterator for easy consumption
  • 16. Data Cube • Example use case: Retail product sales reports, web analytics • Stores “fact” entries, with aggregated values along configured combinations of the “fact” dimensions • Pre-aggregation necessary for efficient retrieval • HBase increments can be costly in write-heavy workload • Querying requires knowledge of pre-aggregation structure • Reconfiguration can be difficult • Need metadata around configuration • Cube Dataset: uses readless increments for efficient aggregation • Transactions keep pre-aggregations consistent • Dataset framework manages metadata
  • 17. Transactions • Provided by Tephra (http://tephra.io), an open-source, distributed, scalable transaction engine designed for HBase and Hadoop • Each transaction assigned a time-based, globally unique transaction ID • Transaction = • Write Pointer: Timestamp for HBase writes • Read pointer: Upper bound timestamp for reads • Excludes: List of timestamps to exclude from reads • HBase cell versions provide MVCC for Snapshot Isolation
  • 18. Tephra Architecture Tx Manager (active) Tx Manager (standby) HBase RS 1 start / commit Client Client Client read / write RS 2 Tx CP Tx CP
  • 19. Transactional Writes • Client sets write pointer (transaction ID) as timestamp on all writes • Maintains set of change coordinates (row-level or column-level granularity depending on needs) • On commit, client sends change set to Transaction Manager • If any overlap with change sets of commits since transaction start, returns failure • On commit failure, attempts to rollback any persisted changes • Deletes use special markers instead of HBase deletes • HBase deletes cannot be rolled back
  • 20. Transactional Reads • TransactionAwareHTable client sets the transaction state as an attribute on all read operations • Get, Scan • Transaction Processor RegionObserver translates transaction state into request properties • max versions • time range • TransactionVisibilityFilter - excludes cells from: • Invalid transactions (failed but not cleaned up) • In-progress transactions • “Delete” markers • TTL’d cells
  • 21. Increment Performance • HBase increments perform read-modify-write cycle • Happens server-side, but read operation still incurs overhead • Read cost is unnecessary if we don't care about return value • Not a great fit for write-heavy workloads
  • 22. HBase Increments row:col timestamp value hello:count 1001 1 1002 2 Example: Word count on “Hello, hello, world” counting 2nd “hello” 1. read: value = 1 2. modify: value += 1 3. write: value = 2
  • 23. Readless Increments • Readless increments store individual increment values for each write • Mark cell value as Increment instead of normal Put • Increment values are summed up on read
  • 24. Readless Increments row:col timestamp value hello:count 1001 +1 1002 +1 1. write: increment = 1 Example: Word count on “Hello, hello, world” counting 2nd “hello”
  • 25. Readless Increments row:col timestamp value hello:count 1001 +1 1002 +1 Example: Word count on “Hello, hello, world” reading current count for “hello” read: 1 + 1 = 2 (total value)
  • 26. Readless Increments • Increments become simple writes • Good for write-heavy workloads (many uses of increments) • Reads incur extra cost from reading all versions up to latest full sum • HBase RegionObserver merges increments on flush and compaction • Limits cost of coalesce-on-read • Work well with transactions: increments do not conflict!
  • 27. Want to Learn More? Open-source (Apache License v2) Website: http://cdap.io Mailing List: cdap-user@googlegroups.com cdap-dev@googlegroups.com Open-source (Apache License v2) Website: http://tephra.io Mailing List: tephra-user@googlegroups.com tephra-dev@googlegroups.com
  • 28. QUESTIONS? Want to work on these and other challenges? http://cask.co/careers/