Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache HBase - Introduction & Use Cases

9,339 views

Published on

Apache HBase™ is the Hadoop database, a distributed, scalable, big data store. Use Apache HBase when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable

This talk will introduce to Apache HBase and will give you an overview of Columnar databases. We will also talk about how Facebook is using HBase currently. We will talk about HBase security, Apache Phoenix and Apache Slider

Published in: Technology
  • slide 4 (CAP) states : "According to the theorem , a distributed system can satisfy any two of these guarantees at the same time, but not all three". That is incorrect. CAP is frequently misunderstood as if one had to choose to abandon one of the three guarantees at all times. In fact, the choice is really between consistency and availability only when a network partition or failure happens, at all other times, no trade-off has to be made
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Thank you for perhaps the most level headed thing I have read today. I need help with this too! I am sure at least once in your life you had to fill out a form. I use a simple service http://goo.gl/CaIhCh for forms filling. It definitely makes my life easier!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Apache HBase - Introduction & Use Cases

  1. 1. S Apache HBase: Introduction & Use Cases Subash D’Souza
  2. 2. What is HBase? S HBase is an open source, distributed, sorted map modeled after Google’s Big Table S NoSQL solution built atop Apache Hadoop S Top level Apache Project
  3. 3. CAP Theorem S Consistency (all nodes see the same data at the same time) S Availability (a guarantee that every request receives a response about whether it was successful or failed) S Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system) According to the theorem, a distributed system can satisfy any two of these guarantees at the same time, but not all three.
  4. 4. Ref: http://blog.nahurst.com/visual-guide-to-nosql-systems
  5. 5. Usage Scenarios S Lots of Data - 100s of Gigs to Petabytes S High Throughput – 1000’s of records/sec S Scalable cache capacity – Adding several nodes adds to available cache S Data Layout – Excels at key lookup and no penalty for sparse columns
  6. 6. Column Oriented Databases S HBase belongs to family of databases called as Column-oriented S Column-oriented databases save their data grouped by columns. S The reason to store values on a per-column basis instead is based on the assumption that, for specific queries, not all of the values are needed. S Reduced I/O is one of the primary reasons for this new layout S Specialized algorithms—for example, delta and/or prefix compression—selected based on the type of the column (i.e., on the data stored) can yield huge improvements in compression ratios. Better ratios result in more efficient bandwidth usage.
  7. 7. HBase as a Column Oriented Database S HBase is not a column-oriented database in the typical RDBMS sense, but utilizes an on-disk column storage format. S This is also where the majority of similarities end, because although HBase stores data on disk in a column-oriented format, it is distinctly different from traditional columnar databases. S Whereas columnar databases excel at providing real-time analytical access to data, HBase excels at providing key-based access to a specific cell of data, or a sequential range of cells.
  8. 8. HBase and Hadoop S Hadoop excels at storing data of arbitrary, semi-, or even unstructured formats, since it lets you decide how to interpret the data at analysis time, allowing you to change the way you classify the data at any time: once you have updated the algorithms, you simply run the analysis again. S HBase sits atop Hadoop using all the best features of HDFS such as scalability and data replication
  9. 9. HBase UnUsage S When data access patterns are unknown – HBase follows a data centric model rather than relationship centric, Hence it does not make sense doing an ERD model for HBase S Small amounts of data – Just use an RDBMS S Limited/No random reads and writes – Just use HDFS directly
  10. 10. HBase Use Cases - Facebook S One of the earliest and largest users of HBase S Facebook messaging platform built atop HBase in 2010 S Chosen because of the high write throughput and low latency random reads S Other features such as Horizontal Scalability, Strong Consistency and High Availability via Automatic Failover.
  11. 11. HBase Use Cases - Facebook S In addition to online transaction processing workloads like messages, it is also used for online analytic processing workloads where large data scans are prevalent. S Also used in production by other Facebook services, including the internal monitoring system, the recently launched Nearby Friends feature, search indexing, streaming data analysis, and data scraping for their internal data warehouses.
  12. 12. Seek vs. Transfer S One of the fundamental differences between typical RDBMS and nosql ones is the use of B or B+ trees and Log Structured Merge Trees(LSM) which was the basis of Google’s BigTable
  13. 13. B+ Trees S B+ Trees allow for efficient insertion, lookup and deletion of records that are identified by keys. S Represent dynamic, multilevel indexes with lower and upper bounds per segment or page S This allows for higher fanout compared to binary trees resulting in lower number of I/O operations S Range scans are also very efficient
  14. 14. LSM Trees S Incoming data first stored in logfile, completely sequentially S Once the log has modification saved, updates an in-memory store. S Once enough updates are accrued in the in-memory store, it flushes a sorted list of key->record pairs to disks creating store files. S At this point all updates to log can be deleted since modifications have been persisted.
  15. 15. Fundamental Difference S Disk Drives S Too Many Modifications force costly optimizations. S More Data at random locations cause faster fragmentation S Updates and deletes are done at disk seek rates rather than disk transfer rates
  16. 16. Fundamental Difference(Contd) S Works at disk transfer rates S Scales better to handle large amounts of data. S Guarantees consistent insert rate S Transform random writes into sequential writes using logfiles plus in-memory store S Reads independent from writes so no contention between the two
  17. 17. HBase Basics S When data is added to HBase, it is first written to the WAL(Write ahead log) called HLog. S Once the write is done, it is then written to an in memory called MemStore S Once the memory exceeds a certain threshold, it flushes to disk as an HFile S Over time HBase merges smaller HFiles into larger ones. This process is called compaction
  18. 18. Ref: https://www.altamiracorp.com/blog/employee-posts/handling-big-data-with- hbase-part-3-architecture-overview
  19. 19. Facebook-Hydrabase S In HBase, when a regionserver fails, all regions hosted by that regionserver are moved to another regionserver. S Depending on HBase has been setup, this typically entails splitting and replaying the WAL files which could take time and lengthens the failover S Hydrabase differs from HBase in this. Instead of having a region by a single region server, it is hosted by a set of regionservers. S When a regionserver fails, there are standby regionservers ready to take over
  20. 20. Facebook-Hydrabase S The standby region servers can be spread across different racks or even data centers, providing availability. S The set of region servers serving each region form a quorum. Each quorum has a leader that services read and write requests from the client. S HydraBase uses the RAFT consensus protocol to ensure consistency across the quorum. S With a quorum of 2F+1, HydraBase can tolerate up to F failures. S Increases reliability from 99.99% to 99.999% ~ 5 mins downtime/year.
  21. 21. HBase Users - Flurry S Mobile analytics, monetization and advertising company founded in 2005 S Recently acquired by Yahoo S 2 data centers with 2 clusters each, bi directional replication S 1000 slave nodes per cluster – 32 GB RAM, 4 drives(1 or 2 TB), 1 Gig E, Dual Quad Core processors *2 HT = 16 procs S ~30 tables, 250k regions, 430TB(after LZO compression) S 2 big tables are approx 90% of that, 1 wide table with 3 CF, 4 billion rows with 1 MM cells per row. The other tall table with 1 CF, 1 trillion rows and 1 cell per row
  22. 22. HBase Security – 0.98 S Cell Tags – All values in HBase are now written in cells, can also carry arbitrary no. of tags such as metadata S Cell ACLs – enables the checking of (R)ead, (W)rite, E(X)excute, (A)dmin & (C)reate S Cell Labels – Visibility expression support via new security coprocessor S Transparent Encryption – data is encrypted on disk – HFiles are encrypted when written and decrypted when read S RBAC – Uses Hadoop Group Mapping Service and ACL’s to implement
  23. 23. Apache Phoenix S SQL layer atop HBase – Has a query engine, metadata repository & embedded JDBC driver, top level apache project, currently only for HBase S Fastest way to access HBase data – HBase specific push down, compiles queries into native, direct HBase calls(no map- reduce), executes scans in parallel S Integrates with Pig, Flume & Sqoop S Phoenix maps HBase data model to relational world
  24. 24. Ref: Taming HBase with Apache Phoenix and SQL, HBaseCon 2014
  25. 25. Open TSDB 2.0 S Distributed, Scalable Time Series Database on top of HBase S Time Series – data points for identity over time. S Stores trillions of data points, never loses precision, scales using HBase S Good for system monitoring & measurement – servers & networks, Sensor data – The internet of things, SCADA, Financial data, Results of Scientific experiments, etc.
  26. 26. Open TSDB 2.0 S Users – OVH(3rd largest cloud/hosting provider) to monitor everything from networking, temperature, voltage to resource utilization, etc. S Yahoo uses it to monitor application performance & statistics S Arista networking uses it for high performance networking S Other users such as Pinterest, Ebay, Box, etc.
  27. 27. Apache Slider(Incubator) S YARN application to deploy existing distributed applications on YARN, monitor them and make them larger or smaller as desired -even while the application is running. S Incubator Apache Project; Similar to Tez for Hive/Pig S Applications can be stopped, "frozen" and restarted, "thawed" later; It allows users to create and run multiple instances of applications, even with different application versions if needed S Applications such as HBase, Accumulo & Storm can run atop it
  28. 28. Thanks!! S Credits – Apache, Cloudera, Hortonworks, MapR, Facebook, Flurry & HBaseCon S @sawjd22 S www.linkedin.com/in/sawjd/ S Q & A

×