Keynote: The Future of Apache HBase

683 views

Published on

Moderated by Lars Hofhansl (Salesforce), with Matteo Bertozzi (Cloudera), John Leach (Splice Machine), Maxim Lukiyanov (Microsoft), Matt Mullins (Facebook), and Carter Page (Google)

The future of HBase, via a variety of viewpoints.

Published in: Software
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
683
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • Top image from HBaseCon website.
    Bottom image is public domain: https://en.wikipedia.org/wiki/Democratic_National_Committee#/media/File:Chicago_delegation_to_the_January_8,_1912_meeting_of_the_Democratic_National_Committee.jpg
  • Images created by Google.
  • Open source results in a richer technology stacks.

    (Image free for commercial use https://pixabay.com/en/zebra-gnu-giraffe-africa-namibia-1170177/)
  • Keynote: The Future of Apache HBase

    1. 1. The Future of HBase Lars Hofhansl Principal Architect & VP, Salesforce HBase PMC & Committer Phoenix PMC & Committer Apache Member
    2. 2. Who’s driving it?
    3. 3. It’s us* We define our future * people in this room, developers, contributors, comments on mailing list, committers, PMC members, etc
    4. 4. Kafka Spark Cassandra HBase : Confluent : Databricks : Datastax : ????
    5. 5. Kafka Spark Cassandra HBase : Confluent : Databricks : Datastax : Cloudera. Hortonworks.
    6. 6. Kafka Spark Cassandra HBase : : Confluent : Databricks : Datastax Adobe, Alibaba, Apple, Cask, Cloudera, Facebook, Google, Hortonworks, Huawei, HubSpot, IBM, Intel, NGDATA, Salesforce, The Gap, Twitter, Xiaomi, Yahoo!, etc, etc….
    7. 7. Cloud?
    8. 8. Carter Page, Google Cloud?
    9. 9. What does Cloud mean for HBase’s future?
    10. 10. More problems? More work to do? Rather...
    11. 11. Free stuff?
    12. 12. Free stuff engineering.
    13. 13. HBase on GCP HBase on Dataproc HBase on Cloud Bigtable
    14. 14. HBase on GCP HBase on Dataproc HBase on Cloud Bigtable HBase Client (>= v1.0)
    15. 15. Why is HBase the client for Cloud Bigtable?
    16. 16. Why HBase? #1: Open source is the de facto way that standards are defined now. Committers Not committees
    17. 17. Why HBase? #2: HBase is indisputably the best open source implementation the Bigtable architecture. Bigtable HBase
    18. 18. Why HBase? #3: Because supporting an ecosystem is the right thing. Technology needs a rich community to flourish.
    19. 19. Supporting how?
    20. 20. Rich abstractions on top of HBase Future big data customers need fully formed solutions: A great graph database A great IoT solution A great geo solution And so on... Open source. Each with the scale of HBase. And we want to help, with engineering time and code.
    21. 21. But there’s already a great _______ HBase solution that could use some love!
    22. 22. Please email me with ideas. (Really.) carterp (at) google.com Have a great open source HBase integration that could use some Google engineering help?
    23. 23. Maxim Lukiyanov, Microsoft Cloud?
    24. 24. CPU utilization Typical picture in pure key/value stores
    25. 25. Unutilized CPU! Run something else on it (Analytics on Hbase anybody?) Give it back to the cloud
    26. 26. HBase file system abstraction
    27. 27. HBase file system abstraction
    28. 28. HBase in the cloud
    29. 29. HBase in the cloud
    30. 30. OLAP?
    31. 31. OLTP?
    32. 32. Database?
    33. 33. “What is your biggest mistake as an engineer? Not putting distributed transactions in BigTable. If you wanted to update more than one row you had to roll your own transaction protocol. It wasn’t put in because it would have complicated the system design. In retrospect lots of teams wanted that capability and built their own with different degrees of success. We should have implemented transactions in the core system. It would have been useful internally as well. Spanner fixed this problem by adding transactions.” - Jeff Dean, March 7th, 2016
    34. 34. John Leach Founder & CTO Call for Founders!!! Be part of bringing Splice Machine to Open Source Splice Machine, the first dual-engine RDBMS on HBase and Spark, is headed to open-source and we are looking for some key individuals to be founders to support the transition.
    35. 35. Multi-Tenant Mixed Workloads
    36. 36. Current Storage Challenges Lack of Transactions (see Dean Quote) Single Write Optimized Store: Log Structured Merge Tree Limited Metadata Facilities Current Execution Challenges OLTP: Limited/Rigid Concurrency Model OLAP: Foggy Execution Model Remote Client Scans (Slow) Internal Scans via Coprocessor (In JVM) Custom Rolled Data Flow Engine (Yikes) Maintenance Operations Do not talk about Fight Club (Compactions)
    37. 37. Future Storage Approach (Code Named: Janus) Typed Storage System JSON first class citizen Serde based on Spark UnsafeRow Hierarchical, Partition Aware Transactions Partitions: Within and Across Data Centers Write Optimized Store (Optional) LSM Tree Read Optimized Store (Optional) Positional Delta Trees, Columnar Full Metadata Facilities (https://datasketches.github.io/) Theta Sketch, Quantiles, Frequent Items
    38. 38. Future Execution Approach (Dual Engine) All Execution Engines Statistical Hooks (Sketching Algorithms) OLAP Execution Engines Spark, Flink, MapReduce, Impala etc. YARN, Fair Scheduling Transactional Input/Output Formats File System based with incremental memstore deltas Columnar Support Arrow, Calcite Perform Compactions (yes, it works) OLTP Execution Engines Row Based Storage, Remote HBase Scans
    39. 39. Modern Hardware?
    40. 40. Matt Mullins, Facebook Lars Hofhansl, Salesforce
    41. 41. Salesforce Single-SKU project We used to have 30+ different SKUs Now there is one SKU (almost) for all projects 1U, 10Ge everywhere, FAT networking tree (no/little oversubscription) Same SKU used by all projects Very few exceptions: High storage SKU and high compute SKU Vendor: varies Allows us to order/repurpose in large quantities and then assign to projects Compromise for individual projects, but cheaper overall FAT network -> location independence
    42. 42. HBase 2.0Matteo Bertozzi, Cloudera
    43. 43. We are trying to avoid another singularity (like 0.94 to 0.96) (almost) Rolling Upgradable from 1.x Wire Compatible with 1.x Possible Features HBASE-11425 - Off-Heap for read and write path HBASE-13773 - Replication off Zookeeper and ReplicationAdmin with ACLs support HBASE-14070 - Hybrid-Logical Clocks HBASE-14123 - Backups HBase 2.0
    44. 44. “What makes HBase… truly Special?”
    45. 45. The (Big) Landscape Cassandra CouchDB DB2 Hana HBase Hive Hypertable Impala Kudu LevelDB MySQL RedShift RocksDB MongoDB Oracle PostgreSQL SOLR SleepyCat SQLLite SQLServer Voldemort
    46. 46. Can you spot HBase?
    47. 47. The (Big) Landscape Cassandra CouchDB DB2 Hana HBase Hive Hypertable Impala Kudu LevelDB MySQL RedShift RocksDB MongoDB Oracle PostgreSQL SOLR SleepyCat SQLLite SQLServer Voldemort
    48. 48. The (Big) Landscape Cassandra CouchDB DB2 Hana HBase Hive Hypertable Impala Kudu LevelDB MySQL RedShift RocksDB MongoDB Oracle PostgreSQL SOLR SleepyCat SQLLite SQLServer Voldemort
    49. 49. (CC BY-SA 2.5)
    50. 50. The HBase Sweet Spot 1. Scales single clusters to 100’s or 1000’s of commodity machines 2. Small Scans (<100m rows) and Gets 3. Operations are harder, but amortized over large (>15 node) installs 4. Consistent, OpenSource, OnPrem, Cloud 5. A foundational, general purpose, low latency storage engine There is no system that handles analytical and OLTP workloads well There is no replacement for HBase in this sweet spot
    51. 51. Future work Using large RAM effectively Off-heaping everything In-memory compactions Optimizing lock contention to utilize all cores, on SSDs, 10Ge or more Scaling assignment manager Spark integration for large scans, OLAP Multi-tenancy Sister projects such as Phoenix, for easy interfacing Easier operations to ease on-boarding
    52. 52. Time To Party!

    ×