Lars George and Jon Hsieh presented archetypes for common Apache HBase application patterns. They defined archetypes as common architecture patterns extracted from multiple use cases to be repeatable. The presentation covered "good" archetypes that are well-suited to HBase's capabilities, such as storing simple entities, messaging data, and metrics. "Bad" archetypes that are not optimal fits for HBase included using it as a large blob store, naively porting a relational database schema, and as an analytic archive requiring frequent full scans. A discussion of access patterns and tradeoffs concluded the overview of HBase application archetypes.
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponCloudera, Inc.
OpenTSDB was built on the belief that, through HBase, a new breed of monitoring systems could be created, one that can store and serve billions of data points forever without the need for destructive downsampling, one that could scale to millions of metrics, and where plotting real-time graphs is easy and fast. In this presentation we’ll review some of the key points of OpenTSDB’s design, some of the mistakes that were made, how they were or will be addressed, and what were some of the lessons learned while writing and running OpenTSDB as well as asynchbase, the asynchronous high-performance thread-safe client for HBase. Specific topics discussed will be around the schema, how it impacts performance and allows concurrent writes without need for coordination in a distributed cluster of OpenTSDB instances.
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponCloudera, Inc.
OpenTSDB was built on the belief that, through HBase, a new breed of monitoring systems could be created, one that can store and serve billions of data points forever without the need for destructive downsampling, one that could scale to millions of metrics, and where plotting real-time graphs is easy and fast. In this presentation we’ll review some of the key points of OpenTSDB’s design, some of the mistakes that were made, how they were or will be addressed, and what were some of the lessons learned while writing and running OpenTSDB as well as asynchbase, the asynchronous high-performance thread-safe client for HBase. Specific topics discussed will be around the schema, how it impacts performance and allows concurrent writes without need for coordination in a distributed cluster of OpenTSDB instances.
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
This presentation describes how to efficiently load data into Hive. I cover partitioning, predicate pushdown, ORC file optimization and different loading schemes
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains, including significantly improved performance for ACID tables. The talk will also provide a glimpse of what is expected to come in the near future.
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
Most developers are familiar with the topic of “database design”. In the relational world, normalization is the name of the game. How do things change when you’re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...confluent
RocksDB is the default state store for Kafka Streams. In this talk, we will discuss how to improve single node performance of the state store by tuning RocksDB and how to efficiently identify issues in the setup. We start with a short description of the RocksDB architecture. We discuss how Kafka Streams restores the state stores from Kafka by leveraging RocksDB features for bulk loading of data. We give examples of hand-tuning the RocksDB state stores based on Kafka Streams metrics and RocksDB’s metrics. At the end, we dive into a few RocksDB command line utilities that allow you to debug your setup and dump data from a state store. We illustrate the usage of the utilities with a few real-life use cases. The key takeaway from the session is the ability to understand the internal details of the default state store in Kafka Streams so that engineers can fine-tune their performance for different varieties of workloads and operate the state stores in a more robust manner.
Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding -- resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query. Finally, ORC works together with the upcoming query vectorization work providing a high bandwidth reader/writer interface.
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinDataStax Academy
You know you need Cassandra for it's uptime and scaling, but what about that data model? Let's bridge that gap and get you building your game changing app. We'll break down topics like storing objects and indexing for fast retrieval. You will see by understanding a few things about Cassandra internals, you can put your data model in the spotlight. The goal of this talk is to get you comfortable working with data in Cassandra throughout the application lifecycle. What are you waiting for? The cameras are waiting!
This talk delves into the many ways that a user has to use HBase in a project. Lars will look at many practical examples based on real applications in production, for example, on Facebook and eBay and the right approach for those wanting to find their own implementation. He will also discuss advanced concepts, such as counters, coprocessors and schema design.
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon
In this presentation, we will introduce Hotspot's Garbage First collector (G1GC) as the most suitable collector for latency-sensitive applications running with large memory environments. We will first discuss G1GC internal operations and tuning opportunities, and also cover tuning flags that set desired GC pause targets, change adaptive GC thresholds, and adjust GC activities at runtime. We will provide several HBase case studies using Java heaps as large as 100GB that show how to best tune applications to remove unpredicted, protracted GC pauses.
Storing time series data with Apache CassandraPatrick McFadin
If you are looking to collect and store time series data, it's probably not going to be small. Don't get caught without a plan! Apache Cassandra has proven itself as a solid choice now you can learn how to do it. We'll look at possible data models and the the choices you have to be successful. Then, let's open the hood and learn about how data is stored in Apache Cassandra. You don't need to be an expert in distributed systems to make this work and I'll show you how. I'll give you real-world examples and work through the steps. Give me an hour and I will upgrade your time series game.
In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs.
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3DataWorks Summit
The Hadoop community announced Hadoop 3.0 GA in December, 2017 and 3.1 around April, 2018 loaded with a lot of features and improvements. One of the biggest challenges for any new major release of a software platform is its compatibility. Apache Hadoop community has focused on ensuring wire and binary compatibility for Hadoop 2 clients and workloads.
There are many challenges to be addressed by admins while upgrading to a major release of Hadoop. Users running workloads on Hadoop 2 should be able to seamlessly run or migrate their workloads onto Hadoop 3. This session will be deep diving into upgrade aspects in detail and provide a detailed preview of migration strategies with information on what works and what might not work. This talk would focus on the motivation for upgrading to Hadoop 3 and provide a cluster upgrade guide for admins and workload migration guide for users of Hadoop.
Speaker
Suma Shivaprasad, Hortonworks, Staff Engineer
Rohith Sharma, Hortonworks, Senior Software Engineer
Cassandra concepts, patterns and anti-patternsDave Gardner
An introduction to the fundamental concepts behind Apache Cassandra. This talk explains the engineering principles that make Cassandra such an attractive choice for building highly resilient and available systems and then goes on to explain how to use it - covering basic data modelling patterns and anti-patterns.
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDBHBaseCon
This case study involves analysis of high-volume, continuous time-series aviation data from jet engines that consist of temperature, pressure, vibration and related parameters from the on-board sensors, joined with well-characterized slowly changing engine asset configuration data and other enterprise data for continuous engine diagnostics and analytics. This data is ingested via distributed fabric comprising transient containers, message queues and a columnar, compressed storage leveraging OpenTSDB over Apache HBase.
HBase can be an intimidating beast for someone considering its adoption. For what kinds of workloads is it well suited? How does it integrate into the rest of my application infrastructure? What are the data semantics upon which applications can be built? What are the deployment and operational concerns? In this talk, I'll address each of these questions in turn. As supporting evidence, both high-level application architecture and internal details will be discussed. This is an interactive talk: bring your questions and your use-cases!
This presentation describes how to efficiently load data into Hive. I cover partitioning, predicate pushdown, ORC file optimization and different loading schemes
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains, including significantly improved performance for ACID tables. The talk will also provide a glimpse of what is expected to come in the near future.
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
Most developers are familiar with the topic of “database design”. In the relational world, normalization is the name of the game. How do things change when you’re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...confluent
RocksDB is the default state store for Kafka Streams. In this talk, we will discuss how to improve single node performance of the state store by tuning RocksDB and how to efficiently identify issues in the setup. We start with a short description of the RocksDB architecture. We discuss how Kafka Streams restores the state stores from Kafka by leveraging RocksDB features for bulk loading of data. We give examples of hand-tuning the RocksDB state stores based on Kafka Streams metrics and RocksDB’s metrics. At the end, we dive into a few RocksDB command line utilities that allow you to debug your setup and dump data from a state store. We illustrate the usage of the utilities with a few real-life use cases. The key takeaway from the session is the ability to understand the internal details of the default state store in Kafka Streams so that engineers can fine-tune their performance for different varieties of workloads and operate the state stores in a more robust manner.
Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding -- resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query. Finally, ORC works together with the upcoming query vectorization work providing a high bandwidth reader/writer interface.
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinDataStax Academy
You know you need Cassandra for it's uptime and scaling, but what about that data model? Let's bridge that gap and get you building your game changing app. We'll break down topics like storing objects and indexing for fast retrieval. You will see by understanding a few things about Cassandra internals, you can put your data model in the spotlight. The goal of this talk is to get you comfortable working with data in Cassandra throughout the application lifecycle. What are you waiting for? The cameras are waiting!
This talk delves into the many ways that a user has to use HBase in a project. Lars will look at many practical examples based on real applications in production, for example, on Facebook and eBay and the right approach for those wanting to find their own implementation. He will also discuss advanced concepts, such as counters, coprocessors and schema design.
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon
In this presentation, we will introduce Hotspot's Garbage First collector (G1GC) as the most suitable collector for latency-sensitive applications running with large memory environments. We will first discuss G1GC internal operations and tuning opportunities, and also cover tuning flags that set desired GC pause targets, change adaptive GC thresholds, and adjust GC activities at runtime. We will provide several HBase case studies using Java heaps as large as 100GB that show how to best tune applications to remove unpredicted, protracted GC pauses.
Storing time series data with Apache CassandraPatrick McFadin
If you are looking to collect and store time series data, it's probably not going to be small. Don't get caught without a plan! Apache Cassandra has proven itself as a solid choice now you can learn how to do it. We'll look at possible data models and the the choices you have to be successful. Then, let's open the hood and learn about how data is stored in Apache Cassandra. You don't need to be an expert in distributed systems to make this work and I'll show you how. I'll give you real-world examples and work through the steps. Give me an hour and I will upgrade your time series game.
In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs.
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3DataWorks Summit
The Hadoop community announced Hadoop 3.0 GA in December, 2017 and 3.1 around April, 2018 loaded with a lot of features and improvements. One of the biggest challenges for any new major release of a software platform is its compatibility. Apache Hadoop community has focused on ensuring wire and binary compatibility for Hadoop 2 clients and workloads.
There are many challenges to be addressed by admins while upgrading to a major release of Hadoop. Users running workloads on Hadoop 2 should be able to seamlessly run or migrate their workloads onto Hadoop 3. This session will be deep diving into upgrade aspects in detail and provide a detailed preview of migration strategies with information on what works and what might not work. This talk would focus on the motivation for upgrading to Hadoop 3 and provide a cluster upgrade guide for admins and workload migration guide for users of Hadoop.
Speaker
Suma Shivaprasad, Hortonworks, Staff Engineer
Rohith Sharma, Hortonworks, Senior Software Engineer
Cassandra concepts, patterns and anti-patternsDave Gardner
An introduction to the fundamental concepts behind Apache Cassandra. This talk explains the engineering principles that make Cassandra such an attractive choice for building highly resilient and available systems and then goes on to explain how to use it - covering basic data modelling patterns and anti-patterns.
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDBHBaseCon
This case study involves analysis of high-volume, continuous time-series aviation data from jet engines that consist of temperature, pressure, vibration and related parameters from the on-board sensors, joined with well-characterized slowly changing engine asset configuration data and other enterprise data for continuous engine diagnostics and analytics. This data is ingested via distributed fabric comprising transient containers, message queues and a columnar, compressed storage leveraging OpenTSDB over Apache HBase.
HBase can be an intimidating beast for someone considering its adoption. For what kinds of workloads is it well suited? How does it integrate into the rest of my application infrastructure? What are the data semantics upon which applications can be built? What are the deployment and operational concerns? In this talk, I'll address each of these questions in turn. As supporting evidence, both high-level application architecture and internal details will be discussed. This is an interactive talk: bring your questions and your use-cases!
Improvements to Apache HBase and Its Applications in Alibaba Search HBaseCon
Yu Li and Shaoxuan Wang (Alibaba)
HBase is the core storage system in Alibaba’s Search Infrastructure. In this session, we will talk about the details of how we use HBase to serve such high-throughput, low-latency, mixed workloads and the various improvements we made to HBase to meet these challenges.
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web ArchivingHBaseCon
Web archiving initiatives around the world capture ephemeral web content to preserve our collective digital memory. However, that requires scalable, responsive tools that support exploration and discovery of captured content. Here you'll learn about why Warcbase, an open-source platform for managing web archives built on HBase, is one such tool. It provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge, tightly integrates with Hadoop for analytics and data processing, and relies on HBase for storage infrastructure.
Speaker: Bryan Beaudreault (HubSpot)
Running HBase in real time in the cloud provides an interesting and ever-changing set of challenges -- instance types are not ideal, neighbors can degrade your performance, and instances can randomly die in unanticipated ways. This talk will cover what HubSpot has learned about running in production on Amazon EC2, how it handle DR and redundancy, and the tooling the team has found to be the most helpful.
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...HBaseCon
Speakers: Chris Huang and Scott Miao (Trend Micro)
Trend Micro collects lots of threat knowledge data for clients containing many different threat (web) entities. Most threat entities will be observed along with relations, such as malicious behaviors or interaction chains among them. So, we built a graph model on HBase to store all the known threat entities and their relationships, allowing clients to query threat relationships via any given threat entity. This presentation covers what problems we try to solve, what and how the design decisions we made, how we design such a graph model, and the graph computation tasks involved.
HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBaseHBaseCon
Blackbird is a large-scale object store built at Rocket Fuel, which stores 100+ TB of data and provides real time access to 10 billion+ objects in a 2-3 milliseconds at a rate of 1 million+ times per second. In this talk (an update from HBaseCon 2014), we will describe Blackbird's comprehensive collections API and various examples of how it can be used to model collections like sets, maps, and aggregates on these collections like counters, etc. We will also illustrate the flexibility and power of the API by modeling custom collection types that are unique to the Rocket Fuel context.
HBaseCon 2015: HBase Operations in a FlurryHBaseCon
With multiple clusters of 1,000+ nodes replicated across multiple data centers, Flurry has learned many operational lessons over the years. In this talk, you'll explore the challenges of maintaining and scaling Flurry's cluster, how we monitor, and how we diagnose and address potential problems.
Rolling Out Apache HBase for Mobile Offerings at Visa HBaseCon
Partha Saha and CW Chung (Visa)
Visa has embarked on an ambitious multi-year redesign of its entire data platform that powers its business. As part of this plan, the Apache Hadoop ecosystem, including HBase, will now become a staple in many of its solutions. Here, we will describe our journey in rolling out a high-availability NoSQL solution based on HBase behind some of our prominent mobile offerings.
HBaseCon 2015: Solving HBase Performance Problems with Apache HTraceHBaseCon
HTrace is a new Apache incubator project which makes it much easier to diagnose and detect performance problems in HBase. It provides a unified view of the performance of requests, following them from their origin in the HBase client, through the HBase region servers, and finally into HDFS. System administrators can use a central web interface to query and view aggregate performance information for the whole cluster. This talk will cover the motivations for creating HTrace, its design, and some examples of how HTrace can help diagnose real-world HBase problems.
This year we'll talk about the joys of the HBase Fuzzy Row Filter, new TSDB filters, expression support, Graphite functions and running OpenTSDB on top of Google’s hosted Bigtable. AsyncHBase now includes per-RPC timeouts, append support, Kerberos auth, and a beta implementation in Go.
Digital Library Collection Management using HBaseHBaseCon
Speaker: Ron Buckley (OCLC)
OCLC has been working over the last year to move its massive repository to HBase. This talk will focus on the impetus behind the move, implementation details and technology choices we've made (key design, shredding PDFs and other digital objects into HBase, scaling), and the value-add that HBase brings to digital collection management.
HBase Data Modeling and Access Patterns with Kite SDKHBaseCon
Speaker: Adam Warrington (Cloudera)
The Kite SDK is a set of libraries and tools focused on making it easier to build systems on top of the Hadoop ecosystem. HBase support has recently been added to the Kite SDK Data Module, which allows a developer to model and access data in HBase consistent with how they would model data in HDFS using Kite. This talk will focus on Kite's HBase support by covering Kite basics and moving through the specifics of working with HBase as a data source. This feature overview will be supplemented by specifics of how that feature is being used in production applications at Cloudera.
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBaseCon
Speaker: Sudarshan Kadambi and Matthew Hunt (Bloomberg LP)
Bloomberg is a financial data and analytics provider, so data management is core to what we do. There's tremendous diversity in the type of data we manage, and HBase is a natural fit for many of these datasets - from the perspective of the data model as well as in terms of a scalable, distributed database. This talk covers data and analytics use cases at Bloomberg and operational challenges around HA. We'll explore the work currently being done under HBASE-10070, further extensions to it, and how this solution is qualitatively different to how failover is handled by Apache Cassandra.
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWSHBaseCon
In this session, we will briefly cover the FINRA use case and then dive into our approach with a particular focus on how we leverage HBase on AWS. Among the topics covered will be our use of HBase Bulk Loading and ExportSnapShots for backup. We will also cover some lessons learned and experiences of running a persistent HBase cluster on AWS.
Speaker: Daniel Nelson (Nielsen)
The motivation behind content identification is to determine the media people are consuming (via TV shows, movies, or streaming). Nielsen collects that data via its Fingerprints system, which generates significant amounts of structured data that is stored in HBase. This presentation will review the options a developer has for HBase querying and retrieval of hash data. Also covered is the use of wire protocols (Protocol Buffers), and how they can improve network efficiency and throughput, especially when combined with an HBase coprocessor.
Apache HBase in the Enterprise Data Hub at CernerHBaseCon
Swarnim Kulkarni (Cerner)
Cerner has been an active consumer of HBase for a very long time, storing petabytes of healthcare data in its multiple isolated HBase clusters. This talk will walk through the design of Cerner's enterprise data hub with a focus on the multi-tenant HBase as a service offering within the hub.
The tech talk was given by Lars Hfhansl, Salesforce Vice President and Principal Architect.
HBase/PHOENIX @ Scale: A study of Salesforce’s use of HBase and Phoenix
Welcome!
Michael Stack, Software Engineer, Cloudera & HBase PMC Chair
9:00-9:05am
Conference MC Michael Stack, Chair of the HBaseCon 2013 Program Committee, welcomes you to the conference and offers a preview of the day.
The Apache HBase Community: Best Ever and Getting Better
Amr Awadallah, CTO and Co-founder, Cloudera
9:05-9:15am
Amr comments on the explosion of interest in Apache HBase over the past few years, how that interest has influenced the Hadoop stack overall, and why Cloudera considers its involvement in the HBase community to be so important.
State of the Apache HBase Union
Michael Stack & Lars Hofhansl, Architect, Salesforce.com
9:15-9:40am
Release-managers-in-crime Michael and Lars offer a look back, and a look forward, at HBase releases and what they have brought us (and will bring us in the future).
The Apache HBase Ecosystem
Aaron Kimball, Chief Architect, WibiData
9:40-10:05am
Today, HBase stands as Apache Hadoop did years ago, a project with a growing and vibrant community in its own right. In this talk, Aaron will overview some of the projects built on top of HBase that you’ll get a chance to learn about during the day – each of these projects having grown out of a need to use HBase for an application that requires real-time atomic access to data. As an example, he’ll present the motivations for Kiji and how it is helping organizations create amazing new applications using HBase and Hadoop.
Overview of Apache HBase at Facebook (Slides Not Available)
Liyin Tang, Software Engineer, Facebook & HBase PMC Member
10:05-10:30am
In this keynote, you’ll get an overview of how HBase is used at Facebook. Explore Facebook’s applications using HBase as an OLTP service, which require high reliability, efficiency, and scalability, and how HBase can tolerate small network glitches and rack failures. You’ll also learn the use cases for adopting HBase as a batch processing service and various optimizations to scale processing throughput. Finally, learn Facebook’s thoughts about the future of HBase.
Hortonworks and Platfora in Financial Services - WebinarHortonworks
Big Data Analytics is transforming how banks and financial institutions unlock insights, make more meaningful decisions, and manage risk. Join this webinar to see how you can gain a clear understanding of the customer journey by leveraging Platfora to interactively analyze the mass of raw data that is stored in your Hortonworks Data Platform. Our experts will highlight use cases, including customer analytics and security analytics.
Speakers: Mark Lochbihler, Partner Solutions Engineer at Hortonworks, and Bob Welshmer, Technical Director at Platfora
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...Chris Huang
Trend Micro collects lots of threat knowledge data for clients containing many different threat (web) entities. Most threat entities will be observed along with relations, such as malicious behaviors or interaction chains among them. So, we built a graph model on HBase to store all the known threat entities and their relationships, allowing clients to query threat relationships via any given threat entity. This presentation covers what problems we try to solve, what and how the design decisions we made, how we design such a graph model, and the graph computation tasks involved.
Big data architectures and the data lakeJames Serra
With so many new technologies it can get confusing on the best approach to building a big data architecture. The data lake is a great new concept, usually built in Hadoop, but what exactly is it and how does it fit in? In this presentation I'll discuss the four most common patterns in big data production implementations, the top-down vs bottoms-up approach to analytics, and how you can use a data lake and a RDBMS data warehouse together. We will go into detail on the characteristics of a data lake and its benefits, and how you still need to perform the same data governance tasks in a data lake as you do in a data warehouse. Come to this presentation to make sure your data lake does not turn into a data swamp!
From: DataWorks Summit Munich 2017 - 20170406
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
Hortonworks Data Platform 2.2 includes Apache HBase for fast NoSQL data access. In this 30-minute webinar, we discussed HBase innovations that are included in HDP 2.2, including: support for Apache Slider; Apache HBase high availability (HA); block ache compression; and wire-level encryption.
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...WebExpo
“Big Data Tools to Build Personalization”
More at http://webexpo.net/prague2013/talk/using-hadoop-and-hbase-to-personalize-web-mobile-and-email-experience-for-millions-of-users/
Precisando lidar com dados massivos onde centenas de gigabytes com crescimento para terabytes ou mesmo petabytes fazem parte do seu dia-a-dia ? Você precisa realizar milhares de operações por segundo em múltiplos terabytes de dados ? Venha conhecer o Apache HBase, um banco de dados NoSQL que roda em cima do HDFS e é altamente disponível, tolerante a falhas e escalável. HBase tem sido muito utilizado em empresas como Facebook e Twitter. Esta palestra faz uma introdução, mostrando o que é o HBase e quando usar, sua arquitetura e também exemplos de soluções reais de grandes empresas como Facebook, Twitter e Trend Micro
SQL on Hadoop Batch, Interactive and Beyond.
Public Presentation showing history and where Hortonworks is looking to go with 100% Open Source Technology.
Apache Hive, Apache SparkSQL, Apache Pheonix, and Apache Druid
Atlanta meetup presentation, discussion around big data processing engines (Hive, HBase, Druid, Spark). Weighs the relative strengths of each engine and which use cases each of the engines are most suited for
Similar to A Survey of HBase Application Archetypes (20)
hbaseconasia2017: Building online HBase cluster of Zhihu based on KubernetesHBaseCon
Zhiyong Bai
As a high performance and scalable key value database, Zhihu use HBase to provide online data store system along with Mysql and Redis. Zhihu’s platform team had accumulated some experience in technology of container, and this time, based on Kubernetes, we build flexible platform of online HBase system, create multiple logic isolated HBase clusters on the shared physical cluster with fast rapid,and provide customized service for different business needs. Combined with Consul and DNS server, we implement high available access of HBase using client mainly written with Python. This presentation is mainly shared the architecture of online HBase platform in Zhihu and some practical experience in production environment.
hbaseconasia2017 hbasecon hbase
Jingcheng Du
Apache Beam is an open source and unified programming model for defining batch and streaming jobs that run on many execution engines, HBase on Beam is a connector that allows Beam to use HBase as a bounded data source and target data store for both batch and streaming data sets. With this connector HBase can work with many batch and streaming engines directly, for example Spark, Flink, Google Cloud Dataflow, etc. In this session, I will introduce Apache Beam, and the current implementation of HBase on Beam and the future plan on this.
hbaseconasia2017 hbasecon hbase
https://www.eventbrite.com/e/hbasecon-asia-2017-tickets-34935546159#
hbaseconasia2017: HBase Disaster Recovery Solution at HuaweiHBaseCon
Ashish Singhi
HBase Disaster recovery solution aims to maintain high availability of HBase service in case of disaster of one HBase cluster with very minimal user intervention. This session will introduce the HBase disaster recovery use cases and the various solutions adopted at Huawei like.
a) Cluster Read-Write mode
b) DDL operations synchronization with standby cluster
c) Mutation and bulk loaded data replication
d) Further challenges and pending work
hbaseconasia2017 hbasecon hbase https://www.eventbrite.com/e/hbasecon-asia-2017-tickets-34935546159#
hbaseconasia2017: Removable singularity: a story of HBase upgrade in PinterestHBaseCon
Tianying Chang
HBase is used to serve online facing traffic in Pinterest. It means no downtime is allowed. However, we were on HBase 94. To upgrade to latest version, we need to figure out a way to live upgrade while keeping Pinterest site live. Recently, we successfully upgrade 94 HBase cluster to 1.2 with no downtime. We made change to both Asynchbase and HBase server side. We will talk about what we did and how we did it. We will also talk about the finding in config and performance tuning we did to achieve low latency.
hbaseconasia2017 hbasecon hbase https://www.eventbrite.com/e/hbasecon-asia-2017-tickets-34935546159#
Xinxin Fan and Hongxiang Jiang
First, we will give a brief introduction about the HBase service at Netease,include the basic cluster info and the key HBase service. And then we will talk same tips about the tuning practices for HBase. Last, we will introduce some improvements at the internal HBase version.
hbaseconasia2017 hbasecon hbase https://www.eventbrite.com/e/hbasecon-asia-2017-tickets-34935546159#
hbaseconasia2017: Large scale data near-line loading method and architectureHBaseCon
Shuaifeng Zhou
When we do real-time data loading to HBase, we use put/putlist interface. After receiving put request, regionserver will write WAL, write data into memory store, flush memory store to disk-store, then compact files again and again. That precedure occupies too much resource and causing read/write performance decrease. To solve the problem, we provide a kind of near-line loading method and architecture, greatly increase the loading bandwidth, and decrease the influence to read operations.
hbaseconasia2017 hbasecon hbase https://www.eventbrite.com/e/hbasecon-asia-2017-tickets-34935546159#
hbaseconasia2017: Ecosystems with HBase and CloudTable service at HuaweiHBaseCon
Jieshan Bi and Yanhui Zhong
1. CTBase: A light-weight HBase client for structured data.
1). Schematized table, more friendly for structured data storage.
2). Global secondary index for HBase.
3). HBase Query DSL. JSON based light-weight API.
4) Cluster table. Pre-joining with keys, a better solution for cross-table join queries from HBase.
2. Tagram: Distributed bitmap index implementation with HBase.
1). Distributed bitmap index for accelerating AD-HOC queries with low cardinality columns.
2). Powerful and flexible query API.
3). Tagram offers millisecond-level query latency.
3. CloudTable Service Introduction: HBase on Huawei cloud.
hbaseconasia2017 hbasecon hbase https://www.eventbrite.com/e/hbasecon-asia-2017-tickets-34935546159#
hbaseconasia2017: HBase Practice At XiaoMiHBaseCon
Zheng Hu
We'll share some HBase experience at XiaoMi:
1. How did we tuning G1GC for HBase Clusters.
2. Development and performance of Async HBase Client.
hbaseconasia2017 hbasecon hbase xiaomi https://www.eventbrite.com/e/hbasecon-asia-2017-tickets-34935546159#
HBase-2.0.0 has been a couple of years in the making. It is chock-a-block full of a long list of new features and fixes. In this session, the 2.0.0 release manager will perform the impossible, describing the release content inside the session time bounds.
hbaseconasia2017 hbasecon hbase https://www.eventbrite.com/e/hbasecon-asia-2017-tickets-34935546159#
As HBase and Hadoop continue to become routine across enterprises, these enterprises inevitably shift priorities from effective deployments to cost-efficient operations. Consolidation of infrastructure, the sum of hardware, software, and system-administrator effort, is the most common strategy to reduce costs. As a company grows, the number of business organizations, development teams, and individuals accessing HBase grows commensurately, creating a not-so-simple requirement: HBase must effectively service many users, each with a variety of use-cases. This is problem is known as multi-tenancy. While multi-tenancy isn’t a new problem, it also isn’t a solved one, in HBase or otherwise. This talk will present a high-level view of the common issues organizations face when multiple users and teams share a single HBase instance and how certain HBase features were designed specifically to mitigate the issues created by the sharing of finite resources.
HBaseCon2017 Removable singularity: a story of HBase upgrade in PinterestHBaseCon
HBase is used to serve online facing traffic in Pinterest. It means no downtime is allowed. However, we were on HBase 94. To upgrade to latest version, we need to figure out a way to live upgrade while keeping Pinterest site live. Recently, we successfully upgrade 94 HBase cluster to 1.2 with no downtime. We made change to both Asynchbase and HBase server side. We will talk about what we did and how we did it. We will also talk about the finding in config and performance tuning we did to achieve low latency.
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBaseHBaseCon
Hundreds of millions of people use Quora to find accurate, informative, and trustworthy answers to their questions. As it so happens, counting things at scale is both an important and a difficult problem to solve.
In this talk, we will be talking about Quanta, Quora's counting system built on top of HBase that powers our high-volume near-realtime analytics that serves many applications like ads, content views, and many dashboards. In addition to regular counting, Quanta supports count propagation along the edges of an arbitrary DAG. HBase is the underlying data store for both the counting data and the graph data.
We will describe the high-level architecture of Quanta and share our design goals, constraints, and choices that enabled us to build Quanta very quickly on top of our existing infrastructure systems.
In the age of NoSQL, big data storage engines such as HBase have given up ACID semantics of traditional relational databases, in exchange for high scalability and availability. However, it turns out that in practice, many applications require consistency guarantees to protect data from concurrent modification in a massively parallel environment. In the past few years, several transaction engines have been proposed as add-ons to HBase; three different engines, namely Omid, Tephra, and Trafodion were open-sourced in Apache alone. In this talk, we will introduce and compare the different approaches from various perspectives including scalability, efficiency, operability and portability, and make recommendations pertaining to different use cases.
In order to effectively predict and prevent online fraud in real time, Sift Science stores hundreds of terabytes of data in HBase—and needs it to be always available. This talk will cover how we used circuit-breaking, cluster failover, monitoring, and automated recovery procedures to improve our HBase uptime from 99.7% to 99.99% on top of unreliable cloud hardware and networks.
In DiDi Chuxing Company, which is China’s most popular ride-sharing company. we use HBase to serve when we have a bigdata problem.
We run three clusters which serve different business needs. We backported the Region Grouping feature back to our internal HBase version so we could isolate the different use cases.
We built the Didi HBase Service platform which is popular amongst engineers at our company. It includes a workflow and project management function as well as a user monitoring view.
Internally we recommend users use Phoenix to simplify access.even more,we used row timestamp;multidimensional table schema to slove muti dimension query problems
C++, Go, Python, and PHP clients get to HBase via thrift2 proxies and QueryServer.
We run many important buisness applications out of our HBase cluster such as ETA/GPS/History Order/API metrics monitoring/ and Traffic in the Cloud. If you are interested in any aspects listed above, please come to our talk. We would like to share our experiences with you.
HBaseCon2017 gohbase: Pure Go HBase ClientHBaseCon
gohbase is an implementation of an HBase client in pure Go: https://github.com/tsuna/gohbase. In this presentation we'll talk about its architecture and compare its performance against the native Java HBase client as well as AsyncHBase (http://opentsdb.github.io/asynchbase/) and some nice characteristics of golang that resulted in a simpler implementation.
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Mind IT Systems
Healthcare providers often struggle with the complexities of chronic conditions and remote patient monitoring, as each patient requires personalized care and ongoing monitoring. Off-the-shelf solutions may not meet these diverse needs, leading to inefficiencies and gaps in care. It’s here, custom healthcare software offers a tailored solution, ensuring improved care and effectiveness.
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfJay Das
With the advent of artificial intelligence or AI tools, project management processes are undergoing a transformative shift. By using tools like ChatGPT, and Bard organizations can empower their leaders and managers to plan, execute, and monitor projects more effectively.
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
1. Headline Goes Here
Speaker Name or Subhead Goes Here
DO NOT USE PUBLICLY
PRIOR TO 10/23/12
Apache HBase Application
Archetypes
Lars George | @larsgeorge | Cloudera EMEA Chief Architect | HBase PMC
Jonathan Hsieh | @jmhsieh | Cloudera HBase Tech lead | HBase PMC
HBaseCon 2014
May 5th , 2014
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
1
2. About Lars and Jon
Lars George
• EMEA Chief Architect
@Cloudera
• Apache HBase PMC
• O’Reilly Author of HBase – The
Definitive Guide
• Contact
• lars@cloudera.com
• @larsgeorge
Jon Hsieh
• Tech Lead HBase Team
@Cloudera
• Apache HBase PMC
• Apache Flume founder
• Contact:
• jon@cloudera.com
• @jmhsieh
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
2
3. About Supporting HBase at Cloudera
• Supporting Customers using HBase since 2011
• HBase Training
• Professional Services
• Team has experience supporting and running HBase since 2009
• 8 committers on staff
• 2 HBase book authors
• As of Jan 2014, ~20,000 HBase nodes (in aggregate) under
management
• Information in this presentation is either aggregated customer data
or from public sources.
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
3
4. An Apache HBase Timeline
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
20142008 2009 2010 2011 20132012
Apr’11: CDH3 GA
with HBase 0.90.1
May ‘12:
HBaseCon 2012
Jun ‘13:
HBaseCon 2013
Summer‘11:
Messages
on HBaseSummer ‘09
StumbleUpon
goes production on
HBase ~0.20
Nov ‘11:
Cassini
on HBase
Jan ‘13
Phoenix
on HBase
Summer‘11:
Web Crawl
Cache
4
Sept’11:
HBase TDG
published
Nov’12:
HBase in
Action
published
2015
May ‘14:
HBaseCon 2014
Aug ‘13
Flurry 1k-1k node
cluster replication
Summer ‘14
HBase v1.0.0
released
Jan’14: Cloudera
has ~20k Hbase
nodes under
management
7. A vocabulary for HBase Archetypes
Definitions
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
7
8. Defining HBase Archetypes
• There are a lot of HBase applications
• Some successful, some less so
• They have common architecture patterns
• They have common tradeoffs
• Archetypes are common architecture patterns
• Common across multiple use-cases
• Extracted to be repeatable
• Our Goal: Define patterns à la “Gang of Four” (Gamma, Helm,
Johnson, Vlissides)
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
8
9. So you want to use HBase?
• What data is being stored?
• Entity data
• Event data
• Why is the data being stored?
• Operational use cases
• Analytical use cases
• How does the data get in and out?
• Real time vs. Batch
• Random vs. Sequential
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
9
10. What is being stored?
There are primarly two kinds of big data workloads. They have
different storage requirements.
Entities Events
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
10
11. Entity Centric Data
• Entity data is information about current state
• Generally real time reads and writes
• Examples:
• Accounts
• Users
• Geolocation points
• Click Counts and Metrics
• Current Sensors Reading
• Scales up with # of Humans and # of Machines/Sensors
• Billions of distinct entities
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
11
12. Event Centric Data
• Event centric data are time-series data points recording successive points
spaced over time intervals.
• Generally real time write, some combination of real time read or batch read
• Examples:
• Sensor data over time
• Historical Stock Ticker data
• Historical Metrics
• Clicks time-series
• Scales up due to finer grained intervals, retention policies, and the passage
of time
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
12
13. Events about Entities
• Majority Big Data use cases are dealing with event-based data
• |Entities| * |Events| = Big data
• When you ask questions, do you hone in on entity first?
• When you ask questions, do you hone in on time ranges first?
• Your answer will help you determine where and how to store
your data.
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
13
14. Why are you storing the data?
• So what kind of questions are you asking the data?
• Entity-centric questions
• Give me everything about entity e
• Give me the most recent event v about entity e
• Give me the n most recent events V about entity e
• Give me all events V about e between time [t1,t2]
• Event and Time-centric questions
• Give me an aggregates on each entity between time [t1,t2]
• Give me an aggregate on each time interval for entity e
• Find events V that match some other given criteria
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
14
15. How does data get in and out of HBase?
HBase Client
Put, Incr,
Append
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
HBase Client
Gets
Short scan
Full Scan,
MapReduce
HBase Scanner
Bulk Import
HBase Client
15
HBase
Replication
HBase
Replication
16. How does data get in and out of HBase?
HBase Client
Put, Incr,
Append
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
HBase Client
Get, Scan
Bulk Import
HBase Client
16
HBase
Replication
HBase
Replication
low latency
high throughput
Gets
Short scan
Full Scan,
MapReduce
HBase Scanner
17. What system is most efficient?
• It is all physics
• You have a limited I/O budget
• Use all your I/O by parallelizing access
and read/write sequentially.
• Choose the system and features that
reduces I/O in general
• Pick the systems best for your workload
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
17
IOPs/s/disk
18. The physics of Hadoop Storage Systems
Workload HBase HDFS
Low latency ms, cached mins, MR
+ seconds, Impala
Random Read primary index - index?, small files problem
Short Scan sorted + partition
Full Scan 0 live table
+ (MR on snapshots)
MR, Hive, Impala
Random Write log structured - Not supported
Sequential Write hbase overhead
bulk load
minimal overhead
Updates log structured - Not supported
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
18
19. The physics of Hadoop Storage Systems
Workload HBase HDFS
Low latency ms, cached mins, MR
+ seconds, Impala
Random Read primary index - index?, small files problem
Short Scan sorted + partition
Full Scan 0 live table
+ (MR on snapshots)
MR, Hive, Impala
Random Write log structured - Not supported
Sequential Write hbase overhead
bulk load
minimal overhead
Updates log structured - Not supported
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
19
20. The physics of Hadoop Storage Systems
Workload HBase HDFS
Low latency ms, cached mins, MR
+ seconds, Impala
Random Read primary index - index?, small files problem
Short Scan sorted + partition
Full Scan 0 live table
+ (MR on snapshots)
MR, Hive, Impala
Random Write log structured - not supported
Sequential Write HBase overhead
bulk load
minimal overhead
Updates log structured - not supported
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
20
22. HBase application use cases
• The Good
• Simple Entities
• Messaging Store
• Graph Store
• Metrics Store
• The Bad
• Large Blobs
• Naïve RDBMS port
• Analytic Archive
• The Maybe
• Time series DB
• Combined workloads
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
22
24. Archetype: Simple Entities
• Purely entity data, no relation between entities
• Batch or real-time, random writes
• Real-time, random reads
• Could be a well-done denormalized RDBMS port.
• Often from many different sources, with poly-structured data
• Schema:
• Row per entity
• Row key => entity ID, or hash of entity ID
• Col qualifier => Property / field, possibly time stamp
• Geolocation data
• Search index building
• Use solr to make text data searchable.
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
24
25. Simple Entities access pattern
HBase Client
Put, Incr,
Append
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
HBase Client
Get, Scan
Bulk Import
HBase Client
25
HBase
Replication
low latency
high throughput
Gets
Short scan
Full Scan,
MapReduce
HBase Scanner
HBase
Replication
Solr
26. Archetype: Messaging Store
• Messaging Data:
• Realtime Random writes: Emails, SMS, MMS, IM
• Realtime random updates: Msg read, starred, moved, deleted
• Reading of top-N entries, sorted by time
• Records are of varying size
• Some time series, but mostly random read/write
• Schema:
• Row = users/feed/inbox
• Row key = UID or UID + time
• Column Qualifier = time or conversation id + time.
• Use CF’s for indexes.
• Examples:
• Facebook Messages, Xiaomi Messages
• Telco SMS/MMS services
• Feeds like tumblr, pinterest
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
26
27. Facebook Messages - Statistics
Source: HBaseCon 2012 - Anshuman Singh
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
27
28. Messages Access Pattern
HBase Client
Put, Incr,
Append
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
HBase Client
Get, Scan
Bulk Import
HBase Client
28
HBase
Replication
HBase
Replication
low latency
high throughput
Gets
Short scan
Full Scan,
MapReduce
HBase Scanner
29. Archetype: Graph Data
• Graph Data: All entities and relations
• Batch or realtime, random writes
• Batch or realtime, random reads
• Its an entity with relation edges
• Schema:
• Row = Node.
• Row key => Node ID.
• Col qualifier => Edge ID, or properties:values
• Examples:
• Web Caches – Yahoo!, Trend Micro
• Titan Graph DB with HBase storage backend
• Sessionization (financial transactions, clicks streams, network traffic)
• Government (connect the bad guy)
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
29
30. Graph Data Access Pattern
HBase Client
Put, Incr,
Append
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
HBase Client
Get, Scan
Bulk Import
HBase Client
30
HBase
Replication
HBase
Replication
low latency
high throughput
Gets
Short scan
Full Scan,
MapReduce
HBase Scanner
31. Archetype: Metrics
• Frequently updated Metrics
• Increments
• Roll ups generated by MR and bulk loaded to HBase
• Poor man’s datacubes
• Examples
• Campaign Impression/Click counts (Ad tech)
• Sensor data (Energy, Manufacturing, Auto)
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
31
32. Metrics Access Pattern
HBase Client
Put, Incr,
Append
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
HBase Client
Get, Scan
Bulk Import
HBase Client
32
HBase
Replication
HBase
Replication
low latency
high throughput
Gets
Short scan
Full Scan,
MapReduce
HBase Scanner
34. Current HBase weak spots
• HBase’s architecture can handle a lot
• We make engineering trade offs to optimize for them.
• HBase can still do things it is not optimal for.
• However, other systems are fundamentally more efficient for some
workloads.
• We’ve often seen some folks forcing apps into HBase.
• If one of these is your only workloads on this data, use another system
• If you are in a mixed workload case, some of these become “maybes”.
• Just because it is not good today, doesn’t mean it cant be better
tomorrow.
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
34
35. Bad Archetype: Large Blob Store
• Saving large objects >3MB per cell
• Schema:
• Normal entity pattern, but with some columns with large cells.
• Examples
• Raw photo or video storage in HBase
• Large frequently updated structs as a single cell
• Problems:
• Will get crushed due to write amplification when reoptimizing data for read.
(compactions on large unchanging data)
• Will crush write pipeline if there are large structs with frequently updated subfields.
Cells are atomic, and hbase must rewrite an entire cell.
• Some work adding LOB support
• This requires new architecture elements
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
35
36. Bad Archetype: Naïve RDBMS port
• A naïve port the RDBMS onto HBase, directly copying the schema.
• Schema
• Many tables, just like an RDBMS schema.
• Row key: primary key or auto-incrementing key, like RDBMS schema
• Column qualifiers: field names
• Manually do joins, or secondary indexes (not consistent)
• Solution:
• HBase is not a SQL Database.
• No multi-region/multi-table in HBase transactions (yet).
• Must to denormalize your schema to use Hbase.
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
36
37. Large blob store, Naïve RDBMS port access patterns
HBase Client
Put, Incr,
Append
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
HBase Client
Get, Scan
Bulk Import
HBase Client
37
HBase
Replication
HBase
Replication
low latency
high throughput
Gets
Short scan
Full Scan,
MapReduce
HBase Scanner
38. Bad Archetype: Analytic archive
• Store purely chronological data, partitioned by time
• Real time writes, chronological time as primary index
• Column-centric aggregations over all rows.
• Bulk reads out, generally for generating periodic reports
• Schema
• Row key: date+xxx or salt+date+xxx
• Column qualifiers: properties with data or counters
• Example
• Machine logs organized by date.
• Full fidelity clickstream
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
38
39. Bad Archetype: Analytic archive Problems
• HBase non-optimal as primary use case.
• Will get crushed by frequent full table scans.
• Will get crushed by large compactions.
• Will get crushed by write-side region hot spotting.
• Instead
• Store in HDFS; Use Parquet columnar data storage + Impala/Hive
• Build rollups in HDFS+MR; store and serve rollups in HBase
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
39
40. Analytic Archive access patterns
HBase Client
Put, Incr,
Append
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
HBase Client
Get, Scan
Bulk Import
HBase Client
40
HBase
Replication
HBase
Replication
low latency
high throughput
Gets
Short scan
Full Scan,
MapReduce
HBase Scanner
41. And this is crazy | But here’s my data, | serve it, maybe!
Archetypes: The Maybe
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
41
42. The Maybe’s
• For some applications, doing it right gets complicated.
• These more sophisticated or nuanced cases require considing
these questions:
• When do you choose HBase vs HDFS storage for time series data?
• Are there times where bad archetypes are ok?
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
42
43. Time Series: in HBase or HDFS?
• IO Patterns:
• Reads: Collocate related data
• Make reads cheap and fast.
• Writes: Spread writes out as much as possible
• Maximize write throughput
• HBase: Tension between these goals
• Spreading writes spreads data making reads inefficient
• Colocating on write causes hotspots, underutilizes resources by limiting write
throughput
• HDFS: The sweet spot.
• Sequential writes and and sequential read.
• Just write more files in date-dirs; physically spreads writes but logically groups data.
• Reads for time centric quieries just read files in date-dir
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
43
44. Time Series data flows
• Ingest
• Flume or similar direct tool via app
• HDFS
• Batch queries and generate rollups in Hive/MR
• Faster queries in Impala
• No user time serving
• HBase for recent, HDFS for historical
• HBase
• Serve individual events
• Serve pre-computed aggregates
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
44
45. Archetype: Entity Time Series
• A time series access pattern suitable for HBase
• Random write to event data, random read specific event or aggregate data
• Generate aggregates via counters, don’t directly compute aggregate on
query
• HBase is system of record
• Schema:
• Rowkey: entity-timestamp or hash(entity)-timestamp, possibly with salt
added after entity.
• Col qualifiers: property
• Use custom aggretation to consolidate old data
• Use TTL’s to bound and age off old data
• Examples:
• OpenTSDB does this well for numeric values; Lazily aggregates cells for
better performance.
• Facebook Insights, ODS
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
45
46. Entity Time Series access pattern
HBase Client
Put, Incr,
Append
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
HBase Client
Get, Scan
Bulk Import
HBase Client
46
HBase
Replication
HBase
Replication
low latency
high throughput
Gets
Short scan
Full Scan,
MapReduce
HBase Scanner
Flume
Custom App
47. Archetypes: Hybrid Entity Time Series
• Essentially a combo of the Metric Archetype and Entity Time
Series Archetype, with bulk loads of rollups via HDFS.
• Land data in HDFS and HBase
• Keep all data in HDFS for future use
• Aggregate in HDFS and write to HBase
• HBase can do some aggregates too (counters)
• Keep serve-able data in HBase.
• Use TTL to discard old values from Hbase.
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
47
48. Hybrid time series access pattern
HBase Client
Put, Incr,
Append
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
HBase Client
Get, Scan
Hive or MR:
Bulk Import
HBase Client
48
HBase
Replication
HBase
Replication
low latency
high throughput
Gets
Short scan
Full Scan,
MapReduce
HBase Scanner
HDFS
Flume
49. Meta Archetype: Combined workloads
• In these cases, the use of HBase depends on workload
• Cases where we have multiple workloads styles.
• Many cases we want to do multiple things with the same
data
• primary use case (real time, random access)
• secondary use case (analytical)
• Pick for your primary, here’s some patterns on how to do
your secondary.
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
49
50. Real time workloads and Analytical access
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
HBase Client
Get, Scan
50
poor latency!
full scans
interfere with
latency!
high throughput
MapReduce
HBase Scanner
HBase Client
Put, Incr,
Append
Bulk Import
HBase Client
HBase
Replication
51. Real time workloads and Analytical access
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
HBase Client
Get, Scan
51
HBase
Replication
low latency
Isolated from full scans
high throughput
MapReduce
HBase Scanner
HBase Client
Put, Incr,
Append
Bulk Import
HBase Client
HBase
Replication
high throughput
52. MR over Table Snapshots (0.98, CDH5.0)
• Previously MapReduce jobs over
HBase required online full table
scan
• Take a snapshot and run MR job
over snapshot files
• Doesn’t use HBase client
• Avoid affecting HBase caches
• 3-5x perf boost.
• Still requires more IOPs than hdfs
raw files
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
map
map
map
map
map
map
map
map
reduce
reduce
reduce
map
map
map
map
map
map
map
map
reduce
reduce
reduce
snapshot
52
53. Analytic Archive access pattern
HBase Client
Put, Incr,
Append
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
HBase Client
Get, Scan
Bulk Import
HBase Client
53
HBase
Replication
HBase
Replication
low latency
high throughput
Gets
Short scan
Full Scan,
MapReduce
HBase Scanner
55. Multitenancy (in progress)
• We want to MR for analytics while
serving low-latency requests in one
cluster.
• Performance Isolation
• Limit performance impact load on
one table has on others. (HBASE-
6721)
• Request prioritization and scheduling
• Toda default is FIFO
• Need to schedule some requests
before others (HBASE-10994)
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
55
1 1 2 1 1 3 1
1 1 21 1 31
Delayed by long
scan requests
Rescheduled so
new request get
priority
Mixed workload
Isolated
workload
57. Big Data Workloads
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
57
Low
latency
Batch
Random Access Full ScanShort Scan
HDFS + MR
(Hive/pig)
HBase
HBase + Snapshots
-> HDFS + MR
HDFS
+ Impala
HBase + MR
58. Big Data Workloads
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
58
Low
latency
Batch
Random Access Full ScanShort Scan
HDFS + MR
(Hive/pig)
HBase
HBase + Snapshots
-> HDFS + MR
HDFS
+ Impala
HBase + MR
Current Metrics
Graph data
Simple Entities
Hybrid Entity Time series
+ Rollup serving
Messages
Analytic archive
Hybrid Entity Time series
+ Rollup generation
Index building
Entity Time series
59. HBase is evolving to be an Operational Database
• Excels at consistent single row centric operations
• Dev efforts aimed at using all machine resources efficiently,
reducing MTTR, and improving latency predictability.
• Projects built on HBase that enable secondary indexing and
multi-row transactions
• Apache Phoenix (incubating) or Impala provide a SQL skin for
simplified application development
• Analytic workloads?
• Can be done but will be beaten by direct HDFS +
MR/Spark/Impala
5/5/14 HBaseCon 2014; Lars George,
Jon Hsieh
59