This document summarizes a presentation about HBase storage internals and future developments. It discusses how HBase provides random read/write access on HDFS using tables, regions, and region servers. It describes the write path involving the client, master, and region servers as well as the read path. It also covers topics like snapshots, compactions, and future plans to improve encryption, security, write-ahead logs, and compaction policies.
This document discusses tuning HBase and HDFS for performance and correctness. Some key recommendations include:
- Enable HDFS sync on close and sync behind writes for correctness on power failures.
- Tune HBase compaction settings like blockingStoreFiles and compactionThreshold based on whether the workload is read-heavy or write-heavy.
- Size RegionServer machines based on disk size, heap size, and number of cores to optimize for the workload.
- Set client and server RPC chunk sizes like hbase.client.write.buffer to 2MB to maximize network throughput.
- Configure various garbage collection settings in HBase like -Xmn512m and -XX:+UseCMSInit
This document discusses techniques for improving latency in HBase. It analyzes the write and read paths, identifying sources of latency such as networking, HDFS flushes, garbage collection, and machine failures. For writes, it finds that single puts can achieve millisecond latency while streaming puts can hide latency spikes. For reads, it notes cache hits are sub-millisecond while cache misses and seeks add latency. GC pauses of 25-100ms are common, and failures hurt locality and require cache rebuilding. The document outlines ongoing work to reduce GC, use off-heap memory, improve compactions and caching to further optimize for low latency.
This talk delves into the many ways that a user has to use HBase in a project. Lars will look at many practical examples based on real applications in production, for example, on Facebook and eBay and the right approach for those wanting to find their own implementation. He will also discuss advanced concepts, such as counters, coprocessors and schema design.
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
Apache Tez is a framework for accelerating Hadoop query processing. It is based on expressing a computation as a dataflow graph and executing it in a highly customizable way. Tez is built on top of YARN and provides benefits like better performance, predictability, and utilization of cluster resources compared to traditional MapReduce. It allows applications to focus on business logic rather than Hadoop internals.
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
This document discusses supporting Apache HBase and improving troubleshooting and supportability. It introduces two Cloudera employees who work on HBase support and provides an overview of typical troubleshooting scenarios for HBase like performance degradation, process crashes, and inconsistencies. The agenda covers using existing tools like logs and metrics to troubleshoot HBase performance issues with a general approach, and introduces htop as a real-time monitoring tool for HBase.
Kudu is popularly referred to as "Fast Analytics on Fast Data" capable of performing both OLAP & OLTP operations. Understand right from essentials to deep-dive into Kudu internals and architecture for building applications based on Kudu and integrating with Hadoop ecosystem.
Read about Kudu clusters, architecture, operations, primary key design and column optimizations, partitioning and other performance considerations.
HBase Accelerated introduces an in-memory flush and compaction pipeline for HBase to improve performance of real-time workloads. By keeping data in memory longer and avoiding frequent disk flushes and compactions, it reduces I/O and improves read and scan latencies. Evaluation on workloads with high update rates and small working sets showed the new approach significantly outperformed the default HBase implementation by serving most data from memory. Work is ongoing to further optimize the in-memory representation and memory usage.
This document discusses tuning HBase and HDFS for performance and correctness. Some key recommendations include:
- Enable HDFS sync on close and sync behind writes for correctness on power failures.
- Tune HBase compaction settings like blockingStoreFiles and compactionThreshold based on whether the workload is read-heavy or write-heavy.
- Size RegionServer machines based on disk size, heap size, and number of cores to optimize for the workload.
- Set client and server RPC chunk sizes like hbase.client.write.buffer to 2MB to maximize network throughput.
- Configure various garbage collection settings in HBase like -Xmn512m and -XX:+UseCMSInit
This document discusses techniques for improving latency in HBase. It analyzes the write and read paths, identifying sources of latency such as networking, HDFS flushes, garbage collection, and machine failures. For writes, it finds that single puts can achieve millisecond latency while streaming puts can hide latency spikes. For reads, it notes cache hits are sub-millisecond while cache misses and seeks add latency. GC pauses of 25-100ms are common, and failures hurt locality and require cache rebuilding. The document outlines ongoing work to reduce GC, use off-heap memory, improve compactions and caching to further optimize for low latency.
This talk delves into the many ways that a user has to use HBase in a project. Lars will look at many practical examples based on real applications in production, for example, on Facebook and eBay and the right approach for those wanting to find their own implementation. He will also discuss advanced concepts, such as counters, coprocessors and schema design.
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
Apache Tez is a framework for accelerating Hadoop query processing. It is based on expressing a computation as a dataflow graph and executing it in a highly customizable way. Tez is built on top of YARN and provides benefits like better performance, predictability, and utilization of cluster resources compared to traditional MapReduce. It allows applications to focus on business logic rather than Hadoop internals.
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
This document discusses supporting Apache HBase and improving troubleshooting and supportability. It introduces two Cloudera employees who work on HBase support and provides an overview of typical troubleshooting scenarios for HBase like performance degradation, process crashes, and inconsistencies. The agenda covers using existing tools like logs and metrics to troubleshoot HBase performance issues with a general approach, and introduces htop as a real-time monitoring tool for HBase.
Kudu is popularly referred to as "Fast Analytics on Fast Data" capable of performing both OLAP & OLTP operations. Understand right from essentials to deep-dive into Kudu internals and architecture for building applications based on Kudu and integrating with Hadoop ecosystem.
Read about Kudu clusters, architecture, operations, primary key design and column optimizations, partitioning and other performance considerations.
HBase Accelerated introduces an in-memory flush and compaction pipeline for HBase to improve performance of real-time workloads. By keeping data in memory longer and avoiding frequent disk flushes and compactions, it reduces I/O and improves read and scan latencies. Evaluation on workloads with high update rates and small working sets showed the new approach significantly outperformed the default HBase implementation by serving most data from memory. Work is ongoing to further optimize the in-memory representation and memory usage.
The document summarizes Apache Phoenix and its past, present, and future as a SQL interface for HBase. It describes Phoenix's architecture and key features like secondary indexes, joins, aggregations, and transactions. Recent releases added functional indexes, the Phoenix Query Server, and initial transaction support. Future plans include improvements to local indexes, integration with Calcite and Hive, and adding JSON and other SQL features. The document aims to provide an overview of Phoenix's capabilities and roadmap for building a full-featured SQL layer over HBase.
This document summarizes a presentation about optimizing HBase performance through caching. It discusses how baseline tests showed low cache hit rates and CPU/memory utilization. Reducing the table block size improved cache hits but increased overhead. Adding an off-heap bucket cache to store table data minimized JVM garbage collection latency spikes and improved memory utilization by caching frequently accessed data outside the Java heap. Configuration parameters for the bucket cache are also outlined.
Practical advices how to achieve persistence in Redis. Detailed overview of all cons and pros of RDB snapshots and AOF logging. Tips and tricks for proper persistence configuration with Redis pools and master/slave replication.
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Databricks
The convergence of big data technology towards traditional database domain has became an industry trend. At present, open source big data processing engines, such as Apache Spark, Apache Hadoop, Apache Flink, etc., already support SQL interfaces, and the usage of SQL basically occupies a dominant position. Companies use above open source software to build their own ETL framework and OLAP technology. However, in terms of OLTP technology, it is still a strong point of traditional databases. One of the main reasons is the support of ACID by traditional databases.
The document discusses Apache Tez, a framework for building data processing applications on Hadoop. It provides an introduction to Tez and describes key features like expressing computations as directed acyclic graphs (DAGs), container reuse, dynamic parallelism, integration with YARN timeline service, and recovery from failures. The document also outlines improvements to Tez around performance, debuggability, and status/roadmap.
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...Altinity Ltd
Columnar stores like ClickHouse enable users to pull insights from big data in seconds, but only if you set things up correctly. This talk will walk through how to implement a data warehouse that contains 1.3 billion rows using the famous NY Yellow Cab ride data. We'll start with basic data implementation including clustering and table definitions, then show how to load efficiently. Next, we'll discuss important features like dictionaries and materialized views, and how they improve query efficiency. We'll end by demonstrating typical queries to illustrate the kind of inferences you can draw rapidly from a well-designed data warehouse. It should be enough to get you started--the next billion rows is up to you!
Sparklens: Understanding the Scalability Limits of Spark Applications with R...Databricks
One of the common requests we receive from customers (at Qubole) is debugging slow spark application. Usually this process is done with trial and error, which takes time and requires running clusters beyond normal usage (read wasted resources). Moreover, it doesn’t tell us where to looks for further improvements. We at Qubole are looking into making this process more self-serve. Towards this goal we have built a tool (OSS https://github.com/qubole/sparklens) based on spark event listener framework.
From a single run of the application, Sparklens provides insights about scalability limits of given spark application. In this talk we will cover what Sparklens does and theory behind Sparklens. We will talk about how structure of spark application puts important constraints on its scalability. How can we find these structural constraints and how to use these constraints as a guide in solving performance and scalability problems of spark applications.
This talk will help audience in answering the following questions about their spark applications: 1) Will their application run faster with more executors? 2) How will cluster utilization change as number of executors change? 3) What is the absolute minimum time this application will take even if we give it infinite executors? 4) What is the expected wall clock time for the application when we fix the most important structural limits of these application? Sparklens makes the ROI of additional executor extremely obvious for a given application and needs just a single run of the application to determine how application with behave with different executor counts. Specifically, it will help managers take the correct side of the tradeoff between spending developer time optimising applications vs spending money on compute bills.
Tez is the next generation Hadoop Query Processing framework written on top of YARN. Computation topologies in higher level languages like Pig/Hive can be naturally expressed in the new graph dataflow model exposed by Tez. Multi-stage queries can be expressed as a single Tez job resulting in lower latency for short queries and improved throughput for large scale queries. MapReduce has been the workhorse for Hadoop but its monolithic structure had made innovation slower. YARN separates resource management from application logic and thus enables the creation of Tez, a more flexible and generic new framework for data processing for the benefit of the entire Hadoop query ecosystem.
This document provides an overview of Hive and its performance capabilities. It discusses Hive's SQL interface for querying large datasets stored in Hadoop, its architecture which compiles SQL queries into MapReduce jobs, and its support for SQL semantics and datatypes. The document also covers techniques for optimizing Hive performance, including data abstractions like partitions, buckets and skews. It describes different join strategies in Hive like shuffle joins, broadcast joins and sort-merge bucket joins and how they are implemented in MapReduce. The overall presentation aims to explain how Hive provides scalable SQL processing for big data.
Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale.
In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.
The document discusses average active sessions (AAS) as a single metric for measuring database performance and load, providing methods for calculating AAS using sampling of active session history (ASH) data or time statistics, and comparing the AAS value to metrics like CPU count to understand if the database is under or over utilized.
It also describes how the components of AAS like CPU usage and wait times can provide more insight, and how tools like the Oracle Enterprise Manager (OEM) can show AAS over time as well as its subcomponents to help identify performance bottlenecks.
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersCloudera, Inc.
Todd Lipcon presents a solution to avoid full garbage collections (GCs) in HBase by using MemStore-Local Allocation Buffers (MSLABs). The document outlines that write operations in HBase can cause fragmentation in the old generation heap, leading to long GC pauses. MSLABs address this by allocating each MemStore's data into contiguous 2MB chunks, eliminating fragmentation. When MemStores flush, the freed chunks are large and contiguous. With MSLABs enabled, the author saw basically zero full GCs during load testing. MSLABs improve performance and stability by preventing GC pauses caused by fragmentation.
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang
This paper, written by the LinkedIn Espresso Team, appeared at the ACM SIGMOD/PODS Conference (June 2013). To see the talk given by Swaroop Jagadish (Staff Software Engineer @ LinkedIn), go here:
http://www.slideshare.net/amywtang/li-espresso-sigmodtalk
How Netflix Tunes EC2 Instances for PerformanceBrendan Gregg
CMP325 talk for AWS re:Invent 2017, by Brendan Gregg. "
At Netflix we make the best use of AWS EC2 instance types and features to create a high performance cloud, achieving near bare metal speed for our workloads. This session will summarize the configuration, tuning, and activities for delivering the fastest possible EC2 instances, and will help other EC2 users improve performance, reduce latency outliers, and make better use of EC2 features. We'll show how we choose EC2 instance types, how we choose between EC2 Xen modes: HVM, PV, and PVHVM, and the importance of EC2 features such SR-IOV for bare-metal performance. SR-IOV is used by EC2 enhanced networking, and recently for the new i3 instance type for enhanced disk performance as well. We'll also cover kernel tuning and observability tools, from basic to advanced. Advanced performance analysis includes the use of Java and Node.js flame graphs, and the new EC2 Performance Monitoring Counter (PMC) feature released this year."
RedisConf17- Using Redis at scale @ TwitterRedis Labs
The document discusses Nighthawk, Twitter's distributed caching system which uses Redis. It provides caching services at a massive scale of over 10 million queries per second and 10 terabytes of data across 3000 Redis nodes. The key aspects of Nighthawk's architecture that allow it to scale are its use of a client-oblivious proxy layer and cluster manager that can independently scale and rebalance partitions across Redis nodes. It also employs replication between data centers to provide high availability even in the event of node failures. Some challenges discussed are handling "hot keys" that get an unusually high volume of requests and more efficiently warming up replicas when nodes fail.
The document provides an overview of the InnoDB storage engine used in MySQL. It discusses InnoDB's architecture including the buffer pool, log files, and indexing structure using B-trees. The buffer pool acts as an in-memory cache for table data and indexes. Log files are used to support ACID transactions and enable crash recovery. InnoDB uses B-trees to store both data and indexes, with rows of variable length stored within pages.
Enabling the Active Data Warehouse with Apache KuduGrant Henke
Apache Kudu is an open source data storage engine that makes fast analytics on fast and changing data easy. In this presentation, Grant Henke from Cloudera will provide an overview of what Kudu is, how it works, and how it makes building an active data warehouse for real time analytics easy. Drawing on experiences from some of our largest deployments, this talk will also include an overview of common Kudu use cases and patterns. Additionally, some of the newest Kudu features and what is coming next will be covered.
MySQL Sharding: Tools and Best Practices for Horizontal ScalingMats Kindahl
This presentation provides an introduction to what you need to consider when implementing a sharding solution and introduce the MySQL Fabric as a tool to help you to easy set up a sharded database.
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019confluent
Data stream processing is built on the core concept of time. However, understanding time semantics and reasoning about time is not simple, especially if deterministic processing is expected. In this talk, we explain the difference between processing, ingestion, and event time and what their impact is on data stream processing. Furthermore, we explain how Kafka clusters and stream processing applications must be configured to achieve specific time semantics. Finally, we deep dive into the time semantics of the Kafka Streams DSL and KSQL operators, and explain in detail how the runtime handles time. Apache Kafka offers many ways to handle time on the storage layer, ie, the brokers, allowing users to build applications with different semantics. Time semantics in the processing layer, ie, Kafka Streams and KSQL, are even richer, more powerful, but also more complicated. Hence, it is paramount for developers, to understand different time semantics and to know how to configure Kafka to achieve them. Therefore, this talk enables developers to design applications with their desired time semantics, help them to reason about the runtime behavior with regard to time, and allow them to understand processing/query results.
Speaker: Jesse Anderson (Cloudera)
As optional pre-conference prep for attendees who are new to HBase, this talk will offer a brief Cliff's Notes-level talk covering architecture, API, and schema design. The architecture section will cover the daemons and their functions, the API section will cover HBase's GET, PUT, and SCAN classes; and the schema design section will cover how HBase differs from an RDBMS and the amount of effort to place on schema and row-key design.
The document discusses designing robust data architectures for decision making. It advocates for building architectures that can easily add new data sources, improve and expand analytics, standardize metadata and storage for easy data access, discover and recover from mistakes. The key aspects discussed are using Kafka as a data bus to decouple pipelines, retaining all data for recovery and experimentation, treating the filesystem as a database by storing intermediate data, leveraging Spark and Spark Streaming for batch and stream processing, and maintaining schemas for integration and evolution of the system.
The document summarizes Apache Phoenix and its past, present, and future as a SQL interface for HBase. It describes Phoenix's architecture and key features like secondary indexes, joins, aggregations, and transactions. Recent releases added functional indexes, the Phoenix Query Server, and initial transaction support. Future plans include improvements to local indexes, integration with Calcite and Hive, and adding JSON and other SQL features. The document aims to provide an overview of Phoenix's capabilities and roadmap for building a full-featured SQL layer over HBase.
This document summarizes a presentation about optimizing HBase performance through caching. It discusses how baseline tests showed low cache hit rates and CPU/memory utilization. Reducing the table block size improved cache hits but increased overhead. Adding an off-heap bucket cache to store table data minimized JVM garbage collection latency spikes and improved memory utilization by caching frequently accessed data outside the Java heap. Configuration parameters for the bucket cache are also outlined.
Practical advices how to achieve persistence in Redis. Detailed overview of all cons and pros of RDB snapshots and AOF logging. Tips and tricks for proper persistence configuration with Redis pools and master/slave replication.
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Databricks
The convergence of big data technology towards traditional database domain has became an industry trend. At present, open source big data processing engines, such as Apache Spark, Apache Hadoop, Apache Flink, etc., already support SQL interfaces, and the usage of SQL basically occupies a dominant position. Companies use above open source software to build their own ETL framework and OLAP technology. However, in terms of OLTP technology, it is still a strong point of traditional databases. One of the main reasons is the support of ACID by traditional databases.
The document discusses Apache Tez, a framework for building data processing applications on Hadoop. It provides an introduction to Tez and describes key features like expressing computations as directed acyclic graphs (DAGs), container reuse, dynamic parallelism, integration with YARN timeline service, and recovery from failures. The document also outlines improvements to Tez around performance, debuggability, and status/roadmap.
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...Altinity Ltd
Columnar stores like ClickHouse enable users to pull insights from big data in seconds, but only if you set things up correctly. This talk will walk through how to implement a data warehouse that contains 1.3 billion rows using the famous NY Yellow Cab ride data. We'll start with basic data implementation including clustering and table definitions, then show how to load efficiently. Next, we'll discuss important features like dictionaries and materialized views, and how they improve query efficiency. We'll end by demonstrating typical queries to illustrate the kind of inferences you can draw rapidly from a well-designed data warehouse. It should be enough to get you started--the next billion rows is up to you!
Sparklens: Understanding the Scalability Limits of Spark Applications with R...Databricks
One of the common requests we receive from customers (at Qubole) is debugging slow spark application. Usually this process is done with trial and error, which takes time and requires running clusters beyond normal usage (read wasted resources). Moreover, it doesn’t tell us where to looks for further improvements. We at Qubole are looking into making this process more self-serve. Towards this goal we have built a tool (OSS https://github.com/qubole/sparklens) based on spark event listener framework.
From a single run of the application, Sparklens provides insights about scalability limits of given spark application. In this talk we will cover what Sparklens does and theory behind Sparklens. We will talk about how structure of spark application puts important constraints on its scalability. How can we find these structural constraints and how to use these constraints as a guide in solving performance and scalability problems of spark applications.
This talk will help audience in answering the following questions about their spark applications: 1) Will their application run faster with more executors? 2) How will cluster utilization change as number of executors change? 3) What is the absolute minimum time this application will take even if we give it infinite executors? 4) What is the expected wall clock time for the application when we fix the most important structural limits of these application? Sparklens makes the ROI of additional executor extremely obvious for a given application and needs just a single run of the application to determine how application with behave with different executor counts. Specifically, it will help managers take the correct side of the tradeoff between spending developer time optimising applications vs spending money on compute bills.
Tez is the next generation Hadoop Query Processing framework written on top of YARN. Computation topologies in higher level languages like Pig/Hive can be naturally expressed in the new graph dataflow model exposed by Tez. Multi-stage queries can be expressed as a single Tez job resulting in lower latency for short queries and improved throughput for large scale queries. MapReduce has been the workhorse for Hadoop but its monolithic structure had made innovation slower. YARN separates resource management from application logic and thus enables the creation of Tez, a more flexible and generic new framework for data processing for the benefit of the entire Hadoop query ecosystem.
This document provides an overview of Hive and its performance capabilities. It discusses Hive's SQL interface for querying large datasets stored in Hadoop, its architecture which compiles SQL queries into MapReduce jobs, and its support for SQL semantics and datatypes. The document also covers techniques for optimizing Hive performance, including data abstractions like partitions, buckets and skews. It describes different join strategies in Hive like shuffle joins, broadcast joins and sort-merge bucket joins and how they are implemented in MapReduce. The overall presentation aims to explain how Hive provides scalable SQL processing for big data.
Meta/Facebook's database serving social workloads is running on top of MyRocks (MySQL on RocksDB). This means our performance and reliability depends a lot on RocksDB. Not just MyRocks, but also we have other important systems running on top of RocksDB. We have learned many lessons from operating and debugging RocksDB at scale.
In this session, we will offer an overview of RocksDB, key differences from InnoDB, and share a few interesting lessons learned from production.
The document discusses average active sessions (AAS) as a single metric for measuring database performance and load, providing methods for calculating AAS using sampling of active session history (ASH) data or time statistics, and comparing the AAS value to metrics like CPU count to understand if the database is under or over utilized.
It also describes how the components of AAS like CPU usage and wait times can provide more insight, and how tools like the Oracle Enterprise Manager (OEM) can show AAS over time as well as its subcomponents to help identify performance bottlenecks.
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersCloudera, Inc.
Todd Lipcon presents a solution to avoid full garbage collections (GCs) in HBase by using MemStore-Local Allocation Buffers (MSLABs). The document outlines that write operations in HBase can cause fragmentation in the old generation heap, leading to long GC pauses. MSLABs address this by allocating each MemStore's data into contiguous 2MB chunks, eliminating fragmentation. When MemStores flush, the freed chunks are large and contiguous. With MSLABs enabled, the author saw basically zero full GCs during load testing. MSLABs improve performance and stability by preventing GC pauses caused by fragmentation.
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang
This paper, written by the LinkedIn Espresso Team, appeared at the ACM SIGMOD/PODS Conference (June 2013). To see the talk given by Swaroop Jagadish (Staff Software Engineer @ LinkedIn), go here:
http://www.slideshare.net/amywtang/li-espresso-sigmodtalk
How Netflix Tunes EC2 Instances for PerformanceBrendan Gregg
CMP325 talk for AWS re:Invent 2017, by Brendan Gregg. "
At Netflix we make the best use of AWS EC2 instance types and features to create a high performance cloud, achieving near bare metal speed for our workloads. This session will summarize the configuration, tuning, and activities for delivering the fastest possible EC2 instances, and will help other EC2 users improve performance, reduce latency outliers, and make better use of EC2 features. We'll show how we choose EC2 instance types, how we choose between EC2 Xen modes: HVM, PV, and PVHVM, and the importance of EC2 features such SR-IOV for bare-metal performance. SR-IOV is used by EC2 enhanced networking, and recently for the new i3 instance type for enhanced disk performance as well. We'll also cover kernel tuning and observability tools, from basic to advanced. Advanced performance analysis includes the use of Java and Node.js flame graphs, and the new EC2 Performance Monitoring Counter (PMC) feature released this year."
RedisConf17- Using Redis at scale @ TwitterRedis Labs
The document discusses Nighthawk, Twitter's distributed caching system which uses Redis. It provides caching services at a massive scale of over 10 million queries per second and 10 terabytes of data across 3000 Redis nodes. The key aspects of Nighthawk's architecture that allow it to scale are its use of a client-oblivious proxy layer and cluster manager that can independently scale and rebalance partitions across Redis nodes. It also employs replication between data centers to provide high availability even in the event of node failures. Some challenges discussed are handling "hot keys" that get an unusually high volume of requests and more efficiently warming up replicas when nodes fail.
The document provides an overview of the InnoDB storage engine used in MySQL. It discusses InnoDB's architecture including the buffer pool, log files, and indexing structure using B-trees. The buffer pool acts as an in-memory cache for table data and indexes. Log files are used to support ACID transactions and enable crash recovery. InnoDB uses B-trees to store both data and indexes, with rows of variable length stored within pages.
Enabling the Active Data Warehouse with Apache KuduGrant Henke
Apache Kudu is an open source data storage engine that makes fast analytics on fast and changing data easy. In this presentation, Grant Henke from Cloudera will provide an overview of what Kudu is, how it works, and how it makes building an active data warehouse for real time analytics easy. Drawing on experiences from some of our largest deployments, this talk will also include an overview of common Kudu use cases and patterns. Additionally, some of the newest Kudu features and what is coming next will be covered.
MySQL Sharding: Tools and Best Practices for Horizontal ScalingMats Kindahl
This presentation provides an introduction to what you need to consider when implementing a sharding solution and introduce the MySQL Fabric as a tool to help you to easy set up a sharded database.
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019confluent
Data stream processing is built on the core concept of time. However, understanding time semantics and reasoning about time is not simple, especially if deterministic processing is expected. In this talk, we explain the difference between processing, ingestion, and event time and what their impact is on data stream processing. Furthermore, we explain how Kafka clusters and stream processing applications must be configured to achieve specific time semantics. Finally, we deep dive into the time semantics of the Kafka Streams DSL and KSQL operators, and explain in detail how the runtime handles time. Apache Kafka offers many ways to handle time on the storage layer, ie, the brokers, allowing users to build applications with different semantics. Time semantics in the processing layer, ie, Kafka Streams and KSQL, are even richer, more powerful, but also more complicated. Hence, it is paramount for developers, to understand different time semantics and to know how to configure Kafka to achieve them. Therefore, this talk enables developers to design applications with their desired time semantics, help them to reason about the runtime behavior with regard to time, and allow them to understand processing/query results.
Speaker: Jesse Anderson (Cloudera)
As optional pre-conference prep for attendees who are new to HBase, this talk will offer a brief Cliff's Notes-level talk covering architecture, API, and schema design. The architecture section will cover the daemons and their functions, the API section will cover HBase's GET, PUT, and SCAN classes; and the schema design section will cover how HBase differs from an RDBMS and the amount of effort to place on schema and row-key design.
The document discusses designing robust data architectures for decision making. It advocates for building architectures that can easily add new data sources, improve and expand analytics, standardize metadata and storage for easy data access, discover and recover from mistakes. The key aspects discussed are using Kafka as a data bus to decouple pipelines, retaining all data for recovery and experimentation, treating the filesystem as a database by storing intermediate data, leveraging Spark and Spark Streaming for batch and stream processing, and maintaining schemas for integration and evolution of the system.
HBase is a distributed, column-oriented database that stores data in tables divided into rows and columns. It is optimized for random, real-time read/write access to big data. The document discusses HBase's key concepts like tables, regions, and column families. It also covers performance tuning aspects like cluster configuration, compaction strategies, and intelligent key design to spread load evenly. Different use cases are suitable for HBase depending on access patterns, such as time series data, messages, or serving random lookups and short scans from large datasets. Proper data modeling and tuning are necessary to maximize HBase's performance.
Jesse Anderson (Smoking Hand)
This early-morning session offers an overview of what HBase is, how it works, its API, and considerations for using HBase as part of a Big Data solution. It will be helpful for people who are new to HBase, and also serve as a refresher for those who may need one.
Jingwei Lu and Jason Zhang (Airbnb)
AirStream is a realtime stream computation framework built on top of Spark Streaming and HBase that allows our engineers and data scientists to easily leverage HBase to get real-time insights and build real-time feedback loops. In this talk, we will introduce AirStream, and then go over a few production use cases.
This document introduces HBase, an open-source, non-relational, distributed database modeled after Google's BigTable. It describes what HBase is, how it can be used, and when it is applicable. Key points include that HBase stores data in columns and rows accessed by row keys, integrates with Hadoop for MapReduce jobs, and is well-suited for large datasets, fast random access, and write-heavy applications. Common use cases involve log analytics, real-time analytics, and messages-centered systems.
Apache HBase is the Hadoop opensource, distributed, versioned storage manager well suited for random, realtime read/write access. This talk will give an overview on how HBase achieve random I/O, focusing on the storage layer internals. Starting from how the client interact with Region Servers and Master to go into WAL, MemStore, Compactions and on-disk format details. Looking at how the storage is used by features like snapshots, and how it can be improved to gain flexibility, performance and space efficiency.
1) HBase satisfied Facebook's requirements for a real-time data store by providing excellent write performance, horizontal scalability, and features like atomic operations.
2) At Facebook, HBase is used for messaging and user activity tracking applications that involve massive write-throughput and petabytes of data.
3) HBase's integration with HDFS provides fault tolerance and scalability, while its column orientation enables complex queries on user activity data.
The document introduces Maxtable, an open-source distributed database. It consists of three components: a metadata server that manages the global namespace, Ranger servers that hold data partitions, and client libraries. Data is automatically partitioned and scaled across servers. The document describes Maxtable's architecture, features like scalability and recovery, its query language, and how to operate and maintain the system. Future work may include secondary indexes and join queries.
Near-realtime analytics with Kafka and HBasedave_revell
A presentation at OSCON 2012 by Nate Putnam and Dave Revell about Urban Airship's analytics stack. Features Kafka, HBase, and Urban Airship's own open source projects statshtable and datacube.
Siebel Server Cloning available in 8.1.1.9 / 8.2.2.2Jeroen Burgers
Installation Cloning
Siebel server cloning
Enterprise cloning
Patch Deployment
Capture installation changes
Apply changes to target environments
Server Configuration Deployment
Extract server configuration settings
Migrate server configurations to target environments
This document summarizes a talk about Facebook's use of HBase for messaging data. It discusses how Facebook migrated data from MySQL to HBase to store metadata, search indexes, and small messages in HBase for improved scalability. It also outlines performance improvements made to HBase, such as for compactions and reads, and future plans such as cross-datacenter replication and running HBase in a multi-tenant environment.
The document discusses Facebook's use of HBase to store messaging data. It provides an overview of HBase, including its data model, performance characteristics, and how it was a good fit for Facebook's needs due to its ability to handle large volumes of data, high write throughput, and efficient random access. It also describes some enhancements Facebook made to HBase to improve availability, stability, and performance. Finally, it briefly mentions Facebook's migration of messaging data from MySQL to their HBase implementation.
The document discusses Facebook's use of HBase as the database storage engine for its messaging platform. It provides an overview of HBase, including its data model, architecture, and benefits like scalability, fault tolerance, and simpler consistency model compared to relational databases. The document also describes Facebook's contributions to HBase to improve performance, availability, and achieve its goal of zero data loss. It shares Facebook's operational experiences running large HBase clusters and discusses its migration of messaging data from MySQL to a de-normalized schema in HBase.
Webinar: Deep Dive on Apache Flink State - Seth WiesmanVerverica
Apache Flink is a world class stateful stream processor presents a huge variety of optional features and configuration choices to the user. Determining out the optimal choice for any production environment and use-case be challenging. In this talk, we will explore and discuss the universe of Flink configuration with respect to state and state backends.
We will start with a closer look under the hood, at core data structures and algorithms, to build the foundation for understanding the impact of tuning parameters and the costs-benefit-tradeoffs that come with certain features and options. In particular, we will focus on state backend choices (Heap vs RocksDB), tuning checkpointing (incremental checkpoints, ...) and recovery (local recovery), serializers and Apache Flink's new state migration capabilities.
This document provides an overview of WebLogic Server topology, configuration, and administration. It describes key concepts such as domains, servers, clusters, and configuration files. It also discusses administration tools for configuring and managing WebLogic domains including the Configuration Wizard, Administration Console, and WLST scripting tool. The Configuration Wizard is a GUI tool for creating domains from templates, while the Administration Console is a browser-based interface for ongoing domain administration.
This document provides an overview of WebLogic Server topology, configuration, and administration. It describes key concepts such as domains, servers, clusters, and configuration files. It also discusses administration tools for configuring and managing WebLogic domains including the Configuration Wizard, Administration Console, and WLST scripting tool. The Configuration Wizard is a GUI tool for creating domains from templates, while the Administration Console is a browser-based interface for ongoing domain administration.
This document discusses WebLogic server domains, clusters, and high availability configurations. It defines domains as logically related groups of WebLogic servers managed from a single configuration, and notes they contain servers and server clusters. It describes the administration server's role in central configuration and deployment, and managed servers which host applications. It explains how server clusters provide scalability and high availability through load balancing, failover, and replication across multiple servers.
Omid Efficient Transaction Mgmt and Processing for HBaseDataWorks Summit
This document discusses Omid, a system for providing efficient transaction management and incremental processing for HBase. Omid implements an optimistic concurrency control model called snapshot isolation without locking. It has a simple API based on Java Transaction API and HBase API. Omid's architecture involves a centralized server for transaction metadata coordination and replication of metadata to HBase clients. It uses BookKeeper for fault tolerance. An example application described is performing TF-IDF indexing of tweets incrementally using Omid transactions.
Apache CloudStack is open source software for building public, private and hybrid Infrastructure as a Service (IaaS) clouds, it allows users to provision virtual servers, storage and networking resources through a web interface and provides APIs for management and integration with other systems, and it supports various hypervisors including KVM, Xen, VMware and Oracle VM VirtualBox as well as storage systems like iSCSI, NFS and object storage.
Rigorous and Multi-tenant HBase Performance MeasurementDataWorks Summit
The document discusses techniques for rigorously measuring HBase performance in both standalone and multi-tenant environments. It begins with an overview of HBase and the Yahoo! Cloud Serving Benchmark (YCSB) for evaluating databases. It then discusses best practices for cluster setup, data loading, and benchmarking techniques like warming the cache, setting target throughput, and using appropriate workloads. Finally, it covers challenges in measuring HBase performance when used alongside other frameworks like MapReduce and Solr in a multi-tenant setting.
Rigorous and Multi-tenant HBase PerformanceCloudera, Inc.
The document discusses techniques for rigorously measuring Apache HBase performance in both standalone and multi-tenant environments. It introduces the Yahoo! Cloud Serving Benchmark (YCSB) and best practices for cluster setup, workload generation, data loading, and measurement. These include pre-splitting tables, warming caches, setting target throughput, and using appropriate workload distributions. The document also covers challenges in achieving good multi-tenant performance across HBase, MapReduce and Apache Solr.
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systexJames Chen
This document discusses using Hadoop/MapReduce with Solr/Lucene for large scale distributed search. It begins with an introduction to the speaker and his experience with Hadoop. The agenda then outlines discussing why search big data, an overview of Lucene, Solr and Zookeeper, distributed searching and indexing with Hadoop, and a case study on web log categorization.
The document summarizes CloudStack architecture plans for the future. It discusses moving to management server clusters per availability zone rather than per region. It also discusses using an object storage system for templates and snapshots rather than a separate NFS server. Finally, it discusses a possible future model where CloudStack manages existing virtualization clusters rather than deploying and managing its own system VMs.
1. HBase Storage Internals, present and future!
Matteo Bertozzi | @Cloudera
Speaker Name or Subhead Goes Here
March 2013 - Hadoop Summit Europe
1
2. What is HBase?
• Open source Storage Manager that provides random
read/write on top of HDFS
• Provides Tables with a “Key:Column/Value” interface
• Dynamic columns (qualifiers), no schema needed
• “Fixed” column groups (families)
• table[row:family:column] = value
2
3. HBase ecosystem
• Apache Hadoop HDFS for data durability and
reliability (Write-Ahead Log)
• Apache ZooKeeper for distributed coordination App MR
• Apache Hadoop MapReduce built-in support
for running MapReduce jobs
ZK HDFS
3
5. Master, Region Servers and Regions
Client • Region Server
• Server that contains a set of Regions
• Responsible to handle reads and writes
ZooKeeper
• Region
Master • The basic unit of scalability in HBase
• Subset of the table’s data
• Contiguous, sorted range of rows stored together.
Region Server Region Server Region Server • Master
Region Region Region • Coordinates the HBase Cluster
Region Region Region • Assignment/Balancing of the Regions
Region Region Region • Handles admin operations
• create/delete/modify table, …
HDFS
5
6. Autosharding and .META. table
• A Region is a Subset of the table’s data
• When there is too much data in a Region…
• a split is triggered, creating 2 regions
• The association “Region -> Server” is stored in a System Table
• The Location of .META. Is stored in ZooKeeper
Table Start Key Region ID Region Server machine01
Region 1 - testTable
testTable Key-00 1 machine01.host Region 4 - testTable
testTable Key-31 2 machine03.host
machine02
testTable Key-65 3 machine02.host
Region 3 - testTable
testTable Key-83 4 machine01.host Region 1 - users
… … … …
machine03
users Key-AB 1 machine03.host Region 2 - testTable
users Key-KG 2 machine02.host Region 2 - users
6
7. The Write Path – Create a New Table
• The client asks to the master to create a new Table
Client
• hbase> create ‘myTable’, ‘cf’
createTable()
• The Master
Master
• Store the Table information (“schema”) Store Table
“Metadata”
• Create Regions based on the key-splits provided
Assign the Regions
• no splits provided, one single region by default “enable”
• Assign the Regions to the Region Servers
Region Region Region
• The assignment Region -> Server Server Server Server
Region
is written to a system table called “.META.” Region Region
Region Region Region
7
8. The Write Path – “Inserting” data
• table.put(row-key:family:column, value)
Client
Where is
• The client asks ZooKeeper the location of .META. .META.? Scan
.META.
• The client scans .META. searching for the Region Server ZooKeeper Region Server
Region
responsible to handle the Key Insert
KeyValue Region
• The client asks the Region Server to insert/update/delete
Region Server
the specified key/value. Region
• The Region Server process the request and dispatch it to Region
Region
the Region responsible to handle the Key
• The operation is written to a Write-Ahead Log (WAL)
• …and the KeyValues added to the Store: “MemStore”
8
9. The Write Path – Append Only to Random R/W
• Files in HDFS are RS
Region Region Region
WAL
• Append-Only
• Immutable once closed MemStore + Store Files (HFiles)
• HBase provides Random Writes?
• …not really from a storage point of view
• KeyValues are stored in memory and written to disk on pressure
• Don’t worry your data is safe in the WAL!
Key0 – value 0
• (The Region Server can recover data from the WAL is case of crash) Key1 – value 1
Key2 – value 2
Key3 – value 3
• But this allow to sort data by Key before writing on disk Key4 – value 4
Key5 – value 5
• Deletes are like Inserts but with a “remove me flag” Store Files
9
10. The Read Path – “reading” data
• The client asks ZooKeeper the location of .META.
Client
Where is
• The client scans .META. searching for the Region Server .META.? Scan
.META.
responsible to handle the Key ZooKeeper Region Server
Region
• The client asks the Region Server to get the specified key/value. Get Key
Region
• The Region Server process the request and dispatch it to the
Region Server
Region responsible to handle the Key Region
• MemStore and Store Files are scanned to find the key Region
Region
10
11. The Read Path – Append Only to Random R/W
Each flush a new file is created
Key0 – value 0.1
•
Key0 – value 0.0
Key2 – value 2.0 Key5 – value 5.0
Key3 – value 3.0 Key1 – value 1.0
Key5 – value 5.0 Key5 – [deleted]
Each file have KeyValues sorted by key
Key6 – value 6.0
•
Key8 – value 8.0
Key9 – value 9.0 Key7– value 7.0
• Two or more files can contains the same key
(updates/deletes)
• To find a Key you need to scan all the files
• …with some optimizations
• Filter Files Start/End Key
• Having a bloom filter on each file
11
13. HFile Format
• Only Sequential Writes, just append(key, value) Blocks
Header
• Large Sequential Reads are better
Record 0
• Why grouping records in blocks? Record 1
…
• Easy to split Record N
• Easy to read Key/Value Header
(record) Record 0
• Easy to cache Key Length : int Record 1
Value Length : int …
• Easy to index (if records are sorted) Key : byte[]
Record N
• Block Compression (snappy, lz4, gz, …) Index 0
…
Value : byte[]
Index N
Trailer
13
14. Data Block Encoding
• “Be aware of the data”
• Block Encoding allows to compress the Key based on what we know
• Keys are sorted… prefix may be similar in most cases
• One file contains keys from one Family only
“on-disk”
• Timestamps are “similar”, we can store the diff KeyValue
• Type is “put” most of the time… Row Length : short
Row : byte[]
Family Length : byte
Family : byte[]
Qualifier : byte[]
Timestamp : long
Type : byte
14
16. Compactions
Reduce the number of files to look into during a scan
Key0 – value 0.1
•
Key0 – value 0.0
Key2 – value 2.0 Key1 – value 1.0
Key3 – value 3.0 Key4– value 4.0
Key5 – value 5.0 Key5 – [deleted]
• Removing duplicated keys (updated values)
Key8 – value 8.0 Key6 – value 6.0
Key9 – value 9.0 Key7– value 7.0
• Removing deleted keys
• Creates a new file by merging the content of two or more files Key0 – value 0.1
Key1 – value 1.0
Key2 – value 2.0
Key4– value 4.0
• Remove the old files Key6 – value 6.0
Key7– value 7.0
Key8– value 8.0
Key9– value 9.0
16
17. Pluggable Compactions
Try different algorithm
Key0 – value 0.1
•
Key0 – value 0.0
Key2 – value 2.0 Key1 – value 1.0
Key3 – value 3.0 Key4– value 4.0
Key5 – value 5.0 Key5 – [deleted]
Be aware of the data
Key6 – value 6.0
•
Key8 – value 8.0
Key9 – value 9.0 Key7– value 7.0
• Time Series? I guess no updates from the 80s
• Be aware of the requests Key0 – value 0.1
Key1 – value 1.0
Key2 – value 2.0
Key4– value 4.0
• Compact based on statistics Key6 – value 6.0
Key7– value 7.0
Key8– value 8.0
Key9– value 9.0
• which files are hot and which are not
• which keys are hot and which are not
17
18. Snapshots
Zero-Copy Snapshots and Table Clones
18
19. What Is a Snapshot?
• “a Snapshot is not a copy of the table”
• a Snapshot is a set of metadata information
• The table “schema” (column families and attributes)
• The Regions information (start key, end key, …)
• The list of Store Files
ZK ZK
• The list of WALs active Master ZK
RS RS
Region Region Region Region Region Region
WAL
WAL
Store Files (HFiles) Store Files (HFiles)
19
20. How Taking a Snapshot Works?
• The master orchestrate the RSs
• the communication is done via ZooKeeper
• using a “2-phase commit like” transaction (prepare/commit)
• Each RS is responsible to take its “piece” of snapshot
• For each Region store the metadata information needed
• (list of Store Files, WALs, region start/end keys, …)
ZK ZK
Master ZK
RS RS
Region Region Region Region Region Region
WAL
WAL
Store Files (HFiles) Store Files (HFiles)
20
21. Cloning a Table from a Snapshots
• hbase> clone_snapshot ‘snapshotName’, ‘tableName’
…
• Creates a new table with the data “contained” in the snapshot
• No data copies involved
• HFiles are immutable, and shared between tables and snapshots
• You can insert/update/remove data from the new table
• No repercussions on the snapshot, original tables or other cloned tables
21
22. Compactions & Archiving
• HFiles are immutable, and shared between tables and snapshots
• On compaction or table deletion, files are removed from disk
• If one of these files are referenced by a snapshot or a cloned table
• The file is moved to an “archive” directory
• And deleted later, when there’re no references to it
22
24. 0.96 is coming up
• Moving RPC to Protobuf
• Allows rolling upgrades with no surprises
• HBase Snapshots
• Pluggable Compactions
• Remove -ROOT-
• Table Locks
24
25. 0.98 and Beyond
• Transparent Table/Column-Family Encryption
• Cell-level security
• Multiple WALs per Region Server (MTTR)
• Data Placement Awareness (MTTR)
• Data Type Awareness
• Compaction policies, based on the data needs
• Managing blocks directly (instead of files)
25
26. DO NOT USE PUBLICLY
PRIOR TO 10/23/12
Questions?
Headline Goes Here
Matteo Name or @Cloudera
SpeakerBertozzi | Subhead Goes Here
26