HBase In Action - Chapter 10: Operations
Learning HBase, Real-time Access to Your Big Data, Data Manipulation at Scale, Big Data, Text Mining, HBase, Deploying HBase
Hbase in action - Chapter 09: Deploying HBasephanleson
Hbase in action - Chapter 09: Deploying HBase
Learning HBase, Real-time Access to Your Big Data, Data Manipulation at Scale, Big Data, Text Mining, HBase, Deploying HBase
HBase In Action - Chapter 04: HBase table designphanleson
HBase In Action - Chapter 04: HBase table design
Learning HBase, Real-time Access to Your Big Data, Data Manipulation at Scale, Big Data, Text Mining, HBase, Deploying HBase
ORACLE 12C DATA GUARD: FAR SYNC, REAL-TIME CASCADE STANDBY AND OTHER GOODIESLudovico Caldara
The new release of Oracle Database has come with many new exciting enhancements for the High Availability.
This whitepaper introduces some new Data Guard features. Among various enhancements, special attention will be given to
the new Far Sync Instance and the Real-Time Cascade Standby.
Hbase in action - Chapter 09: Deploying HBasephanleson
Hbase in action - Chapter 09: Deploying HBase
Learning HBase, Real-time Access to Your Big Data, Data Manipulation at Scale, Big Data, Text Mining, HBase, Deploying HBase
HBase In Action - Chapter 04: HBase table designphanleson
HBase In Action - Chapter 04: HBase table design
Learning HBase, Real-time Access to Your Big Data, Data Manipulation at Scale, Big Data, Text Mining, HBase, Deploying HBase
ORACLE 12C DATA GUARD: FAR SYNC, REAL-TIME CASCADE STANDBY AND OTHER GOODIESLudovico Caldara
The new release of Oracle Database has come with many new exciting enhancements for the High Availability.
This whitepaper introduces some new Data Guard features. Among various enhancements, special attention will be given to
the new Far Sync Instance and the Real-Time Cascade Standby.
An Introduction to Cloudera Impala, shows how Impala works, and the internal processing of query of Impala, including architecture, frontend, query compilation, backend, code generation, HDFS-related stuff and performance comparison.
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
Real-time Big Data Analytics Engine using ImpalaJason Shih
Cloudera Impala is an open-source under Apache Licence enable real-time, interactive analytical SQL queries of the data stored in HBase or HDFS. The work was inspired by Google Dremel paper which is also the basis for Google BigQuery. It provide access same unified storage platform base on it's own distributed query engine but does not use mapreduce. In addition, it use also the same metadata, SQL syntax (HiveQL-like) ODBC driver and user interface (Hue Beeswax) as Hive. Besides the traditional Hadoop approach, aim to provide low-cost solution for resiliency and batch-oriented distributed data processing, we found more and more effort in the Big Data world pursuing the right solution for ad-hoc, fast queries and realtime data processing for large datasets. In this presentation, we'll explore how to run interactive queries inside Impala, advantages of the approach, architecture and understand how it optimizes data systems including also practical performance analysis.
This presentation goes over the PGSQL 9.3 key features along with several other exciting additions released in early September 2013. These features will also be available in the 9.3 release of Postgres Plus Advanced Server.
Include_dir configuration directive
Copy freeze
Custom background workers
Additional JSON functionality
Lateral join
Parallel pg_dump pg_isready
Posix shared memory/mmap
Event triggers
Materialized views
Recursive views
Updateable views
Writeable foreign tables / postgres_fdw
Streaming only remastering
Fast failover
Architecture-independent streaming
pg_basebackup recovery.conf autosetup
Once the ‘Backup Database’ command executed, SQL Server automatically does few ‘Checkpoint’ to reduce the recovery time and also it makes sure that at point of command execution there is no dirty pages in the buffer pool. After that SQL Server creates at least three workers as ‘Controller’, ‘Stream Reader’ and ‘Stream Writer’ to read and buffer the data asynchronously into the buffer area (Out of buffer pool) and write the buffers into the backup device.
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQLPiotr Pruski
This is the extended deck I used for my presentation at the Information On Demand 2013 conference for Session Number 1687 - Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL.
This presentation covers accessing HBase using Big SQL. It starts by going over general HBase concepts, than delves into how Big SQL adds an SQL layer on top of HBase (via HBase storage handler), secondary index support, queries, etc.
New Features for Database Administrator of Oracle 12c Database. Here are some of excellent Oracle 12c new features with examples for learning purpose. SQL,Backup and recovery, Database management, Oracle RAC, Oracle ASM included.
How the new operation of Hadoop Distributed FIle System (HDFS) -- Append works. The internals of the processing. The new states that are more than the write operation.
This presentation by Bruce Momjian. Co-Founder of the Global PostgreSQL Development team and a Senior Architect at EDB. He demonstrates how to use arrays, geometry and JSON for NoSQL data types to overcome restrictions of relational storage to support new innovative applications, specifically by storing and indexing multiple values, even unrelated ones, in a single database field. Such storage allows for greater efficiency and access simplicity, and can also avoid the negatives of entity-attribute-value (eav) storage.
Postgres has always had strong support for relational storage. However, there are some cases where relational storage might be inefficient or overly restrictive.
Speakers: Jingcheng Du and Ramkrishna Vasudevan (Intel)
As HBase continues to expand in application and enterprise or government deployments, there is a growing demand for storing data across geographically distributed datacenters for improved availability and disaster recovery. The Cross-Site BigTable extends HBase to make it well-suited for such deployments, providing the capabilities of creating and accessing HBase tables that are partitioned and asynchronously backed-up over a number of distributed datacenters. This talk reveals how the Cross-Site BigTable manages data access over multiple datacenters and removes the data center itself as a single point of failure in geographically distributed HBase deployments.
An Introduction to Cloudera Impala, shows how Impala works, and the internal processing of query of Impala, including architecture, frontend, query compilation, backend, code generation, HDFS-related stuff and performance comparison.
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
Real-time Big Data Analytics Engine using ImpalaJason Shih
Cloudera Impala is an open-source under Apache Licence enable real-time, interactive analytical SQL queries of the data stored in HBase or HDFS. The work was inspired by Google Dremel paper which is also the basis for Google BigQuery. It provide access same unified storage platform base on it's own distributed query engine but does not use mapreduce. In addition, it use also the same metadata, SQL syntax (HiveQL-like) ODBC driver and user interface (Hue Beeswax) as Hive. Besides the traditional Hadoop approach, aim to provide low-cost solution for resiliency and batch-oriented distributed data processing, we found more and more effort in the Big Data world pursuing the right solution for ad-hoc, fast queries and realtime data processing for large datasets. In this presentation, we'll explore how to run interactive queries inside Impala, advantages of the approach, architecture and understand how it optimizes data systems including also practical performance analysis.
This presentation goes over the PGSQL 9.3 key features along with several other exciting additions released in early September 2013. These features will also be available in the 9.3 release of Postgres Plus Advanced Server.
Include_dir configuration directive
Copy freeze
Custom background workers
Additional JSON functionality
Lateral join
Parallel pg_dump pg_isready
Posix shared memory/mmap
Event triggers
Materialized views
Recursive views
Updateable views
Writeable foreign tables / postgres_fdw
Streaming only remastering
Fast failover
Architecture-independent streaming
pg_basebackup recovery.conf autosetup
Once the ‘Backup Database’ command executed, SQL Server automatically does few ‘Checkpoint’ to reduce the recovery time and also it makes sure that at point of command execution there is no dirty pages in the buffer pool. After that SQL Server creates at least three workers as ‘Controller’, ‘Stream Reader’ and ‘Stream Writer’ to read and buffer the data asynchronously into the buffer area (Out of buffer pool) and write the buffers into the backup device.
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQLPiotr Pruski
This is the extended deck I used for my presentation at the Information On Demand 2013 conference for Session Number 1687 - Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL.
This presentation covers accessing HBase using Big SQL. It starts by going over general HBase concepts, than delves into how Big SQL adds an SQL layer on top of HBase (via HBase storage handler), secondary index support, queries, etc.
New Features for Database Administrator of Oracle 12c Database. Here are some of excellent Oracle 12c new features with examples for learning purpose. SQL,Backup and recovery, Database management, Oracle RAC, Oracle ASM included.
How the new operation of Hadoop Distributed FIle System (HDFS) -- Append works. The internals of the processing. The new states that are more than the write operation.
This presentation by Bruce Momjian. Co-Founder of the Global PostgreSQL Development team and a Senior Architect at EDB. He demonstrates how to use arrays, geometry and JSON for NoSQL data types to overcome restrictions of relational storage to support new innovative applications, specifically by storing and indexing multiple values, even unrelated ones, in a single database field. Such storage allows for greater efficiency and access simplicity, and can also avoid the negatives of entity-attribute-value (eav) storage.
Postgres has always had strong support for relational storage. However, there are some cases where relational storage might be inefficient or overly restrictive.
Speakers: Jingcheng Du and Ramkrishna Vasudevan (Intel)
As HBase continues to expand in application and enterprise or government deployments, there is a growing demand for storing data across geographically distributed datacenters for improved availability and disaster recovery. The Cross-Site BigTable extends HBase to make it well-suited for such deployments, providing the capabilities of creating and accessing HBase tables that are partitioned and asynchronously backed-up over a number of distributed datacenters. This talk reveals how the Cross-Site BigTable manages data access over multiple datacenters and removes the data center itself as a single point of failure in geographically distributed HBase deployments.
Learning spark ch01 - Introduction to Data Analysis with Sparkphanleson
Learning spark ch01 - Introduction to Data Analysis with Spark
References to Spark Course
Course : Introduction to Big Data with Apache Spark : http://ouo.io/Mqc8L5
Course : Spark Fundamentals I : http://ouo.io/eiuoV
Course : Functional Programming Principles in Scala : http://ouo.io/rh4vv
Learning spark ch04 - Working with Key/Value Pairsphanleson
Learning spark ch04 - Working with Key/Value Pairs
Course : Introduction to Big Data with Apache Spark : http://ouo.io/Mqc8L5
Course : Spark Fundamentals I : http://ouo.io/eiuoV
Course : Functional Programming Principles in Scala : http://ouo.io/rh4vv
A hibernate tutorial for beginners. It describe the hibernate concepts in a lucid manner and and test project(User application with database) to get hands on over the same.
Performance Analysis of HBASE and MONGODBKaushik Rajan
Comparison of different NoSQL databases,
namely, HBase and MongoDB at different workloads using Yahoo Cloud Serving Benchmarking (YCSB)
Tools used
> HBase, MongoDB, Shell Scripting, YCSB, Hadoop Environment
> Tableau for Visualization
> LATEX for documentation
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...Impetus Technologies
For Impetus’ White Papers archive, visit- http://lf1.me/drb/
This white paper talks about the design considerations for enterprises to run Hadoop as a shared service for multiple departments.
As Hadoop becomes more mainstream and indispensable to enterprises, it is imperative that they build, operate and scale shared Hadoop clusters. The design considerations discussed in this paper will help enterprises accomplish the essential mission of running multi-tenant, multi-use Hadoop clusters at scale.
The white paper talks about Identity, Security, Resource Sharing, Monitoring and Operations on the Central Service.
For Impetus’ White Papers archive, visit- http://lf1.me/drb/
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster Cloudera, Inc.
This presentation explores the design and challenges HappiestMinds faced while implementing a storage and search infrastructure for a large publisher where books/documents/artifacts related records are stored in Apache HBase. Upon bulk insert of book records into HBase, the Elasticsearch index is built offline using MapReduce code but there are certain use cases where the records need to be re-indexed in Elasticsearch using Region Observer Coprocessors.
Resource balancing comparison: VMware vSphere 6 vs. Red Hat Enterprise Virtua...Principled Technologies
Having ample resources to handle user requests is a necessity of modern virtualization solutions. Allocating and distributing those resources evenly, however, is imperative to the success of your business’s virtualized environment. In our tests, after powering on the other two servers in our three-node cluster and adding resource management features, VMware vSphere 6 improved performance by 183 percent over its baseline configuration of one active server and no resource management features. RHEV 3.5, in contrast, delivered only a 79 percent increase over its baseline. As you design your business’s infrastructure and applications, improvements such as those offered by VMware vSphere 6 DRS and Storage DRS can play a critical role by offering your users better application experiences. Optimized and modern resource management provided by VMware DRS can also help to lower your IT purchase and maintenance costs by reducing the number of servers necessary to run your applications.
Benchmarking Scalability and Elasticity of DistributedDataba.docxjasoninnes20
Benchmarking Scalability and Elasticity of Distributed
Database Systems
Jörn Kuhlenkamp
Technische Universität Berlin
Information Systems
Engineering Group
Berlin, Germany
[email protected]
Markus Klems
Technische Universität Berlin
Information Systems
Engineering Group
Berlin, Germany
[email protected]
Oliver Röss
Karlsruhe Institute of
Technology (KIT)
Karlsruhe, Germany
[email protected]
ABSTRACT
Distributed database system performance benchmarks are
an important source of information for decision makers who
must select the right technology for their data management
problems. Since important decisions rely on trustworthy
experimental data, it is necessary to reproduce experiments
and verify the results. We reproduce performance and scal-
ability benchmarking experiments of HBase and Cassandra
that have been conducted by previous research and com-
pare the results. The scope of our reproduced experiments
is extended with a performance evaluation of Cassandra on
different Amazon EC2 infrastructure configurations, and an
evaluation of Cassandra and HBase elasticity by measuring
scaling speed and performance impact while scaling.
1. INTRODUCTION
Modern distributed database systems, such as HBase, Cas-
sandra, MongoDB, Redis, Riak, etc. have become popular
choices for solving a variety of data management challenges.
Since these systems are optimized for different types of work-
loads, decision makers rely on performance benchmarks to
select the right data management solution for their prob-
lems. Furthermore, for many applications, it is not sufficient
to only evaluate performance of one particular system setup;
scalability and elasticity must also be taken into considera-
tion. Scalability measures how much performance increases
when resource capacity is added to a system, or how much
performance decreases when resource capacity is removed,
respectively. Elasticity measures how efficient a system can
be scaled at runtime, in terms of scaling speed and perfor-
mance impact on the concurrent workloads.
Experiment reproduction. In section 4, we reproduce
performance and scalability benchmarking experiments that
were originally conducted by Rabl, et al. [14] for evaluating
distributed database systems in the context of Enterprise
Application Performance Management (APM) on virtual-
ized infrastructure. In section 5, we discuss the problem of
This work is licensed under the Creative Commons Attribution-
NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li-
cense, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain per-
mission prior to any use beyond those covered by the license. Contact
copyright holder by emailing [email protected] Articles from this volume
were invited to present their results at the 40th International Conference on
Very Large Data Bases, September 1st - 5th 2014, Hangzhou, China.
Proceedings of the VLDB Endowment, Vol. 7, No. 13
Copyright 2014 VLDB Endowment 2150-8097/14/08.
selec ...
Schema-based multi-tenant architecture using Quarkus & Hibernate-ORM.pdfseo18
Architecture design is a must while developing a SaaS application to ensure its scalability and optimising infrastructure costs. In this blog, Lets discuss the implementation of one such architecture with Quarkus java framework and Hibernate ORM
This is the Day-4 lab exercise for CGI group webinar series. It primarily includes demonstrations on Hive, Analytics and other tools on the Cloudera Hadoop Platform.
Fundamentals of Big Data, Hadoop project design and case study or Use case
General planning consideration and most necessaries in Hadoop ecosystem and Hadoop projects
This will provide the basis for choosing the right Hadoop implementation, Hadoop technologies integration, adoption and creating an infrastructure.
Building applications using Apache Hadoop with a use-case of WI-FI log analysis has real life example.
Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools or processing applications. A lot of challenges such as capture, curation, storage, search, sharing, analysis, and visualization can be encountered while handling Big Data. On the other hand the Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Big Data certification is one of the most recognized credentials of today.
For more details Click http://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
User can run queries via MicroStrategy’s visual interface without the need to write unfamiliar HiveQL or MapReduce scripts. In essence, any user, without programming skill in Hadoop, can ask questions against vast volumes of structured and unstructured data to gain valuable business insights.
Similar to HBase In Action - Chapter 10 - Operations (20)
Learning spark ch01 - Introduction to Data Analysis with Sparkphanleson
Learning spark ch01 - Introduction to Data Analysis with Spark
References to Spark Course
Course : Introduction to Big Data with Apache Spark : http://ouo.io/Mqc8L5
Course : Spark Fundamentals I : http://ouo.io/eiuoV
Course : Functional Programming Principles in Scala : http://ouo.io/rh4vv
Honest Reviews of Tim Han LMA Course Program.pptxtimhan337
Personal development courses are widely available today, with each one promising life-changing outcomes. Tim Han’s Life Mastery Achievers (LMA) Course has drawn a lot of interest. In addition to offering my frank assessment of Success Insider’s LMA Course, this piece examines the course’s effects via a variety of Tim Han LMA course reviews and Success Insider comments.
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
3. 09/24/15
10.1 Monitoring Your Cluster
A critical aspect of any production system is the ability of its
operators to monitor its state and behavior.
In this section, we’ll talk about how HBase exposes metrics and the
frameworks that are available to you to capture these metrics and
use them to make sense of how your cluster is performing.
How HBase exposes metrics
Collecting and graphing the metrics
The metrics HBase exposes
Application-side monitoring
4. 09/24/15
10.1.1 How HBase exposes metrics
The metrics framework is another of the many ways that HBase depends on
Hadoop.
HBase is tightly integrated with Hadoop and uses Hadoop’s underlying
metrics framework to expose its metrics.
The metrics framework works by outputting metrics based on a context
implementation that implements the MetricsContext interface.
Ganglia context and File context.
HBase also exposes metrics using Java Management Extensions
6. 09/24/15
10.1.2 Collecting and graphing the metrics
Metrics solutions involve two aspects: collection and graphing.
Collection frameworks collect the metrics being generated by the system that is being
monitored and store them efficiently so they can be used later.
Graphing tools use the data captured and stored by collection frameworks and make it
easily consumable for the end user in the form of graphs and pretty pictures.
Numerous collection and graphing tools are available. But not all of them
are tightly integrated with how Hadoop and HBase expose metrics.
GANGLIA
JMX
7. 09/24/15
10.1.2 Collecting and graphing the metrics
GANGLIA
Ganglia
(http://ganglia.sourceforge.net/)
5 is a distributed monitoring
framework designed to monitor
clusters.
It was developed at UC Berkeley
and open-sourced.
Configure HBase to output
metrics to Ganglia
Set the parameters in the hadoop-
metrics.properties file, which
resides in the
$HBASE_HOME/conf/ directory.
8. 09/24/15
10.1.2 Collecting and graphing the metrics
JMX
Several open source tools such as Cacti and OpenTSDB can be used to collect metrics
via JMX. JMX metrics can also be viewed as JSON from the Master and RegionServer
web UI:
JMX metrics from the Master: http://master_ip_address:port/jmx
JMX metrics from a particular RegionServer: http://region_server_ip
_address:port/jmx
The default port for the Master is 60010 and for the RegionServer is 60030.
FILE BASED
HBase can also be configured to output metrics into a flat file.
File-based metrics aren’t a useful way of recording metrics because they’re hard to
consume thereafter.
9. 09/24/15
10.1.3 The metrics HBase exposes
The Master and RegionServers expose metrics. The metrics of
interest depend on the workload the cluster is sustaining, and
we’ll categorize them accordingly.
GENERAL METRICS
HDFS throughput and latency
HDFS usage
Underlying disk throughput
Network throughput and latency from each node
WRITE-RELATED METRICS
To understand the system state during writes, the metrics of interest are
the ones that are collected as data is written into the system.
READ-RELATED METRICS
Reads are different than writes, and so are the metrics you should monitor to
understand them.
14. 09/24/15
10.1.4 Application-side monitoring
In a production environment, we recommend that you add to the
system-level monitoring that Ganglia and other tools provide and
also monitor how HBase looks from your application’s perspective.
Put performance as seen by the client (the application) for every
RegionServer
Get performance as seen by the client for every RegionServer
Scan performance as seen by the client for every RegionServer
Connectivity to all RegionServers
Network latencies between the application tier and the HBase cluster
Number of concurrent clients opening to HBase at any point in time
Connectivity to ZooKeeper
15. 09/24/15
10.2 Performance of your HBase cluster
Performance of any database is measured in terms of the response
times of the operations that it supports.
This is important to measure in the context of your application so you can set
the right expectations for users.
To make sure your HBase cluster is performing within the expected
SLAs, you must test performance thoroughly and tune the cluster to
extract the maximum performance you can get out of it.
Performance testing
What impacts HBase’s performance?
Tuning dependency systems
Tuning HBase
16. 09/24/15
10.2.1 Performance testing
There are different ways you can
test the performance of your
HBase cluster.
PERFORMANCEEVALUATION
TOOL—BUNDLED WITH HBASE
HBase ships with a tool called
PerformanceEvaluation, which you can
use to evaluate the performance of your
HBase cluster in terms of various
operations.
Examples:
To run a single evaluation client:
$ bin/hbase org.apache.hadoop.hbase.PerformanceEvaluation sequentialWrite 1
$ hbase org.apache.hadoop.hbase.PerformanceEvaluation --rows=10
sequentialWrite 1
17. 09/24/15
10.2.1 Performance testing (con't)
YCSB—YAHOO! CLOUD SERVING BENCHMARK7
YCSB is the closest we have come to having a standard benchmarking tool that can be used
to measure and compare the performance of different distributed databases.
YCSB is available from the project’s GitHub repository
(http://github.com/brianfrankcooper/YCSB/).
Before running the workload, you need to create the HBase table YCSB will
write to. You can do that from the shell:
hbase(main):002:0> create 'mytable', 'myfamily'
$ bin/ycsb load hbase -P workloads/workloada -p columnfamily=myfamily -p
table=mytable
You can do all sorts of fancy stuff with YCSB workloads, including configuring
multiple clients, configuring multiple threads, and running mixed workloads
with different statistical distributions of the data.
18. 09/24/15
10.2.2 What impacts HBase’s performance?
HBase is a distributed database and is
tightly coupled with Hadoop. That
makes it susceptible to the entire stack
under it (figure 10.8) when it comes to
performance.
19. 09/24/15
10.2.3 Tuning dependency systems
Tuning an HBase cluster to extract maximum performance involves tuning
all dependencies.
HARDWARE CHOICES
NETWORK CONFIGURATION
OPERATING SYSTEM
LOCAL FILE SYSTEM
HDFS
20. 09/24/15
10.2.4 Tuning HBase
Tuning an HBase cluster typically involves tuning multiple different
configuration parameters to suit the workload that you plan to put on the
cluster.
Random-read-heavy
Sequential-read-heavy
Write-heavy
Mixed
Each of these workloads demands a different kind of configuration tuning
21. 09/24/15
10.2.4 Tuning HBase
RANDOM-READ-HEAVY : For
random-read-heavy workloads,
effective use of the cache and
better indexing will get you higher
performance.
22. 09/24/15
10.2.4 Tuning HBase
For sequential-read-heavy
workloads, the read cache doesn’t
buy you a lot; chances are you’ll be
hitting the disk more often than
not unless the sequential reads are
small in size and are limited to a
particular key range.
23. 09/24/15
10.2.4 Tuning HBase
WRITE-HEAVY : Write-heavy
workloads need different tuning
than read-heavy ones. The cache
doesn’t play an important role
anymore. Writes always go into the
MemStore and are flushed to form
new HFiles, which later are
compacted.
The way to get good write
performance is by not flushing,
compacting, or splitting too often
because the I/O load goes up
during that time, slowing the
system.
25. 09/24/15
10.2.4 Tuning HBase
MIXED : With completely mixed workloads, tuning becomes slightly
trickier. You have to tweak a mix of the parameters described earlier to
achieve the optimal combination. Iterate over various combinations, and
run performance tests to see where you get the best results.
Compression
Rowkey design
Major compactions
RegionServer handler count
26. 09/24/15
10.3 Cluster management
During the course of running a production system, management
tasks need to be performed at different stages.
Things like starting or stopping the cluster, upgrading the OS on the nodes,
replacing bad hardware, and backing up data are important tasks and need to be
done right to keep the cluster running smoothly.
This section highlights some of the important tasks you may need to
perform and teaches how to do them.
27. 09/24/15
10.3.1 Starting and stopping HBase
The order in which the HBase daemons are stopped and started
matters only to the extent that the dependency systems (HDFS and
ZooKeeper) need to be up before HBase is started and should be
shut down only after HBase has shut down.
SCRIPTS : in the $HBASE_HOME/bin directory
CENTRALIZED MANAGEMENT : Cluster-management frameworks like Puppet
and Chef can be used to manage the starting and stopping of daemons from a
central location.
28. 09/24/15
10.3.2 Graceful stop and decommissioning nodes
When you need to shut down daemons on individual servers for any
management purpose (upgrading, replacing hardware, and so on), you need
to ensure that the rest of the cluster keeps working fine and there is
minimal outage as seen by client applications.
The script follows these steps (in order) to gracefully stop a RegionServer:
Disable the region balancer.
Move the regions off the RegionServer, and randomly assign them to other servers in the
cluster
Stop the REST and Thrift services if they’re active.
Stop the RegionServer process.
$ bin/graceful_stop.sh
Usage: graceful_stop.sh [--config <conf-dir>] [--restart] [--reload]
[--thrift] [--rest] <hostname>
thrift If we should stop/start thrift before/after the
29. 09/24/15
10.3.3 Adding nodes
As your application gets more successful or more use cases crop up, chances
are you’ll need to scale up your HBase cluster.
It could also be that you’re replacing a node for some reason. The process to
add a node to the HBase cluster is the same in both cases.
30. 09/24/15
10.3.4 Rolling restarts and upgrading
It’s not rare to patch or upgrade Hadoop and HBase releases in running
clusters.
In production systems, upgrades can be tricky. Often, it isn’t possible to
take downtime on the cluster to do upgrades.
But not all upgrades are between major releases and require downtime.
To do upgrades without taking a downtime, follow these steps:
Deploy the new HBase version to all nodes in the cluster, including the new ZooKeeper if
that needs an update as well.
Turn off the balancer process. One by one, gracefully stop the RegionServers and bring them
back up.
Restart the HBase Masters one by one.
If ZooKeeper requires a restart, restart all the nodes in the quorum one by one.
Upgrade the clients.
You can use the same steps to do a rolling restart for any other purpose as well.
31. 09/24/15
10.3.5 bin/hbase and the HBase shell
The script basically runs the Java
class associated with the command
you choose to pass it:
33. 09/24/15
We’ll focus on the tools group of
commands (shown in bold). To get a
description for any command, you can
run help 'command_name' in the shell
like this
ZK_DUMP : You can use the zk_dump
command to find out the current state
of ZooKeeper:
STATUS COMMAND : You can use the
status command to determine the
status of the cluster.
COMPACTIONS
BALANCER
SPLITTING TABLES OR REGIONS
ALTERING TABLE SCHEMAS
TRUNCATING TABLES
10.3.5 bin/hbase and the HBase shell
34. 09/24/15
10.3.6 Maintaining consistency—hbck
HBase comes with a tool called hbck (or HBaseFsck) that checks for the
consistency and integrity of the HBase cluster.
Hbck recently underwent an overhaul, and the resulting tool was nicknamed uberhbck.
Hbck is a tool that helps in checking for inconsistencies in HBase clusters.
Inconsistencies can occur at two levels:
Region inconsistencies
Table inconsistencies
Hbck performs two primary functions: detect inconsistencies and fix
inconsistencies.
DETECTING INCONSISTENCIES :
$ $HBASE_HOME/bin/hbase hbck
$ $HBASE_HOME/bin/hbase hbck -details
FIXING INCONSISTENCIES :
Incorrect assignments
Missing or extra regions
35. 09/24/15
10.3.7 Viewing HFiles and HLogs
HBase provides utilities to
examine the HFiles and HLogs
(WAL) that are being created at
write time.
The HLogs are located in the .logs
directory in the HBase root
directory on the file system. You
can examine them by using the
hlog command of the bin/hbase
script, like this:
36. 09/24/15
10.3.7 Viewing HFiles and HLogs
The script has a similar utility for
examining the HFiles. To print the
help for the command, run the
command without any arguments:
You can see that there is a lot of
information about the HFile.
Other options can be used to get
different bits of information.
37. 09/24/15
10.3.8 Presplitting tables
Table splitting during heavy write
loads can result in increased latencies.
Splitting is typically followed by
regions moving around to balance the
cluster, which adds to the overhead.
Presplitting tables is also desirable for
bulk loads, which we cover later in the
chapter. If the key distribution is well
known, you can split the table into the
desired number of regions at the time
of table creation.
38. 09/24/15
10.4 Backup and replication
Inter-cluster replication
Backup using MapReduce jobs
Backing up the root directory
40. 09/24/15
10.4.2 Backup using MapReduce jobs
MapReduce jobs can be configured to use HBase tables as the source and
sink, as we covered in chapter 3. This ability can come in handy to do point-
in-time backups of tables by scanning through them and outputting the
data into flat files or other HBase tables.
This is different from inter-cluster replication, which the last section
described.
Inter-cluster replication is a push mechanism.
Running MapReduce jobs over tables is a pull mechanism
EXPORT/IMPORT
The prebundled Export MapReduce job can be used to export data from HBase tables into
flat files.
That data can then later be imported into another HBase table on the same or a different
cluster using the Import job.
42. 09/24/15
10.4.2 Backup using MapReduce jobs
ADVANCED IMPORT WITH
IMPORTTSV
ImportTsv is more feature-rich.
It allows you to load data from newline-
terminated, delimited text files.
43. 09/24/15
10.4.3 Backing up the root directory
HBase stores its data in the directory specified by the hbase.rootdir
configuration property. This directory contains all the region information,
all the HFiles for the tables, as well as the WALs for all RegionServers.
When an HBase cluster is up and running, several things are going on:
MemStore flushes, region splits, compactions, and so on.
But if you stop the HBase daemons cleanly, the MemStore is flushed and
the root directory isn’t altered by any process.
45. 09/24/15
10.5 Summary
Production-quality operations of any software system are
learned over time. This chapter covered several aspects of
operating HBase in production with the intention of getting
you started on the path to understanding the concepts.
New tools and scripts probably will be developed by HBase
users and will benefit you.
The first aspect of operations is instrumenting and monitoring the system.
From monitoring, the chapter transitioned into talking about performance
testing, measuring performance, and tuning HBase for different kinds of
workloads.
From there we covered a list of common management tasks and how and
when to do them.
Mastering HBase operations requires an understanding of the internals and
experience gained by working with the system.
Editor's Notes
When issues happen, the last thing an operator wants to do is to sift through GBs and TBs of logs to make sense of the state of the system and the root cause of the issue. Not many people are champions at reading thousands of log lines across multiple servers to make sense of what’s going on. That’s where recording detailed metrics comes into play. Many things are happening in a production-quality database like HBase, and each of them can be measured in different ways. These measurements are exposed by the system and can be captured by external frameworks
that are designed to record them and make them available to operators in a consumable fashion.
We recommend that you set up your full metrics collection, graphing, and monitoring stack even in the prototyping stage of your HBase adoption. This will enable you to become familiar with the various aspects of operating HBase and will make the transition to production much smoother.
http://ouo.io/uaiKO
One interesting metric to keep an eye on is the CPU I/O wait percentage. This indicates
the amount of time the CPU spends waiting for disk I/O and is a good indicator of
whether your system is I/O bound.
http://ouo.io/uaiKO
The limitation of this testing utility is that you can’t run mixed workloads without
coding it up yourself.
The test has to be one of the bundled ones, and they have to be
run individually as separate runs. If your workload consists of Scans and Gets and Puts
happening at the same time, this tool doesn’t give you the ability to truly test your cluster
by mixing it all up. That brings us to our next testing utility.
Once YCSB is compiled, put your HBase cluster’s configuration in hbase/src/main/
conf/hbase-site.xml. You only need to put the hbase.zookeeper.quorum property in
the config file so YCSB can use it as the entry point for the cluster. Now you’re ready to
run workloads to test your cluster. YCSB comes with a few sample workloads that you
can find in the workloads directory.
Performance is affected by everything from the underlying hardware that makes
up the boxes in the cluster to the network connecting them to the OS (specifically the
file system) to the JVM to HDFS. The state of the HBase system matters too. For
instance, performance is different during a compaction or during MemStore flushes
compared to when nothing is going on in the cluster. Your application’s performance
depends on how it interacts with HBase, and your schema design plays an integral role
as much as anything else.
When looking at HBase performance, all of these factors matter; and when you tune
your cluster, you need to look into all of them. Going into tuning each of those layers
is beyond the scope of this text. We covered JVM tuning (garbage collection specifically)
in chapter 9. We’ll discuss some key aspects of tuning your HBase cluster next.
Although Import is a simple complement to Export, ImportTsv is more feature-rich.
It allows you to load data from newline-terminated, delimited text files. Most commonly,
this is a tab-separated format, but the delimiter is configurable (for loading
comma-separated files). You specify a destination table and provide it with a mapping
from columns in your data file(s) to columns in HBase: