HBase is a distributed, column-oriented database built on top of HDFS that can handle large datasets across a cluster. It uses a map-reduce model where data is stored as multidimensional sorted maps across nodes. Data is first written to a write-ahead log and memory, then flushed to disk files and compacted for efficiency. Client applications access HBase programmatically through APIs rather than SQL. Map-reduce jobs on HBase use input, mapper, reducer, and output classes to process table data in parallel across regions.
A tutorial presentation based on hadoop.apache.org documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
A tutorial presentation based on hadoop.apache.org documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
More about Hadoop
www.beinghadoop.com
https://www.facebook.com/hadoopinfo
This PPT Gives information about
Complete Hadoop Architecture and
information about
how user request is processed in Hadoop?
About Namenode
Datanode
jobtracker
tasktracker
Hadoop installation Post Configurations
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While developed by Facebook.
HBase Data Modeling and Access Patterns with Kite SDKHBaseCon
Speaker: Adam Warrington (Cloudera)
The Kite SDK is a set of libraries and tools focused on making it easier to build systems on top of the Hadoop ecosystem. HBase support has recently been added to the Kite SDK Data Module, which allows a developer to model and access data in HBase consistent with how they would model data in HDFS using Kite. This talk will focus on Kite's HBase support by covering Kite basics and moving through the specifics of working with HBase as a data source. This feature overview will be supplemented by specifics of how that feature is being used in production applications at Cloudera.
In this session you will learn:
What is Big Data?
What is Hadoop?
Overview of Hadoop Ecosystem
Hadoop Distributed File System or HDFS
Hadoop Cluster Modes
Yarn
MapReduce
Hive
Pig
Zookeeper
Flume
Sqoop
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.
Simplifying Use of Hive with the Hive Query ToolDataWorks Summit
As TripAdvisor moves increasing amounts of data into Hadoop and Hive, the need for simplifying, controlling, and expanding access to this data has grown. Having reviewed existing solutions without finding what we needed, we began working on our own solution to meet our specific goals and use-cases. The Hive Query Tool (HQT) is a web interface that allows anybody to configure and run Hive queries without requiring client-side installation or even knowledge of the query language. Users familiar with HQL can add sophisticated and highly customizable queries with a flexible and powerful template system. A primary innovation, the template system, allows one to define the inputs available to the end-user, validation checks, and what HQL to generate, easily and concisely. We plan to release the code as open-source. This talk will discuss: – The features of the HQT and how it is used for business intelligence – The challenges it was built to meet and how its design and architecture addresses them – Installing and running an HQT server – How to use, customize, and expand the template system – Known limitations and issues – Future plans and features
More about Hadoop
www.beinghadoop.com
https://www.facebook.com/hadoopinfo
This PPT Gives information about
Complete Hadoop Architecture and
information about
how user request is processed in Hadoop?
About Namenode
Datanode
jobtracker
tasktracker
Hadoop installation Post Configurations
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While developed by Facebook.
HBase Data Modeling and Access Patterns with Kite SDKHBaseCon
Speaker: Adam Warrington (Cloudera)
The Kite SDK is a set of libraries and tools focused on making it easier to build systems on top of the Hadoop ecosystem. HBase support has recently been added to the Kite SDK Data Module, which allows a developer to model and access data in HBase consistent with how they would model data in HDFS using Kite. This talk will focus on Kite's HBase support by covering Kite basics and moving through the specifics of working with HBase as a data source. This feature overview will be supplemented by specifics of how that feature is being used in production applications at Cloudera.
In this session you will learn:
What is Big Data?
What is Hadoop?
Overview of Hadoop Ecosystem
Hadoop Distributed File System or HDFS
Hadoop Cluster Modes
Yarn
MapReduce
Hive
Pig
Zookeeper
Flume
Sqoop
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.
Simplifying Use of Hive with the Hive Query ToolDataWorks Summit
As TripAdvisor moves increasing amounts of data into Hadoop and Hive, the need for simplifying, controlling, and expanding access to this data has grown. Having reviewed existing solutions without finding what we needed, we began working on our own solution to meet our specific goals and use-cases. The Hive Query Tool (HQT) is a web interface that allows anybody to configure and run Hive queries without requiring client-side installation or even knowledge of the query language. Users familiar with HQL can add sophisticated and highly customizable queries with a flexible and powerful template system. A primary innovation, the template system, allows one to define the inputs available to the end-user, validation checks, and what HQL to generate, easily and concisely. We plan to release the code as open-source. This talk will discuss: – The features of the HQT and how it is used for business intelligence – The challenges it was built to meet and how its design and architecture addresses them – Installing and running an HQT server – How to use, customize, and expand the template system – Known limitations and issues – Future plans and features
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
Apache HBase™ is the Hadoop database, a distributed, salable, big data store.Its a column-oriented database management system that runs on top of HDFS.
Apache HBase is an open source NoSQL database that provides real-time read/write access to those large data sets. ... HBase is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN.
From: DataWorks Summit 2017 - Munich - 20170406
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
This talk delves into the many ways that a user has to use HBase in a project. Lars will look at many practical examples based on real applications in production, for example, on Facebook and eBay and the right approach for those wanting to find their own implementation. He will also discuss advanced concepts, such as counters, coprocessors and schema design.
HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second. This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they
http://berlinbuzzwords.de/sessions/advanced-hbase-schema-design
In this session you will learn:
HBase Introduction
Row & Column storage
Characteristics of a huge DB
What is HBase?
HBase Data-Model
HBase vs RDBMS
HBase architecture
HBase in operation
Loading Data into HBase
HBase shell commands
HBase operations through Java
HBase operations through MR
To know more, click here: https://www.mindsmapped.com/courses/big-data-hadoop/big-data-and-hadoop-training-for-beginners/
Introduction to HBase. HBase is a NoSQL databases which experienced a tremendous increase in popularity during the last years. Large companies like Facebook, LinkedIn, Foursquare are using HBase. In this presentation we will address questions like: what is HBase?, and compared to relational databases?, what is the architecture?, how does HBase work?, what about the schema design?, what about the IT ressources?. Questions that should help you consider whether this solution might be suitable in your case.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar
The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month.
The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies.
However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News.
Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
Into the Box Keynote Day 2: Unveiling amazing updates and announcements for modern CFML developers! Get ready for exciting releases and updates on Ortus tools and products. Stay tuned for cutting-edge innovations designed to boost your productivity.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
2. RDBMS Scaling
• Cannot scale for large distributed data sets
• Vendors Offers replication and partition solutions to
grow the database beyond the confines of single node,
but generally complicated to install and maintain
• Such techniques compromise RDBMS features such as
– Joins, Complex queries, Views, Triggers and foreign key
constraints
– These queries becomes expensive
3. Why BigTable?
• Performance of RDBMS system is good for transaction
processing but for very large scale analytic processing, the
solutions are expensive, and specialized.
• Very large scale analytic processing
– Big queries – typically range or table scans.
– Big databases (100s of TB)
• Map reduce on Bigtable with optionally Cascading on top to
support some relational algebras may be a cost effective
solution.
• Sharding (Shared nothing horizontal partitioning) is not a
solution to scale open source RDBMS platforms
• Application specific
• Labor intensive (re)partitionaing
4. Key concept
HBase is a distributed column-oriented database built
on top of HDFS.
• At its core, HBase / BigTable is a map.
• It is a persistent storage.
• HBase and BigTable are built upon distributed file-systems.
• Unlike most map implementations, in
HBase/BigTable the key/value pairs are kept in strict
alphabetical order.
• Multidimensional map.
• Sparse.
5. Map
• A map is "an abstract data type composed of a
collection of keys and a collection of values, where
each key is associated with one value."
{
"Name" : "Subhas",
"Mail" : "subhas.ghosh@siemens.com",
"Location" : "9F-TA-WS-21",
"Phone" : "+918025113529",
"Sal" : ************
}
In this example "Name" is a key, and "Subhas" is the
corresponding value.
6. Persistent
• Persistence merely means that the data you put
in this special map "persists" after the program
that created or accessed it is finished.
• This is no different in concept than any other
kind of persistent storage such as a file on a file-system.
• Each value can be versioned in HBase
7. Distributed
• Built upon distributed file-systems
– file storage can be spread out among an array of
independent machines.
– HBase sits atop either Hadoop's Distributed File System
(HDFS) or Amazon's Simple Storage Service (S3),
– BigTable makes use of the Google File System (GFS).
• Data is replicated across a number of participating
nodes in an analogous manner to how data is striped
across discs in a RAID system.
8. Sorted
Continuing our example, the sorted version looks like this:
{
"Location" : "9F-TA-WS-21",
"Mail" : "subhas.ghosh@siemens.com",
"Name" : "Subhas",
"Phone" : "+918025113529",
"Sal" : ************
}
Sorting can ensure that items of greatest interest to you are
near each other
9. Multidimensional
A map of maps
{
"Location" :
{
"FL" : "9F",
"TOWER" : "A",
"WS" : "21“
},
"Mail" : "subhas@xyz.com",
"Name" :
{
"FIRST": "Subhas",
"MID" : "Kumar",
"LAST" : "Ghosh“
},
"Phone" : "+918025113529",
"Sal" : ************
}
Each key points to a map
with one or more keys:
"FL", "TOWER", "WS" e.g.
Top-level key/map pair is a
"row".
Also, in BigTable/HBase
nomenclature, the "FL" and
"TOWER" mappings would
be called "Column
Families".
10. Multidimensional
• A table's column families are specified when
the table is created, and are difficult or
impossible to modify later.
• It can also be expensive to add new column
families, so it's a good idea to specify all the
ones you'll need up front.
• Fortunately, a column family may have any
number of columns, denoted by a column
"qualifier" or "label".
11. Multidimensional
…
"aaaaa" : {
"A" : {
"foo" : "y",
"bar" : "d"
},
"B" : {
"" : "w" }
},
"aaaab" : {
"A" : {
"foo" : "world",
"bar" : "domination"
},
"B" : {
"" : "ocean" }
}
},
…
Column family with two
columns: "foo" and
"bar",
When asking HBase/BigTable for
data provide the full column name
in the form "<family>:<qualifier>“,
e.g. "A:foo", "A:bar" and "B:".
"B" column family has just
one column whose qualifier
is the empty string ("").
12. Multidimensional
• Labeled tables of rows X columns X timestamp
– Cells addressed by row/column/timestamp
– As (perverse) java declaration:
SortedMap<byte [], SortedMap<byte [],
List<Cell>>>> hbase = new TreeMap<ditto>(new RawByteComparator());
• Row keys uninterpreted byte arrays: E.g. an URL
– Rows are ordered by Comparator (Default: byte-order)
– Row updates are atomic; even if hundreds of columns
• Columns grouped into column-families
– Columns have column-family prefix and then qualifier
• E.g. webpage:mimetype, webpage:language
– Column-family 'printable', qualifier arbitrary bytes
– Column-families in table schema but not qualifiers
13. Multidimensional
• Cell is uninterpreted byte array and a
timestamp
– E.g. webpage content
• Tables partitioned into Regions
– Region defined by start & end row
– Regions are the 'atoms' of distribution
deployed around the cluster.
– start < end - in lexicographic sense
16. What HBase Is Not
• Tables have one primary index, the row key.
• No join operators.
• Scans and queries can select a subset of available columns,
perhaps by using a wildcard.
• There are three types of lookups:
– Fast lookup using row key and optional timestamp.
– Full table scan
– Range scan from region start to end.
• Limited atomicity and transaction support.
– HBase supports multiple batched mutations of single rows only.
– Data is unstructured and untyped.
• Not accessed or manipulated via SQL.
– Programmatic access via Java, REST, or Thrift APIs.
– Scripting via JRuby.
– No JOIN, No sophisticated query engine, No column typing, no
ODBC/JDBC, No Crystal Reports, No transactions, No secondary indices
17. Map-Reduce With HBase
• When we use a map-reduce framework with HBase
table, a map function is executed for each region
independently in parallel.
• Within each map query is answered by scanning the
rows in a ordered manner starting with low ordered
key to higher ordered key.
• Optionally, certain rows and columns (column families)
can be filtered out for better performance.
19. Elements
– Table : a list of tuples sorted by row key ascending, column
name ascending and timestamp descending.
– Regions: A Table is broken up into row ranges called regions.
Each row range contains rows from start-key to end-key. (A set
of regions, sorted appropriately, forms an entire table.)
– HStore: Each column family in a region is managed by an
HStore.
– HFile: Each HStore may have one or more HFile (a Hadoop
HDFS file type).
20. Components
• Master
o Responsible for monitoring region servers
o Load balancing for regions
o Redirect client to correct region servers
o The current SPOF (single point of failure)
• Regionserver slaves
o Serving requests(Write/Read/Scan) of Client
o Send HeartBeat to Master
o Throughput and Region numbers are scalable by
region servers
21. Components
• ZooKeeper
– centralized service for maintaining
• configuration information,
• naming,
• providing distributed synchronization, and
• providing group services.
– ZooKeeper allows distributed processes to coordinate with each other
through a shared hierarchal namespace
• organized similarly to a standard file system.
• The name space consists of data registers - called znodes
• in ZooKeeper parlance - and these are similar to files and directories.
• Unlike a typical file system, which is designed for storage, ZooKeeper
data is kept in-memory, which means ZooKeeper can acheive high
throughput and low latency numbers.
23. Distributed Coordination
• The replicated database is in-memory.
• Updates are logged to disk for recoverability.
• Writes are serialized to disk before they are applied to the in-memory
database.
• Clients connect to exactly one server to submit requests.
• Read requests are serviced from the local replica of each server database.
• Requests that change the state of the service, write requests, are
processed by an agreement protocol.
24. Distributed Coordination
• As part of the agreement protocol all write requests from
clients are forwarded to a single server, called the leader.
• The rest of the ZooKeeper servers, called followers, receive
message proposals from the leader and agree upon message
delivery.
• The messaging layer takes care of replacing leaders on failures
and syncing followers with leaders.
• ZooKeeper uses a custom atomic messaging protocol.
– ZooKeeper can guarantee that the local replicas never diverge.
– When the leader receives a write request, it calculates what the state of
the system is when the write is to be applied and transforms this into a
transaction that captures this new state.
26. The general protocol flow
1. Client contacts the Zookeeper to find where it shall put the data.
2. For this purpose, HBase maintains two catalog tables, namely, -ROOT-, and
.META..
3. First HBase finds information from the -ROOT- table about location of
.META. Table.
4. Subsequently about the server location of the assigned region of a table
from the .META. table.
5. Client caches this information and contacts the HRegionServer.
6. Next the HRegionServer creates a HRegion object corresponding to the
opened region.
1. When the HRegion is "opened" it sets up a HStore instance for each
HColumnFamily for every table as defined by the user beforehand.
2. Each of the Store instances have one or more StoreFile instances
3. StoreFile are lightweight wrappers around the actual storage file called HFile.
27. Where is my data?
Zookeeper
.META.
-ROOT- MyRow
MyTable
Row per table region
Row per META region
Client
28. The general protocol flow
7. The client issues a HTable.put(Put) request to the HRegionServer which hands
the details to the matching HRegion instance.
8. The first step is to decide if the data should be first written to the "Write-Ahead-
Log" (WAL) represented by the HLog class. The WAL is a standard Hadoop
SequenceFile and it stores HLogKey's.
9. These keys contain a sequential number as well as the actual data and are used
to replay not yet persisted data after a server crash.
10. Once the data is written (or not) to the WAL it is placed in the MemStore. At the
same time it is checked if the MemStore is full and in that case a flush to disk is
requested.
11. The store files created on disk are immutable. Sometimes the store files are
merged together; this is done by a process called compaction. This buffer-flush-merge
strategy is a common pattern described in Log-Structured Merge-Tree.
12. After a compaction, if a newly written store file size is greater than the size
specified in hbase.hregion.max.filesize (default 256 MB), the region is split into
two new regions.
Flush Flush Flush Compact Flush Flush Compact Flush Flush Flush Compact
29. Log Structured Merge Trees
• Random IO for writes is bad in HDFS.
• LSM Trees convert random writes to sequential writes.
• Writes go to a commit log and in-memory storage
(MemStore)
• The MemStore is occasionally flushed to disk
(StoreFile)
• The disk stores are periodically compacted to HFile (on
HDFS)
• Use Bloom Filters with merge.
30. Buffer-Flush-Compact (minor)
Region
Memstore
HLog
(Append only WAL on
HDFS)
(Sequence file)
(One per region)
HFile on
HDFS
Compact
HFile on
HDFS
StoreFile
HFile on
HDFS
Buffer
Read
Flush
HFile: immutable sorted map (byte[] byte[])
(row, column, timestamp cell value)
31. Compaction
• Major compaction:
– The most important difference between minor and major compactions is
that major compactions processes delete markers, max versions, etc,
while minor compactions don't.
– This is because delete markers might also affect data in the non-merged
files, so it is only possible to do this when merging all files.
• When a delete is performed in HBase table, nothing gets
deleted immediately, rather a delete marker (a.k.a. tombstone)
is written.
– This is because HBase does not modify files once they are written.
– The deletes are processed during the major compaction process; at
which point the data they hide and the delete marker itself will not be
present in the merged file.
33. Java Example
HBaseConfiguration config = new HBaseConfiguration();
HTable table = new HTable(config, "myTable");
Cell cell = table.get("myRow",
"myColumnFamily:columnQualifier1");
34. Java Example: A Table Mapper
Scan scan = new Scan(); scan.addColumns(COLUMN_FAMILIY_NAME);
//add some more filters to acan here as scan.setFilter(...);
TableMapReduceUtil.initTableMapperJob(TABLE_NAME, scan, Mapper.class,
ImmutableBytesWritable.class, IntWritable.class, job);
TableMapper<ImmutableBytesWritable, IntWritable>
{
@Override
public void map(ImmutableBytesWritable row, Result values, Context context) throws
IOException
{
ImmutableBytesWritable userKey = new ImmutableBytesWritable(row.get());
for (KeyValue value: values.list())
{
ByteBuffer b = ByteBuffer.wrap(value.getValue());
String column = Bytes.toString(value.getColumn());
//compute something and put in the int res
try { context.write(userKey, res); }
catch (InterruptedException e) { throw new IOException(e); }
}
}
}
KeyValue in the HFile is a low-level byte array that allows for "zero-copy" access to the data,
even with lazy or custom parsing if necessary.
37. InputFormat
• InputFormat class is responsible for the actual splitting of the input data as
well as returning a RecordReader instance that defines the classes of
the key and value objects as well as providing a next() method that is used to
iterate over each input record.
• In HBase implementation is called TableInputFormatBase as well as its
subclass TableInputFormat.
• TableInputFormat is a light-weight concrete version.
• You can provide the name of the table to scan and the columns you want to
process during the Map phase.
• It splits the table into proper pieces for you and hands them over to the
subsequent classes.
38. Mapper
• The Mapper class(es) are for the next stage of the MapReduce.
• In this step each record read using the RecordReader is processed using
the map() method.
• A TableMap class that is specific to iterating over a HBase table.
• Once specific implementation is the IdentityTableMap which is also a good
example on how to add your own functionality to the supplied classes.
• The TableMap class itself does not implement anything but only adds the
signatures of what the actual key/value pair classes are.
• The IdentityTableMap is simply passing on the records to the next stage of
the processing.
39. Reducer
• The Reduce stage and class layout is very similar to the Mapper
one explained above.
• This time we get the output of a Mapper class and process it
after the data was shuffled and sorted.
40. OutputFormat
• The final stage is the OutputFormat class and its job to persist the data in
various locations.
• There are specific implementations that allow output to files or to HBase
tables in case of the TableOutputFormat.
• It uses a RecordWriter to write the data into the specific HBase output
table.
• It is important to note the cardinality as well.
• While there are many Mappers handing records to many Reducers, there is
only one OutputFormat that takes each output record from its Reducer
subsequently.
• It is the final class handling the key/value pairs and writes them to their final
destination, this being a file or a table.
• The name of the output table is specified when the job is created.
41. Map-reduce options with HBase
Raw data Table-A Table-B
Raw Data
Map +
Reduce
(Hadoop)
Map only or
Map +
Reduce
Map only or
Map +
Reduce
Table-A
Map only or
Map +
Reduce
Map +
Reduce Map
Table-B
Map only or
Map +
Reduce Map
Map +
Reduce
Output
Input
Reading and writing into same table: hinder the proper distribution of regions
across the servers (open scanners block regions splits) and may or may not see the
new data as you scan. must write in the TableReduce.reduce()
Read from one table and write to another: can write updates directly in the
TableMap.map()
Map stage completely reads a table and then passes the data on in
intermediate files to the Reduce stage.
Reducer reads from DFS and writes into the now idle HBase table
43. Classes
• HBaseAdmin
• HBaseConfiguration
• HTable
• HTableDescriptor
• Put
• Get
• Scanner
• Filters
Database Admin
Table
Family
Column Qualifier
44. Using HBase API
HBaseConfiguration: Adds HBase configuration files to a Configuration
new HBaseConfiguration ( )
new HBaseConfiguration (Configuration c)
<property>
<name> name
</name>
<value> value
</value>
</property>
HBaseAdmin: new HBaseAdmin( HBaseConfiguration conf )
• Ex:
HBaseAdmin admin = new HBaseAdmin(config);
admin.disableTable (“tablename”);
45. Using HBase API
HTableDescriptor: HTableDescriptor contains the name of an HTable, and its
column families.
new HTableDescriptor()
new HTableDescriptor(String name)
• Ex: HTableDescriptor htd = new HTableDescriptor(tablename);
htd.addFamily ( new HColumnDescriptor (“Family”));
HColumnDescriptor: An HColumnDescriptor contains information about a column family
new HColumnDescriptor(String familyname)
• Ex:
HTableDescriptor htd = new HTableDescriptor(tablename);
HColumnDescriptor col = new HColumnDescriptor("content:");
htd.addFamily(col);
46. Using HBase API
HTable: Used for communication with a single HBase table.
new HTable(HBaseConfiguration conf, String tableName)
• Ex:
HTable table = new HTable (conf, Bytes.toBytes ( tablename ));
ResultScanner scanner = table.getScanner ( family );
Put: Used to perform Put operations for a single row.
new Put(byte[] row)
new Put(byte[] row, RowLock rowLock)
• Ex:
HTable table = new HTable (conf, Bytes.toBytes ( tablename ));
Put p = new Put ( brow );
p.add (family, qualifier, value);
table.put ( p );
47. Using HBase API
Get: Used to perform Get operations on a single row.
new Get (byte[] row)
new Get (byte[] row, RowLock rowLock)
• Ex:
HTable table = new HTable(conf, Bytes.toBytes(tablename));
Get g = new Get(Bytes.toBytes(row));
Result: Single row result of a Get or Scan query.
new Result()
• Ex:
HTable table = new HTable(conf, Bytes.toBytes(tablename));
Get g = new Get(Bytes.toBytes(row));
Result rowResult = table.get(g);
Bytes[] ret = rowResult.getValue( (family + ":"+ column ) );
48. Using HBase API
Scanner
• All operations are identical to Get
– Rather than specifying a single row, an optional startRow and stopRow
may be defined.
• If rows are not specified, the Scanner will iterate over all rows.
– = new Scan ()
– = new Scan (byte[] startRow, byte[] stopRow)
– = new Scan (byte[] startRow, Filter filter)
49. HBase Shell
• Non-SQL (intentional) “DSL”
• list : List all tables in hbase
• get : Get row or cell contents; pass table name, row, and optionally a
dictionary of column(s), timestamp and versions.
• put : Put a cell 'value' at specified table/row/column and optionally
timestamp coordinates.
• create : hbase> create 't1', {NAME => 'f1', VERSIONS => 5}
• scan : Scan a table; pass table name and optionally a dictionary of
scanner specifications.
• delete : Put a delete cell value at specified table/row/column and
optionally timestamp coordinates.
• enable : Enable the named table
• disable : Disable the named table: e.g. "hbase> disable 't1'"
• drop : Drop the named table.
50. HBase non-java access
• Languages talking to the JVM:
– Jython interface to HBase
– Groovy DSL for HBase
– Scala interface to HBase
• Languages with a custom protocol
– REST gateway specification for HBase
– Thrift gateway specification for HBase
51. Example: Frequency Counter
• Hbase has records of web_access_logs -We record each web page access by
a user.
• The schema looks like this:
userID_timestamp => {
details => {
page:
}
}
• We want to count how many times
we have seen each user
row details:page
user1_t1 a.html
user2_t2 b.html
user3_t4 a.html
user1_t5 c.html
user1_t6 b.html
user2_t7 c.html
user4_t8 a.html
user count (frequency)
user1 3
user2 2
user3 1
user4 1
52. Tutorial
• hbase shell
create 'access_logs', 'details'
create 'summary_user', {NAME=>'details', VERSIONS=>1}
• Add some data using Importer
• scan 'access_logs', {LIMIT => 5}
• Run 'FreqCounter'
• scan 'summary_user', {LIMIT => 5}
• Show output with PrintUserCount
53. coprocessors
• HBase 0.92 release provides coprocessors functionality which includes
– observers (similar to triggers for certain events) and
– endpoints (similar to stored procedures to be invoked from the client)
• Observers can be at the region, master or at the WAL (Write Ahead Log)
level.
• Once a Region Observer has been created, it can be specified in the hbase-default.
xml which applies to all the regions and the tables in it or else the
Region Observer can be specified on a table in which case it applies only to
that table.
• Arbitrary code can run at each tablet in table server
• High-level call interface for clients
– Calls are addressed to rows or ranges of rows and the coprocessor client library
resolves them to actual locations;
– Calls across multiple rows are automatically split into multiple parallelized RPC
• Provides a very flexible model for building distributed services
• Automatic scaling, load balancing, request routing for applications
54. Three observer interfaces
• RegionObserver: Provides hooks for data manipulation events, Get, Put,
Delete, Scan, and so on. There is an instance of a RegionObserver
coprocessor for every table region and the scope of the observations they
can make is constrained to that region.
• WALObserver: Provides hooks for write-ahead log (WAL) related operations.
This is a way to observe or intercept WAL writing and reconstruction events.
A WALObserver runs in the context of WAL processing. There is one such
context per region server.
• MasterObserver: Provides hooks for DDL-type operation, i.e., create, delete,
modify table, etc. The MasterObserver runs within the context of the HBase
master.
55. Example
package org.apache.hadoop.hbase.coprocessor;
import java.util.List;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.Get;
// Sample access-control coprocessor. It utilizes RegionObserver
// and intercept preXXX() method to check user privilege for the given table
// and column family.
public class AccessControlCoprocessor extends BaseRegionObserver {
@Override
public void preGet(final ObserverContext<RegionCoprocessorEnvironment> c,
final Get get, final List<KeyValue> result) throws IOException
throws IOException {
// check permissions..
if (!permissionGranted()) {
throw new AccessDeniedException("User is not allowed to access.");
}
}
// override prePut(), preDelete(), etc.
}
56. Avoiding long pause from The Garbage Collector
• Stop-the-world garbage collections is common in HBase,
especially during loading.
• There are two issues to be addressed
– concurrent mark and sweep (CMS) performance, and
– fragmentation of memstore.
• To address the first, start the CMS earlier than default by adding
-XX:CMSInitiatingOccupancyFraction and setting it down from
defaults. Start at 60 or 70 percent (The lower you bring down
the threshold, the more GCing is done, the more CPU used).
• To address the second fragmentation issue, there is an
experimental facility hbase.hregion.memstore.mslab.enabled
(memstore local allocation buffer) to be set to true in
configuration.
57. For loading data Pre-Create Regions
• Tables in HBase are initially created with one region by
default.
• For bulk imports, this means that all clients will write
to the same region until it is large enough to split and
become distributed across the cluster.
• A useful pattern to speed up the bulk import process is
to pre-create empty regions.
• Note that too-many regions can actually degrade
performance.
58. Enable Scan Caching
• When HBase is used as an input source for a MapReduce job,
set setCaching to something greater than the default (which is
1).
• Using the default value => map-task will make call back to the
region-server for every record processed.
– Setting this value to 80, for example, will transfer 80 rows at a time to
the client to be processed.
• There is a cost/benefit to have the cache value be large because
it costs more in memory for both client and RegionServer, so
bigger isn't always better.
• It appears from the experimentation that selecting a value
between 50 and 100 gives good performance in our setup.
59. Right Scan Attribute Selection
• Whenever a Scan is used to process large numbers of
rows (and especially when used as a MapReduce
source), we shall select the right set of attributes.
• If scan.addFamily is called then all of the attributes in
the specified ColumnFamily will be returned to the
client.
• If only a small number of the available attributes are to
be processed, then only those attributes should be
specified in the input scan because attribute over-selection
is a non-trivial performance penalty over
large datasets.
60. Optimize handler.count
• Count of RPC Listener instances spun up on
RegionServers. Same property is used by the Master
for count of master handlers.
– Default is 10.
• This setting in essence sets how many requests are
concurrently being processed inside the RegionServer
at any one time.
• If multiple map-reduce job is running in the cluster
and there is enough map capacity to handle the jobs
concurrently, then this parameter needs to be tuned.