Jonathan Gray gave an introduction to HBase at the NYC Hadoop Meetup. He began with an overview of HBase and why it was created to handle large datasets beyond what Hadoop could support alone. He then described what HBase is, as a distributed, column-oriented database management system. Gray explained how HBase works with its master and regionserver nodes and how it partitions data across tables and regions. He highlighted some key features of HBase and examples of companies using it in production. Gray concluded with what is planned for the future of HBase and contrasted it with relational database examples.
HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second. This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they
http://berlinbuzzwords.de/sessions/advanced-hbase-schema-design
Apache HBase™ is the Hadoop database, a distributed, salable, big data store.Its a column-oriented database management system that runs on top of HDFS.
Apache HBase is an open source NoSQL database that provides real-time read/write access to those large data sets. ... HBase is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN.
This Presentation will give you Information about :
1. HBase Overview and Architecture.
2. HBase Installation..
3. HBase Shel,
4. CRUD operations,
5. Scanning and Batching,
6. Hbase Filters,
7. HBase Key Design,
With the public confession of Facebook, HBase is on everyone's lips when it comes to the discussion around the new "NoSQL" area of databases. In this talk, Lars will introduce and present a comprehensive overview of HBase. This includes the history of HBase, the underlying architecture, available interfaces, and integration with Hadoop.
HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second. This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they
http://berlinbuzzwords.de/sessions/advanced-hbase-schema-design
Apache HBase™ is the Hadoop database, a distributed, salable, big data store.Its a column-oriented database management system that runs on top of HDFS.
Apache HBase is an open source NoSQL database that provides real-time read/write access to those large data sets. ... HBase is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN.
This Presentation will give you Information about :
1. HBase Overview and Architecture.
2. HBase Installation..
3. HBase Shel,
4. CRUD operations,
5. Scanning and Batching,
6. Hbase Filters,
7. HBase Key Design,
With the public confession of Facebook, HBase is on everyone's lips when it comes to the discussion around the new "NoSQL" area of databases. In this talk, Lars will introduce and present a comprehensive overview of HBase. This includes the history of HBase, the underlying architecture, available interfaces, and integration with Hadoop.
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - FacebookCloudera, Inc.
This technical session will provide a quick review of the Apache HBase project, looking at it from the past to the future. It will cover the imminent HBase 0.92 release as well as what is slated for 0.94 and beyond. A number of companies and use cases will be used as examples to describe the overall direction of the HBase community and project.
Most developers are familiar with the topic of “database design”. In the relational world, normalization is the name of the game. How do things change when you’re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...Cloudera, Inc.
Facebook has one of the largest Apache Hadoop data warehouses in the world, primarily queried through Apache Hive for offline data processing and analytics. However, the need for realtime analytics and end-user access has led to the development of several new systems built using Apache HBase. This talk will cover specific use cases and the work done at Facebook around building large scale, low latency and high throughput realtime services with Hadoop and HBase. This includes several significant contributions to existing projects as well as the release of new open source projects.
Jesse Anderson (Smoking Hand)
This early-morning session offers an overview of what HBase is, how it works, its API, and considerations for using HBase as part of a Big Data solution. It will be helpful for people who are new to HBase, and also serve as a refresher for those who may need one.
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns.
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.
"While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns. "
Big Data Fundamentals in the Emerging New Data WorldJongwook Woo
I talk about the fundamental of Big Data, which includes Hadoop, Intensive Data Computing, and NoSQL database that have received highlight to compute and store BigData that is usually greater then Peta-byte data. Besides, I
introduce the case studies that use Hadoop and NSQL DB.
In this session you will learn:
HBase Introduction
Row & Column storage
Characteristics of a huge DB
What is HBase?
HBase Data-Model
HBase vs RDBMS
HBase architecture
HBase in operation
Loading Data into HBase
HBase shell commands
HBase operations through Java
HBase operations through MR
To know more, click here: https://www.mindsmapped.com/courses/big-data-hadoop/big-data-and-hadoop-training-for-beginners/
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...HBaseCon
Speakers: Chris Huang and Scott Miao (Trend Micro)
Trend Micro collects lots of threat knowledge data for clients containing many different threat (web) entities. Most threat entities will be observed along with relations, such as malicious behaviors or interaction chains among them. So, we built a graph model on HBase to store all the known threat entities and their relationships, allowing clients to query threat relationships via any given threat entity. This presentation covers what problems we try to solve, what and how the design decisions we made, how we design such a graph model, and the graph computation tasks involved.
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
Apache HBase is an open source distributed data-store capable of managing billions of rows of semi-structured data across large clusters of commodity hardware. HBase provides real-time random read-write access as well as integration with Hadoop MapReduce, Hive, and Pig for batch analysis. In this talk, Todd will provide an introduction to the capabilities and characteristics of HBase, comparing and contrasting it with traditional database systems. He will also introduce its architecture and data model, and present some example use cases.
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon
Phoenix has evolved to become a full-fledged relational database layer over HBase data. We'll discuss the fundamental principles of how Phoenix pushes the computation to the server and why this leads to performance enabling direct support of low-latency applications, along with some major new features. Next, we'll outline our approach for transaction support in Phoenix, a work in-progress, and discuss the pros and cons of the various approaches. Lastly, we'll examine the current means of integrating Phoenix with the rest of the Hadoop ecosystem.
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - FacebookCloudera, Inc.
This technical session will provide a quick review of the Apache HBase project, looking at it from the past to the future. It will cover the imminent HBase 0.92 release as well as what is slated for 0.94 and beyond. A number of companies and use cases will be used as examples to describe the overall direction of the HBase community and project.
Most developers are familiar with the topic of “database design”. In the relational world, normalization is the name of the game. How do things change when you’re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...Cloudera, Inc.
Facebook has one of the largest Apache Hadoop data warehouses in the world, primarily queried through Apache Hive for offline data processing and analytics. However, the need for realtime analytics and end-user access has led to the development of several new systems built using Apache HBase. This talk will cover specific use cases and the work done at Facebook around building large scale, low latency and high throughput realtime services with Hadoop and HBase. This includes several significant contributions to existing projects as well as the release of new open source projects.
Jesse Anderson (Smoking Hand)
This early-morning session offers an overview of what HBase is, how it works, its API, and considerations for using HBase as part of a Big Data solution. It will be helpful for people who are new to HBase, and also serve as a refresher for those who may need one.
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns.
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.
"While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns. "
Big Data Fundamentals in the Emerging New Data WorldJongwook Woo
I talk about the fundamental of Big Data, which includes Hadoop, Intensive Data Computing, and NoSQL database that have received highlight to compute and store BigData that is usually greater then Peta-byte data. Besides, I
introduce the case studies that use Hadoop and NSQL DB.
In this session you will learn:
HBase Introduction
Row & Column storage
Characteristics of a huge DB
What is HBase?
HBase Data-Model
HBase vs RDBMS
HBase architecture
HBase in operation
Loading Data into HBase
HBase shell commands
HBase operations through Java
HBase operations through MR
To know more, click here: https://www.mindsmapped.com/courses/big-data-hadoop/big-data-and-hadoop-training-for-beginners/
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...HBaseCon
Speakers: Chris Huang and Scott Miao (Trend Micro)
Trend Micro collects lots of threat knowledge data for clients containing many different threat (web) entities. Most threat entities will be observed along with relations, such as malicious behaviors or interaction chains among them. So, we built a graph model on HBase to store all the known threat entities and their relationships, allowing clients to query threat relationships via any given threat entity. This presentation covers what problems we try to solve, what and how the design decisions we made, how we design such a graph model, and the graph computation tasks involved.
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
Apache HBase is an open source distributed data-store capable of managing billions of rows of semi-structured data across large clusters of commodity hardware. HBase provides real-time random read-write access as well as integration with Hadoop MapReduce, Hive, and Pig for batch analysis. In this talk, Todd will provide an introduction to the capabilities and characteristics of HBase, comparing and contrasting it with traditional database systems. He will also introduce its architecture and data model, and present some example use cases.
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon
Phoenix has evolved to become a full-fledged relational database layer over HBase data. We'll discuss the fundamental principles of how Phoenix pushes the computation to the server and why this leads to performance enabling direct support of low-latency applications, along with some major new features. Next, we'll outline our approach for transaction support in Phoenix, a work in-progress, and discuss the pros and cons of the various approaches. Lastly, we'll examine the current means of integrating Phoenix with the rest of the Hadoop ecosystem.
Precisando lidar com dados massivos onde centenas de gigabytes com crescimento para terabytes ou mesmo petabytes fazem parte do seu dia-a-dia ? Você precisa realizar milhares de operações por segundo em múltiplos terabytes de dados ? Venha conhecer o Apache HBase, um banco de dados NoSQL que roda em cima do HDFS e é altamente disponível, tolerante a falhas e escalável. HBase tem sido muito utilizado em empresas como Facebook e Twitter. Esta palestra faz uma introdução, mostrando o que é o HBase e quando usar, sua arquitetura e também exemplos de soluções reais de grandes empresas como Facebook, Twitter e Trend Micro
In this introduction to Apache Hive the following topics are covered:
1. Hive Introduction
2. Hive origin
3. Where does Hive fall in Big Data stack
4. Hive architecture
5. Tts job execution mechanisms
6. HiveQL and Hive Shell
7 Types of tables
8. Querying data
9. Partitioning
10. Bucketing
11. Pros
12. Limitations of Hive
Data Explosion
- TBs of data generated everyday
Solution – HDFS to store data and Hadoop Map-Reduce framework to parallelize processing of Data
What is the catch?
Hadoop Map Reduce is Java intensive
Thinking in Map Reduce paradigm can get tricky
From: DataWorks Summit 2017 - Munich - 20170406
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
2. About Me
• Jonathan Gray
– HBase Committer
– HBase User since early 2008
– Migrated large PostgreSQL instance to HBase
• In production @ streamy.com since June 2008
– Core contributor to performance improvements
in HBase 0.20
– Currently consulting around HBase
• As well as Hadoop/MR and Lucene/Katta
3. Overview
• Why HBase?
• What is HBase?
• How does HBase work?
• HBase Today and Tomorrow
• HBase vs. RDBMS Example
• HBase and “NoSQL”
4. Why HBase?
• Same reasons we need Hadoop
– Datasets growing into Terabytes and Petabytes
– Scaling out is cheaper than scaling up
• Continue to grow just by adding commodity nodes
• But sometimes Hadoop is not enough
– Need to support random reads and random writes
Traditional databases are expensive to scale
and difficult to distribute
5. What is HBase?
• Distributed
• Column-Oriented
• Multi-Dimensional
• High-Availability
• High-Performance
• Storage System
Project Goal
Billions of Rows * Millions of Columns * Thousands of Versions
Petabytes across thousands of commodity servers
6. HBase is not…
• A Traditional SQL Database
– No joins, no query engine, no types, no SQL
– Transactions and secondary indexing possible but these
are add-ons, not part of core HBase
• A drop-in replacement for your RDBMS
• You must be OK with RDBMS anti-schema
– Denormalized data
– Wide and sparsely populated tables
Just say “no” to your inner DBA
7. How does HBase work?
• Two types of HBase nodes:
Master and RegionServer
• Master (one at a time)
– Manages cluster operations
• Assignment, load balancing, splitting
– Not part of the read/write path
– Highly available with ZooKeeper and standbys
• RegionServer (one or more)
– Hosts tables; performs reads, buffers writes
– Clients talk directly to them for reads/writes
8. HBase Tables
• An HBase cluster is made up of any number of user-
defined tables
• Table schema only defines it’s column families
– Each family consists of any number of columns
– Each column consists of any number of versions
– Columns only exist when inserted, NULLs are free
– Everything except table/family names are byte[]
– Rows in a table are sorted and stored sequentially
– Columns in a family are sorted and stored sequentially
(Table, Row, Family, Column, Timestamp) Value
9. HBase Table as Data Structures
• A table maps rows to its families
– SortedMap(Row List(ColumnFamilies))
• A family maps column names to versioned values
– SortedMap(Column SortedMap(VersionedValues))
• A column maps timestamps to values
– SortedMap(Timestamp Value)
An HBase table is a three-dimensional sorted map
(row, column, and timestamp)
10. HBase Regions
• Table is made up of any number of regions
• Region is specified by its startKey and endKey
– Empty table:
(Table, NULL, NULL)
– Two-region table:
(Table, NULL, “MidKey”) and (Table, “MidKey”, NULL)
• A region only lives on one RegionServer at a time
• Each region may live on a different node and is
made up of several HDFS files and blocks, each of
which is replicated by Hadoop
11. More HBase Architecture
• Region information and locations stored in special
tables called catalog tables
-ROOT- table contains location of meta table
.META. table contains schema/locations of user regions
• Location of -ROOT- is stored in ZooKeeper
– This is the “bootstrap” location
• ZooKeeper is used for coordination / monitoring
– Leader election to decide who is master
– Ephemeral nodes to detect RegionServer node failures
14. HBase Key Features
• Automatic partitioning of data
– As data grows, it is automatically split up
• Transparent distribution of data
– Load is automatically balanced across nodes
• Tables are ordered by row, rows by column
– Designed for efficient scanning (not just gets)
– Composite keys allow ORDER BY / GROUP BY
• Server-side filters
• No SPOF because of ZooKeeper integration
15. HBase Key Features (cont)
• Fast adding/removing of nodes while online
– Moving locations of data doesn’t move data
• Supports creating/modifying tables online
– Both table-level and family-level configuration
parameters
• Close ties with Hadoop MapReduce
– TableInputFormat/TableOutputFormat
– HFileOutputFormat
16. Connecting to HBase
• Native Java Client/API
– Get, Scan, Put, Delete classes
– HTable for read/write, HBaseAdmin for admin stuff
• Non-Java Clients
– Thrift server (Ruby, C++, PHP, etc)
– REST server (stargate contrib)
• HBase Shell
– Jruby shell supports put, delete, get, scan
– Also supports administrative tasks
• TableInputFormat/TableOutputFormat
17. HBase Add-ons
• MapReduce / Cascading / Hive / Pig
– Support for HBase as a data source or sink
• Transactional HBase
– Distributed transactions using OCC
• Indexed HBase
– Utilizes Transactional HBase for secondary indexing
• IHbase
– New contrib for in-memory secondary indexes
• HBql
– SQL syntax on top of HBase
18. HBase Today
• Latest stable release is HBase 0.20.3
– Major improvement over HBase 0.19
– Focus on performance improvement
– Add ZooKeeper, remove SPOF
– Expansion of in-memory and caching capabilities
– Compatible with Hadoop 0.20.x
– Recommend upgrading from earlier 0.20.x HBase
releases as 0.20.3 includes some important fixes
• Improves logging, shell, faster cluster ops, stability
19. HBase in Production
• Streamy
• StumbleUpon
• Adobe
• Meetup
• Ning
• Openplaces
• Powerset
• SocialMedia.com
• TrendMicro
20. The Future of HBase
• Next release is HBase 0.21.0
– Release date will be ~1 month after Hadoop 0.21
• Data durability is fixed in this release
– HDFS append/sync finally works in Hadoop 0.21
– This is implemented and working on TRUNK
– Have added group commit and knobs to adjust
• Other cool features
– Inter-cluster replication
– Master Rewrite
– Parallel Puts
– Co-processors
21. HBase Web Crawl Example
• Store web crawl data
– Table crawl with family content
– Row is URL with Columns
• content:data stores raw crawled data
• content:language stores http language header
• content:type stores http content-type header
– If processing raw data for hyperlinks and images,
add families links and images
• links:<url> column for each hyperlink
• images:<url> column for each image
22. Web Crawl Example in RDBMS
• How would this look in a traditional DB?
– Table crawl with columns url, data, language, and
type
– Table links with columns url and link
– Table images with columns url and image
• How will this scale?
– 10M documents w/ avg10 links and 10 images
– 210M total rows versus 10M total rows
– Index bloat with links/images tables
23. What is “NoSQL”?
• Has little to do with not being SQL
– SQL is just a query language standard
– HBql is an attempt to add SQL syntax to HBase
– Millions are trained in SQL; resistance is futile!
• Popularity of Hive and Pig over raw MapReduce
• Has more to do with anti-RDBMS architecture
– Dropping the relational aspects
– Loosening ACID and transactional elements
24. NoSQL Types and Projects
• Column-oriented
– HBase, Cassandra, Hypertable
• Key/Value
– BerkeleyDB, Tokyo, Memcache, Redis, SimpleDB
• Document
– CouchDB, MongoDB
• Other differentiators as well…
– Strong vs. Eventual consistency
– Database replication vs. Filesystem replication