This document provides an introduction to HBase internals and schema design for HBase users. It discusses the logical and physical views of HBase, including how tables are split into regions and stored across region servers. It covers best practices for schema design, such as using row keys efficiently and avoiding redundancy. The document also briefly discusses advanced topics like coprocessors and compression. The overall goal is to help HBase users optimize performance and scalability based on its internal architecture.
This presentation shortly describes key features of Apache Cassandra. It was held at the Apache Cassandra Meetup in Vienna in January 2014. You can access the meetup here: http://www.meetup.com/Vienna-Cassandra-Users/
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns.
This presentation shortly describes key features of Apache Cassandra. It was held at the Apache Cassandra Meetup in Vienna in January 2014. You can access the meetup here: http://www.meetup.com/Vienna-Cassandra-Users/
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns.
Introduction to memcached, a caching service designed for optimizing performance and scaling in the web stack, seen from perspective of MySQL/PHP users. Given for 2nd year students of professional bachelor in ICT at Kaho St. Lieven, Gent.
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
Apache Kafka is a new breed of messaging system built for the "big data" world. Coming out of LinkedIn (and donated to Apache), it is a distributed pub/sub system built in Scala. It has been an Apache TLP now for several months with the first Apache release imminent. Built for speed, scalability, and robustness, Kafka should definitely be one of the data tools you consider when designing distributed data-oriented applications.
The talk will cover a general overview of the project and technology, with some use cases, and a demo.
As part of NoSQL series, I presented Google Bigtable paper. In presentation I tried to give some plain introduction to Hadoop, MapReduce, HBase
www.scalability.rs
Slides for presentation on ZooKeeper I gave at Near Infinity (www.nearinfinity.com) 2012 spring conference.
The associated sample code is on GitHub at https://github.com/sleberknight/zookeeper-samples
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2L4rPmM
This CloudxLab Basics of RDD tutorial helps you to understand Basics of RDD in detail. Below are the topics covered in this tutorial:
1) What is RDD - Resilient Distributed Datasets
2) Creating RDD in Scala
3) RDD Operations - Transformations & Actions
4) RDD Transformations - map() & filter()
5) RDD Actions - take() & saveAsTextFile()
6) Lazy Evaluation & Instant Evaluation
7) Lineage Graph
8) flatMap and Union
9) Scala Transformations - Union
10) Scala Actions - saveAsTextFile(), collect(), take() and count()
11) More Actions - reduce()
12) Can We Use reduce() for Computing Average?
13) Solving Problems with Spark
14) Compute Average and Standard Deviation with Spark
15) Pick Random Samples From a Dataset using Spark
This talk delves into the many ways that a user has to use HBase in a project. Lars will look at many practical examples based on real applications in production, for example, on Facebook and eBay and the right approach for those wanting to find their own implementation. He will also discuss advanced concepts, such as counters, coprocessors and schema design.
Техносфера Mail.ru Group, МГУ им. М.В. Ломоносова.
Курс "Методы распределенной обработки больших объемов данных в Hadoop"
Видео лекции курса https://www.youtube.com/playlist?list=PLrCZzMib1e9rPxMIgPri9YnOpvyDAL9HD
I don't think it's hyperbole when I say that Facebook, Instagram, Twitter & Netflix now define the dimensions of our social & entertainment universe. But what kind of technology engines purr under the hoods of these social media machines?
Here is a tech student's perspective on making the paradigm shift to "Big Data" using innovative models: alphabet blocks, nesting dolls, & LEGOs!
Get info on:
- What is Cassandra (C*)?
- Installing C* Community Version on Amazon Web Services EC2
- Data Modelling & Database Design in C* using CQL3
- Industry Use Cases
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
Join this session to hear from the Photon product and engineering team talk about the latest developments with the project.
As organizations embrace data-driven decision-making, it has become imperative for them to invest in a platform that can quickly ingest and analyze massive amounts and types of data. With their data lakes, organizations can store all their data assets in cheap cloud object storage. But data lakes alone lack robust data management and governance capabilities. Fortunately, Delta Lake brings ACID transactions to your data lakes – making them more reliable while retaining the open access and low storage cost you are used to.
Using Delta Lake as its foundation, the Databricks Lakehouse platform delivers a simplified and performant experience with first-class support for all your workloads, including SQL, data engineering, data science & machine learning. With a broad set of enhancements in data access and filtering, query optimization and scheduling, as well as query execution, the Lakehouse achieves state-of-the-art performance to meet the increasing demands of data applications. In this session, we will dive into Photon, a key component responsible for efficient query execution.
Photon was first introduced at Spark and AI Summit 2020 and is written from the ground up in C++ to take advantage of modern hardware. It uses the latest techniques in vectorized query processing to capitalize on data- and instruction-level parallelism in CPUs, enhancing performance on real-world data and applications — all natively on your data lake. Photon is fully compatible with the Apache Spark™ DataFrame and SQL APIs to ensure workloads run seamlessly without code changes. Come join us to learn more about how Photon can radically speed up your queries on Databricks.
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.
"While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns. "
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
Introducing Rainbird, Twitter's high volume distributed counting service for realtime analytics, built on Cassandra. This presentation looks at the motivation, design, and uses of Rainbird across Twitter.
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
Most developers are familiar with the topic of “database design”. In the relational world, normalization is the name of the game. How do things change when you’re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.
Introduction to memcached, a caching service designed for optimizing performance and scaling in the web stack, seen from perspective of MySQL/PHP users. Given for 2nd year students of professional bachelor in ICT at Kaho St. Lieven, Gent.
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
Apache Kafka is a new breed of messaging system built for the "big data" world. Coming out of LinkedIn (and donated to Apache), it is a distributed pub/sub system built in Scala. It has been an Apache TLP now for several months with the first Apache release imminent. Built for speed, scalability, and robustness, Kafka should definitely be one of the data tools you consider when designing distributed data-oriented applications.
The talk will cover a general overview of the project and technology, with some use cases, and a demo.
As part of NoSQL series, I presented Google Bigtable paper. In presentation I tried to give some plain introduction to Hadoop, MapReduce, HBase
www.scalability.rs
Slides for presentation on ZooKeeper I gave at Near Infinity (www.nearinfinity.com) 2012 spring conference.
The associated sample code is on GitHub at https://github.com/sleberknight/zookeeper-samples
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2L4rPmM
This CloudxLab Basics of RDD tutorial helps you to understand Basics of RDD in detail. Below are the topics covered in this tutorial:
1) What is RDD - Resilient Distributed Datasets
2) Creating RDD in Scala
3) RDD Operations - Transformations & Actions
4) RDD Transformations - map() & filter()
5) RDD Actions - take() & saveAsTextFile()
6) Lazy Evaluation & Instant Evaluation
7) Lineage Graph
8) flatMap and Union
9) Scala Transformations - Union
10) Scala Actions - saveAsTextFile(), collect(), take() and count()
11) More Actions - reduce()
12) Can We Use reduce() for Computing Average?
13) Solving Problems with Spark
14) Compute Average and Standard Deviation with Spark
15) Pick Random Samples From a Dataset using Spark
This talk delves into the many ways that a user has to use HBase in a project. Lars will look at many practical examples based on real applications in production, for example, on Facebook and eBay and the right approach for those wanting to find their own implementation. He will also discuss advanced concepts, such as counters, coprocessors and schema design.
Техносфера Mail.ru Group, МГУ им. М.В. Ломоносова.
Курс "Методы распределенной обработки больших объемов данных в Hadoop"
Видео лекции курса https://www.youtube.com/playlist?list=PLrCZzMib1e9rPxMIgPri9YnOpvyDAL9HD
I don't think it's hyperbole when I say that Facebook, Instagram, Twitter & Netflix now define the dimensions of our social & entertainment universe. But what kind of technology engines purr under the hoods of these social media machines?
Here is a tech student's perspective on making the paradigm shift to "Big Data" using innovative models: alphabet blocks, nesting dolls, & LEGOs!
Get info on:
- What is Cassandra (C*)?
- Installing C* Community Version on Amazon Web Services EC2
- Data Modelling & Database Design in C* using CQL3
- Industry Use Cases
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
Join this session to hear from the Photon product and engineering team talk about the latest developments with the project.
As organizations embrace data-driven decision-making, it has become imperative for them to invest in a platform that can quickly ingest and analyze massive amounts and types of data. With their data lakes, organizations can store all their data assets in cheap cloud object storage. But data lakes alone lack robust data management and governance capabilities. Fortunately, Delta Lake brings ACID transactions to your data lakes – making them more reliable while retaining the open access and low storage cost you are used to.
Using Delta Lake as its foundation, the Databricks Lakehouse platform delivers a simplified and performant experience with first-class support for all your workloads, including SQL, data engineering, data science & machine learning. With a broad set of enhancements in data access and filtering, query optimization and scheduling, as well as query execution, the Lakehouse achieves state-of-the-art performance to meet the increasing demands of data applications. In this session, we will dive into Photon, a key component responsible for efficient query execution.
Photon was first introduced at Spark and AI Summit 2020 and is written from the ground up in C++ to take advantage of modern hardware. It uses the latest techniques in vectorized query processing to capitalize on data- and instruction-level parallelism in CPUs, enhancing performance on real-world data and applications — all natively on your data lake. Photon is fully compatible with the Apache Spark™ DataFrame and SQL APIs to ensure workloads run seamlessly without code changes. Come join us to learn more about how Photon can radically speed up your queries on Databricks.
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.
"While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns. "
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
Introducing Rainbird, Twitter's high volume distributed counting service for realtime analytics, built on Cassandra. This presentation looks at the motivation, design, and uses of Rainbird across Twitter.
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
Most developers are familiar with the topic of “database design”. In the relational world, normalization is the name of the game. How do things change when you’re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.
Some of the common interview questions asked during a Big Data Hadoop Interview. These may apply to Hadoop Interviews. Be prepared with answers for the interview questions below when you prepare for an interview. Also have an example to explain how you worked on various interview questions asked below. Hadoop Developers are expected to have references and be able to explain from their past experiences. All the Best for a successful career as a Hadoop Developer!
These slides cover the very basics of Hadoop architecture, in particular HDFS. This was my presentation in the first Delhi Hadoop User Group (DHUG) meetup held at Gurgaon on 10th September 2011. Loved the positive feedback. I'll also upload a more elaborate version covering Hadoop mapreduce architecture as well soon. Most of the stuff covered in these slides can be found in Tom White's book as well (See the last slide)
Introduction to HBase. HBase is a NoSQL databases which experienced a tremendous increase in popularity during the last years. Large companies like Facebook, LinkedIn, Foursquare are using HBase. In this presentation we will address questions like: what is HBase?, and compared to relational databases?, what is the architecture?, how does HBase work?, what about the schema design?, what about the IT ressources?. Questions that should help you consider whether this solution might be suitable in your case.
Similar to Intro to HBase Internals & Schema Design (for HBase users) (20)
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Intro to HBase Internals & Schema Design (for HBase users)
1. Intro to HBase
Internals &
Schema Design
(for HBase Users)
Alex Baranau, Sematext International, 2012
Monday, July 9, 12
2. About Me
Software Engineer at Sematext International
http://blog.sematext.com/author/abaranau
@abaranau
http://github.com/sematext (abaranau)
Monday, July 9, 12
4. Why?
Why should I (HBase user) care about
HBase internals?
HBase will not adjust cluster
settings to optimal based on usage
patterns automatically
Schema design, table settings
(defined upon creation), etc.
depend on HBase implementation
aspects
Monday, July 9, 12
6. Logical View: Regions
HBase cluster serves multiple tables, distinguished by
name
Each table contains of rows
Each row contains cells:
(row key, column family, column, timestamp) -> value
Table is split into Regions (table shards, each
contains full rows), defined by start and end row keys
Monday, July 9, 12
7. Logical View: Regions are
Shards
Regions are “atoms of distribution”
Each region assigned to single RegionServer
(HBase cluster slave)
Rows of particular Region served by single
RS (cluster slave)
Regions are distributed evenly across RSs
Region has configurable max size
When region reaches max size (or on request)
it is split into two smaller regions, which
can be assigned to different RSs
Monday, July 9, 12
8. Logical View: Regions on
Cluster
ZooKeeper
ZooKeeper
ZooKeeper
client HMaster
HMaster
Region Region
RegionServer
Region Region Region Region
RegionServer
RegionServer
RegionServer RegionServer
Monday, July 9, 12
9. Logical View: Regions Load
It is essential for Regions under the
load to be evenly distributed across
the cluster
It is HBase user’s job to make sure
the above is true. Note: even
distribution of Regions over cluster
doesn’t imply that the load is evenly
distributed
Monday, July 9, 12
10. Logical View: Regions Load
Take into account that rows are stored in ordered
manner
Make sure you don’t write rows with sequential
keys to avoid RS hotspotting*
When writing data with monotonically increasing/decreasing
keys, data is written at one RS at a time
Use pre-splitting of the table upon creation
Starting with single region means using one RS for some time
In general, splitting can be expensive
Increase max region size
* see https://github.com/sematext/HBaseWD
Monday, July 9, 12
11. Logical View: Slow RSs
When load is distributed evenly, watch for
slowest RSs (HBase slaves)
Since every region served by single RS,
one slow RS can slow down cluster
performance e.g. when:
data is written into multiple RSs at
even pace (random value-based row keys)
data is being read from many RSs when
doing scan
Monday, July 9, 12
13. Physical View: Write/Read Flow
HTable client client
buffer HTable
write read RegionServer
Region z Region ...
...
Store Store
MemStore MemStore
(per CF) (per CF)
flush
HFile HFile ... HFile HFile
Write Ahead Log
HDFS
Monday, July 9, 12
14. Physical: Speed up Writing
Enabling & increasing client-side buffer reduces RPC
operations amount
warn: possible loss of buffered data
in case of client failure; design for failover
in case of write failure (networking/server-
side issues); can be handled on client
Disabling WAL increases write speed
warn: possible data loss in case of RS failure
Use bulk import functionality (writes HFiles
directly, which can be later added to HBase)
Monday, July 9, 12
15. Physical: Memstore Flushes
When memstore is flushed N HFiles are created (one per
CF)
Memstore size which causes flushing is configured on
two levels:
per RS: % of heap occupied by memstores
per table: size in MB of single memstore (per CF)
of Region
When Region memstores flushes, memstores of all CFs
are flushed
Uneven data amount between CFs causes too many
flushes & creation of too many HFiles (one per CF
every time)
In most cases having one CF is the best design
Monday, July 9, 12
16. Physical: Memstore Flushes
Important: there are Memstore size
thresholds which cause writes to be blocked,
so slow memstore flushes and overuse of
memory by memstore can cause write perf
degradation
Hint: watch for flush queue size metric on
RSs
At the same time the more memory memstore
uses the better for writing/reading perf
(unless it reaches those “write blocking”
thresholds)
Monday, July 9, 12
17. Physical: Memstore Flushes
Example of good situation
*
* http://sematext.com/spm/index.html
Monday, July 9, 12
18. Physical: HFiles Compaction
HFiles are periodically compacted into bigger
HFiles containing same data
Reading from less HFiles faster
Important: there’s a configured max number of files
in Store which, when reached causes writes to block
Hint: watch for compaction queue size metric on
RSs
read Store
MemStore
(per CF)
HFile HFile
Monday, July 9, 12
19. Physical: Data Locality
RSs are usually collocated HDFS
with HDFS DataNodes MapReduce
HBase
RegionServer RegionServer
TaskTracker
TaskTracker
DataNode DataNode
Slave Node Slave Node
Monday, July 9, 12
20. Physical: Data Locality
HBase tries to assign Regions to RSs so that
Region data stored physically on the same node.
But sometimes fails
after Region splits there’s no guarantee that
there’s a node that has all blocks (HDFS level)
of new Region and
no guarantee that HBase will not re-assign this
Region to different RS in future (even
distribution of Regions takes preference over
data locality)
There’s an ongoing work towards better preserving
data locality
Monday, July 9, 12
21. Physical: Data Locality
Also, data locality can break when:
Adding new slaves to cluster
Removing slaves from cluster
Incl. node failures
Hint: look at networking IO between slaves when writing/reading
data, it should be minimal
Important:
make sure HDFS is well balanced (use balancer tool)
try to rebalance Regions in HBase cluster if possible (HBase
Master restart will do that) to regain data locality
Pre-split table on creation to limit (ideally avoid) splits
and regions movement; manage splits manually sometimes helps
Monday, July 9, 12
23. Schema: row keys
Using row key (or keys range) is the most
efficient way to retrieve the data from HBase
Row key design is major part of schema design
Note: no secondary indices available out of
the box
Row Key Data
‘login_2012-03-01.00:09:17’ d:{‘user’:‘alex’}
... ...
‘login_2012-03-01.23:59:35’ d:{‘user’:‘otis’}
‘login_2012-03-02.00:00:21’ d:{‘user’:‘david’}
Monday, July 9, 12
24. Schema: row keys
Redundancy is OK!
warn: changing two rows in HBase is not atomic operation
Row Key Data
‘login_2010-01-01.00:09:17’ d:{‘user’:‘alex’}
... ...
‘login_2012-03-01.23:59:35’ d:{‘user’:‘otis’}
‘alex_2010-01-01.00:09:17’ d:{‘action’:‘login’}
... ...
‘otis_2012-03-01.23:59:35’ d:{‘action’:‘login’}
‘alex_login_2010-01-01.00:09:17’ d:{‘device’:’pc’}
... ...
‘otis_login_2012-03-01.23:59:35’ d:{‘device’:‘mobile’}
Monday, July 9, 12
25. Schema: Relations
Not relational
No joins
Denormalization is OK! Use ‘nested entities’
Row Key Data
d:{
student_firstname:Alex,
student_lastname:Baranau,
student
professor_math_firstname:David,
* ‘student_abaranau’ professor_math_lastname:Smart,
* professor_cs_firstname:Jack,
professor professor_cs_lastname:Weird,
}
‘prof_dsmart’ d:{...}
Monday, July 9, 12
26. Schema: row key/CF/qual size
HBase stores cells individually
great for “sparse” data
row key, CF name and column name stored with each
cell which may affect data amount to be stored and
managed
keep them short
serialize and store many values into single cell
Row Key Data
d:{
s:Alex#Baranau#cs#2009,
‘s_abaranau’ p_math:David#Smart,
p_cs:Jack#Weird,
}
Monday, July 9, 12
28. Advanced: Co-Processors
CoProcessors API (HBase 0.92.0+) allows to:
execute (querying/aggregation/etc.)
logic on server side (you may think of
it as of stored procedures in RDBMS)
perform auditing of actions performed on
server-side (you may think of it as of
triggers in RDBMS)
apply security rules for data access
and many more cool stuff
Monday, July 9, 12
29. Other: Use Compression
Using compression:
reduces data amount to be stored on disks
reduces data amount to be transferred when RS reading data not
from local replica
increases amount of CPU used, but CPU isn’t usually a bottleneck
Favor compression speed over compression ratio
SNAPPY is good
Use wisely:
e.g. avoid wasting CPU cycles on compressing images
compression can be configured on per CF basis, so storing
non-compressible data in separate CF sometimes helps
data blocks are uncompressed in memory, avoid this to cause OOME
note: when scanning (seeking data to return for scan) many data
blocks can be uncompressed even if none of the data will be
returned from those block
Monday, July 9, 12
30. Other: Use Monitoring
TBD
Ganglia, Cacti, other*, Just use it!
* http://sematext.com/spm/index.html
Monday, July 9, 12