Most developers are familiar with the topic of “database design”. In the relational world, normalization is the name of the game. How do things change when you’re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
Apache HBase is an open source distributed data-store capable of managing billions of rows of semi-structured data across large clusters of commodity hardware. HBase provides real-time random read-write access as well as integration with Hadoop MapReduce, Hive, and Pig for batch analysis. In this talk, Todd will provide an introduction to the capabilities and characteristics of HBase, comparing and contrasting it with traditional database systems. He will also introduce its architecture and data model, and present some example use cases.
This deck presents the best practices of using Apache Hive with good performance. It covers getting data into Hive, using ORC file format, getting good layout into partitions and files based on query patterns, execution using Tez and YARN queues, memory configuration, and debugging common query performance issues. It also describes Hive Bucketing and reading Hive Explain query plans.
This presentation describes how to efficiently load data into Hive. I cover partitioning, predicate pushdown, ORC file optimization and different loading schemes
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns.
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.
"While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns. "
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
Apache HBase is an open source distributed data-store capable of managing billions of rows of semi-structured data across large clusters of commodity hardware. HBase provides real-time random read-write access as well as integration with Hadoop MapReduce, Hive, and Pig for batch analysis. In this talk, Todd will provide an introduction to the capabilities and characteristics of HBase, comparing and contrasting it with traditional database systems. He will also introduce its architecture and data model, and present some example use cases.
This deck presents the best practices of using Apache Hive with good performance. It covers getting data into Hive, using ORC file format, getting good layout into partitions and files based on query patterns, execution using Tez and YARN queues, memory configuration, and debugging common query performance issues. It also describes Hive Bucketing and reading Hive Explain query plans.
This presentation describes how to efficiently load data into Hive. I cover partitioning, predicate pushdown, ORC file optimization and different loading schemes
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns.
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.
"While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns. "
Overview of HBase cluster replication feature, covering implementation details as well as monitoring tools and tips for troubleshooting and support of Replication deployments.
There's a big shift in both at the architecture and api level from Hadoop 1 vs Hadoop 2, particularly YARN and we had our first meetup to talk about this (http://www.meetup.com/Atlanta-YARN-User-Group/) on 10/13/2013.
Apache HBase™ is the Hadoop database, a distributed, salable, big data store.Its a column-oriented database management system that runs on top of HDFS.
Apache HBase is an open source NoSQL database that provides real-time read/write access to those large data sets. ... HBase is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN.
This talk delves into the many ways that a user has to use HBase in a project. Lars will look at many practical examples based on real applications in production, for example, on Facebook and eBay and the right approach for those wanting to find their own implementation. He will also discuss advanced concepts, such as counters, coprocessors and schema design.
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
Most developers are familiar with the topic of “database design”. In the relational world, normalization is the name of the game. How do things change when you’re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.
As more workloads move to severless-like environments, the importance of properly handling downscaling increases. While recomputing the entire RDD makes sense for dealing with machine failure, if your nodes are more being removed frequently, you can end up in a seemingly loop-like scenario, where you scale down and need to recompute the expensive part of your computation, scale back up, and then need to scale back down again.
Even if you aren’t in a serverless-like environment, preemptable or spot instances can encounter similar issues with large decreases in workers, potentially triggering large recomputes. In this talk, we explore approaches for improving the scale-down experience on open source cluster managers, such as Yarn and Kubernetes-everything from how to schedule jobs to location of blocks and their impact (shuffle and otherwise).
Overview of HBase cluster replication feature, covering implementation details as well as monitoring tools and tips for troubleshooting and support of Replication deployments.
There's a big shift in both at the architecture and api level from Hadoop 1 vs Hadoop 2, particularly YARN and we had our first meetup to talk about this (http://www.meetup.com/Atlanta-YARN-User-Group/) on 10/13/2013.
Apache HBase™ is the Hadoop database, a distributed, salable, big data store.Its a column-oriented database management system that runs on top of HDFS.
Apache HBase is an open source NoSQL database that provides real-time read/write access to those large data sets. ... HBase is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN.
This talk delves into the many ways that a user has to use HBase in a project. Lars will look at many practical examples based on real applications in production, for example, on Facebook and eBay and the right approach for those wanting to find their own implementation. He will also discuss advanced concepts, such as counters, coprocessors and schema design.
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
Most developers are familiar with the topic of “database design”. In the relational world, normalization is the name of the game. How do things change when you’re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.
As more workloads move to severless-like environments, the importance of properly handling downscaling increases. While recomputing the entire RDD makes sense for dealing with machine failure, if your nodes are more being removed frequently, you can end up in a seemingly loop-like scenario, where you scale down and need to recompute the expensive part of your computation, scale back up, and then need to scale back down again.
Even if you aren’t in a serverless-like environment, preemptable or spot instances can encounter similar issues with large decreases in workers, potentially triggering large recomputes. In this talk, we explore approaches for improving the scale-down experience on open source cluster managers, such as Yarn and Kubernetes-everything from how to schedule jobs to location of blocks and their impact (shuffle and otherwise).
HBase In Action - Chapter 04: HBase table designphanleson
HBase In Action - Chapter 04: HBase table design
Learning HBase, Real-time Access to Your Big Data, Data Manipulation at Scale, Big Data, Text Mining, HBase, Deploying HBase
Breaking with relational DBMS and dating with Hbase [5th IndicThreads.com Con...IndicThreads
Session Presented at 5th IndicThreads.com Conference On Java held on 10-11 December 2010 in Pune, India
WEB: http://J10.IndicThreads.com
------------
Hbase is an open-source, non-relational, distributed, sparse, column-oriented data-store modeled after Google’s BigTable and is written in Java.
In this presentation we will talk about how to migrate a RDBMS based Java application to Hbase based application. We will have a discussion on following points:
• Hbase schema design (a paradigm shift from the way we think about data-storage right now) compared to RDBMS based schema design.
• The challenges faced while porting the application with HBase.
• Introduction to HBql to query the data from Hbase.
• Monitoring example application for Hbase (JMX APIs exposed) and Machine’s performance with Gangila.
• Discussion on Thrift interface and how can we used Rest interface to integrate hbase with non java based applications.
• Cluster replication and what is coming in the next major 0.90 release of Hbase.
• We will end up the session, with the demo of ported application.
Takeaways for the Audience 1. When is Hbase appropriate and when not?
2. Hbase architecture and schema design
3. RDBMS vs Hbase
4. Interfacing Hbase with applications using Thrift or REST
5. Hbase cluster and Replication
6. Hbase monitoring
Architectural anti-patterns for data handlingGleicon Moraes
Now with three more anti patterns and a new required listening. This is the Discipline release, all hail to King Crimson and Fripp's care with details.
Big Data Frameworks: Introduction to NoSQL – Aggregate Data Models – Hbase: Data Model and Implementations – Hbase Clients – Examples – .Cassandra: Data Model – Examples – Cassandra Clients – Hadoop Integration. Pig – Grunt – Pig Data Model – Pig Latin – developing and testing Pig Latin scripts. Hive – Data Types and File Formats – HiveQL Data Definition – HiveQL Data Manipulation – HiveQL Queries
Domain Driven Design is a software development process that focuses on finding a common language for the involved parties. This language and the resulting models are taken from the domain rather than the technical details of the implementation. The goal is to improve the communication between customers, developers and all other involved groups. Even if Eric Evan's book about this topic was written almost ten years ago, this topic remains important because a lot of projects fail for communication reasons.
Relational databases have their own language and influence the design of software into a direction further away from the Domain: Entities have to be created for the sole purpose of adhering to best practices of relational database. Two kinds of NoSQL databases are changing that: Document stores and graph databases. In a document store you can model a "contains" relation in a more natural way and thereby express if this entity can exist outside of its surrounding entity. A graph database allows you to model relationships between entities in a straight forward way that can be expressed in the language of the domain.
In this talk I want to look at the way a multi model database that combines a document store and a graph database can help you to model your problems in a way that is understandable for all parties involved, and explain the benefits of this approach for the software development process.
Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple data-centres,with asynchronous master-less replication allowing low latency operations for all clients.
Recently a new breed of "multi-model" databases has emerged. They are a document store, a graph database and a key/value store combined in one program. Therefore they are able to cover a lot of use cases which otherwise would need multiple different database systems.
This approach promises a boost to the idea of "polyglot persistence", which has become very popular in recent years although it creates some friction in the form of data conversion and synchronisation between different systems. This is, because with a multi-model database one can enjoy the benefits of polyglot persistence without the disadvantages.
In this talk I will explain the motivation behind the multi-model approach, discuss its advantages and limitations, and will then risk to make some predictions about the NoSQL database market in five years time, which I shall only reveal during the talk.
MyLife with HBase or HBase three flavorsresponseteam
Description:
A HBase is a NoSQL column store. What does that mean functionally to a software developer?
-A conceptional view of HBase
-How to use HBase
-What features HBase has
-Benefits of HBase
How are we using HBase here at MyLife? I will describe three projects here at MyLife that are currently using HBase in production that I was/am involved with.
-Email content storage
-Connection-Identity mappings
-User stream cache backing
Each of these projects uses HBase in a different way.
Similar to HBase Schema Design - HBase-Con 2012 (20)
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
6. What Is Schema Design?
Logical Data Modeling
+
Physical Implementation
7. You always start with a logical model.
Even if it's just an implicit one.
That's totally fine. (If you're right.)
8. There are lots of ways to model data.
The most common one is:
Entity / Attribute / Relationship
(This is probably what you know
of as just "data modeling".)
9. There's a well established visual
language for data modeling.
15. Example: Concerts
A note about diagrams: they're useful
for communicating, but can be more
trouble than they're worth. Don't do
them out of obligation; only do them to
understand your problem better.
21. So, what's missing?
If your data is not massive,
NOTHING.
You should use a relational database. They rock*
22. So, what's missing?
If your data is not massive,
NOTHING.
You should use a relational database. They rock*
* - This statement has not been approved by the HBase product management committee, and neglects known
deficiencies with the relational model such as poor modeling of hierarchies and graphs, overly rigid attribute structure
enforcement, neglect of the time dimension, and physical optimization concerns leaking into the conceptual
abstraction.
23. Relational DBs work well because they
are close to the pure logical model.
That model is less likely to change as your
business needs change. You may want to ask
different questions over time, but if you got the
logical model correct, you'll have the answers.
24. Ah, but what if you do have massive
data? Then what's missing?
26. Problem: The relational model
doesn't acknowledge scale.
"It's an implementation concern;
you shouldn't have to worry about it."
27. The trouble is, you do have
to worry about it. So you...
● Add indexes
● Add hints
● Write really complex, messy SQL
● Memorize books by Tom Kyte & Joe Celko
● Bow down to the optimizer!
● Denormalize
● Cache
● etc ...
29. So then you hear about this thing
called NoSQL. Can it help?
30. Maybe. But ultimately, it's just a
different way of physically representing
your same logical data model.
Some things are easier; some are much harder.
If you haven't started by understanding your logical
model, you're doing it wrong.
33. HBase Architecture: A Brief Recap
• Scales by splitting all rows into regions
• Each region is hosted by exactly one server
34. HBase Architecture: A Brief Recap
• Scales by splitting all rows into regions
• Each region is hosted by exactly one server
• Writes are held (sorted) in memory until flush
35. HBase Architecture: A Brief Recap
• Scales by splitting all rows into regions
• Each region is hosted by exactly one server
• Writes are held (sorted) in memory until flush
• Reads merge rows in memory with flushed files
36. HBase Architecture: A Brief Recap
• Scales by splitting all rows into regions
• Each region is hosted by exactly one server
• Writes are held (sorted) in memory until flush
• Reads merge rows in memory with flushed files
• Reads & writes to a single row are consistent
39. HBase Data Model: Brief Recap
Table: design-time namespace, has many rows.
40. HBase Data Model: Brief Recap
Table: design-time namespace, has many rows.
Row: atomic byte array, with one row key
41. HBase Data Model: Brief Recap
Table: design-time namespace, has many rows.
Row: atomic byte array, with one row key
Not just a bunch of bytes, dude! A k/v map!
42. HBase Data Model: Brief Recap
Table: design-time namespace, has many rows.
Row: atomic key/value container, with one row key
43. HBase Data Model: Brief Recap
Table: design-time namespace, has many rows.
Row: atomic key/value container, with one row key
Column: a key in the k/v container inside a row
44. HBase Data Model: Brief Recap
Table: design-time namespace, has many rows.
Row: atomic key/value container, with one row key
Column: a key in the k/v container inside a row
Value: a value in the k/v container inside a row
45. HBase Data Model: Brief Recap
Table: design-time namespace, has many rows.
Row: atomic key/value container, with one row key
Column: a key in the k/v container inside a row
Value: a value in the k/v container inside a row
46. HBase Data Model: Brief Recap
Hold up! What about
Table: design-time namespace, has many rows.
time?
Row: atomic key/value container, with one row key
Column: a key in the k/v container inside a row
Value: a value in the k/v container inside a row
47. HBase Data Model: Brief Recap
Table: design-time namespace, has many rows.
Row: atomic key/value container, with one row key
Column: a key in the k/v container inside a row
Timestamp: long milliseconds, sorted descending
Value: a time-versioned value in the k/v container
48. HBase Data Model: Brief Recap
Table: design-time namespace, has many rows.
Row: atomic key/value container, with one row key
Column: a key in the k/v container inside a row
Timestamp: long milliseconds, sorted descending
Value: a time-versioned value in the k/v container
49. The "row" is atomic, and gets flushed
to disk periodically. But it doesn't have
to be flushed into just a single file!
HBase Data Model: Brief Recap
Table: design-time namespace, has many rows.
Row: atomic key/value container, with one row key
Column: a key in the k/v container inside a row
Timestamp: long milliseconds, sorted descending
Value: a time-versioned value in the k/v container
50. This "row" brokenatomic, and gets store
It can be guy is up into different
flushed to disk periodically. Butand
files with different properties, it doesn't
have to caninto just just a subset.
reads be look at one file.
HBase Data Model: Brief Recap
Table: design-time namespace, has many rows.
Row: atomic key/value container, with one row key
Column: a key in the k/v container inside a row
Timestamp: long milliseconds, sorted descending
Value: a time-versioned value in the k/v container
51. This "row" brokenatomic, and gets store
It This becalled "Column different It's
can is guy is up into Families".
flushed to disk periodically. want, and
files inof an advancedyou Butoption, so
whatever way design it doesn't
kind be
have to caninto just one file.at a subset.
reads think too much about it yet.
choose to look
don't
HBase Data Model: Brief Recap
Table: design-time namespace, has many rows.
Row: atomic key/value container, with one row key
Column: a key in the k/v container inside a row
Timestamp: long milliseconds, sorted descending
Value: a time-versioned value in the k/v container
52. This "row" brokenatomic, and gets store
It This becalled "Column different It's
can is guy is up into Families".
From the Percolator paper by Google: "Bigtable
flushed to disk periodically. want, and allows
files inof an advancedyou characteristics of the
whatever performance Butoption, so
way design it doesn't
users to
kind becontrol just one file.
have to caninto the setto columns into subset.
table
locality
reads by grouping amuch about a yet.
choose of
don't think too waylook at it a CFs.
group." That's a good
to think about
HBase Data Model: Brief Recap
Table: design-time namespace, has many rows.
Row: atomic key/value container, with one row key
Column: a key in the k/v container inside a row
Timestamp: long milliseconds, sorted descending
Value: a time-versioned value in the k/v container
53. HBase Data Model: Brief Recap
Table: design-time namespace, has many rows.
Row: atomic key/value container, with one row key
Column Family: divide columns into physical files
Column: a key in the k/v container inside a row
Timestamp: long milliseconds, sorted descending
Value: a time-versioned value in the k/v container
54. Calling these "columns" is an unfortunate use
of terminology. They're not fixed; each row can
have different keys, and the names are not
defined at runtime. So you can represent
another axis of data (in the key of the
key/value pair). More on that later.
HBase Data Model: Brief Recap
Table: design-time namespace, has many rows.
Row: atomic key/value container, with one row key
Column Family: divide columns into physical files
Column: a key in the k/v container inside a row
Timestamp: long milliseconds, sorted descending
Value: a time-versioned value in the k/v container
55. They're officially called "column qualifiers". But
many people just say "columns".
Or "CQ". Or "Quallie". Now you're one of the cool kids.
HBase Data Model: Brief Recap
Table: design-time namespace, has many rows.
Row: atomic key/value container, with one row key
Column Family: divide columns into physical files
Column: a key in the k/v container inside a row
Timestamp: long milliseconds, sorted descending
Value: a time-versioned value in the k/v container
56. What data types are stored in key/value pairs?
HBase Data Model: Brief Recap
Table: design-time namespace, has many rows.
Row: atomic key/value container, with one row key
Column Family: divide columns into physical files
Column: a key in the k/v container inside a row
Timestamp: long milliseconds, sorted descending
Value: a time-versioned value in the k/v container
57. What data types are stored in key/value pairs?
It's all bytes.
HBase Data Model: Brief Recap
Table: design-time namespace, has many rows.
Row: atomic key/value container, with one row key
Column Family: divide columns into physical files
Column: a key in the k/v container inside a row
Timestamp: long milliseconds, sorted descending
Value: a time-versioned value in the k/v container
58. What data types are stored in key/value pairs?
Row keys, column names, values: arbitrary bytes
HBase Data Model: Brief Recap
Table: design-time namespace, has many rows.
Row: atomic key/value container, with one row key
Column Family: divide columns into physical files
Column: a key in the k/v container inside a row
Timestamp: long milliseconds, sorted descending
Value: a time-versioned value in the k/v container
59. What data types are stored in key/value pairs?
Row keys, column names, values: arbitrary bytes
Table and column family names: printable characters
HBase Data Model: Brief Recap
Table: design-time namespace, has many rows.
Row: atomic key/value container, with one row key
Column Family: divide columns into physical files
Column: a key in the k/v container inside a row
Timestamp: long milliseconds, sorted descending
Value: a time-versioned value in the k/v container
60. What data types are stored in key/value pairs?
Row keys, column names, values: arbitrary bytes
Table and column family names: printable characters
Timestamps: long integers
HBase Data Model: Brief Recap
Table: design-time namespace, has many rows.
Row: atomic key/value container, with one row key
Column Family: divide columns into physical files
Column: a key in the k/v container inside a row
Timestamp: long milliseconds, sorted descending
Value: a time-versioned value in the k/v container
61. HBase Data Model: Brief Recap
One more thing that bears repeating: every "cell" (i.e.
the time-versioned value of one column in one row)
is stored "fully qualified" (with its full rowkey, column
family, column name, etc.) on disk.
62. So now you know what's available.
Now, how do you model things?
63. Let's start with the entity / attribute /
relationship modeling paradigm,
and see how far we get applying it to HBase.
64. A note about my example:
it's for clarity, not realism.
For bands & shows, there's not enough data to
warrant using HBase, even if you're tracking every
show by every band for all of human history. It
might be GB, but not TB.
65. So, back to entities (boxes).
With fancy rounded corners.
70. Attributes!
For example: a "Band" entity might have a "Name" (with values like
"Jonathan Coulton") and a "Genre" (with values like "Nerd Core")
71. Logically, attributes are unordered.
Their vertical position is meaningless. They're simply
characteristics of the entity.
72. Attributes can be identifying.
i.e. they uniquely identify a particular instance of that entity.
73. Attributes can be identifying.
i.e. they uniquely identify a particular instance of that entity.
Logical models usually leave this out, and identity is
implied (i.e. we just assume there's some set of attributes
that's identifying, but we don't have to say what it is)
74. Attributes can be identifying.
i.e. they uniquely identify a particular instance of that entity.
PK
Physical models refer to it explicitly, like how in a relational
database model you'd label some set of attributes as being
the "Primary Key" (PK).
75. Attributes can be identifying.
i.e. they uniquely identify a particular instance of that entity.
So in our Band example, neither Name nor Genre is
uniquely identifying, but something like URL might be.
77. How does this map to HBase?
Identifying attributes (aka the "PK") become parts of
the row key, and other attributes become columns.
78. So here's a Band schema.
The row key is the URL, and we have
Name and Genre as attributes.
79. A much common pattern is to use IDs.
That way, you have an immutable way to refer to this
entity forever, even if they leave MySpace.
80. A much common pattern is to use IDs.
That way, you have an immutable way to refer to this
entity forever, even if they leave MySpace.
Where do IDs come from? We'll talk about that later.
81. If there's just a single row key,
how can there be multiple identifying attributes?
82. Let's call that "mashing".
Which is to say, concatenation, in either a fixed-width
or delimited manner, in the byte array.
83. Mashing includes several techniques:
● Fixed byte width representation
● Delimiting (for variable length)
● Other serialization (avro, etc.)
84. Mashing includes several techniques:
● Fixed byte width representation
● Delimiting (for variable length)
● Other serialization (avro, etc.)
Doesn't matter how you do it; the important thing is that any byte
array in HBase can represent more than one logical attribute.
85. If we want, we can even add types to
the schema definition.
86. If we want, we can even add types to
the schema definition.
?
87. If we want, we can even add types to
the schema definition.
?
HBase don't care,
but we do (sometimes).
88. If we want, we can even add types.
You could also mark things as ASC or DESC,
depending on whether you invert the bits.
?
89. This is pretty textbook stuff, but here's
where it gets exciting.
90. This is pretty textbook stuff, but here's
where it gets exciting.
(If you're astute, you'll notice we haven't
talked about relationships yet.)
91. HBase has no foreign keys, or joins, or
cross-table transactions.
This can make representing relationships
between entities ... tricky.
92. Part of the beauty of the relational model
is that you can punt on this question.
If you model the entities as fully normalized, then you
can write any query you want at run time, and the DB
performs joins on tables as optimally as it can.
(This relies heavily on indexing.)
93. In HBase (or any distributed DB)
you don't have that luxury.
Joins get way more expensive and complicated
across a distributed cluster of machines, as do the
indexes that make them happen efficiently.
HBase has neither joins nor indexes.
94. You have two choices, if you need
relationships between entities.
95. You have two choices, if you need
relationships between entities.
● Roll your own joins
96. You have two choices, if you need
relationships between entities.
● Roll your own joins
● Denormalize the data
97. Rolling your own joins is hard.
You'll probably get it wrong
if you try to do it casually.
Ever implemented a multi-way merge sort by hand?
99. The basic concept of denormalization is simple:
two logical entities share
one physical representation.
100. The basic concept of denormalization is simple:
two logical entities share
one physical representation.
Of course, there are lots of ways to skin a cat...
101. Let's look at standard relational database
denormalization techniques first.
102. In a relational database, if the
normalized physical model is this:
103. You could start with the Band, and give it
extra columns to hold the shows:
104. You could start with the Band, and give it
extra columns to hold the shows:
106. You could also just give it an unstructured
blob of text for all the shows.
107. You could also just give it an unstructured
blob of text for all the shows.
But then you've given up
the integrity of your data.
(Which might be fine.
If so, stop here.)
108. You get similar problems if you try to bring
all the info into the Show table.
110. And pull in copies of the info
in the other tables:
111. Leaving you with one table, with
one row per band/show combination:
112. The cons to this should be self-evident.
Updating and querying are more complicated, because
you have to deal with many representations of the "same"
data, which could get out of sync, etc.
113. But the pros are that with huge data,
answering some questions is much faster.
(If you happen to be asking the query in the same way
you structured the denormalization, that is.)
114. So back to HBase. How do you
denormalize in an HBase schema?
115. HBase columns can be defined at runtime.
A column doesn't have to represent a pre-defined attribute.
116. HBase columns can be defined at runtime.
A column doesn't have to represent a pre-defined attribute.
In fact, it doesn't have to be an attribute at all.
117. A set of dynamically named columns
can represent another entity!
118.
119. If you put data into the column name,
and expect many such columns in the same row,
then logically, you've created a nested entity.
120. You can scan over columns.
See: hadoop-hbase.blogspot.com/2012/01/hbase-intra-row-scanning.html
121. So we can store shows inside bands.
Which is like the denormalization we say earlier,
except without the relational DB kludges.
122. Note!
HBase don't care.
It's just a matter of how your app treats the columns. If you
put repeating info in column names, you're doing this.
123. Why is this so difficult for most
relational database devs to grok?
Because
relational databases have no concept of nested entities!
You'd make it a separate table in an RDBMS, which is more
flexible but much more difficult to optimize.
124. It's very difficult to talk about HBase
schemas if you don't acknowledge this.
You don't have to represent it this way (with a nested box) but
you have to at least acknowledge it. And maybe name it.
125. But, once you do acknowledge it,
you can do some neat things.
127. Nested entities can have attributes,
some of which are identifying.
Identifying attributes
make up the column
qualifier (just like
row keys, it could be
multiple attributes
mashed together)
128. Nested entities can have attributes,
some of which are identifying.
Identifying attributes
make up the column
qualifier (just like
row keys, it could be
multiple attributes
mashed together)
Non-identifying
attributes are held
in the value (again,
you could mash
many attributes in
here)
129. Shows is nested in Band
show_id is the column qualifier
Other attributes are mashed into the value
The column qualifier
is the show id.
Everything else
gets mashed into
the value field.
130. 1 table can have many nested entities,
provided your app can tell them apart.
131. How do you tell them apart?
With prefixes ...
qualifier starts with:
"s" + show_id
"a" + album_id
"m" + name
134. Where can you
nest entities?
Knock yourself out.
● In columns
● Two levels deep in
columns!
135. Where can you
nest entities?
Knock yourself out.
● In columns
● Two levels deep in
columns!
● In the row key?!
136. Where can you
nest entities?
Knock yourself out.
● In columns
● Two levels deep in
columns!
● In the row key?!
● Using timestamps as a
dimension!!
137. This is a fundamental modeling
property of HBase: nesting entities.
141. So that brings us to a standard way to
show an HBase schema:
Table: Top level entity name, fixed at design time.
Row key: Can consist of multiple parts, each with a
name & type. All data in one row shares the same
row key. Can be nested.
Column family: A container that allows sets of
columns to have separate storage & semantics.
Column qualifiers: Fixed attributes--design-time
named columns, holding a value of a certain type.
Nested entities: Run-time named column qualifiers,
where the qualifier is made up of one (or more)
values, each with a type, along with one or more
values (by timestamp) which also have a type.
Nested versions: Because multiple timestamped
copies of each value can exist, you can also treat the
timestamp as a modeled dimension in its own right.
Hell, you could even use a long that's not a
timestamp (like, say, an ID number). Caveats apply.
142. "But," you say, "what if I don't have all
of these fancy modeling tools?"
143. Say it in text.
XML, JSON, DML, whatever you like.
<table name="Band">
<key>
<column name="band_id" type="int" />
</key>
<columnFamily name="cf1">
<column name="band_name" type="string"/>
<column name="hometown" type="string"/>
<entity name="Show">
<key>
<column name="show_id">
</key>
<column name="date" type="date" />
</entity>
</columnFamily>
</table>
144. Text is faster and more general, but
slightly harder to grok quickly.
So we'll stick with diagrams here.
146. Example 1: a simple parent/child relationship
Relational Schema: Applicants & Answers
Standard parent/child relationship. One Applicant has many
Answers; every Answer relates to a single Applicant (by id). It
also relates to a Question (not shown). SQL would have you JOIN
them to materialize an applicant and their answers.
147. HBase Schema: Answers By Applicant
● Answer is contained implicitly in Applicant
● If you know an applicant_id, you get O(1) access
● If you know an applicant_id AND question_id, O(1) access
● Answer.created_date is implicit in the timestamp on value
148. HBase Schema: Answers By Applicant
● You get answer history for free
● Applicant.applied_date can be implicit in the timestamp
● More attributes on applicant directly? Just add them!
● Answers are atomic and transactional by Applicant!!
150. Before you get too excited, remember
that there are cons to denormalization.
151. The cons:
● Nested entities aren't independent any more.
○ No: "SELECT avg(value) FROM Answer WHERE
○ But you can still do that with map/reduce
question_id = 123;
"
● This decomposition only works in one direction.
○ No relation chains without serious trickery.
○ Timestamps are another dimension, but that counts as trickery.
● No way to enforce the foreign key to another table.
○ But would you really do that in a big RDBMS?
● On disk, it repeats the row key for every column
○ You didn't really save anything by having applicant_id be implied.
○ Or did you? Compression negates that on disk ...
○ ... and prefix compression (HBASE-4218) will totally sink this.
152. Example 2: dropping some database science
Relational Schema: Users And Messages
Many-to-many relationship. One User sees many Messages, and
a single Message can be seen by many Users. We want to do
things like show the most recent message by subject (e.g. an inbox
view).
153. What kind of SQL would you run on this? Say, we
want to get the most recent 20 messages for 1 user.
SELECT TOP 20
M.subject,
M.body
FROM
User_Message UM
INNER JOIN Message M
ON UM.message_id = M.message_id
WHERE
UM.user_id = <user_id>
ORDER BY
M.sent_date DESC
154. Seems easy, right? Well, the database is doing
some stuff behind the scenes for you:
Assuming no secondary indexes, it might:
● Drive the join from the User_Message table
● For each new record with our given user_id, do a single
disk access into the Message table (i.e. a hash_join)
● Get the records for *every* message for this user
● Sort them all
● Take the top 20
No shortcuts; we can't find the top 20 by date w/o seeing ALL messages.
This gets more expensive as a user gets more messages. (But it's still
pretty fast if a given user has a reasonable number of messages).
155. How could you do this in HBase?
Try the same pattern as parent / child?
1. Because of the many-to-many, you now have N copies
of the message (one per user).
○ Maybe that's OK (especially if it's immutable!). Disk is cheap.
2. Your statement has to do the same amount of work, but
now you have to do it yourself. :(
156. If I know that I always want it ordered by date, why
not store it that way?
● Now I can scan over messages by date until I get
enough; it's O(1)
● But what if I want it by message_id again? Doh. I'm
screwed, unless ...
157. I store it both ways!
Nice: updates to this are transactional (consistent) for a given
user, because it's all in one row. So it's not a bear to maintain.
158. Which I could even do in different column families ...
(Makes sense if I don't usually access them at the same time; I only
pay the cost for the query I am running.)
160. Or I could just use the by-date one as an "index" ...
So I only store the subject and body once.
This means I need to perform my own "join" in code.
161. See a theme emerging?
Relational DBs lull you into not thinking about physical
access as much; but when you have to scale, there are hard
limits.
HBase makes you think about it sooner, but gives you the
tools to think about it in more a straightforward way.
162. Example 3: Flurry
See: http://www.flurry.com/data/
● One row per device
● Nested entity in CF "Sessions"
○ Polymorphic "fragments"
of session reports
○ Map/Reduce transforms
Caveat: this is based on email exchanges, so the details
may be wonky, but I think the overall point is right.
163. Example 4: Indexes
Example from Andrew Purtell
● Using one table as an index into another often makes sense
● Can't be transactional (without external coordination)
● So you have to keep it clean, or tolerate dirty indexes
● Note that there are several attempts to add solid general purpose
indexing to HBase, but so far none have caught on.
Same caveat: this is based on email exchanges, so the
details may be wonky, but I think the overall point is right.
165. 0: The row key design is the single
most important decision you will make.
166. 0: The row key design is the single
most important decision you will make.
This is also true for the "key" you're putting in the
column family name of nested entities.
168. 1: Design for the questions,
not the answers.
(hat tip to Ilya Katsov from the High Scalability blog
for this useful way to put it; and possibly to Billy
Newport or Eben Hewitt for saying it first.)
170. Let's be clear: this sucks big time,
if you aren't 100% sure what the
questions are going to be.
171. Let's be clear: this sucks big time,
if you aren't 100% sure what the
questions are going to be.
Use a relational DB for that!
Or a document database like CouchDB ...
172. "But isn't NoSQL more flexible than a
relational DB?"
For column schema? Yes!
For row key structures, NO!
173. 2: There are only two sizes of data:
too big, and not too big.
174. 2: There are only two sizes of data:
too big, and not too big.
(That is, too big to scan all of something
while answering an interactive request.)
175. 2: There are only two sizes of data:
too big, and not too big.
178. This is also important because the
rowkey and CF name are repeated for
every single value (memory and disk).
File compression can negate this on disk, and prefix
compression will probably negate this in memory.
179. 4: Use row atomicity as a design tool.
Rows are updated atomically, which gives you a form
of relational integrity in HBase!
180. 4: Use row atomicity as a design tool.
If you made this two HBase tables, you couldn't
guarantee integrity (updated to one could succeed,
while updates to the other fail).
181. 4: Use row atomicity as a design tool.
If you make it one table with a nested entity, you can
guarantee updates will be atomic, and you can do
much more complex mutations of state.
182. 5: Attributes can move into the row key
Even if it's not "identifying" (part of the uniqueness of
an entity), adding an attribute into the row key can
make access more efficient in some cases.
183. 5: Attributes can move into the row key
Even if it's not "identifying" (part of the uniqueness of
an entity), adding an attribute into the row key can
make access more efficient in some cases.
184. 5: Attributes can move into the row key
This ability to move things left or right (without
changing the physical storage requirements) is part of
what Lars George calls "folding".
Also, if you like this subject, go watch his videos on
advanced HBase schema design, they're awesome.
185. 6: If you nest entities, you can
transactionally pre-aggregate data.
You can recalculate aggregates on write,
or periodically with a map/reduce job.
187. Practically: how do you load schema
into HBase?
Direct actuation: https://github.com/larsgeorge/hbase-schema-manager
188. Practically: how do you load schema
into HBase?
Direct actuation: https://github.com/larsgeorge/hbase-schema-manager
Coming soon: script generation: https://github.com/ivarley/scoot