SlideShare a Scribd company logo
1 of 17
Download to read offline
HBase schema design
    case studies

    Organized by Evan/Qingyan Liu
     qingyan123 (AT) gmail.com
              2009.7.13
The Tao is ...




De-normalization
Case 1: locations
●
    China
    ●
        Beijing
    ●
        Shanghai
    ●
        Guangzhou
    ●
        Shandong
        –   Jinan
        –   Qingdao
    ●
        Sichuan
        –   Chengdu
In RDBMS
loc_id PK   loc_name      parent_id   child_id
1           China                     2,3,4,5
2           Beijing       1
3           Shanghai      1
4           Guangzhou     1
5           Shandong      1           7,8
6           Sichuan       1           9
7           Jinan         1,5
8           Qingdao       1,5
9           Chengdu       1,6
In HBase
row                      column families
           name:         parent:           child:
<loc_id>                 parent:<loc_id>   child:<loc_id>
1          China                           child:1=state
                                           child:2=state
                                           child:3=state
                                           child:4=state
                                           child:5=state
                                           child:6=state
5          Shangdong     parent:1=nation child:7=city
                                         child:8=city
8          Qingdao       parent:1=nation
                         parent:5=state
Case 2: student-course
●
    Student
    ●
        1 S ~ many C
●
    Course
    ●
        1 C ~ many S
In RDBMS


Students                Courses
id PK      SCs          id PK
name       student_id   title
sex        course_id    introduction
age        type         teacher_id
In HBase
row                           column families
               info:               course:
<student_id>   info:name           course:<course_id>=type
               info:sex
               info:age

row                           column families
               info:               student:
<course_id>    info:title          student:<student_id>=type
               info:introduction
               info:teacher_id
Case 3: user-action
●
    users performs actions now and then
    ●
        store every events
    ●
        query recent events of a user
In RDBMS
                      Actions
                      id PK
                      user_id IDX
                      name
                      time

●   For fast SELECT id, user_id, name, time FROM Action
    WHERE user_id=XXX ORDER BY time DESC LIMIT 10
    OFFSET 20, we must create index on user_id.
    However, indices will greatly decrease insert speed
    for index-rebuild.
In HBase
row                                   column families
                              name:
<user><Long.MAX_VALUE -
System.currentTimeMillis()>
<event id>
Case 4: user-friends
●
    1 user has 1+ friends
●
    will lookup all friends of a user
In RDBMS

        Users
                           Friendships
        id IDX
                           user_id IDX
        name
                           friend_id
        sex
                           type
        age
●
    SELECT * FROM friendships WHERE
    user_id='XXX';
In HBase

row                           column families
                 info:            friend:
<user_id>        info:name        friend:<user_id>=type
                 info:sex
                 info:age

 ●
      actually, it is a graph can be represented by a
      sparse matrix.
 ●
      then you can use M/R to find sth interesting.
      e.g. the shortest path from user A to user B.
Case 5: access log
●
    each log line contains time, ip, domain, url,
    referer, browser_cookie, login_id, etc
●
    will be analyzed every 5 minutes, every hour,
    daily, weekly, and monthly
In RDBMS

Accesslog
time
ip IDX
domain
url
referer
browser_cookie IDX
login_id IDX
In HBase

row                                                  column families
                                          http:                     user
<time><INC_COUNTER>                       http:ip                   user:browser_
                                          http:domain               cookie
                                          http:url                  user:login_id
                                          http:referer




INC_COUNTER is used to distinguish the adjacent same time values.

More Related Content

What's hot

Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
mongodb와 mysql의 CRUD 연산의 성능 비교
mongodb와 mysql의 CRUD 연산의 성능 비교mongodb와 mysql의 CRUD 연산의 성능 비교
mongodb와 mysql의 CRUD 연산의 성능 비교
Woo Yeong Choi
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
Databricks
 

What's hot (20)

Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
On-boarding with JanusGraph Performance
On-boarding with JanusGraph PerformanceOn-boarding with JanusGraph Performance
On-boarding with JanusGraph Performance
 
Building with AWS Databases: Match Your Workload to the Right Database (DAT30...
Building with AWS Databases: Match Your Workload to the Right Database (DAT30...Building with AWS Databases: Match Your Workload to the Right Database (DAT30...
Building with AWS Databases: Match Your Workload to the Right Database (DAT30...
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
mongodb와 mysql의 CRUD 연산의 성능 비교
mongodb와 mysql의 CRUD 연산의 성능 비교mongodb와 mysql의 CRUD 연산의 성능 비교
mongodb와 mysql의 CRUD 연산의 성능 비교
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
GraphQL ♥︎ GraphDB
GraphQL ♥︎ GraphDBGraphQL ♥︎ GraphDB
GraphQL ♥︎ GraphDB
 
Mongo DB 성능최적화 전략
Mongo DB 성능최적화 전략Mongo DB 성능최적화 전략
Mongo DB 성능최적화 전략
 
RedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ TwitterRedisConf17- Using Redis at scale @ Twitter
RedisConf17- Using Redis at scale @ Twitter
 
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaHadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
 
Apache HBase Improvements and Practices at Xiaomi
Apache HBase Improvements and Practices at XiaomiApache HBase Improvements and Practices at Xiaomi
Apache HBase Improvements and Practices at Xiaomi
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Spring Boot on Amazon Web Services with Spring Cloud AWS
Spring Boot on Amazon Web Services with Spring Cloud AWSSpring Boot on Amazon Web Services with Spring Cloud AWS
Spring Boot on Amazon Web Services with Spring Cloud AWS
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron SchildkroutKafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
 
Edge architecture ieee international conference on cloud engineering
Edge architecture   ieee international conference on cloud engineeringEdge architecture   ieee international conference on cloud engineering
Edge architecture ieee international conference on cloud engineering
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
 
Fluentd v0.14 Plugin API Details
Fluentd v0.14 Plugin API DetailsFluentd v0.14 Plugin API Details
Fluentd v0.14 Plugin API Details
 

Viewers also liked

A 3 dimensional data model in hbase for large time-series dataset-20120915
A 3 dimensional data model in hbase for large time-series dataset-20120915A 3 dimensional data model in hbase for large time-series dataset-20120915
A 3 dimensional data model in hbase for large time-series dataset-20120915
Dan Han
 
HBaseCon 2013: General Session
HBaseCon 2013: General SessionHBaseCon 2013: General Session
HBaseCon 2013: General Session
Cloudera, Inc.
 

Viewers also liked (20)

Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
HBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBaseHBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBase
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
 
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance Tuning
 
Hadoop and HBase in the Real World
Hadoop and HBase in the Real WorldHadoop and HBase in the Real World
Hadoop and HBase in the Real World
 
A 3 dimensional data model in hbase for large time-series dataset-20120915
A 3 dimensional data model in hbase for large time-series dataset-20120915A 3 dimensional data model in hbase for large time-series dataset-20120915
A 3 dimensional data model in hbase for large time-series dataset-20120915
 
Osc2012 spring HBase Report
Osc2012 spring HBase ReportOsc2012 spring HBase Report
Osc2012 spring HBase Report
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
 
Apache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesApache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use Cases
 
HBase: Just the Basics
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the Basics
 
Cassandra v0.6-siryou
Cassandra v0.6-siryouCassandra v0.6-siryou
Cassandra v0.6-siryou
 
Hog user manual v3
Hog user manual v3Hog user manual v3
Hog user manual v3
 
Unlocking Data for Analysts & Developers
Unlocking Data for Analysts & DevelopersUnlocking Data for Analysts & Developers
Unlocking Data for Analysts & Developers
 
Spark!
Spark!Spark!
Spark!
 
HBaseCon 2013: General Session
HBaseCon 2013: General SessionHBaseCon 2013: General Session
HBaseCon 2013: General Session
 
20分でわかるHBase
20分でわかるHBase20分でわかるHBase
20分でわかるHBase
 
The Evolution of a Relational Database Layer over HBase
The Evolution of a Relational Database Layer over HBaseThe Evolution of a Relational Database Layer over HBase
The Evolution of a Relational Database Layer over HBase
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 

20090713 Hbase Schema Design Case Studies

  • 1. HBase schema design case studies Organized by Evan/Qingyan Liu qingyan123 (AT) gmail.com 2009.7.13
  • 2. The Tao is ... De-normalization
  • 3. Case 1: locations ● China ● Beijing ● Shanghai ● Guangzhou ● Shandong – Jinan – Qingdao ● Sichuan – Chengdu
  • 4. In RDBMS loc_id PK loc_name parent_id child_id 1 China 2,3,4,5 2 Beijing 1 3 Shanghai 1 4 Guangzhou 1 5 Shandong 1 7,8 6 Sichuan 1 9 7 Jinan 1,5 8 Qingdao 1,5 9 Chengdu 1,6
  • 5. In HBase row column families name: parent: child: <loc_id> parent:<loc_id> child:<loc_id> 1 China child:1=state child:2=state child:3=state child:4=state child:5=state child:6=state 5 Shangdong parent:1=nation child:7=city child:8=city 8 Qingdao parent:1=nation parent:5=state
  • 6. Case 2: student-course ● Student ● 1 S ~ many C ● Course ● 1 C ~ many S
  • 7. In RDBMS Students Courses id PK SCs id PK name student_id title sex course_id introduction age type teacher_id
  • 8. In HBase row column families info: course: <student_id> info:name course:<course_id>=type info:sex info:age row column families info: student: <course_id> info:title student:<student_id>=type info:introduction info:teacher_id
  • 9. Case 3: user-action ● users performs actions now and then ● store every events ● query recent events of a user
  • 10. In RDBMS Actions id PK user_id IDX name time ● For fast SELECT id, user_id, name, time FROM Action WHERE user_id=XXX ORDER BY time DESC LIMIT 10 OFFSET 20, we must create index on user_id. However, indices will greatly decrease insert speed for index-rebuild.
  • 11. In HBase row column families name: <user><Long.MAX_VALUE - System.currentTimeMillis()> <event id>
  • 12. Case 4: user-friends ● 1 user has 1+ friends ● will lookup all friends of a user
  • 13. In RDBMS Users Friendships id IDX user_id IDX name friend_id sex type age ● SELECT * FROM friendships WHERE user_id='XXX';
  • 14. In HBase row column families info: friend: <user_id> info:name friend:<user_id>=type info:sex info:age ● actually, it is a graph can be represented by a sparse matrix. ● then you can use M/R to find sth interesting. e.g. the shortest path from user A to user B.
  • 15. Case 5: access log ● each log line contains time, ip, domain, url, referer, browser_cookie, login_id, etc ● will be analyzed every 5 minutes, every hour, daily, weekly, and monthly
  • 17. In HBase row column families http: user <time><INC_COUNTER> http:ip user:browser_ http:domain cookie http:url user:login_id http:referer INC_COUNTER is used to distinguish the adjacent same time values.