Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
HBase	  schema	  design             Headline	  Goes	  Here             Amandeep	  Khurana	  |	  Solu7ons	  AHere          ...
About	  me         • Solu@ons	  Architect,	  Cloudera	  Inc         • Amazon	  Web	  Services         • Interested	  in	  ...
About	  the	  talk         • Data	  model	  recap         • Data	  modeling	  thought	  process         • Tools	  and	  te...
HBase	  is	  ...         • Column	  family	  oriented	  database                • Column	  family	  oriented              ...
HBase	  is	  not	  ...         • A	  rela@onal	  database                • No	  SQL	  query	  language                • No...
Data	  Model	  recap              It’s	  not	  a	  rela7onal	  database	  system  6Friday, April 12, 13
Important	  terms         • Table                • Consists	  of	  rows	  and	  columns         • Row                • Has...
Data	  coordinates         • Row	  is	  addressed	  using	  rowkey         • Cell	  is	  addressed	  using	             	 ...
Tabular	  representa@on                                                                                                   ...
Key-­‐Value	  store                                                             Keys    Values                       [TheR...
Key-­‐Value	  store                                 [TheRealMT, info, password, 1329088818321]                       abc12...
Sorted	  map	  of	  maps                                                         Rowkey                                   ...
HFiles	  and	  physical	  data	  model         • HFiles	  are                • Immutable                • Sorted	  on	  ro...
Thinking	  through	  the	  design              ...	  it’s	  a	  database	  a?er-­‐all  14Friday, April 12, 13
But	  isn’t	  HBase	  schema-­‐less?         • Number	  of	  tables         • Rowkey	  design	           • Number	  of	  c...
Rowkeys         • Rowkey	  design	  is	  the	  single	  most	  important	             aspect	  of	  HBase	  table	  design...
TwitBase	  rela@onships         • Users	  follow	  users         • Rela@onships	  need	  to	  be	  persisted	  for	  usage...
Start	  simple         • Adjacency	  list                                                            Column Family : follo...
Op@mizing	  the	  adjacency	  list         • We	  need	  a	  count                • Where	  does	  a	  new	  followed	  us...
Adding	  a	  new	  user                       Row that needs to be updated                                                ...
Transac@ons	  ==	  not	  good         • HBase	  doesn’t	  have	  na@ve	  support	  (think	  scale)         • Don’t	  want	...
Revisit	  the	  ques@ons         • Read	  paHern                • Who	  all	  does	  A	  follow?                • Who	  al...
Revisit	  the	  ques@ons  22Friday, April 12, 13
Denormaliza@on         • Second	  table	  for	  reverse	  rela@onship         • Otherwise	  scan	  across	  en@re	  table	...
More	  op@miza@ons         • Convert	  into	  tall-­‐narrow	  table         • Leverage	  rowkey	  indexing	  beHer        ...
Tall-­‐narrow	  table	  example         • Denormaliza@on	  is	  the	  way	  to	  go                                       ...
Uniform	  rowkey	  length         • MD5	  the	  userids	  -­‐>	  16	  bytes	  +	  16	  bytes	  rowkeys         • BeHer	  d...
Uniform	  rowkey	  length	  (con@nued)                                                                     f              ...
Tall	  v/s	  Wide	  tables	  storage	  footprint                                Logical representation of an HBase table. ...
Rowkey	  design         • Single	  most	  important	  aspect	  of	  designing	  tables         • Depends	  on	  expected	 ...
Write	  op@mized         • Distribute	  writes	  across	  the	  cluster                • Issue	  most	  pronounced	  with	...
Read	  op@mized         • Data	  to	  be	  accessed	  together	  should	  be	  stored	             together            • e...
Rela@onal	  to	  Non	  rela@onal         • Rela@onal	  concepts                • En@@es                • AHributes        ...
Rela@onal	  to	  Non	  rela@onal	           • AHributes                • Iden@fying                       • Primary	  keys...
Nested	  En@@es         • Column	  Qualifiers	  can	  contain	  data	  instead	  of	  just	             a	  column	  name  ...
Schema	  design	  summary         • Schema	  can	  make	  or	  break	  the	  performance	  you	  get         • Rowkey	  is...
36Friday, April 12, 13
Upcoming SlideShare
Loading in …5
×

HBase schema design Big Data TechCon Boston

4,292 views

Published on

Published in: Technology
  • Dating for everyone is here: ❶❶❶ http://bit.ly/39mQKz3 ❶❶❶
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating direct: ❤❤❤ http://bit.ly/39mQKz3 ❤❤❤
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

HBase schema design Big Data TechCon Boston

  1. 1. HBase  schema  design Headline  Goes  Here Amandeep  Khurana  |  Solu7ons  AHere Speaker  Name  or  Subhead  Goes   rchitect Big  Data  TechCon,  Boston,  April  2013 1Friday, April 12, 13
  2. 2. About  me • Solu@ons  Architect,  Cloudera  Inc • Amazon  Web  Services • Interested  in  large  scale  distributed  systems • Co-­‐author,  HBase  In  Ac@on • TwiHer:  amansk Nick Dimiduk Amandeep Khurana MANNING 2Friday, April 12, 13
  3. 3. About  the  talk • Data  model  recap • Data  modeling  thought  process • Tools  and  techniques 3Friday, April 12, 13
  4. 4. HBase  is  ... • Column  family  oriented  database • Column  family  oriented • Tables  consis@ng  of  rows  and  columns • Persisted  Map • Sparse • Mul@  dimensional • Sorted • Indexed  by  rowkey,  column  and  @mestamp • Key  Value  store • [rowkey,  col  family,  col  qualifier,  @mestamp]  -­‐>  cell  value 4Friday, April 12, 13
  5. 5. HBase  is  not  ... • A  rela@onal  database • No  SQL  query  language • No  joins • No  secondary  indexing • No  transac@ons 5Friday, April 12, 13
  6. 6. Data  Model  recap It’s  not  a  rela7onal  database  system 6Friday, April 12, 13
  7. 7. Important  terms • Table • Consists  of  rows  and  columns • Row • Has  a  bunch  of  columns. • Iden@fied  by  a  rowkey  (primary  key) • Column  Qualifier • Dynamic  column  name • Column  Family • Column  groups  -­‐  logical  and  physical  (Similar  access  paHern) • Cell • The  actual  element  that  contains  the  data  for  a  row-­‐column  intersec@on • Version • Every  cell  has  mul@ple  versions. 7Friday, April 12, 13
  8. 8. Data  coordinates • Row  is  addressed  using  rowkey • Cell  is  addressed  using              [rowkey  +  family  +  qualifier] 8Friday, April 12, 13
  9. 9. Tabular  representa@on 2 Column Family - Info 1 3 Rowkey name email password GrandpaD Mark Twain samuel@clemens.org abc123 HMS_Surprise Patrick OBrien aubrey@sea.com abc123 The table is lexicographically sorted on the rowkeys SirDoyle Fyodor Dostoyevsky fyodor@brothers.net abc123 TheRealMT Sir Arthur Conan Doyle art@TheQueensMen.co.uk Langhorne abc123 4 Cells ts1=1329088321289 ts2=1329088818321 Each cell has multiple The coordinates used to identify data in an HBase table are: versions, (1) rowkey, (2) column family, (3) column qualifier, (4) version typically represented by the timestamp of when they were inserted into the table (ts2>ts1) 9Friday, April 12, 13
  10. 10. Key-­‐Value  store Keys Values [TheRealMT, info, password, 1329088818321] abc123 [TheRealMT, info, password, 1329088321289] Langhorne A single KeyValue instance 10Friday, April 12, 13
  11. 11. Key-­‐Value  store [TheRealMT, info, password, 1329088818321] abc123 1 Start with coordinates of full precision { 1329088818321 : "abc123", [TheRealMT, info, password] 1329088321289 : "Langhorne" } 2 Drop version and youre left with a map of version to values Keys { "email" : { 1329088321289 : "samuel@clemens.org" }, "name" : { [TheRealMT, info] 1329088321289 : "Mark Twain" }, Values "password" : { 1329088818321 : "abc123", 1329088321289 : "Langhorne" } } 3 Omit qualifier and you have a map of qualifiers to the previous maps { "info" : { "email" : { 1329088321289 : "samuel@clemens.org" }, "name" : { 1329088321289 : "Mark Twain" [TheRealMT] }, "password" : { 1329088818321 : "abc123", 1329088321289 : "Langhorne" } } } 4 Finally, drop the column family and you have a row, a map of maps 11Friday, April 12, 13
  12. 12. Sorted  map  of  maps Rowkey { "TheRealMT" : { Column family "info" : { "email" : { 1329088321289 : "samuel@clemens.org" }, "name" : { 1329088321289 : "Mark Twain" Column qualifiers }, "password" : { Values 1329088818321 : "abc123", 1329088321289 : "Langhorne" } } } Versions } 12Friday, April 12, 13
  13. 13. HFiles  and  physical  data  model • HFiles  are • Immutable • Sorted  on  rowkey  +  qualifier  +  @mestamp • In  the  context  of  a  column  family  per  region "TheRealMT" , "info" , "email" , 1329088321289, "samuel@clemens.org" "TheRealMT" , "info" , "name" , 1329088321289 , "Mark Twain" "TheRealMT" , "info" , "password" , 1329088818321 , "abc123", "TheRealMT" , "info" , "password" , 1329088321289 , "Langhorne" HFile for the info column family in the users table 13Friday, April 12, 13
  14. 14. Thinking  through  the  design ...  it’s  a  database  a?er-­‐all 14Friday, April 12, 13
  15. 15. But  isn’t  HBase  schema-­‐less? • Number  of  tables • Rowkey  design   • Number  of  column  families  per  table.  What  goes   into  what  column  family • Column  qualifier  names • What  goes  into  the  cells • Number  of  versions 15Friday, April 12, 13
  16. 16. Rowkeys • Rowkey  design  is  the  single  most  important   aspect  of  HBase  table  designs • The  only  way  to  address  rows  in  HBase 16Friday, April 12, 13
  17. 17. TwitBase  rela@onships • Users  follow  users • Rela@onships  need  to  be  persisted  for  usage  later  on • Model  tables  for  the  expected  access  paHerns • Read  paHern • Who  does  A  follow? • Who  follows  A? • Does  A  follow  B? • Write  paHern • A  follows  B • A  unfollows  B 17Friday, April 12, 13
  18. 18. Start  simple • Adjacency  list Column Family : follows row key: userid column qualifier: followed user number cell value: followed userid Cell value Col Qualifier follows TheFakeMT 1:TheRealMT 2:MTFanBoy 3:Olivia 4:HRogers TheRealMT 1:HRogers 2:Olivia 18Friday, April 12, 13
  19. 19. Op@mizing  the  adjacency  list • We  need  a  count • Where  does  a  new  followed  user  go? follows TheFakeMT 1:TheRealMT 2:MTFanBoy 3:Olivia 4:HRogers count:4 TheRealMT 1:HRogers 2:Olivia count:2 19Friday, April 12, 13
  20. 20. Adding  a  new  user Row that needs to be updated follows TheFakeMT 1:TheRealMT 2:MTFanBoy 3:Olivia 4:HRogers count:4 TheRealMT 1:HRogers 2:Olivia count:2 1 TheFakeMT : follows: {count -> 4} 2 increment count Client code: TheFakeMT : follows: {count -> 5} Step 1: Get current count Step 2: Update count Step 3: Add new entry 3 add new entry Step 4: Write the new data to HBase TheFakeMT : follows: {5 -> MTFanBoy2, count -> 5} 4 follows TheFakeMT 1:TheRealMT 2:MTFanBoy 3:Olivia 4:HRogers 5:MTFanBoy2 count:5 TheRealMT 1:HRogers 2:Olivia count:2 20Friday, April 12, 13
  21. 21. Transac@ons  ==  not  good • HBase  doesn’t  have  na@ve  support  (think  scale) • Don’t  want  to  complicate  client  side  logic • Only  solu@on  -­‐>  simplify  schema follows TheFakeMT TheRealMT:1 MTFanBoy:1 Olivia:1 HRogers:1 TheRealMT HRogers:1 Olivia:1 21Friday, April 12, 13
  22. 22. Revisit  the  ques@ons • Read  paHern • Who  all  does  A  follow? • Who  all  follows  A? • Does  A  follow  B? • Write  paHern • A  follows  B • A  unfollows  B 22Friday, April 12, 13
  23. 23. Revisit  the  ques@ons 22Friday, April 12, 13
  24. 24. Denormaliza@on • Second  table  for  reverse  rela@onship • Otherwise  scan  across  en@re  table  and  affect  read   performance Normalization Dreamland Write performance Poor design Denormalization Read performance 23Friday, April 12, 13
  25. 25. More  op@miza@ons • Convert  into  tall-­‐narrow  table • Leverage  rowkey  indexing  beHer • Gets  -­‐>  short  Scans Keeping the column family and column qualifier names short reduces the data transferred over the network back to the client. The KeyValue objects become smaller. CF : f The + in the row key refers to concatenating row key: CQ: followed users name the two values. You could delimitate follower + followed using any character you like. eg: A-B or A,B cell value: 1 24Friday, April 12, 13
  26. 26. Tall-­‐narrow  table  example • Denormaliza@on  is  the  way  to  go f Putting the user name in the column qualifier saves you from looking up TheFakeMT+TheRealMT Mark Twain:1 the users table for the name of the user given an id. You can simply TheFakeMT+MTFanBoy Amandeep Khurana:1 list out names or ids while looking TheFakeMT+Olivia Olivia Clemens:1 at relationships just from this table. The downside of this is that you need TheFakeMT+HRogers Henry Rogers:1 to update the name in all the cells TheRealMT+Olivia Olivia Clemens:1 if the user updates their name in their profile. TheRealMT+HRogers Henry Rogers:1 This is classic Denormalization. 25Friday, April 12, 13
  27. 27. Uniform  rowkey  length • MD5  the  userids  -­‐>  16  bytes  +  16  bytes  rowkeys • BeHer  distribu@on  of  load CF : f Using MD5 of the user ids gives you row key: CQ: followed userid fixed lengths instead of variable md5(follower)md5(followed) length user ids. You dont need concatenation logic anymore. cell value: followed users name 26Friday, April 12, 13
  28. 28. Uniform  rowkey  length  (con@nued) f MD5(TheFakeMT) MD5(TheRealMT) TheRealMT:Mark Twain MD5(TheFakeMT) MD5(MTFanBoy) MTFanBoy:Amandeep Khurana MD5(TheFakeMT) MD5(Olivia) Olivia:Olivia Clemens MD5(TheFakeMT) MD5(HRogers) HRogers:Henry Rogers MD5(TheRealMT) MD5(Olivia) Olivia:Olivia Clemens MD5(TheRealMT) MD5(HRogers) HRogers:Henry Rogers 27Friday, April 12, 13
  29. 29. Tall  v/s  Wide  tables  storage  footprint Logical representation of an HBase table. Actual physical storage of the table Well look at what it means to Get() row r5 from this table. CF1 CF2 HFile for CF1 HFile for CF2 r1 c1:v1 c1:v9 c6:v2 r1:CF1:c1:t1:v1 r2 c1:v2 c3:v6 r2:CF1:c1:t2:v2 r1:CF2:c1:t1:v9 r2:CF1:c3:t3:v6 r1:CF2:c6:t4:v2 r3 c2:v3 c5:v6 r3:CF1:c2:t1:v3 r3:CF2:c5:t4:v6 r4:CF1:c2:t1:v4 r5:CF2:c7:t3:v8 r4 c2:v4 r5:CF1:c1:t2:v1 r5:CF1:c3:t3:v5 r5 c1:v1 c3:v5 c7:v8 Result object returned for a Get() on row r5 r5:CF1:c1:t2:v1 r5:CF1:c3:t3:v5 KeyValue objects r5:cf2:c7:t3:v8 Key Value Row Col Col Time Cell Key Fam Qual Stamp Value Structure of a KeyValue object 28Friday, April 12, 13
  30. 30. Rowkey  design • Single  most  important  aspect  of  designing  tables • Depends  on  expected  access  paHerns • HFiles  are  sorted  on  Key  part  of  KeyValue  objects "TheRealMT" , "info" , "email" , 1329088321289, "samuel@clemens.org" "TheRealMT" , "info" , "name" , 1329088321289 , "Mark Twain" "TheRealMT" , "info" , "password" , 1329088818321 , "abc123", "TheRealMT" , "info" , "password" , 1329088321289 , "Langhorne" HFile for the info column family in the users table 29Friday, April 12, 13
  31. 31. Write  op@mized • Distribute  writes  across  the  cluster • Issue  most  pronounced  with  @me  series  data • Hashing hash("TheRealMT") -> random byte[] • Sal@ng int salt = new Integer(new Long(timestamp).hashCode()).shortValue() % <number of region servers>; byte[] rowkey = Bytes.add(Bytes.toBytes(salt) + Bytes.toBytes("|") + Bytes.toBytes(timestamp)); 30Friday, April 12, 13
  32. 32. Read  op@mized • Data  to  be  accessed  together  should  be  stored   together • eg:  twit  streams  -­‐  last  10  twits  by  the  users  I   follow Olivia1 Olivia2 1Olivia 1TheRealMT Olivia5 2Olivia Olivia7 2TheFakeMT Olivia9 2TheRealMT TheFakeMT2 3TheFakeMT TheFakeMT3 4TheFakeMT TheFakeMT4 5Olivia TheFakeMT5 5TheFakeMT TheFakeMT6 5TheRealMT TheRealMT1 6TheFakeMT TheRealMT2 7Olivia TheRealMT5 8TheRealMT TheRealMT8 9Olivia 31Friday, April 12, 13
  33. 33. Rela@onal  to  Non  rela@onal • Rela@onal  concepts • En@@es • AHributes • Rela@onships • En@@es • Table  is  a  table.  Not  much  going  on  there • Users  table  contains...  users.  Those  are  en@@es • Good  place  to  start 32Friday, April 12, 13
  34. 34. Rela@onal  to  Non  rela@onal   • AHributes • Iden@fying • Primary  keys.  Compound  keys • Maps  to  rowkeys • Non-­‐iden@fying • Other  columns • Maps  to  column  qualifiers  and  cells • Rela@onships • Foreign  keys,  junc@on  tables,  joins. • Non-­‐existent  in  HBase.  Instead  try  to  denormalize 33Friday, April 12, 13
  35. 35. Nested  En@@es • Column  Qualifiers  can  contain  data  instead  of  just   a  column  name hbase table row key column family fixed qualifier → timestamp → value Nested entities repeating entity variable qualifier → timestamp → value 34Friday, April 12, 13
  36. 36. Schema  design  summary • Schema  can  make  or  break  the  performance  you  get • Rowkey  is  the  single  most  important  thing • Use  tricks  like  hashing  and  sal@ng • Denormalize  to  your  advantage • There  are  no  joins • Isolate  access  paHerns • Separate  CFs  or  even  separate  tables • Shorter  names  -­‐>  lower  storage  footprint • Column  qualifiers  can  be  used  to  store  data  and  not  just  column   names • Big  difference  as  compared  to  RDBMS 35Friday, April 12, 13
  37. 37. 36Friday, April 12, 13

×