Building apps with HBase - Big Data TechCon Boston
Upcoming SlideShare
Loading in...5
×
 

Building apps with HBase - Big Data TechCon Boston

on

  • 1,095 views

 

Statistics

Views

Total Views
1,095
Views on SlideShare
1,069
Embed Views
26

Actions

Likes
0
Downloads
36
Comments
0

1 Embed 26

https://twitter.com 26

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Building apps with HBase - Big Data TechCon Boston Building apps with HBase - Big Data TechCon Boston Presentation Transcript

  • Building  applica2ons  with   HBase Goes  Here Headline   Amandeep  Khurana  |  Solu7ons  AHere Speaker  Name  or  Subhead  Goes   rchitect Big  Data  TechCon,  Boston,  April  2013 1Friday, April 12, 13
  • About  me • Solu2ons  Architect,  Cloudera  Inc • Amazon  Web  Services • Interested  in  large  scale  distributed  systems • Co-­‐author,  HBase  In  Ac2on • TwiHer:  amansk Nick Dimiduk Amandeep Khurana MANNING 2Friday, April 12, 13
  • About  the  talk • Setup  local  HBase  environment • Simple  client • Deeper  dive  on  API • Data  model 3Friday, April 12, 13
  • What  is  HBase? • Scalable,  distributed  data  store • Open  source  avatar  of  Google’s  Bigtable • Sorted  map  of  maps • Sparse • Mul2  dimensional • Tightly  integrated  with  Hadoop • Not  a  RDBMS 4Friday, April 12, 13
  • So,  what’s  the  big  deal? • Give  up  system  complexity  and  sophis2cated   features  for  ease  of  scale  and  manageability • Simple  system  !=  easy  to  build  applica2ons  on  top • Goal:  predictable  read  and  write  performance  at   scale • Effec2ve  API  usage • Op2mal  schema  design • Understand  internals  and  leverage  at  scale 5Friday, April 12, 13
  • Use  cases • Canonical  use  case:  storing  crawl  data  and  indices   for  search Web Search powered by Bigtable Crawlers Internets Crawlers Bigtable Crawlers Crawlers 1 2 MapReduce Indexing the Internet 1 Crawlers constantly scour the Internet for new pages. Those pages are stored as individual records in Bigtable. 4 3 2 A MapReduce job runs over the entire table, generating Search search indexes for the Web Search application. Search Index Search Web Search Index Index 5 You Searching the Internet 3 The user initiates a Web Search request. 4 The Web Search application queries the Search Indexes and retries matching documents directly from Bigtable. 5 Search results are presented to the user. 6Friday, April 12, 13
  • Use  cases  (con2nued) • Capturing  incremental  trickle  data • Content  serving  /  OLTP • Blob  store 7Friday, April 12, 13
  • Installing  HBase • Get  it  here  -­‐>  hHp://hbase.apache.org/ • Downloads  -­‐>  Choose  a  mirror  -­‐>  Choose  a  release • Download  the  tarball • Untar  it  and  run  it 8Friday, April 12, 13
  • Basics • Modes  of  opera2on • Standalone • Pseudo-­‐distributed • Fully  distributed • Start  up  in  standalone  mode • cd  hbase-­‐0.94.1 • bin/start-­‐hbase.sh • Logs  are  in  ./logs/ • Check  out  hHp://localhost:60010  in  your  browser 9Friday, April 12, 13
  • Ways  to  interact • Shell • Java  client • Thrid  interface • REST  interface • Asynchbase 10Friday, April 12, 13
  • Shell • Basic  interac2on  tool • WriHen  in  JRuby • Uses  the  Java  client  library • Good  place  to  start  poking  around 11Friday, April 12, 13
  • Shell  (con2nued) • bin/hbase  shell • Create  table • create  ‘users’  ,  ‘info’ • List  tables • list • Describe  table • describe  ‘users’ 12Friday, April 12, 13
  • Shell  (con2nued) • Put  a  row • put  ‘users’  ,  ‘u1’,  ‘info:name’  ,  ‘Mr.  Rogers’ • Get  a  row • get  ‘users’  ,  ‘u1’ • Put  more • put  ‘users’  ,  ‘u1’  ,  ‘info:email’  ,  ‘mail@rogers.com’ • put  ‘users’  ,  ‘u1’  ,  ‘info:password’  ,  ‘foobar’ • Get  a  row • get  ‘users’  ,  ‘u1’ • Scan  table • scan  ‘users’ 13Friday, April 12, 13
  • Demo -­‐  Standalone  mode -­‐  Shell 14Friday, April 12, 13
  • Java  API ...  can’t  really  build  an  applica7on  just  with  the  shell 15Friday, April 12, 13
  • Important  terms • Table • Consists  of  rows  and  columns • Row • Has  a  bunch  of  columns. • Iden2fied  by  a  rowkey  (primary  key) • Column  Qualifier • Dynamic  column  name • Column  Family • Column  groups  -­‐  logical  and  physical  (Similar  access  paHern) • Cell • The  actual  element  that  contains  the  data  for  a  row-­‐column  intersec2on • Version • Every  cell  has  mul2ple  versions. 16Friday, April 12, 13
  • Introducing  TwitBase • TwiHer  clone  built  on  HBase • Designed  to  scale • It’s  an  example,  not  a  produc2on  app • Start  with  users  and  twits  for  now • Users • Have  profiles • Post  twits • Twits • Have  text  posted  by  users • Get  sample  code  at:  hHps://github.com/hbaseinac2on/twitbase 17Friday, April 12, 13
  • Java  API • Configura)on  -­‐  holds  config  informa2on  about   the  cluster.  Roughly  equivalent  to  JDBC   connec2on  string • HConnec)on  -­‐  represents  connec2on  to  a  cluster • HBaseAdmin  -­‐  admin  tool  to  handle  DDL   opera2ons • HTablePool  -­‐  pool  of  connec2ons  to  cluster • HTable  (HTableInterface)  -­‐  handle  on  a  single   HBase  table.  Sends  commands  to  table 18Friday, April 12, 13
  • Connec2ng  to  HBase • Using default confs: HTableInterface usersTable = new HTable("users"); • Passing custom conf: Configuration myConf = HBaseConfiguration.create(); myConf.set("parameter_name", "parameter_value"); HTableInterface usersTable = new HTable(myConf, "users"); • Connection pooling HTablePool pool = new HTablePool(); HTableInterface usersTable = pool.getTable("users"); // work with the table usersTable.close(); 19Friday, April 12, 13
  • Inser2ng  data • Put  call  on  HTable Put p = new Put(Bytes.toBytes("TheRealMT")); p.add(Bytes.toBytes("info"), Bytes.toBytes("name"), Bytes.toBytes("Mark Twain")); p.add(Bytes.toBytes("info"), Bytes.toBytes("email"), Bytes.toBytes("samuel@clemens.org")); p.add(Bytes.toBytes("info"), Bytes.toBytes("password"), Bytes.toBytes("Langhorne")); usersTable.put(p); 20Friday, April 12, 13
  • Data  coordinates • Row  is  addressed  using  rowkey • Cell  is  addressed  using              [rowkey  +  family  +  qualifier] 21Friday, April 12, 13
  • Upda2ng  data •Update  ==  Write Put p = new Put(Bytes.toBytes("TheRealMT")); p.add(Bytes.toBytes("info"), Bytes.toBytes("password"), Bytes.toBytes("abc123")); usersTable.put(p); 22Friday, April 12, 13
  • Write  path WAL Write goes to the write-ahead log (WAL) Memstore as well as the in memory write buffer called the Memstore. Client HFile Clients dont interact directly with the underlying HFiles during writes. HFile Memstore New HFile Memstore gets flushed to disk to form a new HFile when filled up with writes. HFile HFile 23Friday, April 12, 13
  • Reading  data • Get  the  en2re  row Get g = new Get(Bytes.toBytes("TheRealMT")); Result r = usersTable.get(g); • Get  selected  columns Get g = new Get(Bytes.toBytes("TheRealMT")); g.addColumn(Bytes.toBytes("info"), Bytes.toBytes("password")); Result r = usersTable.get(g); • Extract  the  value  out  of  the  Result  object byte[] b = r.getValue(Bytes.toBytes("info"), Bytes.toBytes("email")); String email = Bytes.toString(b); 24Friday, April 12, 13
  • Reading  data  (con2nued) • Scan Scan s = new Scan(startRow, stopRow); ResultsScanner rs = twits.getScanner(s); for(Result r : rs) { // iterate over scanner results } • Single threaded iteration over defined set of rows. Could be the entire table too. 25Friday, April 12, 13
  • What  happens  when  you  read  data? BlockCache Memstore Client HFile HFile 26Friday, April 12, 13
  • Dele2ng  data • Another  simple  API  call • Delete  en2re  row Delete d = new Delete(Bytes.toBytes("TheRealMT")); usersTable.delete(d); • Delete  selected  columns Delete d = new Delete(Bytes.toBytes("TheRealMT")); d.deleteColumns(Bytes.toBytes("info"), Bytes.toBytes("email")); usersTable.delete(d); 27Friday, April 12, 13
  • Delete  internals • Tombstone  markers • Dele2on  only  happens  during  major  compac2ons • Compac2ons • Minor • Major 28Friday, April 12, 13
  • Compac2ons Memstore Memstore HFile Two or more HFiles get combined Compacted to form a single HFile during HFile a compaction. HFile HFile HFile 29Friday, April 12, 13
  • DAO • Like  any  other  applica2on • Makes  data  access  management  easy • Cleaner  code • Table  info  at  a  single  place • Op2miza2ons  like  connec2on  pooling  etc • Easier  management  of  database  configs 30Friday, April 12, 13
  • Admin HBaseAdmin Table  Opera2ons • Administra2ve  API  to  the   • createTable() cluster. • deleteTable() • Provides  an  interface  to   • addColumn() manage  HBase  database   tables  metadata  and   • modifyColumn() general  administra2ve   • deleteColumn() func2ons. • listTables()Friday, April 12, 13
  • Demo -­‐  Java  client 32Friday, April 12, 13
  • More  API  features • Filters • Compare  and  Swap • Coprocessors 33Friday, April 12, 13
  • Data  Model It’s  not  a  rela7onal  database  system 34Friday, April 12, 13
  • HBase  is  ... • Column  family  oriented  database • Column  family  oriented • Tables  consis2ng  of  rows  and  columns • Persisted  Map • Sparse • Mul2  dimensional • Sorted • Indexed  by  rowkey,  column  and  2mestamp • Key  Value  store • [rowkey,  col  family,  col  qualifier,  2mestamp]  -­‐>  cell  value 35Friday, April 12, 13
  • HBase  is  not  ... • A  rela2onal  database • No  SQL  query  language • No  joins • No  secondary  indexing • No  transac2ons 36Friday, April 12, 13
  • Tabular  representa2on 2 Column Family - Info 1 3 Rowkey name email password GrandpaD Mark Twain samuel@clemens.org abc123 HMS_Surprise Patrick OBrien aubrey@sea.com abc123 The table is lexicographically sorted on the rowkeys SirDoyle Fyodor Dostoyevsky fyodor@brothers.net abc123 TheRealMT Sir Arthur Conan Doyle art@TheQueensMen.co.uk Langhorne abc123 4 Cells ts1=1329088321289 ts2=1329088818321 Each cell has multiple The coordinates used to identify data in an HBase table are: versions, (1) rowkey, (2) column family, (3) column qualifier, (4) version typically represented by the timestamp of when they were inserted into the table (ts2>ts1) 37Friday, April 12, 13
  • Key-­‐Value  store Keys Values [TheRealMT, info, password, 1329088818321] abc123 [TheRealMT, info, password, 1329088321289] Langhorne A single KeyValue instance 38Friday, April 12, 13
  • Key-­‐Value  store [TheRealMT, info, password, 1329088818321] abc123 1 Start with coordinates of full precision { 1329088818321 : "abc123", [TheRealMT, info, password] 1329088321289 : "Langhorne" } 2 Drop version and youre left with a map of version to values Keys { "email" : { 1329088321289 : "samuel@clemens.org" }, "name" : { [TheRealMT, info] 1329088321289 : "Mark Twain" }, Values "password" : { 1329088818321 : "abc123", 1329088321289 : "Langhorne" } } 3 Omit qualifier and you have a map of qualifiers to the previous maps { "info" : { "email" : { 1329088321289 : "samuel@clemens.org" }, "name" : { 1329088321289 : "Mark Twain" [TheRealMT] }, "password" : { 1329088818321 : "abc123", 1329088321289 : "Langhorne" } } } 4 Finally, drop the column family and you have a row, a map of maps 39Friday, April 12, 13
  • Sorted  map  of  maps Rowkey { "TheRealMT" : { Column family "info" : { "email" : { 1329088321289 : "samuel@clemens.org" }, "name" : { 1329088321289 : "Mark Twain" Column qualifiers }, "password" : { Values 1329088818321 : "abc123", 1329088321289 : "Langhorne" } } } Versions } 40Friday, April 12, 13
  • HFiles  and  physical  data  model • HFiles  are • Immutable • Sorted  on  rowkey  +  qualifier  +  2mestamp • In  the  context  of  a  column  family  per  region "TheRealMT" , "info" , "email" , 1329088321289, "samuel@clemens.org" "TheRealMT" , "info" , "name" , 1329088321289 , "Mark Twain" "TheRealMT" , "info" , "password" , 1329088818321 , "abc123", "TheRealMT" , "info" , "password" , 1329088321289 , "Langhorne" HFile for the info column family in the users table 41Friday, April 12, 13
  • HBase  in  distributed  mode ...  what  really  goes  on  under  the  hood 42Friday, April 12, 13
  • Distributed  mode • Table  is  split  into  Regions T1R1 00001 John 415-111-1234 00002 Paul 408-432-9922 00001 John 415-111-1234 00003 Ron 415-993-2124 00001 John 415-111-1234 00002 Paul 408-432-9922 00004 Bob 818-243-9988 00002 Paul 408-432-9922 00003 Ron 415-993-2124 00003 Ron 415-993-2124 00004 Bob 818-243-9988 00004 Bob 818-243-9988 T1R2 00005 Carly 206-221-9123 00005 Carly 206-221-9123 00005 Carly 206-221-9123 00006 Scott 818-231-2566 00006 Scott 818-231-2566 00006 Scott 818-231-2566 00007 Simon 425-112-9877 00007 Simon 425-112-9877 00007 Simon 425-112-9877 00008 Lucas 415-992-4432 00008 Lucas 415-992-4432 00008 Lucas 415-992-4432 00009 Steve 530-288-9832 00010 Kelly 916-992-1234 00009 Steve 530-288-9832 T1R3 00011 Betty 650-241-1192 00010 Kelly 916-992-1234 00012 Anne 206-294-1298 00011 Betty 650-241-1192 00009 Steve 530-288-9832 00012 Anne 206-294-1298 00010 Kelly 916-992-1234 00011 Betty 650-241-1192 00012 Anne 206-294-1298 Full table T1 Table T1 split into 3 regions - R1, R2 and R3 43Friday, April 12, 13
  • Distributed  mode  (con2nued) • Regions  are  served   by  RegionServers DataNode RegionServer T1R1 00001 00002 00003 John Paul Ron 415-111-1234 408-432-9922 415-993-2124 00004 Bob 818-243-9988 • RegionServers  are   Host1 Java  processes,   T1R2 00005 Carly 206-221-9123 00006 Scott 818-231-2566 DataNode RegionServer 00007 Simon 425-112-9877 00008 Lucas 415-992-4432 typically  collocated   T1R3 00009 00010 00011 Steve Kelly Betty 530-288-9832 916-992-1234 650-241-1192 with  the  DataNodes Host2 00012 Anne 206-294-1298 DataNode RegionServer Host3 44Friday, April 12, 13
  • Finding  your  region • Two  special  tables:  -­‐ROOT-­‐  and  .META. -ROOT- table, consisting of only 1 region -ROOT- hosted on region server RS1 RS1 .META. table, consisting of 3 regions, M1 M2 M3 hosted on RS1, RS2 and RS3 RS2 RS3 RS1 User tables T1 and T2, consisting of T1R1 T1R2 T2R1 T1R3 T2R2 T2R3 T2R4 4 and 3 regions respectively, distributed amongst RS1, RS2 and RS3 RS3 RS1 RS2 RS3 RS2 RS1 RS1 45Friday, April 12, 13
  • Regions  distributed  across  the  cluster T1-R1 00001 John 415-111-1234 DataNode RegionServer 00002 Paul 408-432-9922 00003 Ron 415-993-2124 RegionServer 1 (RS1), hosting regions R1 of 00004 Bob 818-243-9988 the user table T1 and M2 of .META. .META. - M2 T1:00009-T1:00012 R3 RS2 Host1 T1-R2 00005 Carly 206-221-9123 00006 Scott 818-231-2566 00007 Simon 425-112-9877 DataNode RegionServer 00008 Lucas 415-992-4432 T1-R3 RegionServer 2 (RS2), hosting regions R2, 00009 Steve 530-288-9832 R3 of the user table and M1 of .META. 00010 Kelly 916-992-1234 00011 Betty 650-241-1192 00012 Anne 206-294-1298 Host2 .META. - M1 T1:00001-T1:00004 R1 RS1 T1:00005-T1:00008 R2 RS2 DataNode RegionServer -ROOT- M:T100001-M:T100009 M1 RS2 RegionServer 3 (RS3), hosting just -ROOT- M:T100010-M:T100012 M2 RS1 Host3 46Friday, April 12, 13
  • What  the  client  does MyTable .META. TheRealMT -ROOT- ZooKeeper 47Friday, April 12, 13
  • What  the  client  does MyTable .META. TheRealMT -ROOT- ZooKeeper 47Friday, April 12, 13
  • What  the  client  does MyTable .META. TheRealMT -ROOT- ZooKeeper Row per META region 47Friday, April 12, 13
  • What  the  client  does MyTable .META. TheRealMT -ROOT- ZooKeeper Row per META region Row per table region 47Friday, April 12, 13
  • What  the  client  does MyTable .META. TheRealMT -ROOT- ZooKeeper Row per META region Row per table region 47Friday, April 12, 13
  • 48Friday, April 12, 13