HBase	
Tame your BigData

                    Andrzej  Grzesik  
                    LunarLogicPolska
me:  present	




    past
Questions?	
Ask them right away!
So
HBase	

                                        open-­‐‑source	
high-­‐‑performance	
                     BigTable	


      fast	
 distributed	
              NoSQL	
            datastore	
                        scalable	
                         built  upon    
                                                            Hadoop	
                    fault  tolerant	


                 Cool  and  fun  to  work  with!
Who  uses  Hbase?
Beware!	
 Lots of text
Hadoop  stack	
By  my  count  —  and  it’s  very  possible  I’m  missing  someone  —  	
Hadoop-­‐‑based  startups  have  raised  $104.5  million  since  May.  	
The  same  set  of  companies  has  raised  $159.7  million  since  2009  	
when  Cloudera  closed  its  first  round.	


By  comparison,  the  handful  of  popular  NoSQL  database  vendors,  	
often  lumped  into  the  big  data  category  as  well,  and  similar  to  
Hadoop  in  their  focus  on  unstructured  data,  have  announced  just  
more  than  $90  million  in  funding  overall.	




via  (hKp://gigaom.com/cloud/with-­‐‑40m-­‐‑for-­‐‑cloudera-­‐‑how-­‐‑much-­‐‑is-­‐‑hadoop-­‐‑worth/)
Some  theory
architecture	
                                HBase	

Zookeeper	



              m/r	
    hdfs	
                  hadoop	




                                               servers	
              node	
   node	
         node
Related  projects:	
•  Chukwa
   o  Log analysis tool

•  Hive

   o  Or, if Hive is slow:

•  Pig
   o  High level data manipulation language
   o  Don’t write all MapReduce jobs by hand!
Brewer’s  CAP  theorem	
                 Availability	




      HBase	
                     RDBMS	

                  Pick  
                   2	
   Partition                        Consistency	
   Tolerance	


                 CouchDB
Data  organisation	

Rowkey  1	
        Rowkey  n+1	
  …	
                                   …	
     Rowkey  n	
       …	


     Region  1	
    Region  2
Data  organisation	
                               	
                          Region	


       Column  family                       Column  family  
        col1,  col2,  col3	
                   col1,  col2	




Column  family	
                    Column  family
Data  organisation	
                           ColumnKey	



                            Region	
               column1	
    column2	
    column3	
Timestamp	




               v1@t1	
      v1@t1	
      v1@t1	

               v1@t2	
                   v1@t2	

                                         v1@t3
Let’s  see  some  code?
Integration  testing?	
Start cluster locally



                        ?	
                              Use a remote one
How  to  start  hacking?	
Grab hadoop
     http://hadoop.apache.org/


and Hbase
     http://hbase.apache.org/


Spend an eon learning more than you wanted about
plumbing
How  to  start  hacking?	
Better (faster) way:

Grab a VM/packages from
Pro  tip	
Don’t run HBase on       or face problems

It’s doable
(http://hbase.apache.org/docs/r0.20.6/cygwin.html)
but VMs are faster!
How  to  start  hacking?	
Situation will improve, since
modes	
Develop with
•  local mode
   o  single instance, single JVM

Then
•  Pseudo-distributed
   o  multiple instances, single machine

For production
•  Distributed mode
   o  many nodes
One  more	
Befriend some admins, you will need them
Use  cases?
Example  from  X	
•  Customer-provided user data
•  Schema varying between customers
   o  kept in RDBMS,

•  Data in HBase
Example  from  Facebook	
HBase drives Facebook messages

•  Key: UserId
•  Column: Word
•  Version: MessageId



See for more details
(http://www.infoq.com/presentations/HBase-at-Facebook)
When  to  use  Hbase?	
•    Lots of key/value data
•    Need good scalability
•    Need good query times with random access
•    Data analytics
What  is  HBase  poor  at?	
•  transactions
•  relying on indexes
•  security
T(h)ank  you!
Useful	
Brewer’s CAP theorem
http://citeseerx.ist.psu.edu/viewdoc/download?
doi=10.1.1.20.1495&rep=rep1&type=pdf


Google BigTable
http://labs.google.com/papers/bigtable-osdi06.pdf


Dzone Refcards
http://refcardz.dzone.com/refcardz/getting-started-apache-hadoop
http://refcardz.dzone.com/refcardz/deploying-hadoop

Hbase jdd

  • 1.
    HBase Tame your BigData Andrzej  Grzesik   LunarLogicPolska
  • 2.
  • 3.
  • 5.
  • 6.
    HBase open-­‐‑source high-­‐‑performance BigTable fast distributed NoSQL datastore scalable built  upon     Hadoop fault  tolerant Cool  and  fun  to  work  with!
  • 7.
  • 8.
  • 9.
    Hadoop  stack By  my count  —  and  it’s  very  possible  I’m  missing  someone  —   Hadoop-­‐‑based  startups  have  raised  $104.5  million  since  May.   The  same  set  of  companies  has  raised  $159.7  million  since  2009   when  Cloudera  closed  its  first  round. By  comparison,  the  handful  of  popular  NoSQL  database  vendors,   often  lumped  into  the  big  data  category  as  well,  and  similar  to   Hadoop  in  their  focus  on  unstructured  data,  have  announced  just   more  than  $90  million  in  funding  overall. via  (hKp://gigaom.com/cloud/with-­‐‑40m-­‐‑for-­‐‑cloudera-­‐‑how-­‐‑much-­‐‑is-­‐‑hadoop-­‐‑worth/)
  • 10.
  • 11.
    architecture HBase Zookeeper m/r hdfs hadoop servers node node node
  • 12.
    Related  projects: •  Chukwa o  Log analysis tool •  Hive o  Or, if Hive is slow: •  Pig o  High level data manipulation language o  Don’t write all MapReduce jobs by hand!
  • 13.
    Brewer’s  CAP  theorem Availability HBase RDBMS Pick   2 Partition   Consistency Tolerance CouchDB
  • 14.
    Data  organisation Rowkey  1 Rowkey  n+1 … … Rowkey  n … Region  1 Region  2
  • 15.
    Data  organisation Region Column  family   Column  family   col1,  col2,  col3 col1,  col2 Column  family Column  family
  • 16.
    Data  organisation ColumnKey Region column1 column2 column3 Timestamp v1@t1 v1@t1 v1@t1 v1@t2 v1@t2 v1@t3
  • 17.
  • 18.
    Integration  testing? Start clusterlocally ? Use a remote one
  • 19.
    How  to  start hacking? Grab hadoop http://hadoop.apache.org/ and Hbase http://hbase.apache.org/ Spend an eon learning more than you wanted about plumbing
  • 20.
    How  to  start hacking? Better (faster) way: Grab a VM/packages from
  • 21.
    Pro  tip Don’t runHBase on or face problems It’s doable (http://hbase.apache.org/docs/r0.20.6/cygwin.html) but VMs are faster!
  • 22.
    How  to  start hacking? Situation will improve, since
  • 23.
    modes Develop with •  localmode o  single instance, single JVM Then •  Pseudo-distributed o  multiple instances, single machine For production •  Distributed mode o  many nodes
  • 24.
    One  more Befriend someadmins, you will need them
  • 25.
  • 26.
    Example  from  X • Customer-provided user data •  Schema varying between customers o  kept in RDBMS, •  Data in HBase
  • 27.
    Example  from  Facebook HBasedrives Facebook messages •  Key: UserId •  Column: Word •  Version: MessageId See for more details (http://www.infoq.com/presentations/HBase-at-Facebook)
  • 28.
    When  to  use Hbase? •  Lots of key/value data •  Need good scalability •  Need good query times with random access •  Data analytics
  • 29.
    What  is  HBase poor  at? •  transactions •  relying on indexes •  security
  • 30.
  • 31.
    Useful Brewer’s CAP theorem http://citeseerx.ist.psu.edu/viewdoc/download? doi=10.1.1.20.1495&rep=rep1&type=pdf GoogleBigTable http://labs.google.com/papers/bigtable-osdi06.pdf Dzone Refcards http://refcardz.dzone.com/refcardz/getting-started-apache-hadoop http://refcardz.dzone.com/refcardz/deploying-hadoop