2. We will discuss
• Introduction to Hadoop
• HBase: Definition, Storage Model, Usecases
• Basic Data Access from shell
• Hands-on with HBase API
3. What is Hadoop
• Framework for distributed processing of large
datasets(BigData)
• HDFS+MapReduce
• HDFS: (Data)
Distributed Filesystem responsible for storing data across
cluster
Provides replication on cheap commodity hardware
Namenode and DataNode processes
• MapReduce: (Processing)
May be a future session
4. HBase: What
• a sparse, distributed, persistent, multidimensional, sorted
map ( defined by Google’s paper on BigTable)
• Distributed NoSQL Database designed on top of HDFS
5. RDBMS Woes (with massive data)
• Scaling is Hard and Expensive
• Turn off relational features/secondary indexes.. to scale
• Hard to do quick reads at larger tables sizes(500 GB)
• Single point of failures
• Schema changes
6. HBase: Why
• Scalable: Just add nodes as your data grows
• Distributed: Leveraging Hadoop’s HDFS advantages
• Built on top of Hadoop : Being part of the
ecosystem, can be integrated to multiple tools
• High performance for read/write
Short-Circuit reads
Single reads: 1 to 10 ms, Scan for: 100s of rows in 10ms
• Schema less
• Production-Ready where data is in order of petabytes
8. HTable
• Tables are split into regions
• Region: Data with continuous range of RowKeys from
[Start to End) sorted Order
• Regions split as Table grows (Region size can be
configured)
• Table Schema defines Column Families
• (Table, RowKey, ColumnFamily, ColumnName, Timestamp)
Value
13. HBase: API
• Configuration: holds details where to find the cluster
and tunable setting .
• Hconnection : represent connection to the cluster.
• HBaseAdmin: handles DDL
operations(create, list,drop,alter).
• Htable (HTableInterface) :is a handle on a single Hbase
table. Send “command” to the table (Put , Get , Scan
, Delete , Increment)
17. HBase: API:DML PUT
HTable table = new HTable(conf, "testtable");
Put put = new Put(Bytes.toBytes("row1"));
put.add(Bytes.toBytes("colfam1"), Bytes.toBytes("qual1"),
Bytes.toBytes("val1"));
put.add(Bytes.toBytes("colfam1"), Bytes.toBytes("qual2"),
Bytes.toBytes("val2"));
table.put(put);
18. HBase: API:DML GET
Configuration conf = HBaseConfiguration.create();
HTable table = new HTable(conf, "testtable");
Get get = new Get(Bytes.toBytes("row1"));
get.addColumn(Bytes.toBytes("colfam1"), Bytes.toBytes("q
ual1"));
Result result = table.get(get);
byte[] val = result.getValue(Bytes.toBytes("colfam1"),
Bytes.toBytes("qual1"));
System.out.println("Value: " + Bytes.toString(val));
19. HBase: API:DML SCAN
Scan scan1 = new Scan();
ResultScanner scanner1 = table.getScanner(scan1);
for (Result res : scanner1) {
System.out.println(res);
}
scanner1.close();
20. Other Projects around HBase
• SQL Layer: Phoenix, Hive, Impala
• Object Persistence: Lily, Kundera
21. FollowUp
• Part2:
Building KeyValue Data store in HBase
Challenges we faced in SMART
• {Rahul, vinay}@briotribes.com
23. HBase: Usecase (Facebook)
• Facebook Messaging:
Titan
1.5 M ops per second at peak
6B+ messages per day
16 columns per operation across diff. families
• Facebook insights:
Puma
provides developers and Page owners with metrics about their
content
> 1 M counter increments per second