Advance Hive, NoSQL Database (HBase) - Module 7

Advance Hive, NoSQL
DataBase (HBase)

HiveQL: Data Manipulation
Loading Data into Managed Tables
• Hive has no row-level insert, update, and delete operations.
• Only data can be loaded in tables through “bulk” load operations.
LOAD DATA LOCAL INPATH ‘/usr/hive/warehouse/california-employees'
OVERWRITE INTO TABLE employees
PARTITION (country = 'US', state = 'CA');

Inserting Data into Tables from Queries
• INSERT statement allows to load data into a table from a query.
• With OVERWRITE, any previous contents of the partition (or
whole table if not partitioned) are replaced.

Dynamic Partition Inserts
• Dynamic partition feature, where it can infer the partitions to create
based on query parameters.

Creating Tables and Loading Them in One Query
Exporting Data

User Defined Functions
• Hive has the ability to use User Defined Functions written in Java to perform
computations that would otherwise be difficult (or impossible) to perform
using the built-in Hive functions and SQL commands.
• To invoke a UDF from within a Hive script, it is required to:
1. Register the JAR file that contains the UDF class, and
2. Define an alias for the function using the CREATE TEMPORARY FUNCTION
command.

public class UDFZodiacSign extends UDF {
private SimpleDateFormat df;
public UDFZodiacSign() {
df = new SimpleDateFormat("MM-dd-yyyy");}
public String evaluate(Date bday) {
return this.evaluate(bday.getMonth(), bday.getDay());}
public String evaluate(String bday) {
Date date = null;
try {date = df.parse(bday);} catch (Exception ex) {return null;}
return this.evaluate(date.getMonth() + 1, date.getDay());}
public String evaluate(Integer month, Integer day) {
if (month == 1) {
if (day < 20) {return "Capricorn";} else {return "Aquarius";}}
if (month == 2) {
if (day < 19) {return "Aquarius";} else {return "Pisces";}}return null;}}

HBase: Introduction to HBase
• HBase is a distributed column-oriented data store built on top of HDFS.
• HBase is an Apache open source project whose goal is to provide storage for the
Hadoop Distributed Computing
• Data is logically organized into tables, rows and columns

HBase vs. HDFS
• HDFS is good for batch processing (scans over big files)
• Not good for record lookup
• Not good for incremental addition of small batches
• Not good for updates
• HBase is designed to efficiently address the above points
• Fast record lookup
• Support for record-level insertion
• Support for updates (not in place)

Tables, Rows, Column family
• Table: HBase organizes data into tables. Table names are Strings and composed of
characters that are safe for use in a file system path.
• Row: Within a table, data is stored according to its row. Rows are identified
uniquely by their row key. Row keys do not have a data type and are always
treated as a byte[ ] (byte array).
• Column Family: Data within a row is grouped by column family. Column families
also impact the physical arrangement of data stored in HBase. For this reason,
they must be defined up front and are not easily modified. Every row in a table
has the same column families, although a row need not store data in all its
families. Column families are Strings and composed of characters that are safe for
use in a file system path.

• Column Qualifier: Data within a column family is addressed via its column
qualifier, or simply, column. Column qualifiers need not be specified in advance.
Column qualifiers need not be consistent between rows. Like row keys, column
qualifiers do not have a data type and are always treated as a byte[ ].
• Cell: A combination of row key, column family, and column qualifier uniquely
identifies a cell. The data stored in a cell is referred to as that cell’s value. Values
also do not have a data type and are always treated as a byte[ ].
• Timestamp: Values within a cell are versioned. Versions are identified by their
version number, which by default is the timestamp of when the cell was written.
If a timestamp is not specified during a write, the current timestamp is used. If
the timestamp is not specified for a read, the latest one is returned. The number
of cell value versions retained by HBase is configured for each column family. The
default number of cell versions is three.
Column, Cell, Timestamp

Representation as a Multi Dimensional Map
SortedMap<RowKey, List<SortedMap<Column, List<Value, Timestamp>>>>

HBase Table as Key-Value Store

Client API: Administrative API

Client API: CRUD Operations put()

Client API: CRUD Operations get()

Client API: CRUD Operations delete()

HBase Clients
• Java Client
• Useful when the interacting application is written in a java language.
• REST and Thrift
• HBase ships with REST and Thrift interfaces. These are useful when the
interacting application is written in a language other than Java.

HBase MapReduce Integration
public class SimpleRowCounter extends
Configured implements Tool {
static class RowCounterMapper extends
TableMapper<ImmutableBytesWritable, Result> {
public static enum Counters { ROWS }
@Override
public void map(ImmutableBytesWritable row,
Result value, Context context) {
context.getCounter(Counters.ROWS).increment(1);}}
@Override
public int run(String[] args) throws Exception {
if (args.length != 1) {System.err.println("Usage:
SimpleRowCounter <tablename>"); return -1;}
String tableName = args[0];
Scan scan = new Scan();
scan.setFilter(new FirstKeyOnlyFilter());
Job job = new Job(getConf(),
getClass().getSimpleName());
job.setJarByClass(getClass());
TableMapReduceUtil.initTableMapperJob(tableNam
e, scan,
RowCounterMapper.class,
ImmutableBytesWritable.class, Result.class, job);
job.setNumReduceTasks(0);
job.setOutputFormatClass(NullOutputFormat.class);
return job.waitForCompletion(true) ? 0 : 1;}
public static void main(String[] args) throws
Exception {
int exitCode =
ToolRunner.run(HBaseConfiguration.create(),
new SimpleRowCounter(), args);
System.exit(exitCode);}}

Advance Hive, NoSQL Database (HBase) - Module 7

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Advance Hive, NoSQL Database (HBase) - Module 7

Similar to Advance Hive, NoSQL Database (HBase) - Module 7 (20)

More from Rohit Agrawal

More from Rohit Agrawal (7)

Recently uploaded

Recently uploaded (20)

Advance Hive, NoSQL Database (HBase) - Module 7