Hadoop and HBase experiences in perf log project

1
Hadoop & HBase
Experiences in
Perf-Log Project
Eric Geng & Gary Zhao
Performance Team, Platform 11/24/2011

3
Architecture of Perf-Log Project

4
Perf Log Format
• Event Level
• Request Level

11
Yahoo! Cloud Serving Benchmark
• 3 HBase nodes on Solaris zones
Throughput
Average
Response
Time
Max
Response
Time
Write 1808 writes/s 1.6 ms
0.02% > 1s
(due to region
splitting)
Read 9846 reads/s 0.3 ms 45ms

12
Hadoop
Configuration
Overview

13
Setting up Hadoop
• Supported Platforms
• Linux – best
• Solaris – ok. Just works
• Windows – not recommend
• Required Software
• JDK 1.6.x
• SSH
• Packages
• Cloudera

14
Match Hadoop & HBase Version
Hadoop version HBase version Compatible?
0.20.3 release 0.90.x NO
0.20-append 0.90.x YES
0.20.5 release 0.90.x YES
0.21.0 release 0.90.x NO
0.22.x (in
development)
0.90.x NO

15
Running Modes of Hadoop
• Standalone Operation
By default, run in a non-distributed mode, as a single Java process, be useful for
debugging.
• Pseudo-Distributed Operation
Run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs
in a separate Java process.
• Fully-Distributed Operation
Run in a cluster, the real production environment.

18
Map Reduce Job
MapReduce is a programming model for data processing on Hadoop. It works by breaking the
processing into two phases: the map phase and the reduce phase. Each phase has key-value pairs
as input and output, the types of which may be chosen by the programmer.
• Mapper
A Mapper usually process data in single lines. Ignore the useless lines and collect useful information
from data into <Key, Value> pairs.
• Reducer
Receive the <Key, <Value1, Value2, …>> pairs from Mappers. Tabulate statistics data and write the
results into <Key, Value> pairs.

20
Serialization in Hadoop
int IntWritable
long LongWritable
boolean BooleanWritable
byte ByteWritable
float FloatWritable
double DoubleWritable
String Text
null NullWritable

21
Example: WordCount
Before we jump into the details, lets walk through an example MapReduce application to get a flavour for how they
work. WordCount is a simple application that counts the number of occurences of each word in a given input set.
public static class Map extends MapReduceBase implements Mapper < LongWritable, Text, Text, IntWritable > {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws
IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
public static class Reduce extends MapReduceBase implements Reducer< Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws
IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
input Key - Value
data format
output Key - Value
data format
must be extened and
implemented
put the word as Key, occurence as
Value into collector
input Key - Value data format match
the output format of Mapper

22
MapReduce Job Configuration
Before running a MapReduce job, the following fields should be set:
• Mapper Class
The mapper class written by yourself to be run.
• Reducer Class
The reducer class written by yourself to be run.
• Input Format & Output Format
Define the format of all input and outputs. A large number of formats are supported in
Hadoop Library.
• OutputKeyClass & OutputValueClass
The data type class of the outputs that Mappers send to Reducers.

23
Example: WordCount
Code to run the job
public class WordCount {
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
set output key & value class
set Mapper & Reducer class
set InputFormat & OutputFormat class
set input & output path

24
Example: WordCount
Hello World, Bye World
Hello Hadoop, Goodbye Hadoop
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>
< Bye, <1>>
< Goodbye, <1>>
< Hadoop, <1,1>>
< Hello, <1,1>>
< World, <1,1>>
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>
Map 1
Map 2
Shuffle(auto) Reduce

25
Example in perf-log
Here shows an example of using MapReduce to analyze the log files in perf-log project.
In log files, There are two kinds of record type and each record is a single line.
Event Level
Request Level

26
Example Using MapReduce
Here we use a MapReduce job to calculate the most used event everyday. All the event
records are collected in Map and the most used events are counted in Reduce.
event PLT_LOGIN
request record…
request record…
event PM_HOME
request record…
event PM_OPENFORM
request record…
request record…
request record…
event CDP_LOGOUT
request record…
request record…
request record…
.
.
.
(11/12, PLT_LOGIN)
(11/12, PM_HOME)
(11/12, PLT_LOGIN)
(11/12, PM_LOGOUT)
.
.
.
(11/13, CDP_LOGIN)
(11/13, CDP_LOGIN)
.
.
.
(11/12, [PLT_LOGIN,
PM_HOME,
PLT_LOGIN,
PLT_LOGOUT…])
(11/13, [CDP_LOGIN,
CDP_LOGIN…])
.
.
.
(11/12, PLT_LOGIN)
(11/13, PM_HOME)
(11/14, CDP_HOME)
(11/15, PLT_HOME)
.
.
.
Map Shuffle(auto) Reduce

28
Table Structure
Tables in HBase have the following features:
1. They are large, sparsely populated tables.
2. Each row has a row key.
3. Table rows are sorted by row key, the table’s primary key.
By default, the sort is byte-ordered.
4. Row columns are grouped into column families. A table’s
column families must be specified up front as part of the
table schema definition and can not be changed.
5. New column family members can be added on demand.

29
Table Structure
Here is the table structure of “perflog” in the pref-log
project:
column family column qualifer
event req
event_name event_id … req1 req1_id … req2 req2_id …
row1 xxx xxx … xxx xxx xxx xxx xxx …
row2 xxx xxx … xxx xxx xxx xxx xxx …
row key
column value

30
Column Design
When designing column families and qualifiers, pay
attention to the following two points:
1. Keep the number of column families in your schema low.
HBase currently does not do well with anything above two or three column families.
2. Name the column families and qualifiers as short as
possible.
Operating on a table in HBase will cause thousands and thousands of compares on
column names. So short names will improve the performance.

31
HBase Command Shell
HBase provides a command shell to operate the
system. Here are some example commands :
• Status
• Create
• List
• Put
• Scan
• Disable & Drop

33
API to Operate Tables in HBase
There are four main methods to operate a table in
Hbase:
• Get
• Put
• Scan
• Delete
**Put and Scan are widely used in perf-log project.

34
Using Put & Scan in HBase
When using put in HBase, notice:
• AutoFlush
• WAL on Puts
When using scan in HBase, notice:
• Scan Attribute Selection
• Scan Caching

35
Using Scan with Filter
HBase filters are a powerful feature that can greatly enhance your effectiveness
working with data stored in tables. Four filters are used in perf-log project:
• SingleColumnValueFilter
You can use this filter when you have exactly one column that decides if an entire row
should be returned or not.
• RowFilter
This filter gives you the ability to filter data based on row keys.
• PageFilter
You paginate through rows by employing this filter.
• FilterList
Enable you to use several filters at the same time.

36
• PageFilter
There is a fundamental issue with filtering on physically separate servers. Filters run on
different region servers in parallel and can not retain or communicate their current state
across those boundaries and each filter is required to scan at least up to pageCount
rows before ending the scan. Thus you may get more rows than really you want.
Filter filter = new PageFilter(5); // 5 is the pageCount
int totalRows = 0;
byte[] lastRow = null;
while (true) {
Scan scan = new Scan();
scan.setFilter(filter);
if (lastRow != null) {
scan.setStartRow(startRow);
}
ResultScanner scanner = table.getScanner(scan);
int localRows = 0;
Result result;
while ((result = scanner.next()) != null) {
totalRows++;
lastRow = result.getRow();
}
scanner.close();
if (localRows == 0) break;
}

37
• FilterList
When using multiple filters with FilterList, pay attention that putting filters into FilterList
in different orders will generate different results.
pageFilter = new PageFilter(5);
singleColumnValueFilter = new SingleColumnValueFilter(“event”, “name”, CompareOp.EQUAL, “PLT_LOGIN”);
Take out the first 5 records and then
return the ones that their event name
values “PLT_LOGIN”.
filterList = new FilterList();
filterList.addFilter(pageFilter);
filterList.assFilter(singleColumnValueFilter);
Take out all the records that their
event name values “PLT_LOGIN” and
then return the first 5 of them.
filterList = new FilterList();
filterList.assFilter(singleColumnValueFilter);
filterList.addFilter(pageFilter);

38
Map Reduce with HBase
Here is an example:
static class MyMapper<K, V> extends MapReduceBase implements Mapper<LongWritable, Text, K, V> {
private HTable table;
@override
public void configure(JobConf jc) {
supper.configure(jc);
try {
this.table = new HTable(HBaseConfiguration.create(), “table_name”);
} catch (IOException e) {
throw new RuntimeException(“Failed HTable construction”, e);
}
}
@override
public void close() throws IOException {
super.close();
table.close();
}
public void map(LongWritable key, Text value, OutputCollector<K, V> output, Reporter reporter) throws IOException {
Put p = new Put();
… // Set your own put.
table.put(p);
}
}

39
Bulk Load
HBase includes several methods of loading data into tables. The most
straightforward method is to either use a MapReduce job, or use the normal
client APIs; however, these are not always the most efficient methods.
The bulk load feature uses a MapReduce job to output table data in HBase's
internal data format, and then directly loads the data files into a running
cluster. Using bulk load will use less CPU and network resources than simply
using the HBase API.
Data
Files
Map
Reduce
Job
HFiles HBase

40
Bulk Load
Notic that we use HFileOutputFormat as the output fomat of the map
reduce job used to generate HFile. But the HFileOutputFormat
provided by HBase Library DO NOT support writing multiple column
families into HFile.
But a Multi-family supported version for HFileOutputFormat can be
found HERE:
https://review.cloudera.org/r/1272/diff/1/?file=17977#file17977line93

41
Thank You, and Questions
See More About Hadoop & HBase:
http://confluence.successfactors.com/display/ENG/Programming+experience+on+Ha
doop+&+HBase

Hadoop and HBase experiences in perf log project

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop and HBase experiences in perf log project

Similar to Hadoop and HBase experiences in perf log project (20)

Recently uploaded

Recently uploaded (20)

Hadoop and HBase experiences in perf log project

Editor's Notes