SlideShare a Scribd company logo
1 of 41
1 
Hadoop & HBase 
Experiences in 
Perf-Log Project 
Eric Geng & Gary Zhao 
Performance Team, Platform 11/24/2011
2 
Introduction
3 
Architecture of Perf-Log Project
4 
Perf Log Format 
• Event Level 
• Request Level
5 
Event Configuration
6 
Reports and Charts
7 
Log Lookup
8 
HDFS Architecture
9 
MapReduce
10 
HBase Architecture
11 
Yahoo! Cloud Serving Benchmark 
• 3 HBase nodes on Solaris zones 
Throughput 
Average 
Response 
Time 
Max 
Response 
Time 
Write 1808 writes/s 1.6 ms 
0.02% > 1s 
(due to region 
splitting) 
Read 9846 reads/s 0.3 ms 45ms
12 
Hadoop 
Configuration 
Overview
13 
Setting up Hadoop 
• Supported Platforms 
• Linux – best 
• Solaris – ok. Just works 
• Windows – not recommend 
• Required Software 
• JDK 1.6.x 
• SSH 
• Packages 
• Cloudera
14 
Match Hadoop & HBase Version 
Hadoop version HBase version Compatible? 
0.20.3 release 0.90.x NO 
0.20-append 0.90.x YES 
0.20.5 release 0.90.x YES 
0.21.0 release 0.90.x NO 
0.22.x (in 
development) 
0.90.x NO
15 
Running Modes of Hadoop 
• Standalone Operation 
By default, run in a non-distributed mode, as a single Java process, be useful for 
debugging. 
• Pseudo-Distributed Operation 
Run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs 
in a separate Java process. 
• Fully-Distributed Operation 
Run in a cluster, the real production environment.
16 
Web Access to Hadoop
17 
Map Reduce 
Job Guide
18 
Map Reduce Job 
MapReduce is a programming model for data processing on Hadoop. It works by breaking the 
processing into two phases: the map phase and the reduce phase. Each phase has key-value pairs 
as input and output, the types of which may be chosen by the programmer. 
• Mapper 
A Mapper usually process data in single lines. Ignore the useless lines and collect useful information 
from data into <Key, Value> pairs. 
• Reducer 
Receive the <Key, <Value1, Value2, …>> pairs from Mappers. Tabulate statistics data and write the 
results into <Key, Value> pairs.
19 
Data Flow
20 
Serialization in Hadoop 
int IntWritable 
long LongWritable 
boolean BooleanWritable 
byte ByteWritable 
float FloatWritable 
double DoubleWritable 
String Text 
null NullWritable
21 
Example: WordCount 
Before we jump into the details, lets walk through an example MapReduce application to get a flavour for how they 
work. WordCount is a simple application that counts the number of occurences of each word in a given input set. 
public static class Map extends MapReduceBase implements Mapper < LongWritable, Text, Text, IntWritable > { 
private final static IntWritable one = new IntWritable(1); 
private Text word = new Text(); 
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws 
IOException { 
String line = value.toString(); 
StringTokenizer tokenizer = new StringTokenizer(line); 
while (tokenizer.hasMoreTokens()) { 
word.set(tokenizer.nextToken()); 
output.collect(word, one); 
} 
} 
} 
public static class Reduce extends MapReduceBase implements Reducer< Text, IntWritable, Text, IntWritable> { 
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws 
IOException { 
int sum = 0; 
while (values.hasNext()) { 
sum += values.next().get(); 
} 
output.collect(key, new IntWritable(sum)); 
} 
} 
input Key - Value 
data format 
output Key - Value 
data format 
must be extened and 
implemented 
put the word as Key, occurence as 
Value into collector 
input Key - Value data format match 
the output format of Mapper
22 
MapReduce Job Configuration 
Before running a MapReduce job, the following fields should be set: 
• Mapper Class 
The mapper class written by yourself to be run. 
• Reducer Class 
The reducer class written by yourself to be run. 
• Input Format & Output Format 
Define the format of all input and outputs. A large number of formats are supported in 
Hadoop Library. 
• OutputKeyClass & OutputValueClass 
The data type class of the outputs that Mappers send to Reducers.
23 
Example: WordCount 
Code to run the job 
public class WordCount { 
public static void main(String[] args) throws Exception { 
JobConf conf = new JobConf(WordCount.class); 
conf.setJobName("wordcount"); 
conf.setOutputKeyClass(Text.class); 
conf.setOutputValueClass(IntWritable.class); 
conf.setMapperClass(Map.class); 
conf.setReducerClass(Reduce.class); 
conf.setInputFormat(TextInputFormat.class); 
conf.setOutputFormat(TextOutputFormat.class); 
FileInputFormat.setInputPaths(conf, new Path(args[0])); 
FileOutputFormat.setOutputPath(conf, new Path(args[1])); 
JobClient.runJob(conf); 
} 
} 
set output key & value class 
set Mapper & Reducer class 
set InputFormat & OutputFormat class 
set input & output path
24 
Example: WordCount 
Hello World, Bye World 
Hello Hadoop, Goodbye Hadoop 
< Hello, 1> 
< World, 1> 
< Bye, 1> 
< World, 1> 
< Hello, 1> 
< Hadoop, 1> 
< Goodbye, 1> 
< Hadoop, 1> 
< Bye, <1>> 
< Goodbye, <1>> 
< Hadoop, <1,1>> 
< Hello, <1,1>> 
< World, <1,1>> 
< Bye, 1> 
< Goodbye, 1> 
< Hadoop, 2> 
< Hello, 2> 
< World, 2> 
Map 1 
Map 2 
Shuffle(auto) Reduce
25 
Example in perf-log 
Here shows an example of using MapReduce to analyze the log files in perf-log project. 
In log files, There are two kinds of record type and each record is a single line. 
Event Level 
Request Level
26 
Example Using MapReduce 
Here we use a MapReduce job to calculate the most used event everyday. All the event 
records are collected in Map and the most used events are counted in Reduce. 
event PLT_LOGIN 
request record… 
request record… 
event PM_HOME 
request record… 
event PM_OPENFORM 
request record… 
request record… 
request record… 
event CDP_LOGOUT 
request record… 
request record… 
request record… 
. 
. 
. 
(11/12, PLT_LOGIN) 
(11/12, PM_HOME) 
(11/12, PLT_LOGIN) 
(11/12, PM_LOGOUT) 
. 
. 
. 
(11/13, CDP_LOGIN) 
(11/13, CDP_LOGIN) 
. 
. 
. 
(11/12, [PLT_LOGIN, 
PM_HOME, 
PLT_LOGIN, 
PLT_LOGOUT…]) 
(11/13, [CDP_LOGIN, 
CDP_LOGIN…]) 
. 
. 
. 
(11/12, PLT_LOGIN) 
(11/13, PM_HOME) 
(11/14, CDP_HOME) 
(11/15, PLT_HOME) 
. 
. 
. 
Map Shuffle(auto) Reduce
27 
HBase API 
Guide
28 
Table Structure 
Tables in HBase have the following features: 
1. They are large, sparsely populated tables. 
2. Each row has a row key. 
3. Table rows are sorted by row key, the table’s primary key. 
By default, the sort is byte-ordered. 
4. Row columns are grouped into column families. A table’s 
column families must be specified up front as part of the 
table schema definition and can not be changed. 
5. New column family members can be added on demand.
29 
Table Structure 
Here is the table structure of “perflog” in the pref-log 
project: 
column family column qualifer 
event req 
event_name event_id … req1 req1_id … req2 req2_id … 
row1 xxx xxx … xxx xxx xxx xxx xxx … 
row2 xxx xxx … xxx xxx xxx xxx xxx … 
row key 
column value
30 
Column Design 
When designing column families and qualifiers, pay 
attention to the following two points: 
1. Keep the number of column families in your schema low. 
HBase currently does not do well with anything above two or three column families. 
2. Name the column families and qualifiers as short as 
possible. 
Operating on a table in HBase will cause thousands and thousands of compares on 
column names. So short names will improve the performance.
31 
HBase Command Shell 
HBase provides a command shell to operate the 
system. Here are some example commands : 
• Status 
• Create 
• List 
• Put 
• Scan 
• Disable & Drop
32 
HBase Command Shell
33 
API to Operate Tables in HBase 
There are four main methods to operate a table in 
Hbase: 
• Get 
• Put 
• Scan 
• Delete 
**Put and Scan are widely used in perf-log project.
34 
Using Put & Scan in HBase 
When using put in HBase, notice: 
• AutoFlush 
• WAL on Puts 
When using scan in HBase, notice: 
• Scan Attribute Selection 
• Scan Caching
35 
Using Scan with Filter 
HBase filters are a powerful feature that can greatly enhance your effectiveness 
working with data stored in tables. Four filters are used in perf-log project: 
• SingleColumnValueFilter 
You can use this filter when you have exactly one column that decides if an entire row 
should be returned or not. 
• RowFilter 
This filter gives you the ability to filter data based on row keys. 
• PageFilter 
You paginate through rows by employing this filter. 
• FilterList 
Enable you to use several filters at the same time.
36 
Using Scan with Filter 
• PageFilter 
There is a fundamental issue with filtering on physically separate servers. Filters run on 
different region servers in parallel and can not retain or communicate their current state 
across those boundaries and each filter is required to scan at least up to pageCount 
rows before ending the scan. Thus you may get more rows than really you want. 
Filter filter = new PageFilter(5); // 5 is the pageCount 
int totalRows = 0; 
byte[] lastRow = null; 
while (true) { 
Scan scan = new Scan(); 
scan.setFilter(filter); 
if (lastRow != null) { 
scan.setStartRow(startRow); 
} 
ResultScanner scanner = table.getScanner(scan); 
int localRows = 0; 
Result result; 
while ((result = scanner.next()) != null) { 
totalRows++; 
lastRow = result.getRow(); 
} 
scanner.close(); 
if (localRows == 0) break; 
}
37 
Using Scan with Filter 
• FilterList 
When using multiple filters with FilterList, pay attention that putting filters into FilterList 
in different orders will generate different results. 
pageFilter = new PageFilter(5); 
singleColumnValueFilter = new SingleColumnValueFilter(“event”, “name”, CompareOp.EQUAL, “PLT_LOGIN”); 
Take out the first 5 records and then 
return the ones that their event name 
values “PLT_LOGIN”. 
filterList = new FilterList(); 
filterList.addFilter(pageFilter); 
filterList.assFilter(singleColumnValueFilter); 
Take out all the records that their 
event name values “PLT_LOGIN” and 
then return the first 5 of them. 
filterList = new FilterList(); 
filterList.assFilter(singleColumnValueFilter); 
filterList.addFilter(pageFilter);
38 
Map Reduce with HBase 
Here is an example: 
static class MyMapper<K, V> extends MapReduceBase implements Mapper<LongWritable, Text, K, V> { 
private HTable table; 
@override 
public void configure(JobConf jc) { 
supper.configure(jc); 
try { 
this.table = new HTable(HBaseConfiguration.create(), “table_name”); 
} catch (IOException e) { 
throw new RuntimeException(“Failed HTable construction”, e); 
} 
} 
@override 
public void close() throws IOException { 
super.close(); 
table.close(); 
} 
public void map(LongWritable key, Text value, OutputCollector<K, V> output, Reporter reporter) throws IOException { 
Put p = new Put(); 
… // Set your own put. 
table.put(p); 
} 
}
39 
Bulk Load 
HBase includes several methods of loading data into tables. The most 
straightforward method is to either use a MapReduce job, or use the normal 
client APIs; however, these are not always the most efficient methods. 
The bulk load feature uses a MapReduce job to output table data in HBase's 
internal data format, and then directly loads the data files into a running 
cluster. Using bulk load will use less CPU and network resources than simply 
using the HBase API. 
Data 
Files 
Map 
Reduce 
Job 
HFiles HBase
40 
Bulk Load 
Notic that we use HFileOutputFormat as the output fomat of the map 
reduce job used to generate HFile. But the HFileOutputFormat 
provided by HBase Library DO NOT support writing multiple column 
families into HFile. 
But a Multi-family supported version for HFileOutputFormat can be 
found HERE: 
https://review.cloudera.org/r/1272/diff/1/?file=17977#file17977line93
41 
Thank You, and Questions 
See More About Hadoop & HBase: 
http://confluence.successfactors.com/display/ENG/Programming+experience+on+Ha 
doop+&+HBase

More Related Content

What's hot

Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
Dmytro Sandu
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 

What's hot (20)

Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Map/Reduce intro
Map/Reduce introMap/Reduce intro
Map/Reduce intro
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 

Viewers also liked

2012 re:Invent Netflix: embracing the cloud final
2012 re:Invent Netflix: embracing the cloud final2012 re:Invent Netflix: embracing the cloud final
2012 re:Invent Netflix: embracing the cloud final
YuryIzrailevsky
 
Providing Better Producer Administration With TrueProducer
Providing Better Producer Administration With TrueProducerProviding Better Producer Administration With TrueProducer
Providing Better Producer Administration With TrueProducer
Callidus Software
 
Introduction
IntroductionIntroduction
Introduction
Jeff12982
 
Mongolfiere
MongolfiereMongolfiere
Mongolfiere
giusnico
 
Giovanna Stumpo, L’ESERCIZIO DELLA PROFESSIONE DI AVVOCATO
Giovanna Stumpo, L’ESERCIZIO DELLA PROFESSIONE DI AVVOCATOGiovanna Stumpo, L’ESERCIZIO DELLA PROFESSIONE DI AVVOCATO
Giovanna Stumpo, L’ESERCIZIO DELLA PROFESSIONE DI AVVOCATO
Andrea Rossetti
 

Viewers also liked (20)

(SPOT302) Availability: The New Kind of Innovator’s Dilemma
(SPOT302) Availability: The New Kind of Innovator’s Dilemma(SPOT302) Availability: The New Kind of Innovator’s Dilemma
(SPOT302) Availability: The New Kind of Innovator’s Dilemma
 
2012 re:Invent Netflix: embracing the cloud final
2012 re:Invent Netflix: embracing the cloud final2012 re:Invent Netflix: embracing the cloud final
2012 re:Invent Netflix: embracing the cloud final
 
NetflixOSS Meetup season 3 episode 2
NetflixOSS Meetup season 3 episode 2NetflixOSS Meetup season 3 episode 2
NetflixOSS Meetup season 3 episode 2
 
Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in t...
Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in t...Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in t...
Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in t...
 
Engineering Velocity: Shifting the Curve at Netflix
Engineering Velocity: Shifting the Curve at NetflixEngineering Velocity: Shifting the Curve at Netflix
Engineering Velocity: Shifting the Curve at Netflix
 
From Code to the Monkeys: Continuous Delivery at Netflix
From Code to the Monkeys: Continuous Delivery at NetflixFrom Code to the Monkeys: Continuous Delivery at Netflix
From Code to the Monkeys: Continuous Delivery at Netflix
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing system
 
Engineering Tools at Netflix: Enabling Continuous Delivery
Engineering Tools at Netflix: Enabling Continuous DeliveryEngineering Tools at Netflix: Enabling Continuous Delivery
Engineering Tools at Netflix: Enabling Continuous Delivery
 
OTT & The Future of Connected TV
OTT & The Future of Connected TVOTT & The Future of Connected TV
OTT & The Future of Connected TV
 
Continuous Delivery at Netflix, and beyond
Continuous Delivery at Netflix, and beyondContinuous Delivery at Netflix, and beyond
Continuous Delivery at Netflix, and beyond
 
Implementing DevOps
Implementing DevOpsImplementing DevOps
Implementing DevOps
 
Splitting the Check on Compliance and Security
Splitting the Check on Compliance and SecuritySplitting the Check on Compliance and Security
Splitting the Check on Compliance and Security
 
How Netflix thinks of DevOps. Spoiler: we don’t.
How Netflix thinks of DevOps. Spoiler: we don’t.How Netflix thinks of DevOps. Spoiler: we don’t.
How Netflix thinks of DevOps. Spoiler: we don’t.
 
Providing Better Producer Administration With TrueProducer
Providing Better Producer Administration With TrueProducerProviding Better Producer Administration With TrueProducer
Providing Better Producer Administration With TrueProducer
 
Competitività, diversità culturale e creatività Mediterranea
Competitività, diversità culturale e creatività MediterraneaCompetitività, diversità culturale e creatività Mediterranea
Competitività, diversità culturale e creatività Mediterranea
 
Introduction
IntroductionIntroduction
Introduction
 
she de ivan y cecilia
she de ivan y ceciliashe de ivan y cecilia
she de ivan y cecilia
 
Unfolding Of A Vision
Unfolding Of A VisionUnfolding Of A Vision
Unfolding Of A Vision
 
Mongolfiere
MongolfiereMongolfiere
Mongolfiere
 
Giovanna Stumpo, L’ESERCIZIO DELLA PROFESSIONE DI AVVOCATO
Giovanna Stumpo, L’ESERCIZIO DELLA PROFESSIONE DI AVVOCATOGiovanna Stumpo, L’ESERCIZIO DELLA PROFESSIONE DI AVVOCATO
Giovanna Stumpo, L’ESERCIZIO DELLA PROFESSIONE DI AVVOCATO
 

Similar to Hadoop and HBase experiences in perf log project

Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
DrPDShebaKeziaMalarc
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
Fei Dong
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
Apache Apex
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 

Similar to Hadoop and HBase experiences in perf log project (20)

MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Data Science
Data ScienceData Science
Data Science
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
 
pig.ppt
pig.pptpig.ppt
pig.ppt
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Apache phoenix
Apache phoenixApache phoenix
Apache phoenix
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop Engine
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
 
Unit 2
Unit 2Unit 2
Unit 2
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 

Recently uploaded

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 

Recently uploaded (20)

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 

Hadoop and HBase experiences in perf log project

  • 1. 1 Hadoop & HBase Experiences in Perf-Log Project Eric Geng & Gary Zhao Performance Team, Platform 11/24/2011
  • 3. 3 Architecture of Perf-Log Project
  • 4. 4 Perf Log Format • Event Level • Request Level
  • 6. 6 Reports and Charts
  • 11. 11 Yahoo! Cloud Serving Benchmark • 3 HBase nodes on Solaris zones Throughput Average Response Time Max Response Time Write 1808 writes/s 1.6 ms 0.02% > 1s (due to region splitting) Read 9846 reads/s 0.3 ms 45ms
  • 13. 13 Setting up Hadoop • Supported Platforms • Linux – best • Solaris – ok. Just works • Windows – not recommend • Required Software • JDK 1.6.x • SSH • Packages • Cloudera
  • 14. 14 Match Hadoop & HBase Version Hadoop version HBase version Compatible? 0.20.3 release 0.90.x NO 0.20-append 0.90.x YES 0.20.5 release 0.90.x YES 0.21.0 release 0.90.x NO 0.22.x (in development) 0.90.x NO
  • 15. 15 Running Modes of Hadoop • Standalone Operation By default, run in a non-distributed mode, as a single Java process, be useful for debugging. • Pseudo-Distributed Operation Run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process. • Fully-Distributed Operation Run in a cluster, the real production environment.
  • 16. 16 Web Access to Hadoop
  • 17. 17 Map Reduce Job Guide
  • 18. 18 Map Reduce Job MapReduce is a programming model for data processing on Hadoop. It works by breaking the processing into two phases: the map phase and the reduce phase. Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer. • Mapper A Mapper usually process data in single lines. Ignore the useless lines and collect useful information from data into <Key, Value> pairs. • Reducer Receive the <Key, <Value1, Value2, …>> pairs from Mappers. Tabulate statistics data and write the results into <Key, Value> pairs.
  • 20. 20 Serialization in Hadoop int IntWritable long LongWritable boolean BooleanWritable byte ByteWritable float FloatWritable double DoubleWritable String Text null NullWritable
  • 21. 21 Example: WordCount Before we jump into the details, lets walk through an example MapReduce application to get a flavour for how they work. WordCount is a simple application that counts the number of occurences of each word in a given input set. public static class Map extends MapReduceBase implements Mapper < LongWritable, Text, Text, IntWritable > { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } } public static class Reduce extends MapReduceBase implements Reducer< Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } input Key - Value data format output Key - Value data format must be extened and implemented put the word as Key, occurence as Value into collector input Key - Value data format match the output format of Mapper
  • 22. 22 MapReduce Job Configuration Before running a MapReduce job, the following fields should be set: • Mapper Class The mapper class written by yourself to be run. • Reducer Class The reducer class written by yourself to be run. • Input Format & Output Format Define the format of all input and outputs. A large number of formats are supported in Hadoop Library. • OutputKeyClass & OutputValueClass The data type class of the outputs that Mappers send to Reducers.
  • 23. 23 Example: WordCount Code to run the job public class WordCount { public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } } set output key & value class set Mapper & Reducer class set InputFormat & OutputFormat class set input & output path
  • 24. 24 Example: WordCount Hello World, Bye World Hello Hadoop, Goodbye Hadoop < Hello, 1> < World, 1> < Bye, 1> < World, 1> < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> < Bye, <1>> < Goodbye, <1>> < Hadoop, <1,1>> < Hello, <1,1>> < World, <1,1>> < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2> Map 1 Map 2 Shuffle(auto) Reduce
  • 25. 25 Example in perf-log Here shows an example of using MapReduce to analyze the log files in perf-log project. In log files, There are two kinds of record type and each record is a single line. Event Level Request Level
  • 26. 26 Example Using MapReduce Here we use a MapReduce job to calculate the most used event everyday. All the event records are collected in Map and the most used events are counted in Reduce. event PLT_LOGIN request record… request record… event PM_HOME request record… event PM_OPENFORM request record… request record… request record… event CDP_LOGOUT request record… request record… request record… . . . (11/12, PLT_LOGIN) (11/12, PM_HOME) (11/12, PLT_LOGIN) (11/12, PM_LOGOUT) . . . (11/13, CDP_LOGIN) (11/13, CDP_LOGIN) . . . (11/12, [PLT_LOGIN, PM_HOME, PLT_LOGIN, PLT_LOGOUT…]) (11/13, [CDP_LOGIN, CDP_LOGIN…]) . . . (11/12, PLT_LOGIN) (11/13, PM_HOME) (11/14, CDP_HOME) (11/15, PLT_HOME) . . . Map Shuffle(auto) Reduce
  • 27. 27 HBase API Guide
  • 28. 28 Table Structure Tables in HBase have the following features: 1. They are large, sparsely populated tables. 2. Each row has a row key. 3. Table rows are sorted by row key, the table’s primary key. By default, the sort is byte-ordered. 4. Row columns are grouped into column families. A table’s column families must be specified up front as part of the table schema definition and can not be changed. 5. New column family members can be added on demand.
  • 29. 29 Table Structure Here is the table structure of “perflog” in the pref-log project: column family column qualifer event req event_name event_id … req1 req1_id … req2 req2_id … row1 xxx xxx … xxx xxx xxx xxx xxx … row2 xxx xxx … xxx xxx xxx xxx xxx … row key column value
  • 30. 30 Column Design When designing column families and qualifiers, pay attention to the following two points: 1. Keep the number of column families in your schema low. HBase currently does not do well with anything above two or three column families. 2. Name the column families and qualifiers as short as possible. Operating on a table in HBase will cause thousands and thousands of compares on column names. So short names will improve the performance.
  • 31. 31 HBase Command Shell HBase provides a command shell to operate the system. Here are some example commands : • Status • Create • List • Put • Scan • Disable & Drop
  • 33. 33 API to Operate Tables in HBase There are four main methods to operate a table in Hbase: • Get • Put • Scan • Delete **Put and Scan are widely used in perf-log project.
  • 34. 34 Using Put & Scan in HBase When using put in HBase, notice: • AutoFlush • WAL on Puts When using scan in HBase, notice: • Scan Attribute Selection • Scan Caching
  • 35. 35 Using Scan with Filter HBase filters are a powerful feature that can greatly enhance your effectiveness working with data stored in tables. Four filters are used in perf-log project: • SingleColumnValueFilter You can use this filter when you have exactly one column that decides if an entire row should be returned or not. • RowFilter This filter gives you the ability to filter data based on row keys. • PageFilter You paginate through rows by employing this filter. • FilterList Enable you to use several filters at the same time.
  • 36. 36 Using Scan with Filter • PageFilter There is a fundamental issue with filtering on physically separate servers. Filters run on different region servers in parallel and can not retain or communicate their current state across those boundaries and each filter is required to scan at least up to pageCount rows before ending the scan. Thus you may get more rows than really you want. Filter filter = new PageFilter(5); // 5 is the pageCount int totalRows = 0; byte[] lastRow = null; while (true) { Scan scan = new Scan(); scan.setFilter(filter); if (lastRow != null) { scan.setStartRow(startRow); } ResultScanner scanner = table.getScanner(scan); int localRows = 0; Result result; while ((result = scanner.next()) != null) { totalRows++; lastRow = result.getRow(); } scanner.close(); if (localRows == 0) break; }
  • 37. 37 Using Scan with Filter • FilterList When using multiple filters with FilterList, pay attention that putting filters into FilterList in different orders will generate different results. pageFilter = new PageFilter(5); singleColumnValueFilter = new SingleColumnValueFilter(“event”, “name”, CompareOp.EQUAL, “PLT_LOGIN”); Take out the first 5 records and then return the ones that their event name values “PLT_LOGIN”. filterList = new FilterList(); filterList.addFilter(pageFilter); filterList.assFilter(singleColumnValueFilter); Take out all the records that their event name values “PLT_LOGIN” and then return the first 5 of them. filterList = new FilterList(); filterList.assFilter(singleColumnValueFilter); filterList.addFilter(pageFilter);
  • 38. 38 Map Reduce with HBase Here is an example: static class MyMapper<K, V> extends MapReduceBase implements Mapper<LongWritable, Text, K, V> { private HTable table; @override public void configure(JobConf jc) { supper.configure(jc); try { this.table = new HTable(HBaseConfiguration.create(), “table_name”); } catch (IOException e) { throw new RuntimeException(“Failed HTable construction”, e); } } @override public void close() throws IOException { super.close(); table.close(); } public void map(LongWritable key, Text value, OutputCollector<K, V> output, Reporter reporter) throws IOException { Put p = new Put(); … // Set your own put. table.put(p); } }
  • 39. 39 Bulk Load HBase includes several methods of loading data into tables. The most straightforward method is to either use a MapReduce job, or use the normal client APIs; however, these are not always the most efficient methods. The bulk load feature uses a MapReduce job to output table data in HBase's internal data format, and then directly loads the data files into a running cluster. Using bulk load will use less CPU and network resources than simply using the HBase API. Data Files Map Reduce Job HFiles HBase
  • 40. 40 Bulk Load Notic that we use HFileOutputFormat as the output fomat of the map reduce job used to generate HFile. But the HFileOutputFormat provided by HBase Library DO NOT support writing multiple column families into HFile. But a Multi-family supported version for HFileOutputFormat can be found HERE: https://review.cloudera.org/r/1272/diff/1/?file=17977#file17977line93
  • 41. 41 Thank You, and Questions See More About Hadoop & HBase: http://confluence.successfactors.com/display/ENG/Programming+experience+on+Ha doop+&+HBase

Editor's Notes

  1. http://hadoop.apache.org/common/docs/r0.20.2/hdfs_design.html Data blocks are automatically replicated cross Data Nodes. Fault-tolerant. Default number of replicates is 3. Share-nothing architecture. Add data nodes to increase disk capacity and I/O throughput. Due to replicates and internal structure, the actual capacity will be less than 1/3 of raw data. Name Node manages file system’s metadata. SPOF. Need HA and backup. Workload increases along with files/blocks number and operations. Potential bottleneck.
  2. Job Tracker manages Map/Reduce job execution. Often runs along with Name Node. Job is split into tasks. Task Tracker manages task execution. Runs on Data Nodes. Natural distributed parallel computing architecture. Web console to monitor job/task. “hadoop” command to run jobs, manage nodes and file system. Specially, “hadoop fs” provide many unix-like commands to access the HDFS.
  3. HMaster manages region servers. It normally runs with Hadoop NameNode together. Data are sorted by row key and split into regions, which are managed by region server. Region servers often run on data nodes. Each region includes one MemStore and several store files. Data writes are recorded into “Write-Ahead-Log” (HLog, but by default it is flushed to disk every 1 second), and written into MemStore. When Memstore becomes full, it is flushed to HDFS as a store file. Full operations: get, put, scan, delete.
  4. GNU/Linux is supported as a development and production platform. Hadoop has been demonstrated on GNU/Linux clusters with 2000 nodes. Win32 is supported as a development platform. Distributed operation has not been well tested on Win32, so it is not supported as a production platform. Required software for Linux and Windows include: Java 1.6.x, preferably from Sun, must be installed. ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons. Additional requirements for Windows include: Cygwin - Required for shell support in addition to the required software above.
  5. 1. Hadoop Version The most used Hadoop version 0.20.203.X the current stable version, DO NOT contain the entire new API of MapReduce, DO NOT have sync attribute on HDFS, currently used in perf-log project. 0.20.205.X the current beta version, DO NOT contain the entire new API of MapReduce, has sync attribute on HDFS. 0.21.X the newest version, provide the entire new API of MapReduce, unstable, unsupported, does not include security, can not run with HBase. 2. Running HBase on Hadoop The newest version of HBase is 0.90.x. This version of HBase will only run on Hadoop 0.20.x. It will not run on hadoop 0.21.x (nor 0.22.x). HBase will lose data unless it is running on an HDFS that has a durable sync. Hadoop 0.20.2 and Hadoop 0.20.203.0 DO NOT have this attribute. You choose one of the following solutions: HBase bundles an instance of the hadoop jar under its lib directory. The bundled Hadoop was made from the Apache branch-0.20-append branch at the time of the HBase‘s release and has the sync attribuate. Replace the hadoop jar you are running on your cluster with the hadoop jar found in the HBase lib directory. You could use the Cloudera or MapR distributions. Cloudera' CDH3 is Apache Hadoop 0.20.x plus patches including all of the 0.20-append additions needed to add a durable sync. CHD3 contains both Hadoop and HBase in its production. Just use Hadoop 0.20.205.0. Since this release includes a merge of append/hsynch/hflush capabilities from 0.20-append branch, it can support HBase in secure mode. But it is a beta version.
  6. In Hadoop MapReduce, interprocess communication between nodes in the system is implemented using remote procedure calls (RPCs). The RPC protocol uses serialization to render the message into a binary stream to be sent to the remote node, which then deserializes the binary stream into the original message. Hadoop uses its own serialization format, Writables, which is certainly compact and fast (but not so easy to extend, or use from languages other than Java). Hadoop Library provides many basic data types to be used in MapReduce, and you can also implement you own data structure according to the interfaces of Writables.
  7. Status Show the status of all nodes in HBase. Create Create a table. List List all the existing tables. Put Put either adds new rows to a table (if the key is new) or can update existing rows (if the key already exists). Scan Scan allows iteration over multiple rows for specified attributes of a certain table. Disable & Drop First do disable, then do drop when deleting a table.
  8. Get Get returns attributes for a specified row. Put Put either adds new rows to a table (if the key is new) or can update existing rows (if the key already exists). Scan Scan allows iteration over multiple rows for specified attributes. It can be used with filters and provides powerful query functions on HBase. Delete Delete removes a row from a table.
  9. When using put in HBase, notice: AutoFlush AutoFlush meets the request of real time, and you can immediately see the row after it is added into the table. But when performing a lot of Puts, make sure that setAutoFlush is set to false on your HTable instance. Otherwise, the Puts will be sent one at a time to the RegionServer. If autoFlush = false, these messages are not sent until the write-buffer is filled, so it can reduce the number of client RPC calls. To explicitly flush the messages, call flushCommits. Calling close on the HTable instance will invoke flushCommits. WAL on Puts WAL means Write Ahead Log. Turning this off means that the RegionServer will not write the Put to the Write Ahead Log, only into the memstore and it will improve the performance. HOWEVER turn it off is not recommended because if there is a RegionServer failure there will be data loss. When using scan in HBase, notice: Scan Attribute Selection Whenever a Scan is used to process large numbers of rows, be aware of which attributes are selected. Call the scan.addFamily to appoint the specific column values you want rather than get the entire row, because attribute over-selection is a non-trivial performance penalty over large datasets. Scan Caching When peforming a large number of Scans, make sure that the input Scan instance has setCaching set to something greater than the default (which is 1). Setting this value to 500, for example, will transfer 500 rows at a time to the client to be processed. There is a cost/benefit to have the cache value be large because it costs more in memory for both client and RegionServer, so bigger isn't always better.
  10. HBase can be both the input and output of a Map Reduce Job. In the perf-log project, we use HBase as the output of the MR job and it is best to obey the following rules: Get one HTable instance There is a cost instantiating an HTable, so if you do this for each insert, you may have a negative impact on performance. Hence our setup of HTable in the configure() step. Skip the Reducer if possible When writing a lot of data to an HBase table from a MR job and specifically where Puts are being emitted from the Mapper, skip the Reducer step. When a Reducer step is used, all of the output (Puts) from the Mapper will get spooled to disk, then sorted/shuffled to other Reducers that will most likely be off-node. It's far more efficient to just write directly to HBase.