2. What is Hadoop?, What Hadoop is not?, and Hadoop Assumptions.
What is Rack, Cluster, Nodes and Commodity Hardware?
HDFS - Hadoop Distributed File System
Using HDFS commands
MapReduce
Higher-level languages over Hadoop: Pig and Hive
HBase – Overview
HCatalog
3. What is Hadoop and its components?
What is the commodity server/Hardware?
Why HDFS ?
What is the responsibility of NameNode in HDFS?
What is Fault Tolerance?
What is the default replication factor in HDFS?
What is the heartbeat in HDFS?
What are JobTracker and TaskTracker?
Why MapReduce programming model?
Where do we have Data Locality in MapReduce?
Why we need to use Pig and Hive?
What is the difference between Hbase and HCatalog
4. • At Google:
• Index building for Google Search
• Article clustering for Google News
• Statistical machine translation
• At Yahoo!:
• Index building for Yahoo! Search
• Spam detection for Yahoo! Mail
• At Facebook:
• Data mining
• Ad optimization
• Spam detection
5. The MapReduce algorithm contains two important tasks (Map and Reduce tasks)
• The Map task:
• The Reduce task
Map Output (key-value pairs)
The quick
Brown fox
The fox ate
Map input (set of data )
converts
The 1
quick 1
Brown 1
Fox 1
The 1
Fox 1
Ate 1
Ate 1
Brown 1
Fox 1
Fox 1
quick 1
The 1
The 1
combines
Reduce input (key-value pairs)
Ate 1
Brown 1
Fox 2
quick 1
The 2
Reduce Output
I’m a
leading task
MapReduce
By the way,
I always
start first
6. • Data type: key-value records
• Map function:
• Reduce function:
8. the quick
brown fox
the fox ate
the mouse
how now
brown cow
Map
Map
Map
Reduce
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
ate, 1
cow, 1
mouse, 1
quick, 1
the, 1
brown, 1
fox, 1
quick, 1
the, 1
fox, 1
the, 1
how, 1
now, 1
brown, 1
ate, 1
mouse, 1
cow, 1
Input Map Shuffle & Sort Reduce Output
9. • Single master controls job execution on multiple slaves
• Mappers preferentially placed on same node or same rack as their
input block
• Minimizes network usage
• Mappers save outputs to local disk before serving them to
reducers
• Allows recovery if a reducer crashes
• Allows having more reducers than nodes
10. • A combiner is a local aggregation function for repeated keys
produced by same map
• Works for associative functions like sum, count, max
• Decreases size of intermediate data
• Example: map-side aggregation for Word Count:
def combiner(key, values):
output(key, sum(values))
11. Input Map & Combine Shuffle & Sort Reduce Output
the quick
brown fox
the fox ate
the mouse
how now
brown cow
Map
Map
Map
Reduce
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
ate, 1
cow, 1
mouse, 1
quick, 1
the, 1
brown, 1
fox, 1
quick, 1
the, 2
fox, 1
how, 1
now, 1
brown, 1
ate, 1
mouse, 1
cow, 1
12. Input Phase − Here we have a Record Reader that
translates each record in an input file and sends the
parsed data to the mapper in the form of key-value pairs.
Map Phase − Map is a user-defined function, which takes
a series of key-value pairs and processes each one of them
to generate zero or more key-value pairs.
Intermediate Keys − They key-value pairs generated by
the mapper are known as intermediate keys.
13. Combiner − A combiner is a type of local Reducer that
groups similar data from the map phase into identifiable
sets. It takes the intermediate keys from the mapper as
input and applies a user-defined code to aggregate the
values in a small scope of one mapper. It is not a part of
the main MapReduce algorithm; it is optional.
Shuffle and Sort − The Reducer task starts with the
Shuffle and Sort step. It downloads the grouped key-value
pairs onto the local machine, where the Reducer is
running. The individual key-value pairs are sorted by key
into a larger data list. The data list groups the equivalent
keys together so that their values can be iterated easily in
the Reducer task.
14. Reducer − The Reducer takes the grouped key-value
paired data as input and runs a Reducer function on each
one of them. Here, the data can be aggregated, filtered,
and combined in a number of ways, and it requires a wide
range of processing. Once the execution is over, it gives
zero or more key-value pairs to the final step.
Output Phase − In the output phase, we have an output
formatter that translates the final key-value pairs from
the Reducer function and writes them onto a file using a
record writer.
15. Word Count in Java
public class MapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable ONE = new IntWritable(1);
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> out,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
out.collect(new text(itr.nextToken()), ONE);
}
}
}
16. public class ReduceClass extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> out,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
out.collect(key, new IntWritable(sum));
}
}
Word Count in Java
17. public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setMapperClass(MapClass.class);
conf.setCombinerClass(ReduceClass.class);
conf.setReducerClass(ReduceClass.class);
FileInputFormat.setInputPaths(conf, args[0]);
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setOutputKeyClass(Text.class); // out keys are words
(strings)
conf.setOutputValueClass(IntWritable.class); // values are counts
JobClient.runJob(conf);
}
18. import sys
for line in sys.stdin:
for word in line.split():
print(word.lower() + "t" + 1)
import sys
counts = {}
for line in sys.stdin:
word, count = line.split("t”)
dict[word] = dict.get(word, 0) +
int(count)
for word, count in counts:
print(word.lower() + "t" + 1)
19. A real-world example to comprehend the power of MapReduce.
Twitter receives around 500 million tweets per day, which is nearly
3000 tweets per second. The following illustration shows how
Tweeter manages its tweets with the help of MapReduce.
20. Many parallel algorithms can be expressed by a
series of MapReduce jobs
But MapReduce is fairly low-level: must think
about keys, values, partitioning, etc
Can we capture common “job building blocks”?
21. Started at Yahoo! Research
Runs about 30% of Yahoo!’s jobs
Features:
• Expresses sequences of MapReduce jobs
• Data model: nested “bags” of items
• Provides relational (SQL) operators (JOIN, GROUP BY, etc)
• Easy to plug in Java functions
• Pig Pen development environment for Eclipse
22. • Suppose you have user data in
one file, page view data in
another, and you need to find
the top 5 most visited pages by
users aged 18 - 25.
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
23. In MapReduce
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
24. Users = load ‘users’ as (name, age);
Filtered = filter Users by
age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name, Pages by user;
Grouped = group Joined by url;
Summed = foreach Grouped generate group,
count(Joined) as clicks;
Sorted = order Summed by clicks desc;
Top5 = limit Sorted 5;
store Top5 into ‘top5sites’;
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
25. Notice how naturally the components of the job translate into Pig Latin.
Job 1
Job 3
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load …
Filtered = filter …
Pages = load …
Joined = join …
Grouped = group …
Summed = … count()…
Sorted = order …
Top5 = limit …
26. Developed at Facebook
Used for majority of Facebook jobs
“Relational database” built on Hadoop
Maintains list of table schemas
SQL-like query language (HQL)
Can call Hadoop Streaming scripts from HQL
Supports table partitioning, clustering, complex
data types, some optimizations
27. •Find top 5 pages visited by users aged 18-25:
•Filter page views through Python script:
28. Limitations of Hadoop
Hadoop can perform only batch processing, and data will be accessed
only in a sequential manner. That means one has to search the entire
dataset even for the simplest of jobs. A new solution is needed to
access any point of data in a single unit of time (random access).
What is HBase?
HBase is a distributed column-oriented database built on top of the
Hadoop file system. It is designed to provide quick random access to
huge amounts of structured data. It leverages the fault tolerance
provided by the Hadoop File System (HDFS).
29. Rowid Column Family Column Family Column Family Column Family
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
1
2
3
30. Features of HBase
• HBase is linearly scalable.
• It has automatic failure
support.
• It provides consistent read
and writes.
• It integrates with Hadoop,
both as a source and a
destination.
• It has easy java API for client.
• It provides data replication
across clusters.
Where to Use HBase
• Apache HBase is used to have
random, real-time read/write
access to Big Data.
• It hosts very large tables on top of
clusters of commodity hardware.
• Apache HBase is a non-relational
database modeled after Google's
Bigtable. Bigtable acts up on
Google File System, likewise
Apache HBase works on top of
Hadoop and HDFS.
Applications of HBase
• It is used whenever
there is a need to write
heavy applications.
• HBase is used whenever
we need to provide fast
random access to
available data.
• Companies such as
Facebook, Twitter,
Yahoo, and Adobe use
HBase internally.
32. HBase RDBMS
HBase is schema-less, it doesn't have the concept
of fixed columns schema; defines only column
families.
An RDBMS is governed by its schema, which
describes the whole structure of tables.
It is built for wide tables. HBase is horizontally
scalable.
It is thin and built for small tables. Hard to scale.
No transactions are there in HBase. RDBMS is transactional.
It has de-normalized data. It will have normalized data.
It is good for semi-structured as well as structured
data.
It is good for structured data.
33. HCatalog, provides a relational table
abstraction layer over HDFS. Using the
HCatalog abstraction layer allows query tools
such as Pig and Hive to treat the data in a
familiar relational architecture. It also permits
easier exchange of data between the HDFS
storage and client tools used to present the data
for analysis using familiar data exchange
application programming interfaces (APIs) such
as Java Database Connectivity (JDBC) and
Open Database Connectivity.