Seattle hug 2010

Title Here
• First Level
– Second Level
• Third Level

0-60 Hadoop Development
in 60 minutes or less
Abe Taha
abetaha@karmasphere.com

1
Saturday, August 14, 2010

Agenda
• Background
• Motivation for Hadoop
• Hadoop Architecture
- HDFS
- MapReduce framework
• Example Jobs
• Karmasphere Studio
• Ancillary Hadoop technologies
• Questions

• 2

Background
• Worked at Yahoo on search and social search
• Worked at Google on App infrastructure
• Worked at Ning on Hadoop for analytics and
system management services
• Worked at Ask on Dictionary.com and
Reference.com properties
• Now at Karmasphere

• 3

Motivation for Hadoop
• Data is growing fast
- Website usage increasing
- Logging user events on the rise
- Disks are becoming cheaper
- Companies realize insights buried in the data
• Era of Big Data
- You know big data when you see it
- Data that is large enough that it takes time
to extract insights in a reasonable amount of
time

• 4

Big Data example
• Apache log ﬁles are common for web properties
• Simple format
- 27.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /search?q=book HTTP/1.0" 200
2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"

• Contains wealth of information
- IP address of the client
- User requesting the resource
- Date and Time
- URL Path
- Result code
- Object size returned to the client
- Referrer
- User-Agent
• 5

Insights in log data
• The log data contains a wealth of information
- Duration of user’s visit
- Most popular queries/pages
- Most common browsers
- Geo location of users
- Flow analysis of user sessions

• 6

Typical log data lifecycle
• Instead of gaining these insights
- Logs are kept for 30 days
- Then sent to tape
‣ Where they die
‣ Except if the government needs to access
them
• Sometimes
- Data is extracted
- Placed into a data warehouse for future
processing
- Not very ﬂexible, if data ﬁelds change

• 7

Solution?
• Problem prevalent in a lot of search companies
and at a very large scale
• In 2004 Google published their take on the
problem
- Paper in OSDI ’04
- MapReduce: Simpliﬁed Data Processing on
Large Clusters
• System built on cheap commodity hardware,
and horizontally scalable
• New paradigm for solving problems
- Map
- Reduce
• 8

What is MapReduce
• Old paradigm from functional languages
• Works on data tuples
• For each tuple apply mapper function f: [k1, v1]
-> [k2, v2]
• Collect tuples with similar keys and apply a
combine function g: [k2, [v1, v2, …,vn]]->[k3,v3]

• 9

MapReduce (cont’d)
• To speed up the computation we divide and
conquer
- Divide the tuples into manageable groups
- Process each group of tuples separately
- Collect similar tuples and send them to the
reduce phase
- Combine the results together
• Luckily in most data problems the data records
are independent

• 10

MapReduce Framework
• Takes care of the scaffolding around the map/
reduce functions
- Partition the data across multiple machines
- Run a function (Map) on each partition in
parallel
- Collect the results, and sort them
- Send the results to multiple machines that
run a Reduce function
- Rinse and repeat if needed

• 11

MapReduce Framework

Input Map

Reduce Output
Input Map

Shuffle
Input Map & Reduce Output
Sort

Input Map

Input Map

12

Example
• Find the maximum number in a list
• Luckily max A = max(max(A[1..k]), max(A[k..N]))
• A = [1, 2, 3, 4, 5, …, 10]
• Divide A into chunks
- A1=[1,..,5]
- A2=[6,…,10]
• Map max on A1 to get 5
• Map max on A2 to get 10
• Reduce [5,10] by using max to get 10

• 13

Another example
• Add Numbers from 1..100
• Sum of A[1..100] = Sum of A[1..k] + Sum of A[k
+1..p] + Sum of A[p+1..100]

Text

1 2 3 4 5 6 7 8 9 . . . . . 100

15 N M

5050

• 14

Another example
• Canonical word count
• Divide a text into words
- “To be or not to be”
- To, be, or, not, to, be
• Mapper
- For every word emit a tuple (word, 1)
- (To, 1), (be, 1), (or, 1), (not, 1), (to, 1), (be, 1)
• Collect output by word
- (To, [1, 1]), (be, [1,1]), (or, [1]), (not, [1])
• Reduce the tuples
- (To, 2), (be, 2), (or, 1), (not, 1)
• 15

So how do we run the examples
• Using Hadoop
- Open source implementation of MR
framework

- Two major components
‣ Distributed ﬁle system--HDFS
‣ Code execution framework--MR

• 16

HDFS
• Stores data in files that are divided into blocks
• Blocks are large, usually 64MB to marginalize
the cost of seeks
• Blocks are stored on multiple machines called
“Data Nodes”
• One master node “Name Node” stores
filesystem meta-data including the directory
hierarchy, file names, and file to block mapping
• All meta data operations go through the Name
Node, however data access goes directly to the
data nodes

• 17

HDFS
• Single point of failure because of single Name
Node
- Secondary Name Node that replicates all
transactions from the name node
• Limitation on number of files in the file system
as all meta-data is stored in memory on the
Name Node
- Hadoop archive files

• 18

MapReduce Framework
• Execution framework that orchestrates the MR
jobs
- Takes care of running the code where the
data is
- Partitions the input into chunks
- Runs the user provided Mappers and collects
the output, sorts and combines the
intermediate results
- Takes care of job failures and task laggards
- Runs Reducers to summarize results
• Supports streaming for scripting languages and
Pipes for C/C++
• 19

And how would WordCount look?

public class HadoopMapper extends MapReduceBase implements
Mapper<LongWritable,Text,Text,LongWritable> {

@Override
public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output,
Reporter reporter)
throws IOException {

String[] line = value.toString().split("[,s]+");

for(String token : line) {
output.collect(new Text(token), new LongWritable(1));
}
}
}

• 20

Word Count-Reducer

public class HadoopReducer extends MapReduceBase implements
Reducer<Text,LongWritable,Text,LongWritable> {
@Override
public void reduce(Text key, Iterator<LongWritable> value, OutputCollector<Text,
LongWritable> output, Reporter reporter)

long sum = 0;

while (value.hasNext()) {
++sum;
value.next();
}
output.collect(key, new LongWritable(sum));
}
}

• 21

Max - Mapper

Reporter reporter)
String numbers[] = value.toString().split("[,s]+");

long max = -1;

for (String token : numbers) {
long number = Long.parseLong(token);
if (number > max) {
max = number;
}
}

output.collect(new Text("k"), new LongWritable(max));
}

• 22

Max-Reducer

public void reduce(Text key, Iterator<LongWritable> value, OutputCollector<Text, LongWritable>
output, Reporter reporter)
long max = 0;

while (value.hasNext()) {
long number = value.next().get();
if(number>max) {
max = number;
}
}

output.collect(key, new LongWritable(max));
}

• 23

Sum - Mapper

Reporter reporter)
String numbers[] = value.toString().split("[,s]+");

long sum = 0;

for(String token : numbers) {
sum += Long.parseLong(token);
}

output.collect(new Text("k"), new LongWritable(sum));
}

• 24

Sum - Reducer

public void reduce(Text key, Iterator<LongWritable> value, OutputCollector<Text, LongWritable>
output, Reporter reporter)
long sum = 0;

while(value.hasNext()) {
sum += value.next().get();
}

output.collect(key, new LongWritable(sum));
}

• 25

First Impressions
• Lots of overhead even for simple examples
• Can’t test on data before deploying to cluster
- Bugs
- Prototyping for data format changes
- Testing different versions of Hadoop runtime
• Tools like Karmasphere help with that

• 26

Karmasphere Studio
• For NetBeans and Eclipse
• Two editions
- Community (Free)
- Professional

• 27

Community Edition
• Community edition focusses on
- Development and prototyping
‣ MR workﬂow development
‣ Local execution with multiple Hadoop
versions
- Packaging jars
• Eclipse
- http://www.hadoopstudio.org/dist/eclipse-
community/site.xml
• NetBeans
- http://hadoopstudio.org/updates/
updates.xml
• 28

Professional Edition
• Professional edition focuses on what happens to
the job after initial development
- Proﬁling and tuning
- Packaging and deployment (local/colo/ssh
tunnel/EMR)
- Support
• Sign-up for beta on our site
- http://karmasphere.com/Products-
Information/karmasphere-studio.html

• 29

Workﬂow demo

• 30

Create new Java project

31

Add Hadoop libraries

32

Add library

33

Client and MR libraries

34

Demo

35

Create Job

36

Hadoop Jobs

37

MR Job

38

Hadoop Workﬂow

39

Input Format

40

Mapper

41

Partitioner

42

Comparator

43

Combiner

44

Reducer

45

Output

46

What happened?
• Without deploying anything to the cluster, we
can:
- See how the job behaves locally
- Fix bugs if data output does not match
expectation
- Experiment with different versions of Hadoop
• We can also write custom code for each MR
stage or use the ones provided by Hadoop

• 47

Running locally
• Studio comes with 3 versions of Hadoop
runtime libraries
- 0.18
- 0.19
- 0.20
• Can run job locally as an in-proc thread using
export jar
- Test behavior on different Hadoop runtimes
without deploying
- Just need to supply input/output

• 48

Looking at ﬁle systems

49

Local/HDFS/Amazon S3

50

Direct connection or SSH Tunnel

51

Browse, Drag Drop, Copy

52

Monitor File System

53

Amazon Elastic MapReduce (EMR)

54

Amazon S3

55

S3 credentials

56

Monitoring Job Flows

57

Diagnostics

58

Summary

59

Logs

60

Tasks

61

Conﬁg

62

Other Hadoop technologies
• Cascading
- Higher level data ﬂow language
- Operates on sources and sinks
- Turns workﬂows into jobs
- Studio includes Cascading support
• Hive
- High level SQL like language
- Concepts such as tables, and queries
- Converts SQL to MapReduce
- Working on enterprise quality Hive based SQL product
• Pig
- Scripting language
- Converts script to MR job

• 63

Questions

• 64

Title Here
• First Level
– Second Level
• Third Level

65

Seattle hug 2010

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Seattle hug 2010

Similar to Seattle hug 2010 (20)

Recently uploaded

Recently uploaded (20)

Seattle hug 2010