1. Title Here
• First Level
– Second Level
• Third Level
0-60 Hadoop Development
in 60 minutes or less
Abe Taha
abetaha@karmasphere.com
1
Saturday, August 14, 2010
2. Agenda
• Background
• Motivation for Hadoop
• Hadoop Architecture
- HDFS
- MapReduce framework
• Example Jobs
• Karmasphere Studio
• Ancillary Hadoop technologies
• Questions
• 2
Saturday, August 14, 2010
3. Background
• Worked at Yahoo on search and social search
• Worked at Google on App infrastructure
• Worked at Ning on Hadoop for analytics and
system management services
• Worked at Ask on Dictionary.com and
Reference.com properties
• Now at Karmasphere
• 3
Saturday, August 14, 2010
4. Motivation for Hadoop
• Data is growing fast
- Website usage increasing
- Logging user events on the rise
- Disks are becoming cheaper
- Companies realize insights buried in the data
• Era of Big Data
- You know big data when you see it
- Data that is large enough that it takes time
to extract insights in a reasonable amount of
time
• 4
Saturday, August 14, 2010
5. Big Data example
• Apache log files are common for web properties
• Simple format
- 27.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /search?q=book HTTP/1.0" 200
2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"
• Contains wealth of information
- IP address of the client
- User requesting the resource
- Date and Time
- URL Path
- Result code
- Object size returned to the client
- Referrer
- User-Agent
• 5
Saturday, August 14, 2010
6. Insights in log data
• The log data contains a wealth of information
- Duration of user’s visit
- Most popular queries/pages
- Most common browsers
- Geo location of users
- Flow analysis of user sessions
• 6
Saturday, August 14, 2010
7. Typical log data lifecycle
• Instead of gaining these insights
- Logs are kept for 30 days
- Then sent to tape
‣ Where they die
‣ Except if the government needs to access
them
• Sometimes
- Data is extracted
- Placed into a data warehouse for future
processing
- Not very flexible, if data fields change
• 7
Saturday, August 14, 2010
8. Solution?
• Problem prevalent in a lot of search companies
and at a very large scale
• In 2004 Google published their take on the
problem
- Paper in OSDI ’04
- MapReduce: Simplified Data Processing on
Large Clusters
• System built on cheap commodity hardware,
and horizontally scalable
• New paradigm for solving problems
- Map
- Reduce
• 8
Saturday, August 14, 2010
9. What is MapReduce
• Old paradigm from functional languages
• Works on data tuples
• For each tuple apply mapper function f: [k1, v1]
-> [k2, v2]
• Collect tuples with similar keys and apply a
combine function g: [k2, [v1, v2, …,vn]]->[k3,v3]
• 9
Saturday, August 14, 2010
10. MapReduce (cont’d)
• To speed up the computation we divide and
conquer
- Divide the tuples into manageable groups
- Process each group of tuples separately
- Collect similar tuples and send them to the
reduce phase
- Combine the results together
• Luckily in most data problems the data records
are independent
• 10
Saturday, August 14, 2010
11. MapReduce Framework
• Takes care of the scaffolding around the map/
reduce functions
- Partition the data across multiple machines
- Run a function (Map) on each partition in
parallel
- Collect the results, and sort them
- Send the results to multiple machines that
run a Reduce function
- Rinse and repeat if needed
• 11
Saturday, August 14, 2010
13. Example
• Find the maximum number in a list
• Luckily max A = max(max(A[1..k]), max(A[k..N]))
• A = [1, 2, 3, 4, 5, …, 10]
• Divide A into chunks
- A1=[1,..,5]
- A2=[6,…,10]
• Map max on A1 to get 5
• Map max on A2 to get 10
• Reduce [5,10] by using max to get 10
• 13
Saturday, August 14, 2010
14. Another example
• Add Numbers from 1..100
• Sum of A[1..100] = Sum of A[1..k] + Sum of A[k
+1..p] + Sum of A[p+1..100]
Text
1 2 3 4 5 6 7 8 9 . . . . . 100
15 N M
5050
• 14
Saturday, August 14, 2010
15. Another example
• Canonical word count
• Divide a text into words
- “To be or not to be”
- To, be, or, not, to, be
• Mapper
- For every word emit a tuple (word, 1)
- (To, 1), (be, 1), (or, 1), (not, 1), (to, 1), (be, 1)
• Collect output by word
- (To, [1, 1]), (be, [1,1]), (or, [1]), (not, [1])
• Reduce the tuples
- (To, 2), (be, 2), (or, 1), (not, 1)
• 15
Saturday, August 14, 2010
16. So how do we run the examples
• Using Hadoop
- Open source implementation of MR
framework
- Two major components
‣ Distributed file system--HDFS
‣ Code execution framework--MR
• 16
Saturday, August 14, 2010
17. HDFS
• Stores data in files that are divided into blocks
• Blocks are large, usually 64MB to marginalize
the cost of seeks
• Blocks are stored on multiple machines called
“Data Nodes”
• One master node “Name Node” stores
filesystem meta-data including the directory
hierarchy, file names, and file to block mapping
• All meta data operations go through the Name
Node, however data access goes directly to the
data nodes
• 17
Saturday, August 14, 2010
18. HDFS
• Single point of failure because of single Name
Node
- Secondary Name Node that replicates all
transactions from the name node
• Limitation on number of files in the file system
as all meta-data is stored in memory on the
Name Node
- Hadoop archive files
• 18
Saturday, August 14, 2010
19. MapReduce Framework
• Execution framework that orchestrates the MR
jobs
- Takes care of running the code where the
data is
- Partitions the input into chunks
- Runs the user provided Mappers and collects
the output, sorts and combines the
intermediate results
- Takes care of job failures and task laggards
- Runs Reducers to summarize results
• Supports streaming for scripting languages and
Pipes for C/C++
• 19
Saturday, August 14, 2010
20. And how would WordCount look?
public class HadoopMapper extends MapReduceBase implements
Mapper<LongWritable,Text,Text,LongWritable> {
@Override
public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output,
Reporter reporter)
throws IOException {
String[] line = value.toString().split("[,s]+");
for(String token : line) {
output.collect(new Text(token), new LongWritable(1));
}
}
}
• 20
Saturday, August 14, 2010
21. Word Count-Reducer
public class HadoopReducer extends MapReduceBase implements
Reducer<Text,LongWritable,Text,LongWritable> {
@Override
public void reduce(Text key, Iterator<LongWritable> value, OutputCollector<Text,
LongWritable> output, Reporter reporter)
throws IOException {
long sum = 0;
while (value.hasNext()) {
++sum;
value.next();
}
output.collect(key, new LongWritable(sum));
}
}
• 21
Saturday, August 14, 2010
22. Max - Mapper
public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output,
Reporter reporter)
throws IOException {
String numbers[] = value.toString().split("[,s]+");
long max = -1;
for (String token : numbers) {
long number = Long.parseLong(token);
if (number > max) {
max = number;
}
}
output.collect(new Text("k"), new LongWritable(max));
}
• 22
Saturday, August 14, 2010
23. Max-Reducer
public void reduce(Text key, Iterator<LongWritable> value, OutputCollector<Text, LongWritable>
output, Reporter reporter)
throws IOException {
long max = 0;
while (value.hasNext()) {
long number = value.next().get();
if(number>max) {
max = number;
}
}
output.collect(key, new LongWritable(max));
}
• 23
Saturday, August 14, 2010
24. Sum - Mapper
public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output,
Reporter reporter)
throws IOException {
String numbers[] = value.toString().split("[,s]+");
long sum = 0;
for(String token : numbers) {
sum += Long.parseLong(token);
}
output.collect(new Text("k"), new LongWritable(sum));
}
• 24
Saturday, August 14, 2010
25. Sum - Reducer
public void reduce(Text key, Iterator<LongWritable> value, OutputCollector<Text, LongWritable>
output, Reporter reporter)
throws IOException {
long sum = 0;
while(value.hasNext()) {
sum += value.next().get();
}
output.collect(key, new LongWritable(sum));
}
• 25
Saturday, August 14, 2010
26. First Impressions
• Lots of overhead even for simple examples
• Can’t test on data before deploying to cluster
- Bugs
- Prototyping for data format changes
- Testing different versions of Hadoop runtime
• Tools like Karmasphere help with that
• 26
Saturday, August 14, 2010
27. Karmasphere Studio
• For NetBeans and Eclipse
• Two editions
- Community (Free)
- Professional
• 27
Saturday, August 14, 2010
28. Community Edition
• Community edition focusses on
- Development and prototyping
‣ MR workflow development
‣ Local execution with multiple Hadoop
versions
- Packaging jars
• Eclipse
- http://www.hadoopstudio.org/dist/eclipse-
community/site.xml
• NetBeans
- http://hadoopstudio.org/updates/
updates.xml
• 28
Saturday, August 14, 2010
29. Professional Edition
• Professional edition focuses on what happens to
the job after initial development
- Profiling and tuning
- Packaging and deployment (local/colo/ssh
tunnel/EMR)
- Support
• Sign-up for beta on our site
- http://karmasphere.com/Products-
Information/karmasphere-studio.html
• 29
Saturday, August 14, 2010
47. What happened?
• Without deploying anything to the cluster, we
can:
- See how the job behaves locally
- Fix bugs if data output does not match
expectation
- Experiment with different versions of Hadoop
• We can also write custom code for each MR
stage or use the ones provided by Hadoop
• 47
Saturday, August 14, 2010
48. Running locally
• Studio comes with 3 versions of Hadoop
runtime libraries
- 0.18
- 0.19
- 0.20
• Can run job locally as an in-proc thread using
export jar
- Test behavior on different Hadoop runtimes
without deploying
- Just need to supply input/output
• 48
Saturday, August 14, 2010
63. Other Hadoop technologies
• Cascading
- Higher level data flow language
- Operates on sources and sinks
- Turns workflows into jobs
- Studio includes Cascading support
• Hive
- High level SQL like language
- Concepts such as tables, and queries
- Converts SQL to MapReduce
- Working on enterprise quality Hive based SQL product
• Pig
- Scripting language
- Converts script to MR job
• 63
Saturday, August 14, 2010