Introduction to hadoop

INTRODUCTION TO HADOOP
Keegan Witt ( )@keeganwitt

SLIDES
http://bit.ly/cm14_hadoop

THINGS I'LL TALK ABOUT
Why Hadoop?
Hadoop ecosystem
Deploying Hadoop
Writing your first
job
Testing your first
job
Why not Hadoop?
Advanced usages

THINGS I WON'T TALK ABOUT
Anything I lack prod experience in
Configuring & managing a Hadoop cluster
Querying & data mining (e.g. Hive, Pig, Mahout, Flume)

WHY RIDE THE ELEPHANT?
Source: Hadoop

THE PROBLEM
Growing data
Disks are slow
Need higher throughput
More unstructured data

DESIRABLE FEATURES
Scale out, not up
Easy to use
Builtin backups
Builtin fault tolerance

USE CASES
Text mining/pattern recognition
Graph processing
Collaborative filtering
Clustering

Amazon
AOL
Autodesk
eBay
Google*
Groupon
HP
IBM
Intel
J.P. Morgan
Last.fm
LinkedIn
NASA
Navteq
NSA
Rackspace
Samsung
StumbleUpon
Twitter
Visa
Yahoo
WHO ELSE IS RIDING?

CONTRIBUTORS
Source: Cloudera

WHAT IS HADOOP?
Source: Unknown

HDFS ARCHITECTURE
Source: Hadoop

HDFS ARCHITECTURE
Source: Computer Geek Blog

HBASE ARCHITECTURE
Source: Lars George's Blog

HBASE HDFS STRUCTURE
HFILES
HLOGS (WALS)
/hbase
    /<Table>
        /<Region>
            /<ColumnFamiy>
                /<StoreFile>

/hbase
    /.logs
        /<RegionServer>
            /<HLog>

LOGICAL VIEW
Source: Manoj Khangaonkar's Blog

SERVER VIEW
Source: Hortonworks

PHYSICAL VIEW
Source: Microsoft

PROCESS VIEW
Source: Rohit Menon's blog

YARN & MAPREDUCE 2
Source: Hortonworks

DEPLOYING HADOOP
Source: Dilbert

DEPLOYING HADOOP
FOR EXPERIMENTING
FOR REAL
on
From distribution's packages
From source
Cloudera QuickStart VM
Hortonworks Sandbox
Amazon EMR
Cloudera CDH
Hortonworks HDP
MapR
Microsoft HDInsight Azure

CONFIGURING HADOOP
DEFAULTS
coresite.xml
hdfssite.xml
mapredsite.xml
hbasesite.xml
hivesite.xml
yarnsite.xml
OVERRIDING
Configuration conf = new Configuration();
conf.set("<optionKey>", "<optionValue>");

WRITING YOUR FIRST JOB
Source: CloudTweaks

DRIVER
Source: (slightly modified)
public class WordCount_Driver {
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = new Job(conf, "wordcount");
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        job.setMapperClass(WordCount_Mapper.class);
        job.setReducerClass(WordCount_Reducer.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}
Hadoop

MAPPER
Source:
public class WordCount_Mapper
  extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    public void map(LongWritable key, Text value, Context context)
      throws IOException, InterruptedException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
            word.set(tokenizer.nextToken());
            context.write(word, one);
        }
    }
}
Hadoop

REDUCER
Source:
public class WordCount_Reducer
  extends Reducer<Text, IntWritable, Text, IntWritable> {
  public void reduce(Text key, Iterable<IntWritable> values,
      Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        context.write(key, new IntWritable(sum));
    }
}
Hadoop

MAP TEST
public class WordCount_Mapper_Test {
    private MapDriver<LongWritable, Text, Text, IntWritable> mapDriver;
    @Before
    public void setUp() {
        WordCount_Mapper mapper = new WordCount_Mapper();
        mapDriver = new MapDriver<LongWritable, Text, Text, IntWritable>();
        mapDriver.setMapper(mapper);
    }
    @Test
    public void testMapper() {
        mapDriver.withInput(new LongWritable(1), new Text("cat cat dog"))
            .withOutput(new Text("cat"), new IntWritable(1))
            .withOutput(new Text("dog"), new IntWritable(1))
            .runTest();
    }
}
MRUnit

REDUCE TEST
public class WordCount_Reducer_Test {
    private ReduceDriver<Text, IntWritable, Text, IntWritable> reduceDriver;
    @Before
        WordCount_Reducer reducer = new WordCount_Reducer();
        reduceDriver = new ReduceDriver<Text, IntWritable, Text, IntWritable>();
        reduceDriver.setReducer(reducer);
    }
    @Test
    public void testReducer() {
        List<IntWritable> catValues = new ArrayList<IntWritable>();
        catValues.add(new IntWritable(1));
        catValues.add(new IntWritable(1));
        List<IntWritable> dogValues = new ArrayList<IntWritable>();
        dogValues.add(new IntWritable(1));
        reduceDriver.withInput(new Text("cat"), catValues)
            .withInput(new Text("dog"), dogValues)
            .withOutput(new Text("dog"), new IntWritable(1))
            .runTest();
    }
}
MRUnit

MAPREDUCE TEST
public class WordCount_MapReduce_Test {
    private MapReduceDriver<LongWritable, Text, Text, IntWritable, Text, IntWritable> mapReduceDriver;
    @Before
        WordCount_Mapper mapper = new WordCount_Mapper();
        WordCount_Reducer reducer = new WordCount_Reducer();
        mapReduceDriver = new MapReduceDriver<LongWritable, Text, Text, IntWritable, Text, IntWritable>();
        mapReduceDriver.setMapper(mapper);
        mapReduceDriver.setReducer(reducer);
    }
    @Test
    public void testMapReduce() {
        mapReduceDriver.withInput(new LongWritable(1), new Text("cat cat dog"))
            .addOutput(new Text("cat"), new IntWritable(2))
            .addOutput(new Text("dog"), new IntWritable(1))
            .runTest();
    }
}
MRUnit

WHAT ABOUT SYSTEM TESTING?
(Sematext )
MiniCluster
HBaseTestingUtility example

WHY NOT RIDE THE ELEPHANT?
Source: geek & poke

WHY NOT RIDE THE ELEPHANT?
Request/response model
External clients
Not much data
Young

DEPENDENCIES
HADOOP_CLASSPATH
Überjar
libjars
CLASSPATH ORDERING
HADOOP_USER_CLASSPATH_FIRST
mapreduce.task.classpath.first > true

CUSTOM COUNTERS
public enum KeegansCounters {
FOO,
BAR;
}
// ...
context.getCounter(KeegansCounters.FOO).increment(1);

JOB FLOWS
&
Sequentially in main()
Use JobControl in main()
Multiple Hadoop jar commands
Oozie
Azkaban
ChainMapper
ChainReducer

SQOOP PROCESS OVERVIEW
Source: DevX

SQOOPING DATA FROM RDBMSS
sqoop import
‐‐connect jdbc:mysql://foo.com/db
‐‐table orders
‐‐fields‐terminated‐by 't'
‐‐lines‐terminated‐by 'n'

SQOOPING DATA INTO RDBMSS
sqoop export
‐‐connect jdbc:mysql://foo.com/db
‐‐table bar
‐‐export‐dir /hdfs_path/bar_data

COMPRESSING INTERMEDIATE DATA
COMPRESSING OUTPUT
mapred.compress.map.output ‐> true
mapred.map.output.compression.codec ‐> com.hadoop.compression.lzo.SnappyCodec
FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class);

PROFILING JOBS
HPROF
Trial and error

DISTRIBUTED CACHE
COMMANDLINE (USING INTERFACE)
files
archives
libjars
PROGRAMMATICALLY
TOOL
public void addCacheFile(URI uri)
public void addCacheArchive(URI uri)
public void addFileToClassPath(Path file)
public void addArchiveToClassPath(Path archive)

SECONDARY SORTING
STEPS
Change key to composite
Create Partitioner and grouping Comparator on original key
Create sort Comparator on composite key

SECONDARY SORTING EXAMPLE
job.setPartitionerClass(FirstPartitioner.class);
job.setSortComparatorClass(KeyComparator.class);
job.setGroupingComparatorClass(GroupComparator.class);

LINKS
http://hadoop.apache.org/
http://mrunit.apache.org/
http://hbase.apache.org/
http://avro.apache.org/
http://www.cascading.org/
http://pig.apache.org/
http://hive.apache.org/
http://flume.apache.org/
http://oozie.apache.org/
https://github.com/azkaban/azkaban
http://crunch.apache.org/
http://spark.incubator.apache.org/
http://developer.yahoo.com/hadoop/tutorial/
http://sortbenchmark.org/
https://github.com/cloudera/impala

Introduction to hadoop

Recommended

Recommended

More Related Content

Similar to Introduction to hadoop

Similar to Introduction to hadoop (20)

Recently uploaded

Recently uploaded (20)

Introduction to hadoop