Hadoop
Saeed Iqbal P11-6501 MSCS
What Is                            ?   inspired by
 • System for Processing mind-boggingly large amount
   of data.
 • The Apache Hadoop software library is a framework:
    • that allows for the distributed processing of large data sets
        across clusters of computers using simple programming
        models.
    •   It is designed to scale up from single servers to thousands
        of machines, each offering local computation and storage.
    •   Rather than rely on hardware to deliver high-availability,
    •   The library itself is designed to detect and handle failures at
        the application layer, so delivering a highly-available service
        on top of a cluster of computers, each of which may be
        prone to failures.
Hadoop Core
• Open sourced, flexible and available architecture
    for large scale computation and data processing on
    a network of commodity hardware
•   Open Source Software + Hardware Commodity
    •   IT Costs Reduction


    MapReduce                 Computation

    HDFS                      Storage
Hadoop, Why?
• Need to process Multi Petabyte Datasets
• Expensive to build reliability in each application.
• Nodes fail every day
   • Failure is expected, rather than exceptional.
   • The number of nodes in a cluster is not constant.
• Need common infrastructure
   • Efficient, reliable, Open Source Apache License
• The above goals are same as Condor, but
   • Workloads are IO bound and not CPU bound
Hadoop History
•   2004—Initial versions of what is now Hadoop Distributed Filesystem and
    MapReduce implemented by Doug Cutting and Mike Cafarella.
•   December 2005—Nutch ported to the new framework. Hadoop runs reliably on
    20 nodes.
•   January 2006—Doug Cutting joins Yahoo!.
•   February 2006—Apache Hadoop project officially started to support the
    standalone development of MapReduce and HDFS.
•   February 2006—Adoption of Hadoop by Yahoo! Grid team.
•   April 2006—Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours.
•   May 2006—Yahoo! set up a Hadoop research cluster—300 nodes.
•   May 2006—Sort benchmark run on 500 nodes in 42 hours (better hardware than
    April benchmark).
•   October 2006—Research cluster reaches 600 nodes.
•   December 2006—Sort benchmark run on 20 nodes in 1.8 hours, 100 nodes in 3.3
    hours, 500 nodes in 5.2 hours, 900 nodes in 7.8 hours.
•   January 2007—Research cluster reaches 900 nodes.
•   April 2007—Research clusters—2 clusters of 1000 nodes.
•   April 2008—Won the 1 terabyte sort benchmark in 209 seconds on 900 nodes.
•   October 2008—Loading 10 terabytes of data per a day on to research clusters.
•   March 2009—17 clusters with a total of 24,000 nodes
•   April 2009—Won the minute sort by sorting 500 GB in 59 seconds (on 1400
    nodes) and the 100 terabyte sort in 173 minutes (on 3400 nodes).
Who uses Hadoop?
Amazon/A9
Facebook
Google
IBM
Joost
Last.fm
New York Times
PowerSet
Veoh
Yahoo!
Twitter.com
How does HDFS Work?            MapReduce has
                               undergone a
Let Suppose we have a file     complete
                               overhaul in
Size : 300MB                   hadoop-0.23 and
                               we now
                               have, what we
                               call, MapReduce
                               2.0 (MRv2) or
                             0
                               YARN.
                             0 The fundamental
                             M idea of MRv2 is
                             B to split up the
                               two major
                               functionalities of
                               the
                               JobTracker, resou
                               rce management
                               and job
                               scheduling/monit
                               oring, into
                               separate daemo
How does HDFS Work?             1   MapReduce has
HDFS splits it into blocks      2   undergone a complete
                                8   overhaul in hadoop-0.23
Size of each block is 128 MB.   M
                                    and we now have, what
                                    we call, MapReduce 2.0
                                B   (MRv2) or YARN.
                                    The fundamental idea of
                                    MRv2 is to split up the
                                1   two major functionalities
                                2   of the JobTracker,
                                8
                                M
                                B



                                4   resource management
                                4   and job
                                M   scheduling/monitoring, i
                                B   nto separate daemo
How does HDFS Work?               MapReduce has
                                  undergone a complete
HDFS will keep 3 copies of each   overhaul in hadoop-0.23
                                  and we now have, what
Block.                            we call, MapReduce 2.0
HDFS store these blocks on        (MRv2) or YARN.

datanodes,
HDFS distributes the block to
the DNs
How does HDFS Work?
The Name Node tracks blocks and Data nodes.
  DN         DN          DN




             DN           DN
  DN
                                      Name Node


   DN         DN         DN
How does HDFS Work?
Sometimes a datanode will die. Not a problem,
Example for MapReduce
Page 1: the weather is good
Page 2: today is good
Page 3: good weather is good.
Map output
Worker 1:
  (the 1), (weather 1), (is 1), (good 1).
Worker 2:
  (today 1), (is 1), (good 1).
Worker 3:
  (good 1), (weather 1), (is 1), (good 1).
Reduce Output
Worker 1:
  (the 1)
Worker 2:
  (is 3)
Worker 3:
  (weather 2)
Worker 4:
  (today 1)
Worker 5:
  (good 4)
MapReduce Architecture
MapReduce: Programming Model
Process data using special map() and reduce()
  functions
  The map() function is called on every item in the
    input and emits a series of intermediate key/value
    pairs
  All values associated with a given key are grouped
    together
  The reduce() function is called on every unique
    key, and its value list, and emits a value that is
    added to the output
MapReduce:Programming Model



              M     <How,1>
                    <now,1>     <How,1 1>
 How now            <brown,1>   <now,1 1>            brown 1
  Brown       M     <cow,1>     <brown,1>     R       cow 1
   cow              <How,1>     <cow,1>               does 1
                    <does,1>    <does,1>              How 2
              M     <it,1>      <it,1>
                                              R        it 1
  How does
It work now         <work,1>    <work,1>              now 2
                    <now,1>                          work 1
              M                             Reduce
                          MapReduce
                          Framework
              Map
  Input                                              Output
MapReduce:Programming Model
More formally,
   Map(k1,v1) --> list(k2,v2)
   Reduce(k2, list(v2)) --> list(v2)
MapReduce Life Cycle

                          Map function



                          Reduce function




                       Run this program as a
                         MapReduce job
Hadoop Environment
Hadoop has become the kernel of the distributed
  operating system for Big Data
The project includes these modules:
Hadoop Common: The common utilities that support
  the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A
  distributed file system that provides high-throughput
  access to application data.
Hadoop YARN: A framework for job scheduling and
  cluster resource management.
Hadoop MapReduce: A YARN-based system for
  parallel processing of large data sets.
Hadoop Architecture

                                              Hue                                   Mahout
                                          (Web Console)                          (Data Mining)


                                                                Oozie
                                                      (Job Workflow & Scheduling)
  (Coordination)
                   Zookeeper




                                        Sqoop/Flume
                                                                        Pig/Hive (Analytical Language)
                                        (Data integration)



                               MapReduce Runtime
                               (Dist. Programming Framework)                         Hbase
                                                                               (Column NoSQL DB)


                                           Hadoop Distributed File System (HDFS)
Zookeeper – Coordination
Framework
                                              Hue                                   Mahout
                                          (Web Console)                          (Data Mining)


                                                                Oozie
                                                      (Job Workflow & Scheduling)
  (Coordination)
                   Zookeeper




                                        Sqoop/Flume
                                                                        Pig/Hive (Analytical Language)
                                        (Data integration)



                               MapReduce Runtime
                               (Dist. Programming Framework)                         Hbase
                                                                               (Column NoSQL DB)


                                           Hadoop Distributed File System (HDFS)
What is ZooKeeper
• A centralized service for maintaining
   •   Configuration information, naming
   •   Providing distributed synchronization,
   •   and providing group services.
   •   A set of tools to build distributed applications that can
       safely handle partial failures
• ZooKeeper was designed to store coordination data
   •   Status information
   •   Configuration
   •   Location information
ZooKeeper
• ZooKeeper allows distributed processes to coordinate
  with each other through a shared hierarchical name
  space of data registers (we call these registers
  znodes), much like a file system.
Flume / Sqoop – Data Integration
Framework
                                               Hue                                   Mahout
                                           (Web Console)                          (Data Mining)


                                                                 Oozie
                                                       (Job Workflow & Scheduling)
   (Coordination)
                    Zookeeper




                                         Sqoop/Flume
                                                                         Pig/Hive (Analytical Language)
                                         (Data integration)



                                MapReduce Runtime
                                (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                            Hadoop Distributed File System (HDFS)
FLUME

• Flume is a distributed:
   • A distributed, reliable, and data collection service
   • It efficiently collecting, aggregating, and moving large
     amounts of data
   • Fault tolerant, many failover and recovery mechanism
• One-stop solution for data collection of all formats
• It has a simple and flexible architecture based on
  streaming data flows.
Flume: High-Level Overview
• Logical Node
• Source
• Sink
Flume Architecture

Log                                                             Log
                                              ...


Flume Node                                                      Flume Node


HDFS

      December 2nd, 2012 - Apache Flume 1.3.0 Released
                    ©2011 Cloudera, Inc. All Rights Reserved.
Sqoop
• Apache Sqoop(TM) is a tool designed for efficiently
    transferring bulk data between Apache Hadoop and
    structured data stores such as relational databases.
•   Easy, parallel database import/export
•   What you want do?
    •   Insert data from RDBMS to HDFS
    •   Export data from HDFS back into RDBMS
Sqoop Architecture & Example
                             March of 2012
      HDFS

      Sqoop

      RDBMS
     $ sqoop import --connect jdbc:mysql://localhost/world --username root --
     table City
     ...

     $ hadoop fs -cat City/part-m-00000
     1,Kabul,AFG,Kabol,17800002,Qandahar,AFG,Qandahar,2375003,Herat,AF
     G,Herat,1868004,Mazar-e-
     Sharif,AFG,Balkh,1278005,Amsterdam,NLD,Noord-Holland,731200
     ...
33
                          ©2011 Cloudera, Inc. All Rights Reserved.
Pig / Hive – Analytical Language
                                              Hue                                   Mahout
                                          (Web Console)                          (Data Mining)


                                                                Oozie
                                                      (Job Workflow & Scheduling)
  (Coordination)
                   Zookeeper




                                        Sqoop/Flume
                                                                        Pig/Hive (Analytical Language)
                                        (Data integration)



                               MapReduce Runtime
                               (Dist. Programming Framework)                         Hbase
                                                                               (Column NoSQL DB)


                                           Hadoop Distributed File System (HDFS)
Why Hive and Pig?
Although MapReduce is very powerful, it can also be
   complex to master
Many organizations have business or data analysts who
   are skilled at writing SQL queries, but not at writing
   Java code
Many organizations have programmers who are skilled
   at writing code in scripting languages
Hive and Pig are two projects which evolved separately
   to help such people analyze huge amounts of data
   via MapReduce
   Hive was initially developed at Facebook, Pig at Yahoo!
Hive
What is Hive?
   An SQL-like interface to Hadoop
Data Warehouse infrastructure that provides easy data
  summarization and ad hoc querying and the analysis
  of large datasets stored in Hadoop compatible file
  systems.
   MapRuduce for execution
   HDFS for storage
Hive Query Language
   Basic-SQL : Select, From, Join, Group-By
   Equi-Join, Muti-Table Insert, Multi-Group-By
   Batch query
SELECT * FROM purchases WHERE price > 100 GROUP BY storeid
Hive

            SQL


            Hive


            MapReduce


38
            ©2011 Cloudera, Inc. All Rights Reserved.
Pig
Apache Pig is a platform to analyze large data sets.
In simple terms you have lots and lots of data on which
   you need to do some processing or analysis , one
   way is to write Map Reduce code and then run that
   processing on data.
Other way is to write Pig scripts which would inturn be
   converted to Map Reduce code and would process
   your data.
Pig consists of two parts
•   Pig latin language
•   Pig engine         A = load ‘a.txt’ as (id, name, age, ...)
                         B = load ‘b.txt’ as (id, address, ...)
                         C = JOIN A BY id, B BY id;STORE C into ‘c.txt’
Pig Latin & Pig Engine
• Pig latin is a scripting language which allows you to
    describe how data flow from one or more inputs
    should be read , how it should be processed and
    then where it should be stored.
•   The flows can be simple or complex where some
    processing is applied in between. Data can be picked
    from multiple inputs.
•   We can say Pig Latin describes a directed acyclic
    graphs where edges are data flows and the nodes
    are operators that process the data
•   Pig Engine:
•   The job of engine is to execute the data flow written
    in Pig latin in parallel on hadoop infrastructure.
Why Pig is required when we can code all in MR
• Pig provides all standard data processing operations
    like sort , group , join , filter , order by , union right
    inside pig latin
•   In MR we have to lots of manual coding.
    Pig does optimization of Pig latin scripts while
    creating them into MR jobs.
•   It creates optimized version of Map reduce to run on
    hadoop
•   It takes very less time to write Pig latin script then to
    write corresponding MR code
•   Where Pig is useful
    Transactional ETL Data pipelines ( Mostly used)
    Research on raw data
    Iterative processing
Pig

      Script


      Pig


      MapReduce


       ©2011 Cloudera, Inc. All Rights Reserved.
WordCount Example
• Input
  Hello World Bye World
  Hello Hadoop Goodbye Hadoop
• For the given sample input the map emits
  <   Hello, 1>
  <   World, 1>
  <   Bye, 1>
  <   World, 1>
  <   Hello, 1>
  <   Hadoop, 1>
  <   Goodbye, 1>
  <   Hadoop, 1>

• theBye, 1> just sums up the values
   < reduce
  <   Goodbye, 1>
  <   Hadoop, 2>
  <   Hello, 2>
  <   World, 2>
WordCount Example In MapReduce
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
      String line = value.toString();
      StringTokenizer tokenizer = new StringTokenizer(line);
      while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());
        context.write(word, one);
      }
    }
}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
  public void reduce(Text key, Iterable<IntWritable> values, Context context)
    throws IOException, InterruptedException {
     int sum = 0;
     for (IntWritable val : values) {
        sum += val.get();
     }
     context.write(key, new IntWritable(sum));
  }
}

public static void main(String[] args) throws Exception {
 Configuration conf = new Configuration();

    Job job = new Job(conf, "wordcount");
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);

    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.waitForCompletion(true);
}
WordCount Example By Pig

A = LOAD 'wordcount/input' USING PigStorage as
(token:chararray);

B = GROUP A BY token;

C = FOREACH B GENERATE group, COUNT(A) as count;

DUMP C;
WordCount Example By Hive

CREATE TABLE wordcount (token STRING);

LOAD DATA LOCAL INPATH ’wordcount/input'
OVERWRITE INTO TABLE wordcount;


SELECT count(*) FROM wordcount GROUP BY token;
Hbase – Column NoSQL DB
                                             Hue                                   Mahout
                                         (Web Console)                          (Data Mining)


                                                               Oozie
                                                     (Job Workflow & Scheduling)
 (Coordination)
                  Zookeeper




                                       Sqoop/Flume
                                                                       Pig/Hive (Analytical Language)
                                       (Data integration)



                              MapReduce Runtime
                              (Dist. Programming Framework)                         Hbase
                                                                              (Column NoSQL DB)


                                          Hadoop Distributed File System (HDFS)
Structured-data vs Raw-data
• Apache HBase™ is the Hadoop database, a
  distributed, scalable, big data store.
  • Apache HBase is an open-source, distributed, versioned, column-
    oriented store modeled after Google's Bigtable: A Distributed Storage
    System for Structured Data by Chang et al.

• Coordinated by Zookeeper
• Low Latency
• Random Reads And Writes
• Distributed Key/Value Store
• Simple API
  –   PUT
  –   GET
  –   DELETE
  –   SCANE
Hbase
HBase is a type of "NoSQL" database. NoSQL?
 "NoSQL" is a general term meaning that the database
 isn't an RDBMS which supports SQL as its primary
 access language, but there are many types of NoSQL
 databases: BerkeleyDB is an example of a local
 NoSQL database, whereas HBase is very much a
 distributed database.

 Technically speaking, HBase is really more a "Data
 Store" than "Data Base" because it lacks many of the
 features you find in an RDBMS, such as typed
 columns, secondary indexes, triggers, and advanced
 query languages, etc.
HBase Examples
hbase>   create 'mytable', 'mycf‘
hbase>   list
hbase>   put 'mytable', 'row1', 'mycf:col1', 'val1‘
hbase>   put 'mytable', 'row1', 'mycf:col2', 'val2‘
hbase>   put 'mytable', 'row2', 'mycf:col1', 'val3‘
hbase>   scan 'mytable‘
hbase>   disable 'mytable‘
hbase>   drop 'mytable'




Hbase reference : http://hbase.apache.org




                     ©2011 Cloudera, Inc. All Rights Reserved.
Oozie – Job Workflow & Scheduling

                                               Hue                                   Mahout
                                           (Web Console)                          (Data Mining)


                                                                 Oozie
                                                       (Job Workflow & Scheduling)
   (Coordination)
                    Zookeeper




                                         Sqoop/Flume
                                                                         Pig/Hive (Analytical Language)
                                         (Data integration)



                                MapReduce Runtime
                                (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                            Hadoop Distributed File System (HDFS)
What is                               ?
Oozie is a server-based workflow scheduler system to
   manage Apache Hadoop jobs (e.g. load data, storing
   data, analyze data, cleaning data, running map
   reduce jobs, etc.)
   A Java Web Application
Oozie is a workflow scheduler for Hadoop
Triggered
   Time
                                 Job 1 Job 2
   Data

                                       Job 3

                                 Job 4 Job 5
Mahout – Data Mining

                                               Hue                                   Mahout
                                           (Web Console)                          (Data Mining)


                                                                 Oozie
                                                       (Job Workflow & Scheduling)
   (Coordination)
                    Zookeeper




                                         Sqoop/Flume
                                                                         Pig/Hive (Analytical Language)
                                         (Data integration)



                                MapReduce Runtime
                                (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                            Hadoop Distributed File System (HDFS)
What is
Machine-learning tool
Distributed and scalable machine learning algorithms
  on the Hadoop platform
Building intelligent applications easier and faster
Our core algorithms for clustering, classfication and
  batch based collaborative filtering are implemented
  on top of Apache Hadoop using the map/reduce
  paradigm.
Mahout Use Cases
Yahoo: Spam Detection
Foursquare: Recommendations
SpeedDate.com: Recommendations
Adobe: User Targetting
Amazon: Personalization Platform




               ©2011 Cloudera, Inc. All Rights Reserved.
Use case Example
Predict what the user likes based on
   His/Her historical behavior
   Aggregate behavior of people similar to him
Recap – Hadoop Arcitecture
                                              Hue                                   Mahout
                                          (Web Console)                          (Data Mining)


                                                                Oozie
                                                      (Job Workflow & Scheduling)
  (Coordination)
                   Zookeeper




                                        Sqoop/Flume
                                                                        Pig/Hive (Analytical Language)
                                        (Data integration)



                               MapReduce Runtime
                               (Dist. Programming Framework)                         Hbase
                                                                               (Column NoSQL DB)


                                           Hadoop Distributed File System (HDFS)
THANKS

Hadoop

  • 1.
  • 2.
    What Is ? inspired by • System for Processing mind-boggingly large amount of data. • The Apache Hadoop software library is a framework: • that allows for the distributed processing of large data sets across clusters of computers using simple programming models. • It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. • Rather than rely on hardware to deliver high-availability, • The library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
  • 3.
    Hadoop Core • Opensourced, flexible and available architecture for large scale computation and data processing on a network of commodity hardware • Open Source Software + Hardware Commodity • IT Costs Reduction MapReduce Computation HDFS Storage
  • 4.
    Hadoop, Why? • Needto process Multi Petabyte Datasets • Expensive to build reliability in each application. • Nodes fail every day • Failure is expected, rather than exceptional. • The number of nodes in a cluster is not constant. • Need common infrastructure • Efficient, reliable, Open Source Apache License • The above goals are same as Condor, but • Workloads are IO bound and not CPU bound
  • 5.
    Hadoop History • 2004—Initial versions of what is now Hadoop Distributed Filesystem and MapReduce implemented by Doug Cutting and Mike Cafarella. • December 2005—Nutch ported to the new framework. Hadoop runs reliably on 20 nodes. • January 2006—Doug Cutting joins Yahoo!. • February 2006—Apache Hadoop project officially started to support the standalone development of MapReduce and HDFS. • February 2006—Adoption of Hadoop by Yahoo! Grid team. • April 2006—Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours. • May 2006—Yahoo! set up a Hadoop research cluster—300 nodes. • May 2006—Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark). • October 2006—Research cluster reaches 600 nodes. • December 2006—Sort benchmark run on 20 nodes in 1.8 hours, 100 nodes in 3.3 hours, 500 nodes in 5.2 hours, 900 nodes in 7.8 hours. • January 2007—Research cluster reaches 900 nodes. • April 2007—Research clusters—2 clusters of 1000 nodes. • April 2008—Won the 1 terabyte sort benchmark in 209 seconds on 900 nodes. • October 2008—Loading 10 terabytes of data per a day on to research clusters. • March 2009—17 clusters with a total of 24,000 nodes • April 2009—Won the minute sort by sorting 500 GB in 59 seconds (on 1400 nodes) and the 100 terabyte sort in 173 minutes (on 3400 nodes).
  • 6.
    Who uses Hadoop? Amazon/A9 Facebook Google IBM Joost Last.fm NewYork Times PowerSet Veoh Yahoo! Twitter.com
  • 7.
    How does HDFSWork? MapReduce has undergone a Let Suppose we have a file complete overhaul in Size : 300MB hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2) or 0 YARN. 0 The fundamental M idea of MRv2 is B to split up the two major functionalities of the JobTracker, resou rce management and job scheduling/monit oring, into separate daemo
  • 8.
    How does HDFSWork? 1 MapReduce has HDFS splits it into blocks 2 undergone a complete 8 overhaul in hadoop-0.23 Size of each block is 128 MB. M and we now have, what we call, MapReduce 2.0 B (MRv2) or YARN. The fundamental idea of MRv2 is to split up the 1 two major functionalities 2 of the JobTracker, 8 M B 4 resource management 4 and job M scheduling/monitoring, i B nto separate daemo
  • 9.
    How does HDFSWork? MapReduce has undergone a complete HDFS will keep 3 copies of each overhaul in hadoop-0.23 and we now have, what Block. we call, MapReduce 2.0 HDFS store these blocks on (MRv2) or YARN. datanodes, HDFS distributes the block to the DNs
  • 10.
    How does HDFSWork? The Name Node tracks blocks and Data nodes. DN DN DN DN DN DN Name Node DN DN DN
  • 11.
    How does HDFSWork? Sometimes a datanode will die. Not a problem,
  • 12.
    Example for MapReduce Page1: the weather is good Page 2: today is good Page 3: good weather is good.
  • 13.
    Map output Worker 1: (the 1), (weather 1), (is 1), (good 1). Worker 2: (today 1), (is 1), (good 1). Worker 3: (good 1), (weather 1), (is 1), (good 1).
  • 14.
    Reduce Output Worker 1: (the 1) Worker 2: (is 3) Worker 3: (weather 2) Worker 4: (today 1) Worker 5: (good 4)
  • 15.
  • 16.
    MapReduce: Programming Model Processdata using special map() and reduce() functions The map() function is called on every item in the input and emits a series of intermediate key/value pairs All values associated with a given key are grouped together The reduce() function is called on every unique key, and its value list, and emits a value that is added to the output
  • 17.
    MapReduce:Programming Model M <How,1> <now,1> <How,1 1> How now <brown,1> <now,1 1> brown 1 Brown M <cow,1> <brown,1> R cow 1 cow <How,1> <cow,1> does 1 <does,1> <does,1> How 2 M <it,1> <it,1> R it 1 How does It work now <work,1> <work,1> now 2 <now,1> work 1 M Reduce MapReduce Framework Map Input Output
  • 18.
    MapReduce:Programming Model More formally, Map(k1,v1) --> list(k2,v2) Reduce(k2, list(v2)) --> list(v2)
  • 19.
    MapReduce Life Cycle Map function Reduce function Run this program as a MapReduce job
  • 20.
    Hadoop Environment Hadoop hasbecome the kernel of the distributed operating system for Big Data The project includes these modules: Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
  • 21.
    Hadoop Architecture Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 22.
    Zookeeper – Coordination Framework Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 23.
    What is ZooKeeper •A centralized service for maintaining • Configuration information, naming • Providing distributed synchronization, • and providing group services. • A set of tools to build distributed applications that can safely handle partial failures • ZooKeeper was designed to store coordination data • Status information • Configuration • Location information
  • 24.
    ZooKeeper • ZooKeeper allowsdistributed processes to coordinate with each other through a shared hierarchical name space of data registers (we call these registers znodes), much like a file system.
  • 25.
    Flume / Sqoop– Data Integration Framework Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 26.
    FLUME • Flume isa distributed: • A distributed, reliable, and data collection service • It efficiently collecting, aggregating, and moving large amounts of data • Fault tolerant, many failover and recovery mechanism • One-stop solution for data collection of all formats • It has a simple and flexible architecture based on streaming data flows.
  • 27.
    Flume: High-Level Overview •Logical Node • Source • Sink
  • 28.
    Flume Architecture Log Log ... Flume Node Flume Node HDFS December 2nd, 2012 - Apache Flume 1.3.0 Released ©2011 Cloudera, Inc. All Rights Reserved.
  • 29.
    Sqoop • Apache Sqoop(TM)is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases. • Easy, parallel database import/export • What you want do? • Insert data from RDBMS to HDFS • Export data from HDFS back into RDBMS
  • 30.
    Sqoop Architecture &Example March of 2012 HDFS Sqoop RDBMS $ sqoop import --connect jdbc:mysql://localhost/world --username root -- table City ... $ hadoop fs -cat City/part-m-00000 1,Kabul,AFG,Kabol,17800002,Qandahar,AFG,Qandahar,2375003,Herat,AF G,Herat,1868004,Mazar-e- Sharif,AFG,Balkh,1278005,Amsterdam,NLD,Noord-Holland,731200 ... 33 ©2011 Cloudera, Inc. All Rights Reserved.
  • 31.
    Pig / Hive– Analytical Language Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 32.
    Why Hive andPig? Although MapReduce is very powerful, it can also be complex to master Many organizations have business or data analysts who are skilled at writing SQL queries, but not at writing Java code Many organizations have programmers who are skilled at writing code in scripting languages Hive and Pig are two projects which evolved separately to help such people analyze huge amounts of data via MapReduce Hive was initially developed at Facebook, Pig at Yahoo!
  • 33.
    Hive What is Hive? An SQL-like interface to Hadoop Data Warehouse infrastructure that provides easy data summarization and ad hoc querying and the analysis of large datasets stored in Hadoop compatible file systems. MapRuduce for execution HDFS for storage Hive Query Language Basic-SQL : Select, From, Join, Group-By Equi-Join, Muti-Table Insert, Multi-Group-By Batch query SELECT * FROM purchases WHERE price > 100 GROUP BY storeid
  • 34.
    Hive SQL Hive MapReduce 38 ©2011 Cloudera, Inc. All Rights Reserved.
  • 35.
    Pig Apache Pig isa platform to analyze large data sets. In simple terms you have lots and lots of data on which you need to do some processing or analysis , one way is to write Map Reduce code and then run that processing on data. Other way is to write Pig scripts which would inturn be converted to Map Reduce code and would process your data. Pig consists of two parts • Pig latin language • Pig engine A = load ‘a.txt’ as (id, name, age, ...) B = load ‘b.txt’ as (id, address, ...) C = JOIN A BY id, B BY id;STORE C into ‘c.txt’
  • 36.
    Pig Latin &Pig Engine • Pig latin is a scripting language which allows you to describe how data flow from one or more inputs should be read , how it should be processed and then where it should be stored. • The flows can be simple or complex where some processing is applied in between. Data can be picked from multiple inputs. • We can say Pig Latin describes a directed acyclic graphs where edges are data flows and the nodes are operators that process the data • Pig Engine: • The job of engine is to execute the data flow written in Pig latin in parallel on hadoop infrastructure.
  • 37.
    Why Pig isrequired when we can code all in MR • Pig provides all standard data processing operations like sort , group , join , filter , order by , union right inside pig latin • In MR we have to lots of manual coding. Pig does optimization of Pig latin scripts while creating them into MR jobs. • It creates optimized version of Map reduce to run on hadoop • It takes very less time to write Pig latin script then to write corresponding MR code • Where Pig is useful Transactional ETL Data pipelines ( Mostly used) Research on raw data Iterative processing
  • 38.
    Pig Script Pig MapReduce ©2011 Cloudera, Inc. All Rights Reserved.
  • 39.
    WordCount Example • Input Hello World Bye World Hello Hadoop Goodbye Hadoop • For the given sample input the map emits < Hello, 1> < World, 1> < Bye, 1> < World, 1> < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> • theBye, 1> just sums up the values < reduce < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>
  • 40.
    WordCount Example InMapReduce public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }
  • 41.
    WordCount Example ByPig A = LOAD 'wordcount/input' USING PigStorage as (token:chararray); B = GROUP A BY token; C = FOREACH B GENERATE group, COUNT(A) as count; DUMP C;
  • 42.
    WordCount Example ByHive CREATE TABLE wordcount (token STRING); LOAD DATA LOCAL INPATH ’wordcount/input' OVERWRITE INTO TABLE wordcount; SELECT count(*) FROM wordcount GROUP BY token;
  • 43.
    Hbase – ColumnNoSQL DB Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 44.
  • 45.
    • Apache HBase™is the Hadoop database, a distributed, scalable, big data store. • Apache HBase is an open-source, distributed, versioned, column- oriented store modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. • Coordinated by Zookeeper • Low Latency • Random Reads And Writes • Distributed Key/Value Store • Simple API – PUT – GET – DELETE – SCANE
  • 46.
    Hbase HBase is atype of "NoSQL" database. NoSQL? "NoSQL" is a general term meaning that the database isn't an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database. Technically speaking, HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.
  • 47.
    HBase Examples hbase> create 'mytable', 'mycf‘ hbase> list hbase> put 'mytable', 'row1', 'mycf:col1', 'val1‘ hbase> put 'mytable', 'row1', 'mycf:col2', 'val2‘ hbase> put 'mytable', 'row2', 'mycf:col1', 'val3‘ hbase> scan 'mytable‘ hbase> disable 'mytable‘ hbase> drop 'mytable' Hbase reference : http://hbase.apache.org ©2011 Cloudera, Inc. All Rights Reserved.
  • 48.
    Oozie – JobWorkflow & Scheduling Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 49.
    What is ? Oozie is a server-based workflow scheduler system to manage Apache Hadoop jobs (e.g. load data, storing data, analyze data, cleaning data, running map reduce jobs, etc.) A Java Web Application Oozie is a workflow scheduler for Hadoop Triggered Time Job 1 Job 2 Data Job 3 Job 4 Job 5
  • 50.
    Mahout – DataMining Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 51.
    What is Machine-learning tool Distributedand scalable machine learning algorithms on the Hadoop platform Building intelligent applications easier and faster Our core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm.
  • 52.
    Mahout Use Cases Yahoo:Spam Detection Foursquare: Recommendations SpeedDate.com: Recommendations Adobe: User Targetting Amazon: Personalization Platform ©2011 Cloudera, Inc. All Rights Reserved.
  • 53.
    Use case Example Predictwhat the user likes based on His/Her historical behavior Aggregate behavior of people similar to him
  • 54.
    Recap – HadoopArcitecture Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 55.