Hadoop Product Family
and Ecosystem
PC Liao
Agenda

• What is BigData?
• What is the problem?
• Hadoop
  – Introduction to Hadoop
  – Hadoop components
  – What sort of problems can be solved with Hadoop?
• Hadoop ecosystem
• Conclusion
What is BigData?




  A set of files   A database   A single file
The Data-Driven World

• Modern systems have to deal with far more data than
  was the case in the past
  – Organizations are generating huge amounts of data
  – That data has inherent value, and cannot be discarded
• Examples:
  – Yahoo – over 170PB of data
  – Facebook – over 30PB of data
  – eBay – over 5PB of data




• Many organizations are generating data at a rate of
  terabytes per day
What is the problem

• Traditionally, computation has been processor-bound
• For decades, the primary push was to increase the
  computing power of a single machine
  – Faster processor, more RAM
• Distributed systems evolved to allow developers to use
  multiple machines for a single job
  – At compute time, data is copied to the compute nodes
What is the problem

• Getting the data to the processors
  becomes the bottleneck


• Quick calculation
   – Typical disk data transfer rate:
      • 75MB/sec
   – Time taken to transfer 100GB of data
     to the processor:
      • approx. 22   minutes!
What is the problem

• Failure of a component may cost a lot
• What we need when job fail?
  – May result in a graceful degradation of application performance,
    but entire system does not completely fail
  – Should not result in the loss of any data
  – Would not affect the outcome of the job
Big Data Solutions by Industries
The most common problems Hadoop can solve
Threat Analysis/Trade Surveillance

• Challenge:
  – Detecting threats in the form of fraudulent activity or attacks
     • Large data volumes involved
     • Like looking for a needle in a haystack

• Solution with Hadoop:
  – Parallel processing over huge datasets
  – Pattern recognition to identify anomalies
     • – i.e., threats

• Typical Industry:
  – Security, Financial Services
Big Data Use Case
Smart Protection Network
• Challenge
   – Information accessibility and transparency problems
     for threat researcher due to the size and source of
     data (volume, variety and velocity)

• Size of Data
   – Overall Data
        •   Data sources: 20+
        •   Data fields: 1000+
        •   Daily new records: 23 Billion+
        •   Daily new data size: 4TB+


   SPN Smart Feedback
   •   Feedback components: 26
   •   Data fields : 300+
   •   Daily new file counts: 6 Million+
   •   Daily new records: 90 Million+
   •   Daily new data size: 261GB+
Index=“vsapi” zbot
Recommendation Engine

• Challenge:
  – Using user data to predict which products to recommend
• Solution with Hadoop:
  – Batch processing framework
     • Allow execution in in parallel over large datasets
  – Collaborative filtering
     • Collecting „taste‟ information from many users
     • Utilizing information to predict what similar users like

• Typical Industry
  – ISP, Advertising
Walmart Case


                        Diapers
               Beer


                      Friday




               Revenue         ?
Hadoop!
– inspired by
• Apache Hadoop project
  – inspired by Google's MapReduce and Google File System
    papers.
• Open sourced, flexible and available architecture for
  large scale computation and data processing on a
  network of commodity hardware
• Open Source Software + Hardware Commodity
  – IT Costs Reduction
Hadoop Concepts

• Distribute the data as it is initially stored in the system
• Individual nodes can work on data local to those nodes
• Users can focus on developing applications.
Hadoop Components

• Hadoop consists of two core components
  – The Hadoop Distributed File System (HDFS)
  – MapReduce Software Framework
• There are many other projects based around core
  Hadoop
  – Often referred to as the „Hadoop Ecosystem‟
  – Pig, Hive, HBase, Flume, Oozie, Sqoop, etc

                                                                                  Hue                         Mahout
                                                                             (Web Console)                   (Data Mining)

                                                                                                 Oozie
                                                                                      (Job Workflow & Scheduling)




                                          (Coordination)
                                                           Zookeeper
                                                                             Sqoop/Flume                 Pig/Hive (Analytical
                                                                            (Data integration)                Language)


                                                                       MapReduce Runtime
                                                                       (Dist. Programming Framework)            Hbase
                                                                                                          (Column NoSQL DB)


                                                                             Hadoop Distributed File System (HDFS)
Hadoop Components: HDFS

• HDFS, the Hadoop Distributed File System, is
  responsible for storing data on the cluster
• Two roles in HDFS
  – Namenode: Record metadata
  – Datanode: Store data



                                                                            Hue                         Mahout
                                                                       (Web Console)                   (Data Mining)

                                                                                           Oozie
                                                                                (Job Workflow & Scheduling)




                                    (Coordination)
                                                     Zookeeper
                                                                       Sqoop/Flume                 Pig/Hive (Analytical
                                                                      (Data integration)                Language)


                                                                 MapReduce Runtime
                                                                 (Dist. Programming Framework)            Hbase
                                                                                                    (Column NoSQL DB)


                                                                       Hadoop Distributed File System (HDFS)
How Files Are Stored: Example
                       • NameNode holds metadata for the
                         data files
                       • DataNodes hold the actual blocks
                          • Each block is replicated three
                             times on the cluster
HDFS: Points To Note
                       • When a client application
                         wants to read a file:
                          • It communicates with
                             the NameNode to
                             determine which
                             blocks make up the
                             file, and which
                             DataNodes those
                             blocks reside on
                          • It then
                             communicates
                             directly with the
                             DataNodes to read
                             the data
Hadoop Components: MapReduce

• MapReduce is a method for distributing a task across
  multiple nodes
• It works like a Unix pipeline:
  – cat input | grep | sort     | uniq -c | cat > output
  – Input | Map | Shuffle & Sort | Reduce | Output



                                                                                Hue                         Mahout
                                                                           (Web Console)                   (Data Mining)

                                                                                               Oozie
                                                                                    (Job Workflow & Scheduling)




                                        (Coordination)
                                                         Zookeeper
                                                                           Sqoop/Flume                 Pig/Hive (Analytical
                                                                          (Data integration)                Language)


                                                                     MapReduce Runtime
                                                                     (Dist. Programming Framework)            Hbase
                                                                                                        (Column NoSQL DB)


                                                                           Hadoop Distributed File System (HDFS)
Features of MapReduce

• Automatic parallelization and distribution
• Automatic re-execution on failure
• Locality optimizations
• MapReduce abstracts all the „housekeeping‟ away from
  the developer
  – Developer can concentrate simply on writing the Map and
    Reduce functions
                                                                                 Hue                         Mahout
                                                                            (Web Console)                   (Data Mining)

                                                                                                Oozie
                                                                                     (Job Workflow & Scheduling)




                                         (Coordination)
                                                          Zookeeper
                                                                            Sqoop/Flume                 Pig/Hive (Analytical
                                                                           (Data integration)                Language)


                                                                      MapReduce Runtime
                                                                      (Dist. Programming Framework)            Hbase
                                                                                                         (Column NoSQL DB)


                                                                            Hadoop Distributed File System (HDFS)
Example : word count

• Word count is challenging over massive amounts of
  data
  – Using a single compute node would be too time-consuming
  – Number of unique words can easily exceed the RAM
• MapReduce breaks complex tasks down into smaller
  elements which can be executed in parallel
• More nodes, more faster
Word Count Example




       Key: offset
       Value: line

                            Key: word      Key: word
                            Value: count   Value: sum of count



0:The cat sat on the mat
22:The aardvark sat on the sofa
The Hadoop Ecosystems
Growing Hadoop Ecosystem

• The term „Hadoop‟ is taken to be the combination of
  HDFS and MapReduce
• There are numerous other projects surrounding Hadoop
  – Typically referred to as the „Hadoop Ecosystem‟
     •   Zookeeper
     •   Hive and Pig
     •   HBase
     •   Flume
     •   Other Ecosystem Projects
          – Sqoop
          – Oozie
          – Hue
          – Mahout
The Ecosystem is the System

• Hadoop has become the kernel of the distributed
  operating system for Big Data
• No one uses the kernel alone
• A collection of projects at Apache
Relation Map

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)


                                                                  Oozie
                                                        (Job Workflow & Scheduling)
    (Coordination)
                     Zookeeper




                                           Sqoop/Flume
                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)



                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                              Hadoop Distributed File System (HDFS)
Zookeeper – Coordination Framework

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)


                                                                  Oozie
                                                        (Job Workflow & Scheduling)
    (Coordination)
                     Zookeeper




                                           Sqoop/Flume
                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)



                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                              Hadoop Distributed File System (HDFS)
What is ZooKeeper

• A centralized service for maintaining
  – Configuration information
  – Providing distributed synchronization
• A set of tools to build distributed applications that can
  safely handle partial failures
• ZooKeeper was designed to store coordination data
  – Status information
  – Configuration
  – Location information
Why use ZooKeeper?

• Manage configuration across nodes
• Implement reliable messaging
• Implement redundant services
• Synchronize process execution
ZooKeeper Architecture




   – All servers store a copy of the data (in memory)
   – A leader is elected at startup
   – 2 roles – leader and follower
     • Followers service clients, all updates go through leader
     • Update responses are sent when a majority of servers have persisted the
        change
   – HA support
Hbase – Column NoSQL DB

                                               Hue                                   Mahout
                                           (Web Console)                         (Data Mining)


                                                                 Oozie
                                                       (Job Workflow & Scheduling)
   (Coordination)
                    Zookeeper




                                          Sqoop/Flume
                                                                         Pig/Hive (Analytical Language)
                                         (Data integration)



                                MapReduce Runtime
                                (Dist. Programming Framework)                         Hbase
                                                                               (Column NoSQL DB)


                                             Hadoop Distributed File System (HDFS)
Structured-data vs Raw-data
I – Inspired by

• Apache open source project
• Inspired from Google Big Table
• Non-relational, distributed database written in Java
• Coordinated by Zookeeper
Row & Column Oriented
Hbase – Data Model

• Cells are “versioned”
• Table rows are sorted by row key
• Region – a row range [start-key:end-key]
Architecture

• Master Server (HMaster)
  – Assigns regions to regionservers
  – Monitors the health of regionservers
• RegionServers
  – Contain regions and handle client read/write request
Hbase – workflow
When to use HBase

• Need random, low latency access to the data
• Application has a variable schema where each row is
  slightly different
• Add columns
• Most of columns are NULL in each row
Flume / Sqoop – Data Integration Framework

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)


                                                                  Oozie
                                                        (Job Workflow & Scheduling)
    (Coordination)
                     Zookeeper




                                           Sqoop/Flume
                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)



                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                              Hadoop Distributed File System (HDFS)
What‟s the problem for data collection

• Data collection is currently a priori and ad hoc
• A priori – decide what you want to collect ahead of time
• Ad hoc – each kind of data source goes through its own
  collection path
(and how can it help?)



• A distributed data collection service
• It efficiently collecting, aggregating, and moving large
  amounts of data
• Fault tolerant, many failover and recovery mechanism
• One-stop solution for data collection of all formats
Flume: High-Level Overview
• Logical Node
• Source
• Sink
Architecture

• basic diagram
  – one master control multiple node
Architecture

• multiple master control multiple node
An example flow
Flume / Sqoop – Data Integration Framework

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)


                                                                  Oozie
                                                        (Job Workflow & Scheduling)
    (Coordination)
                     Zookeeper




                                           Sqoop/Flume
                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)



                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                              Hadoop Distributed File System (HDFS)
Sqoop

• Easy, parallel database import/export
• What you want do?
  – Insert data from RDBMS to HDFS
  – Export data from HDFS back into RDBMS
What is Sqoop

• A suite of tools that connect Hadoop and database
  systems
• Import tables from databases into HDFS for deep
  analysis
• Export MapReduce results back to a database for
  presentation to end-users
• Provides the ability to import from SQL databases
  straight into your Hive data warehouse
How Sqoop helps

• The Problem
  – Structured data in traditional databases cannot be easily
    combined with complex data stored in HDFS
• Sqoop (SQL-to-Hadoop)
  – Easy import of data from many databases to HDFS
  – Generate code for use in MapReduce applications
Sqoop - import process
Sqoop - export process

• Exports are performed in parallel using MapReduce
Why Sqoop

• JDBC-based implementation
  – Works with many popular database vendors
• Auto-generation of tedious user-side code
  – Write MapReduce applications to work with your data, faster
• Integration with Hive
  – Allows you to stay in a SQL-based environment
Sqoop - JOB
• Job management options




• E.g sqoop job –create myjob –import –connect xxxxxxx
  --table mytable
Pig / Hive – Analytical Language

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)


                                                                  Oozie
                                                        (Job Workflow & Scheduling)
    (Coordination)
                     Zookeeper




                                           Sqoop/Flume
                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)



                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                              Hadoop Distributed File System (HDFS)
Why Hive and Pig?

• Although MapReduce is very powerful, it can also be
  complex to master
• Many organizations have business or data analysts who
  are skilled at writing SQL queries, but not at writing Java
  code
• Many organizations have programmers who are skilled
  at writing code in scripting languages
• Hive and Pig are two projects which evolved separately
  to help such people analyze huge amounts of data via
  MapReduce
  – Hive was initially developed at Facebook, Pig at Yahoo!
Hive     – Developed by

• What is Hive?
  – An SQL-like interface to Hadoop
• Data Warehouse infrastructure that provides data
  summarization and ad hoc querying on top of Hadoop
  – MapRuduce for execution
  – HDFS for storage
• Hive Query Language
  – Basic-SQL : Select, From, Join, Group-By
  – Equi-Join, Muti-Table Insert, Multi-Group-By
  – Batch query
 SELECT * FROM purchases WHERE price > 100 GROUP BY storeid
Pig                 – Initiated by




• A high-level scripting language (Pig Latin)
• Process data one step at a time
• Simple to write MapReduce program
• Easy understand
• Easy debug         A = load ‘a.txt’ as (id, name, age, ...)
                     B = load ‘b.txt’ as (id, address, ...)
                     C = JOIN A BY id, B BY id;STORE C into ‘c.txt’
Hive vs. Pig

                  Hive                   Pig
Language          HiveQL (SQL-like)      Pig Latin, a scripting language
Schema            Table definitions      A schema is optionally defined
                  that are stored in a   at runtime
                  metastore
Programmait Access JDBC, ODBC            PigServer
WordCount Example

• Input
  Hello World Bye World
  Hello Hadoop Goodbye Hadoop
• For the given sample input the map emits
  < Hello, 1>
  < World, 1>
  < Bye, 1>
  < World, 1>
  < Hello, 1>
  < Hadoop, 1>
  < Goodbye, 1>
  < Hadoop, 1>

   < Bye, 1>
• the reduce just sums up the values
  < Goodbye, 1>
  < Hadoop, 2>
  < Hello, 2>
  < World, 2>
WordCount Example In MapReduce
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
      String line = value.toString();
      StringTokenizer tokenizer = new StringTokenizer(line);
      while (tokenizer.hasMoreTokens()) {
         word.set(tokenizer.nextToken());
         context.write(word, one);
      }
    }
}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
  public void reduce(Text key, Iterable<IntWritable> values, Context context)
    throws IOException, InterruptedException {
     int sum = 0;
     for (IntWritable val : values) {
        sum += val.get();
     }
     context.write(key, new IntWritable(sum));
  }
}

public static void main(String[] args) throws Exception {
  Configuration conf = new Configuration();

    Job job = new Job(conf, "wordcount");
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);

    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.waitForCompletion(true);
}
WordCount Example By Pig


A = LOAD 'wordcount/input' USING PigStorage as (token:chararray);

B = GROUP A BY token;

C = FOREACH B GENERATE group, COUNT(A) as count;

DUMP C;
WordCount Example By Hive

CREATE TABLE wordcount (token STRING);

LOAD DATA LOCAL INPATH ‟wordcount/input'
OVERWRITE INTO TABLE wordcount;


SELECT count(*) FROM wordcount GROUP BY token;
Oozie – Job Workflow & Scheduling

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)


                                                                  Oozie
                                                        (Job Workflow & Scheduling)
    (Coordination)
                     Zookeeper




                                           Sqoop/Flume
                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)



                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                              Hadoop Distributed File System (HDFS)
What is                       ?

• A Java Web Application
• Oozie is a workflow scheduler for Hadoop
• Crond for Hadoop
                       Job 1 Job 2

                            Job 3

                       Job 4 Job 5
Why

• Why use Oozie instead of just cascading a jobs one
  after another
• Major flexibility
  – Start, Stop, Suspend, and re-run jobs
• Oozie allows you to restart from a failure
  – You can tell Oozie to restart a job from a specific node in the
    graph or to skip specific failed nodes
High Level Architecture

• Web Service API
• database store :
  – Workflow definitions
  – Currently running workflow instances, including instance states
    and variables



                 Oozie

        WS       Tomcat
                                      Hadoop/Pig/HDFS
        API      web-app



                   DB
How it triggered

• Time
   – Execute your workflow every 15 minutes


     00:15           00:30     00:45      01:00


 • Time and Data
    – Materialize your workflow every hour, but only run them when
      the input data is ready.
                               Hadoop
Input Data Exists?


               01:00         02:00      03:00     04:00
Exeample Workflow
Oozie use criteria

• Need Launch, control, and monitor jobs from your Java
  Apps
  – Java Client API/Command Line Interface
• Need control jobs from anywhere
  – Web Service API
• Have jobs that you need to run every hour, day, week
• Need receive notification when a job done
  – Email when a job is complete
Hue – Web Console

                                               Hue                                   Mahout
                                           (Web Console)                         (Data Mining)


                                                                 Oozie
                                                       (Job Workflow & Scheduling)
   (Coordination)
                    Zookeeper




                                          Sqoop/Flume
                                                                         Pig/Hive (Analytical Language)
                                         (Data integration)



                                MapReduce Runtime
                                (Dist. Programming Framework)                         Hbase
                                                                               (Column NoSQL DB)


                                             Hadoop Distributed File System (HDFS)
Hue – developed by

• Hadoop User Experience
• Apache Open source project
• HUE is a web UI for Hadoop
• Platform for building custom applications with a nice UI
  library
Hue

• HUE comes with a suite of applications
  – File Browser: Browse HDFS; change permissions and
    ownership; upload, download, view and edit files.
  – Job Browser: View jobs, tasks, counters, logs, etc.
  – Beeswax: Wizards to help create Hive tables, load data, run and
    manage Hive queries, and download results in Excel format.
Hue: File Browser UI
Hue: Beewax UI
Mahout – Data Mining

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)


                                                                  Oozie
                                                        (Job Workflow & Scheduling)
    (Coordination)
                     Zookeeper




                                           Sqoop/Flume
                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)



                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                              Hadoop Distributed File System (HDFS)
What is

• Machine-learning tool
• Distributed and scalable machine learning algorithms on
  the Hadoop platform
• Building intelligent applications easier and faster
Why

• Current state of ML libraries
  –   Lack Community
  –   Lack Documentation and Examples
  –   Lack Scalability
  –   Are Research oriented
Mahout – scale

• Scale to large datasets
  – Hadoop MapReduce implementations that scales linearly with
    data
• Scalable to support your business case
  – Mahout is distributed under a commercially friendly Apache
    Software license
• Scalable community
  – Vibrant, responsive and diverse
Mahout – four use cases

• Mahout machine learning algorithms
  – Recommendation mining : takes users‟ behavior and find items
    said specified user might like
  – Clustering : takes e.g. text documents and groups them based
    on related document topics
  – Classification : learns from existing categorized documents what
    specific category documents look like and is able to assign
    unlabeled documents to appropriate category
  – Frequent item set mining : takes a set of item groups (e.g. terms
    in query session, shopping cart content) and identifies, which
    individual items typically appear together
Use case Example

• Predict what the user likes based on
  – His/Her historical behavior
  – Aggregate behavior of people similar to him
Conclusion

Today, we introduced:
• Why Hadoop is needed
• The basic concepts of HDFS and MapReduce
• What sort of problems can be solved with Hadoop
• What other projects are included in the Hadoop
  ecosystem
Recap – Hadoop Ecosystem

                                               Hue                                   Mahout
                                           (Web Console)                         (Data Mining)


                                                                 Oozie
                                                       (Job Workflow & Scheduling)
   (Coordination)
                    Zookeeper




                                          Sqoop/Flume
                                                                         Pig/Hive (Analytical Language)
                                         (Data integration)



                                MapReduce Runtime
                                (Dist. Programming Framework)                         Hbase
                                                                               (Column NoSQL DB)


                                             Hadoop Distributed File System (HDFS)
Questions?
Thank you!

Hadoop Family and Ecosystem

  • 1.
    Hadoop Product Family andEcosystem PC Liao
  • 2.
    Agenda • What isBigData? • What is the problem? • Hadoop – Introduction to Hadoop – Hadoop components – What sort of problems can be solved with Hadoop? • Hadoop ecosystem • Conclusion
  • 3.
    What is BigData? A set of files A database A single file
  • 4.
    The Data-Driven World •Modern systems have to deal with far more data than was the case in the past – Organizations are generating huge amounts of data – That data has inherent value, and cannot be discarded • Examples: – Yahoo – over 170PB of data – Facebook – over 30PB of data – eBay – over 5PB of data • Many organizations are generating data at a rate of terabytes per day
  • 5.
    What is theproblem • Traditionally, computation has been processor-bound • For decades, the primary push was to increase the computing power of a single machine – Faster processor, more RAM • Distributed systems evolved to allow developers to use multiple machines for a single job – At compute time, data is copied to the compute nodes
  • 6.
    What is theproblem • Getting the data to the processors becomes the bottleneck • Quick calculation – Typical disk data transfer rate: • 75MB/sec – Time taken to transfer 100GB of data to the processor: • approx. 22 minutes!
  • 7.
    What is theproblem • Failure of a component may cost a lot • What we need when job fail? – May result in a graceful degradation of application performance, but entire system does not completely fail – Should not result in the loss of any data – Would not affect the outcome of the job
  • 8.
    Big Data Solutionsby Industries The most common problems Hadoop can solve
  • 9.
    Threat Analysis/Trade Surveillance •Challenge: – Detecting threats in the form of fraudulent activity or attacks • Large data volumes involved • Like looking for a needle in a haystack • Solution with Hadoop: – Parallel processing over huge datasets – Pattern recognition to identify anomalies • – i.e., threats • Typical Industry: – Security, Financial Services
  • 10.
    Big Data UseCase Smart Protection Network • Challenge – Information accessibility and transparency problems for threat researcher due to the size and source of data (volume, variety and velocity) • Size of Data – Overall Data • Data sources: 20+ • Data fields: 1000+ • Daily new records: 23 Billion+ • Daily new data size: 4TB+ SPN Smart Feedback • Feedback components: 26 • Data fields : 300+ • Daily new file counts: 6 Million+ • Daily new records: 90 Million+ • Daily new data size: 261GB+
  • 11.
  • 13.
    Recommendation Engine • Challenge: – Using user data to predict which products to recommend • Solution with Hadoop: – Batch processing framework • Allow execution in in parallel over large datasets – Collaborative filtering • Collecting „taste‟ information from many users • Utilizing information to predict what similar users like • Typical Industry – ISP, Advertising
  • 14.
    Walmart Case Diapers Beer Friday Revenue ?
  • 15.
  • 16.
    – inspired by •Apache Hadoop project – inspired by Google's MapReduce and Google File System papers. • Open sourced, flexible and available architecture for large scale computation and data processing on a network of commodity hardware • Open Source Software + Hardware Commodity – IT Costs Reduction
  • 17.
    Hadoop Concepts • Distributethe data as it is initially stored in the system • Individual nodes can work on data local to those nodes • Users can focus on developing applications.
  • 18.
    Hadoop Components • Hadoopconsists of two core components – The Hadoop Distributed File System (HDFS) – MapReduce Software Framework • There are many other projects based around core Hadoop – Often referred to as the „Hadoop Ecosystem‟ – Pig, Hive, HBase, Flume, Oozie, Sqoop, etc Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical (Data integration) Language) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 19.
    Hadoop Components: HDFS •HDFS, the Hadoop Distributed File System, is responsible for storing data on the cluster • Two roles in HDFS – Namenode: Record metadata – Datanode: Store data Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical (Data integration) Language) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 20.
    How Files AreStored: Example • NameNode holds metadata for the data files • DataNodes hold the actual blocks • Each block is replicated three times on the cluster
  • 21.
    HDFS: Points ToNote • When a client application wants to read a file: • It communicates with the NameNode to determine which blocks make up the file, and which DataNodes those blocks reside on • It then communicates directly with the DataNodes to read the data
  • 22.
    Hadoop Components: MapReduce •MapReduce is a method for distributing a task across multiple nodes • It works like a Unix pipeline: – cat input | grep | sort | uniq -c | cat > output – Input | Map | Shuffle & Sort | Reduce | Output Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical (Data integration) Language) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 23.
    Features of MapReduce •Automatic parallelization and distribution • Automatic re-execution on failure • Locality optimizations • MapReduce abstracts all the „housekeeping‟ away from the developer – Developer can concentrate simply on writing the Map and Reduce functions Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical (Data integration) Language) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 24.
    Example : wordcount • Word count is challenging over massive amounts of data – Using a single compute node would be too time-consuming – Number of unique words can easily exceed the RAM • MapReduce breaks complex tasks down into smaller elements which can be executed in parallel • More nodes, more faster
  • 25.
    Word Count Example Key: offset Value: line Key: word Key: word Value: count Value: sum of count 0:The cat sat on the mat 22:The aardvark sat on the sofa
  • 26.
  • 27.
    Growing Hadoop Ecosystem •The term „Hadoop‟ is taken to be the combination of HDFS and MapReduce • There are numerous other projects surrounding Hadoop – Typically referred to as the „Hadoop Ecosystem‟ • Zookeeper • Hive and Pig • HBase • Flume • Other Ecosystem Projects – Sqoop – Oozie – Hue – Mahout
  • 28.
    The Ecosystem isthe System • Hadoop has become the kernel of the distributed operating system for Big Data • No one uses the kernel alone • A collection of projects at Apache
  • 29.
    Relation Map Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 30.
    Zookeeper – CoordinationFramework Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 31.
    What is ZooKeeper •A centralized service for maintaining – Configuration information – Providing distributed synchronization • A set of tools to build distributed applications that can safely handle partial failures • ZooKeeper was designed to store coordination data – Status information – Configuration – Location information
  • 32.
    Why use ZooKeeper? •Manage configuration across nodes • Implement reliable messaging • Implement redundant services • Synchronize process execution
  • 33.
    ZooKeeper Architecture – All servers store a copy of the data (in memory) – A leader is elected at startup – 2 roles – leader and follower • Followers service clients, all updates go through leader • Update responses are sent when a majority of servers have persisted the change – HA support
  • 34.
    Hbase – ColumnNoSQL DB Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 35.
  • 36.
    I – Inspiredby • Apache open source project • Inspired from Google Big Table • Non-relational, distributed database written in Java • Coordinated by Zookeeper
  • 37.
    Row & ColumnOriented
  • 38.
    Hbase – DataModel • Cells are “versioned” • Table rows are sorted by row key • Region – a row range [start-key:end-key]
  • 39.
    Architecture • Master Server(HMaster) – Assigns regions to regionservers – Monitors the health of regionservers • RegionServers – Contain regions and handle client read/write request
  • 40.
  • 41.
    When to useHBase • Need random, low latency access to the data • Application has a variable schema where each row is slightly different • Add columns • Most of columns are NULL in each row
  • 42.
    Flume / Sqoop– Data Integration Framework Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 43.
    What‟s the problemfor data collection • Data collection is currently a priori and ad hoc • A priori – decide what you want to collect ahead of time • Ad hoc – each kind of data source goes through its own collection path
  • 44.
    (and how canit help?) • A distributed data collection service • It efficiently collecting, aggregating, and moving large amounts of data • Fault tolerant, many failover and recovery mechanism • One-stop solution for data collection of all formats
  • 45.
    Flume: High-Level Overview •Logical Node • Source • Sink
  • 46.
    Architecture • basic diagram – one master control multiple node
  • 47.
    Architecture • multiple mastercontrol multiple node
  • 48.
  • 49.
    Flume / Sqoop– Data Integration Framework Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 50.
    Sqoop • Easy, paralleldatabase import/export • What you want do? – Insert data from RDBMS to HDFS – Export data from HDFS back into RDBMS
  • 51.
    What is Sqoop •A suite of tools that connect Hadoop and database systems • Import tables from databases into HDFS for deep analysis • Export MapReduce results back to a database for presentation to end-users • Provides the ability to import from SQL databases straight into your Hive data warehouse
  • 52.
    How Sqoop helps •The Problem – Structured data in traditional databases cannot be easily combined with complex data stored in HDFS • Sqoop (SQL-to-Hadoop) – Easy import of data from many databases to HDFS – Generate code for use in MapReduce applications
  • 53.
  • 54.
    Sqoop - exportprocess • Exports are performed in parallel using MapReduce
  • 55.
    Why Sqoop • JDBC-basedimplementation – Works with many popular database vendors • Auto-generation of tedious user-side code – Write MapReduce applications to work with your data, faster • Integration with Hive – Allows you to stay in a SQL-based environment
  • 56.
    Sqoop - JOB •Job management options • E.g sqoop job –create myjob –import –connect xxxxxxx --table mytable
  • 57.
    Pig / Hive– Analytical Language Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 58.
    Why Hive andPig? • Although MapReduce is very powerful, it can also be complex to master • Many organizations have business or data analysts who are skilled at writing SQL queries, but not at writing Java code • Many organizations have programmers who are skilled at writing code in scripting languages • Hive and Pig are two projects which evolved separately to help such people analyze huge amounts of data via MapReduce – Hive was initially developed at Facebook, Pig at Yahoo!
  • 59.
    Hive – Developed by • What is Hive? – An SQL-like interface to Hadoop • Data Warehouse infrastructure that provides data summarization and ad hoc querying on top of Hadoop – MapRuduce for execution – HDFS for storage • Hive Query Language – Basic-SQL : Select, From, Join, Group-By – Equi-Join, Muti-Table Insert, Multi-Group-By – Batch query SELECT * FROM purchases WHERE price > 100 GROUP BY storeid
  • 60.
    Pig – Initiated by • A high-level scripting language (Pig Latin) • Process data one step at a time • Simple to write MapReduce program • Easy understand • Easy debug A = load ‘a.txt’ as (id, name, age, ...) B = load ‘b.txt’ as (id, address, ...) C = JOIN A BY id, B BY id;STORE C into ‘c.txt’
  • 61.
    Hive vs. Pig Hive Pig Language HiveQL (SQL-like) Pig Latin, a scripting language Schema Table definitions A schema is optionally defined that are stored in a at runtime metastore Programmait Access JDBC, ODBC PigServer
  • 62.
    WordCount Example • Input Hello World Bye World Hello Hadoop Goodbye Hadoop • For the given sample input the map emits < Hello, 1> < World, 1> < Bye, 1> < World, 1> < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> < Bye, 1> • the reduce just sums up the values < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>
  • 63.
    WordCount Example InMapReduce public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }
  • 64.
    WordCount Example ByPig A = LOAD 'wordcount/input' USING PigStorage as (token:chararray); B = GROUP A BY token; C = FOREACH B GENERATE group, COUNT(A) as count; DUMP C;
  • 65.
    WordCount Example ByHive CREATE TABLE wordcount (token STRING); LOAD DATA LOCAL INPATH ‟wordcount/input' OVERWRITE INTO TABLE wordcount; SELECT count(*) FROM wordcount GROUP BY token;
  • 66.
    Oozie – JobWorkflow & Scheduling Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 67.
    What is ? • A Java Web Application • Oozie is a workflow scheduler for Hadoop • Crond for Hadoop Job 1 Job 2 Job 3 Job 4 Job 5
  • 68.
    Why • Why useOozie instead of just cascading a jobs one after another • Major flexibility – Start, Stop, Suspend, and re-run jobs • Oozie allows you to restart from a failure – You can tell Oozie to restart a job from a specific node in the graph or to skip specific failed nodes
  • 69.
    High Level Architecture •Web Service API • database store : – Workflow definitions – Currently running workflow instances, including instance states and variables Oozie WS Tomcat Hadoop/Pig/HDFS API web-app DB
  • 70.
    How it triggered •Time – Execute your workflow every 15 minutes 00:15 00:30 00:45 01:00 • Time and Data – Materialize your workflow every hour, but only run them when the input data is ready. Hadoop Input Data Exists? 01:00 02:00 03:00 04:00
  • 71.
  • 72.
    Oozie use criteria •Need Launch, control, and monitor jobs from your Java Apps – Java Client API/Command Line Interface • Need control jobs from anywhere – Web Service API • Have jobs that you need to run every hour, day, week • Need receive notification when a job done – Email when a job is complete
  • 73.
    Hue – WebConsole Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 74.
    Hue – developedby • Hadoop User Experience • Apache Open source project • HUE is a web UI for Hadoop • Platform for building custom applications with a nice UI library
  • 75.
    Hue • HUE comeswith a suite of applications – File Browser: Browse HDFS; change permissions and ownership; upload, download, view and edit files. – Job Browser: View jobs, tasks, counters, logs, etc. – Beeswax: Wizards to help create Hive tables, load data, run and manage Hive queries, and download results in Excel format.
  • 76.
  • 77.
  • 78.
    Mahout – DataMining Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 79.
    What is • Machine-learningtool • Distributed and scalable machine learning algorithms on the Hadoop platform • Building intelligent applications easier and faster
  • 80.
    Why • Current stateof ML libraries – Lack Community – Lack Documentation and Examples – Lack Scalability – Are Research oriented
  • 81.
    Mahout – scale •Scale to large datasets – Hadoop MapReduce implementations that scales linearly with data • Scalable to support your business case – Mahout is distributed under a commercially friendly Apache Software license • Scalable community – Vibrant, responsive and diverse
  • 82.
    Mahout – fouruse cases • Mahout machine learning algorithms – Recommendation mining : takes users‟ behavior and find items said specified user might like – Clustering : takes e.g. text documents and groups them based on related document topics – Classification : learns from existing categorized documents what specific category documents look like and is able to assign unlabeled documents to appropriate category – Frequent item set mining : takes a set of item groups (e.g. terms in query session, shopping cart content) and identifies, which individual items typically appear together
  • 83.
    Use case Example •Predict what the user likes based on – His/Her historical behavior – Aggregate behavior of people similar to him
  • 84.
    Conclusion Today, we introduced: •Why Hadoop is needed • The basic concepts of HDFS and MapReduce • What sort of problems can be solved with Hadoop • What other projects are included in the Hadoop ecosystem
  • 85.
    Recap – HadoopEcosystem Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 86.
  • 87.