Hadoop Family and Ecosystem

Hadoop Product Family
and Ecosystem
PC Liao

Agenda

• What is BigData?
• What is the problem?
• Hadoop
– Introduction to Hadoop
– Hadoop components
– What sort of problems can be solved with Hadoop?
• Hadoop ecosystem
• Conclusion

What is BigData?

A set of files A database A single file

The Data-Driven World

• Modern systems have to deal with far more data than
was the case in the past
– Organizations are generating huge amounts of data
– That data has inherent value, and cannot be discarded
• Examples:
– Yahoo – over 170PB of data
– Facebook – over 30PB of data
– eBay – over 5PB of data

• Many organizations are generating data at a rate of
terabytes per day

What is the problem

• Traditionally, computation has been processor-bound
• For decades, the primary push was to increase the
computing power of a single machine
– Faster processor, more RAM
• Distributed systems evolved to allow developers to use
multiple machines for a single job
– At compute time, data is copied to the compute nodes

What is the problem

• Getting the data to the processors
becomes the bottleneck

• Quick calculation
– Typical disk data transfer rate:
• 75MB/sec
– Time taken to transfer 100GB of data
to the processor:
• approx. 22 minutes!

What is the problem

• Failure of a component may cost a lot
• What we need when job fail?
– May result in a graceful degradation of application performance,
but entire system does not completely fail
– Should not result in the loss of any data
– Would not affect the outcome of the job

Big Data Solutions by Industries
The most common problems Hadoop can solve

Threat Analysis/Trade Surveillance

• Challenge:
– Detecting threats in the form of fraudulent activity or attacks
• Large data volumes involved
• Like looking for a needle in a haystack

• Solution with Hadoop:
– Parallel processing over huge datasets
– Pattern recognition to identify anomalies
• – i.e., threats

• Typical Industry:
– Security, Financial Services

Big Data Use Case
Smart Protection Network
• Challenge
– Information accessibility and transparency problems
for threat researcher due to the size and source of
data (volume, variety and velocity)

• Size of Data
– Overall Data
• Data sources: 20+
• Data fields: 1000+
• Daily new records: 23 Billion+
• Daily new data size: 4TB+

SPN Smart Feedback
• Feedback components: 26
• Data fields : 300+
• Daily new file counts: 6 Million+
• Daily new records: 90 Million+
• Daily new data size: 261GB+

Recommendation Engine

• Challenge:
– Using user data to predict which products to recommend
• Solution with Hadoop:
– Batch processing framework
• Allow execution in in parallel over large datasets
– Collaborative filtering
• Collecting „taste‟ information from many users
• Utilizing information to predict what similar users like

• Typical Industry
– ISP, Advertising

Walmart Case

Diapers
Beer

Friday

Revenue ?

– inspired by
• Apache Hadoop project
– inspired by Google's MapReduce and Google File System
papers.
• Open sourced, flexible and available architecture for
large scale computation and data processing on a
network of commodity hardware
• Open Source Software + Hardware Commodity
– IT Costs Reduction

Hadoop Concepts

• Distribute the data as it is initially stored in the system
• Individual nodes can work on data local to those nodes
• Users can focus on developing applications.

Hadoop Components

• Hadoop consists of two core components
– The Hadoop Distributed File System (HDFS)
– MapReduce Software Framework
• There are many other projects based around core
Hadoop
– Often referred to as the „Hadoop Ecosystem‟
– Pig, Hive, HBase, Flume, Oozie, Sqoop, etc

Hue Mahout
(Web Console) (Data Mining)

Oozie
(Job Workflow & Scheduling)

(Coordination)
Zookeeper
Sqoop/Flume Pig/Hive (Analytical
(Data integration) Language)

MapReduce Runtime
(Dist. Programming Framework) Hbase
(Column NoSQL DB)

Hadoop Distributed File System (HDFS)

Hadoop Components: HDFS

• HDFS, the Hadoop Distributed File System, is
responsible for storing data on the cluster
• Two roles in HDFS
– Namenode: Record metadata
– Datanode: Store data

Hue Mahout

Oozie

(Coordination)
Zookeeper

MapReduce Runtime
(Column NoSQL DB)


How Files Are Stored: Example
• NameNode holds metadata for the
data files
• DataNodes hold the actual blocks
• Each block is replicated three
times on the cluster

HDFS: Points To Note
• When a client application
wants to read a file:
• It communicates with
the NameNode to
determine which
blocks make up the
file, and which
DataNodes those
blocks reside on
• It then
communicates
directly with the
DataNodes to read
the data

Features of MapReduce

• Automatic parallelization and distribution
• Automatic re-execution on failure
• Locality optimizations
• MapReduce abstracts all the „housekeeping‟ away from
the developer
– Developer can concentrate simply on writing the Map and
Reduce functions
Hue Mahout

Oozie

(Coordination)
Zookeeper

MapReduce Runtime
(Column NoSQL DB)


Example : word count

• Word count is challenging over massive amounts of
data
– Using a single compute node would be too time-consuming
– Number of unique words can easily exceed the RAM
• MapReduce breaks complex tasks down into smaller
elements which can be executed in parallel
• More nodes, more faster

Word Count Example

Key: offset
Value: line

Key: word Key: word
Value: count Value: sum of count

0:The cat sat on the mat
22:The aardvark sat on the sofa

Growing Hadoop Ecosystem

• The term „Hadoop‟ is taken to be the combination of
HDFS and MapReduce
• There are numerous other projects surrounding Hadoop
– Typically referred to as the „Hadoop Ecosystem‟
• Zookeeper
• Hive and Pig
• HBase
• Flume
• Other Ecosystem Projects
– Sqoop
– Oozie
– Hue
– Mahout

The Ecosystem is the System

• Hadoop has become the kernel of the distributed
operating system for Big Data
• No one uses the kernel alone
• A collection of projects at Apache

Relation Map

Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
Pig/Hive (Analytical Language)
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


Zookeeper – Coordination Framework

Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


What is ZooKeeper

• A centralized service for maintaining
– Configuration information
– Providing distributed synchronization
• A set of tools to build distributed applications that can
safely handle partial failures
• ZooKeeper was designed to store coordination data
– Status information
– Configuration
– Location information

Why use ZooKeeper?

• Manage configuration across nodes
• Implement reliable messaging
• Implement redundant services
• Synchronize process execution

ZooKeeper Architecture

– All servers store a copy of the data (in memory)
– A leader is elected at startup
– 2 roles – leader and follower
• Followers service clients, all updates go through leader
• Update responses are sent when a majority of servers have persisted the
change
– HA support

Hbase – Column NoSQL DB

Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


I – Inspired by

• Apache open source project
• Inspired from Google Big Table
• Non-relational, distributed database written in Java
• Coordinated by Zookeeper

Hbase – Data Model

• Cells are “versioned”
• Table rows are sorted by row key
• Region – a row range [start-key:end-key]

Architecture

• Master Server (HMaster)
– Assigns regions to regionservers
– Monitors the health of regionservers
• RegionServers
– Contain regions and handle client read/write request

When to use HBase

• Need random, low latency access to the data
• Application has a variable schema where each row is
slightly different
• Add columns
• Most of columns are NULL in each row

Flume / Sqoop – Data Integration Framework

Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


What‟s the problem for data collection

• Data collection is currently a priori and ad hoc
• A priori – decide what you want to collect ahead of time
• Ad hoc – each kind of data source goes through its own
collection path

(and how can it help?)

• A distributed data collection service
• It efficiently collecting, aggregating, and moving large
amounts of data
• Fault tolerant, many failover and recovery mechanism
• One-stop solution for data collection of all formats

Flume: High-Level Overview
• Logical Node
• Source
• Sink

Architecture

• basic diagram
– one master control multiple node

Architecture

• multiple master control multiple node

Sqoop

• Easy, parallel database import/export
• What you want do?
– Insert data from RDBMS to HDFS
– Export data from HDFS back into RDBMS

What is Sqoop

• A suite of tools that connect Hadoop and database
systems
• Import tables from databases into HDFS for deep
analysis
• Export MapReduce results back to a database for
presentation to end-users
• Provides the ability to import from SQL databases
straight into your Hive data warehouse

How Sqoop helps

• The Problem
– Structured data in traditional databases cannot be easily
combined with complex data stored in HDFS
• Sqoop (SQL-to-Hadoop)
– Easy import of data from many databases to HDFS
– Generate code for use in MapReduce applications

Sqoop - export process

• Exports are performed in parallel using MapReduce

Why Sqoop

• JDBC-based implementation
– Works with many popular database vendors
• Auto-generation of tedious user-side code
– Write MapReduce applications to work with your data, faster
• Integration with Hive
– Allows you to stay in a SQL-based environment

Sqoop - JOB
• Job management options

• E.g sqoop job –create myjob –import –connect xxxxxxx
--table mytable

Pig / Hive – Analytical Language

Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


Why Hive and Pig?

• Although MapReduce is very powerful, it can also be
complex to master
• Many organizations have business or data analysts who
are skilled at writing SQL queries, but not at writing Java
code
• Many organizations have programmers who are skilled
at writing code in scripting languages
• Hive and Pig are two projects which evolved separately
to help such people analyze huge amounts of data via
MapReduce
– Hive was initially developed at Facebook, Pig at Yahoo!

Hive – Developed by

• What is Hive?
– An SQL-like interface to Hadoop
• Data Warehouse infrastructure that provides data
summarization and ad hoc querying on top of Hadoop
– MapRuduce for execution
– HDFS for storage
• Hive Query Language
– Basic-SQL : Select, From, Join, Group-By
– Equi-Join, Muti-Table Insert, Multi-Group-By
– Batch query
SELECT * FROM purchases WHERE price > 100 GROUP BY storeid

Pig – Initiated by

• A high-level scripting language (Pig Latin)
• Process data one step at a time
• Simple to write MapReduce program
• Easy understand
• Easy debug A = load ‘a.txt’ as (id, name, age, ...)
B = load ‘b.txt’ as (id, address, ...)
C = JOIN A BY id, B BY id;STORE C into ‘c.txt’

Hive vs. Pig

Hive Pig
Language HiveQL (SQL-like) Pig Latin, a scripting language
Schema Table definitions A schema is optionally defined
that are stored in a at runtime
metastore
Programmait Access JDBC, ODBC PigServer

WordCount Example

• Input
Hello World Bye World
Hello Hadoop Goodbye Hadoop
• For the given sample input the map emits
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>

< Bye, 1>
• the reduce just sums up the values
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>

WordCount Example In MapReduce
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();

Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);
}

WordCount Example By Pig

A = LOAD 'wordcount/input' USING PigStorage as (token:chararray);

B = GROUP A BY token;

C = FOREACH B GENERATE group, COUNT(A) as count;

DUMP C;

WordCount Example By Hive

CREATE TABLE wordcount (token STRING);

LOAD DATA LOCAL INPATH ‟wordcount/input'
OVERWRITE INTO TABLE wordcount;

SELECT count(*) FROM wordcount GROUP BY token;

Oozie – Job Workflow & Scheduling

Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


What is ?

• A Java Web Application
• Oozie is a workﬂow scheduler for Hadoop
• Crond for Hadoop
Job 1 Job 2

Job 3

Job 4 Job 5

Why

• Why use Oozie instead of just cascading a jobs one
after another
• Major flexibility
– Start, Stop, Suspend, and re-run jobs
• Oozie allows you to restart from a failure
– You can tell Oozie to restart a job from a speciﬁc node in the
graph or to skip speciﬁc failed nodes

High Level Architecture

• Web Service API
• database store :
– Workflow definitions
– Currently running workflow instances, including instance states
and variables

Oozie

WS Tomcat
Hadoop/Pig/HDFS
API web-app

DB

How it triggered

• Time
– Execute your workflow every 15 minutes

00:15 00:30 00:45 01:00

• Time and Data
– Materialize your workflow every hour, but only run them when
the input data is ready.
Hadoop
Input Data Exists?

01:00 02:00 03:00 04:00

Oozie use criteria

• Need Launch, control, and monitor jobs from your Java
Apps
– Java Client API/Command Line Interface
• Need control jobs from anywhere
– Web Service API
• Have jobs that you need to run every hour, day, week
• Need receive notification when a job done
– Email when a job is complete

Hue – Web Console

Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


Hue – developed by

• Hadoop User Experience
• Apache Open source project
• HUE is a web UI for Hadoop
• Platform for building custom applications with a nice UI
library

Hue

• HUE comes with a suite of applications
– File Browser: Browse HDFS; change permissions and
ownership; upload, download, view and edit files.
– Job Browser: View jobs, tasks, counters, logs, etc.
– Beeswax: Wizards to help create Hive tables, load data, run and
manage Hive queries, and download results in Excel format.

Mahout – Data Mining

Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


What is

• Machine-learning tool
• Distributed and scalable machine learning algorithms on
the Hadoop platform
• Building intelligent applications easier and faster

Why

• Current state of ML libraries
– Lack Community
– Lack Documentation and Examples
– Lack Scalability
– Are Research oriented

Mahout – scale

• Scale to large datasets
– Hadoop MapReduce implementations that scales linearly with
data
• Scalable to support your business case
– Mahout is distributed under a commercially friendly Apache
Software license
• Scalable community
– Vibrant, responsive and diverse

Mahout – four use cases

• Mahout machine learning algorithms
– Recommendation mining : takes users‟ behavior and find items
said specified user might like
– Clustering : takes e.g. text documents and groups them based
on related document topics
– Classification : learns from existing categorized documents what
specific category documents look like and is able to assign
unlabeled documents to appropriate category
– Frequent item set mining : takes a set of item groups (e.g. terms
in query session, shopping cart content) and identifies, which
individual items typically appear together

Use case Example

• Predict what the user likes based on
– His/Her historical behavior
– Aggregate behavior of people similar to him

Conclusion

Today, we introduced:
• Why Hadoop is needed
• The basic concepts of HDFS and MapReduce
• What sort of problems can be solved with Hadoop
• What other projects are included in the Hadoop
ecosystem

Recap – Hadoop Ecosystem

Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


Hadoop Family and Ecosystem

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Hadoop Family and Ecosystem

Similar to Hadoop Family and Ecosystem (20)

More from tcloudcomputing-tw

More from tcloudcomputing-tw (7)

Recently uploaded

Recently uploaded (20)

Hadoop Family and Ecosystem