Hadoop Family and Ecosystem

Hadoop Product Family
and Ecosystem
PC Liao

Agenda

• What is BigData?
• What is the problem?
• Hadoop
– Introduction to Hadoop
– Hadoop components
– What sort of problems can be solved with Hadoop?
• Hadoop ecosystem
• Conclusion

What is BigData?

A set of files A database A single file

The Data-Driven World

• Modern systems have to deal with far more data than
was the case in the past
– Organizations are generating huge amounts of data
– That data has inherent value, and cannot be discarded
• Examples:
– Yahoo – over 170PB of data
– Facebook – over 30PB of data
– eBay – over 5PB of data

• Many organizations are generating data at a rate of
terabytes per day

What is the problem

• Traditionally, computation has been processor-bound
• For decades, the primary push was to increase the
computing power of a single machine
– Faster processor, more RAM
• Distributed systems evolved to allow developers to use
multiple machines for a single job
– At compute time, data is copied to the compute nodes

What is the problem

• Getting the data to the processors
becomes the bottleneck

• Quick calculation
– Typical disk data transfer rate:
• 75MB/sec
– Time taken to transfer 100GB of data
to the processor:
• approx. 22 minutes!

What is the problem

• Failure of a component may cost a lot
• What we need when job fail?
– May result in a graceful degradation of application performance,
but entire system does not completely fail
– Should not result in the loss of any data
– Would not affect the outcome of the job

Big Data Solutions by Industries
The most common problems Hadoop can solve

Threat Analysis/Trade Surveillance

• Challenge:
– Detecting threats in the form of fraudulent activity or attacks
• Large data volumes involved
• Like looking for a needle in a haystack

• Solution with Hadoop:
– Parallel processing over huge datasets
– Pattern recognition to identify anomalies
• – i.e., threats

• Typical Industry:
– Security, Financial Services

Big Data Use Case
Smart Protection Network
• Challenge
– Information accessibility and transparency problems
for threat researcher due to the size and source of
data (volume, variety and velocity)

• Size of Data
– Overall Data
• Data sources: 20+
• Data fields: 1000+
• Daily new records: 23 Billion+
• Daily new data size: 4TB+

SPN Smart Feedback
• Feedback components: 26
• Data fields : 300+
• Daily new file counts: 6 Million+
• Daily new records: 90 Million+
• Daily new data size: 261GB+

Recommendation Engine

• Challenge:
– Using user data to predict which products to recommend
• Solution with Hadoop:
– Batch processing framework
• Allow execution in in parallel over large datasets
– Collaborative filtering
• Collecting „taste‟ information from many users
• Utilizing information to predict what similar users like

• Typical Industry
– ISP, Advertising

Walmart Case

Diapers
Beer

Friday

Revenue ?

– inspired by
• Apache Hadoop project
– inspired by Google's MapReduce and Google File System
papers.
• Open sourced, flexible and available architecture for
large scale computation and data processing on a
network of commodity hardware
• Open Source Software + Hardware Commodity
– IT Costs Reduction

Hadoop Concepts

• Distribute the data as it is initially stored in the system
• Individual nodes can work on data local to those nodes
• Users can focus on developing applications.

Hadoop Components

• Hadoop consists of two core components
– The Hadoop Distributed File System (HDFS)
– MapReduce Software Framework
• There are many other projects based around core
Hadoop
– Often referred to as the „Hadoop Ecosystem‟
– Pig, Hive, HBase, Flume, Oozie, Sqoop, etc

Hue Mahout
(Web Console) (Data Mining)

Oozie
(Job Workflow & Scheduling)

(Coordination)
Zookeeper
Sqoop/Flume Pig/Hive (Analytical
(Data integration) Language)

MapReduce Runtime
(Dist. Programming Framework) Hbase
(Column NoSQL DB)

Hadoop Distributed File System (HDFS)

Hadoop Components: HDFS

• HDFS, the Hadoop Distributed File System, is
responsible for storing data on the cluster
• Two roles in HDFS
– Namenode: Record metadata
– Datanode: Store data

Hue Mahout

Oozie

(Coordination)
Zookeeper

MapReduce Runtime
(Column NoSQL DB)


How Files Are Stored: Example
• NameNode holds metadata for the
data files
• DataNodes hold the actual blocks
• Each block is replicated three
times on the cluster

HDFS: Points To Note
• When a client application
wants to read a file:
• It communicates with
the NameNode to
determine which
blocks make up the
file, and which
DataNodes those
blocks reside on
• It then
communicates
directly with the
DataNodes to read
the data

Features of MapReduce

• Automatic parallelization and distribution
• Automatic re-execution on failure
• Locality optimizations
• MapReduce abstracts all the „housekeeping‟ away from
the developer
– Developer can concentrate simply on writing the Map and
Reduce functions
Hue Mahout

Oozie

(Coordination)
Zookeeper

MapReduce Runtime
(Column NoSQL DB)


Example : word count

• Word count is challenging over massive amounts of
data
– Using a single compute node would be too time-consuming
– Number of unique words can easily exceed the RAM
• MapReduce breaks complex tasks down into smaller
elements which can be executed in parallel
• More nodes, more faster

Word Count Example

Key: offset
Value: line

Key: word Key: word
Value: count Value: sum of count

0:The cat sat on the mat
22:The aardvark sat on the sofa

Growing Hadoop Ecosystem

• The term „Hadoop‟ is taken to be the combination of
HDFS and MapReduce
• There are numerous other projects surrounding Hadoop
– Typically referred to as the „Hadoop Ecosystem‟
• Zookeeper
• Hive and Pig
• HBase
• Flume
• Other Ecosystem Projects
– Sqoop
– Oozie
– Hue
– Mahout

The Ecosystem is the System

• Hadoop has become the kernel of the distributed
operating system for Big Data
• No one uses the kernel alone
• A collection of projects at Apache

Relation Map

Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
Pig/Hive (Analytical Language)
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


Zookeeper – Coordination Framework

Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


What is ZooKeeper

• A centralized service for maintaining
– Configuration information
– Providing distributed synchronization
• A set of tools to build distributed applications that can
safely handle partial failures
• ZooKeeper was designed to store coordination data
– Status information
– Configuration
– Location information

Why use ZooKeeper?

• Manage configuration across nodes
• Implement reliable messaging
• Implement redundant services
• Synchronize process execution

ZooKeeper Architecture

– All servers store a copy of the data (in memory)
– A leader is elected at startup
– 2 roles – leader and follower
• Followers service clients, all updates go through leader
• Update responses are sent when a majority of servers have persisted the
change
– HA support

Hbase – Column NoSQL DB

Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


I – Inspired by

• Apache open source project
• Inspired from Google Big Table
• Non-relational, distributed database written in Java
• Coordinated by Zookeeper

Hbase – Data Model

• Cells are “versioned”
• Table rows are sorted by row key
• Region – a row range [start-key:end-key]

Architecture

• Master Server (HMaster)
– Assigns regions to regionservers
– Monitors the health of regionservers
• RegionServers
– Contain regions and handle client read/write request

When to use HBase

• Need random, low latency access to the data
• Application has a variable schema where each row is
slightly different
• Add columns
• Most of columns are NULL in each row

Flume / Sqoop – Data Integration Framework

Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


What‟s the problem for data collection

• Data collection is currently a priori and ad hoc
• A priori – decide what you want to collect ahead of time
• Ad hoc – each kind of data source goes through its own
collection path

(and how can it help?)

• A distributed data collection service
• It efficiently collecting, aggregating, and moving large
amounts of data
• Fault tolerant, many failover and recovery mechanism
• One-stop solution for data collection of all formats

Flume: High-Level Overview
• Logical Node
• Source
• Sink

Architecture

• basic diagram
– one master control multiple node

Architecture

• multiple master control multiple node

Sqoop

• Easy, parallel database import/export
• What you want do?
– Insert data from RDBMS to HDFS
– Export data from HDFS back into RDBMS

What is Sqoop

• A suite of tools that connect Hadoop and database
systems
• Import tables from databases into HDFS for deep
analysis
• Export MapReduce results back to a database for
presentation to end-users
• Provides the ability to import from SQL databases
straight into your Hive data warehouse

How Sqoop helps

• The Problem
– Structured data in traditional databases cannot be easily
combined with complex data stored in HDFS
• Sqoop (SQL-to-Hadoop)
– Easy import of data from many databases to HDFS
– Generate code for use in MapReduce applications

Sqoop - export process

• Exports are performed in parallel using MapReduce

Why Sqoop

• JDBC-based implementation
– Works with many popular database vendors
• Auto-generation of tedious user-side code
– Write MapReduce applications to work with your data, faster
• Integration with Hive
– Allows you to stay in a SQL-based environment

Sqoop - JOB
• Job management options

• E.g sqoop job –create myjob –import –connect xxxxxxx
--table mytable

Pig / Hive – Analytical Language

Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


Why Hive and Pig?

• Although MapReduce is very powerful, it can also be
complex to master
• Many organizations have business or data analysts who
are skilled at writing SQL queries, but not at writing Java
code
• Many organizations have programmers who are skilled
at writing code in scripting languages
• Hive and Pig are two projects which evolved separately
to help such people analyze huge amounts of data via
MapReduce
– Hive was initially developed at Facebook, Pig at Yahoo!

Hive – Developed by

• What is Hive?
– An SQL-like interface to Hadoop
• Data Warehouse infrastructure that provides data
summarization and ad hoc querying on top of Hadoop
– MapRuduce for execution
– HDFS for storage
• Hive Query Language
– Basic-SQL : Select, From, Join, Group-By
– Equi-Join, Muti-Table Insert, Multi-Group-By
– Batch query
SELECT * FROM purchases WHERE price > 100 GROUP BY storeid

Pig – Initiated by

• A high-level scripting language (Pig Latin)
• Process data one step at a time
• Simple to write MapReduce program
• Easy understand
• Easy debug A = load ‘a.txt’ as (id, name, age, ...)
B = load ‘b.txt’ as (id, address, ...)
C = JOIN A BY id, B BY id;STORE C into ‘c.txt’

Hive vs. Pig

Hive Pig
Language HiveQL (SQL-like) Pig Latin, a scripting language
Schema Table definitions A schema is optionally defined
that are stored in a at runtime
metastore
Programmait Access JDBC, ODBC PigServer

WordCount Example

• Input
Hello World Bye World
Hello Hadoop Goodbye Hadoop
• For the given sample input the map emits
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>

< Bye, 1>
• the reduce just sums up the values
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>

WordCount Example In MapReduce
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();

Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);
}

WordCount Example By Pig

A = LOAD 'wordcount/input' USING PigStorage as (token:chararray);

B = GROUP A BY token;

C = FOREACH B GENERATE group, COUNT(A) as count;

DUMP C;

WordCount Example By Hive

CREATE TABLE wordcount (token STRING);

LOAD DATA LOCAL INPATH ‟wordcount/input'
OVERWRITE INTO TABLE wordcount;

SELECT count(*) FROM wordcount GROUP BY token;

Oozie – Job Workflow & Scheduling

Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


What is ?

• A Java Web Application
• Oozie is a workﬂow scheduler for Hadoop
• Crond for Hadoop
Job 1 Job 2

Job 3

Job 4 Job 5

Why

• Why use Oozie instead of just cascading a jobs one
after another
• Major flexibility
– Start, Stop, Suspend, and re-run jobs
• Oozie allows you to restart from a failure
– You can tell Oozie to restart a job from a speciﬁc node in the
graph or to skip speciﬁc failed nodes

High Level Architecture

• Web Service API
• database store :
– Workflow definitions
– Currently running workflow instances, including instance states
and variables

Oozie

WS Tomcat
Hadoop/Pig/HDFS
API web-app

DB

How it triggered

• Time
– Execute your workflow every 15 minutes

00:15 00:30 00:45 01:00

• Time and Data
– Materialize your workflow every hour, but only run them when
the input data is ready.
Hadoop
Input Data Exists?

01:00 02:00 03:00 04:00

Oozie use criteria

• Need Launch, control, and monitor jobs from your Java
Apps
– Java Client API/Command Line Interface
• Need control jobs from anywhere
– Web Service API
• Have jobs that you need to run every hour, day, week
• Need receive notification when a job done
– Email when a job is complete

Hue – Web Console

Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


Hue – developed by

• Hadoop User Experience
• Apache Open source project
• HUE is a web UI for Hadoop
• Platform for building custom applications with a nice UI
library

Hue

• HUE comes with a suite of applications
– File Browser: Browse HDFS; change permissions and
ownership; upload, download, view and edit files.
– Job Browser: View jobs, tasks, counters, logs, etc.
– Beeswax: Wizards to help create Hive tables, load data, run and
manage Hive queries, and download results in Excel format.

Mahout – Data Mining

Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


What is

• Machine-learning tool
• Distributed and scalable machine learning algorithms on
the Hadoop platform
• Building intelligent applications easier and faster

Why

• Current state of ML libraries
– Lack Community
– Lack Documentation and Examples
– Lack Scalability
– Are Research oriented

Mahout – scale

• Scale to large datasets
– Hadoop MapReduce implementations that scales linearly with
data
• Scalable to support your business case
– Mahout is distributed under a commercially friendly Apache
Software license
• Scalable community
– Vibrant, responsive and diverse

Mahout – four use cases

• Mahout machine learning algorithms
– Recommendation mining : takes users‟ behavior and find items
said specified user might like
– Clustering : takes e.g. text documents and groups them based
on related document topics
– Classification : learns from existing categorized documents what
specific category documents look like and is able to assign
unlabeled documents to appropriate category
– Frequent item set mining : takes a set of item groups (e.g. terms
in query session, shopping cart content) and identifies, which
individual items typically appear together

Use case Example

• Predict what the user likes based on
– His/Her historical behavior
– Aggregate behavior of people similar to him

Conclusion

Today, we introduced:
• Why Hadoop is needed
• The basic concepts of HDFS and MapReduce
• What sort of problems can be solved with Hadoop
• What other projects are included in the Hadoop
ecosystem

Recap – Hadoop Ecosystem

Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


Hadoop Family and Ecosystem

More Related Content

What's hot

Viewers also liked

Similar to Hadoop Family and Ecosystem

More from tcloudcomputing-tw

Recently uploaded

In this document

Hadoop Family and Ecosystem