2. Agenda
• What is BigData?
• What is the problem?
• Hadoop
– Introduction to Hadoop
– Hadoop components
– What sort of problems can be solved with Hadoop?
• Hadoop ecosystem
• Conclusion
4. The Data-Driven World
• Modern systems have to deal with far more data than
was the case in the past
– Organizations are generating huge amounts of data
– That data has inherent value, and cannot be discarded
• Examples:
– Yahoo – over 170PB of data
– Facebook – over 30PB of data
– eBay – over 5PB of data
• Many organizations are generating data at a rate of
terabytes per day
5. What is the problem
• Traditionally, computation has been processor-bound
• For decades, the primary push was to increase the
computing power of a single machine
– Faster processor, more RAM
• Distributed systems evolved to allow developers to use
multiple machines for a single job
– At compute time, data is copied to the compute nodes
6. What is the problem
• Getting the data to the processors
becomes the bottleneck
• Quick calculation
– Typical disk data transfer rate:
• 75MB/sec
– Time taken to transfer 100GB of data
to the processor:
• approx. 22 minutes!
7. What is the problem
• Failure of a component may cost a lot
• What we need when job fail?
– May result in a graceful degradation of application performance,
but entire system does not completely fail
– Should not result in the loss of any data
– Would not affect the outcome of the job
8. Big Data Solutions by Industries
The most common problems Hadoop can solve
9. Threat Analysis/Trade Surveillance
• Challenge:
– Detecting threats in the form of fraudulent activity or attacks
• Large data volumes involved
• Like looking for a needle in a haystack
• Solution with Hadoop:
– Parallel processing over huge datasets
– Pattern recognition to identify anomalies
• – i.e., threats
• Typical Industry:
– Security, Financial Services
10. Big Data Use Case
Smart Protection Network
• Challenge
– Information accessibility and transparency problems
for threat researcher due to the size and source of
data (volume, variety and velocity)
• Size of Data
– Overall Data
• Data sources: 20+
• Data fields: 1000+
• Daily new records: 23 Billion+
• Daily new data size: 4TB+
SPN Smart Feedback
• Feedback components: 26
• Data fields : 300+
• Daily new file counts: 6 Million+
• Daily new records: 90 Million+
• Daily new data size: 261GB+
13. Recommendation Engine
• Challenge:
– Using user data to predict which products to recommend
• Solution with Hadoop:
– Batch processing framework
• Allow execution in in parallel over large datasets
– Collaborative filtering
• Collecting „taste‟ information from many users
• Utilizing information to predict what similar users like
• Typical Industry
– ISP, Advertising
16. – inspired by
• Apache Hadoop project
– inspired by Google's MapReduce and Google File System
papers.
• Open sourced, flexible and available architecture for
large scale computation and data processing on a
network of commodity hardware
• Open Source Software + Hardware Commodity
– IT Costs Reduction
17. Hadoop Concepts
• Distribute the data as it is initially stored in the system
• Individual nodes can work on data local to those nodes
• Users can focus on developing applications.
18. Hadoop Components
• Hadoop consists of two core components
– The Hadoop Distributed File System (HDFS)
– MapReduce Software Framework
• There are many other projects based around core
Hadoop
– Often referred to as the „Hadoop Ecosystem‟
– Pig, Hive, HBase, Flume, Oozie, Sqoop, etc
Hue Mahout
(Web Console) (Data Mining)
Oozie
(Job Workflow & Scheduling)
(Coordination)
Zookeeper
Sqoop/Flume Pig/Hive (Analytical
(Data integration) Language)
MapReduce Runtime
(Dist. Programming Framework) Hbase
(Column NoSQL DB)
Hadoop Distributed File System (HDFS)
19. Hadoop Components: HDFS
• HDFS, the Hadoop Distributed File System, is
responsible for storing data on the cluster
• Two roles in HDFS
– Namenode: Record metadata
– Datanode: Store data
Hue Mahout
(Web Console) (Data Mining)
Oozie
(Job Workflow & Scheduling)
(Coordination)
Zookeeper
Sqoop/Flume Pig/Hive (Analytical
(Data integration) Language)
MapReduce Runtime
(Dist. Programming Framework) Hbase
(Column NoSQL DB)
Hadoop Distributed File System (HDFS)
20. How Files Are Stored: Example
• NameNode holds metadata for the
data files
• DataNodes hold the actual blocks
• Each block is replicated three
times on the cluster
21. HDFS: Points To Note
• When a client application
wants to read a file:
• It communicates with
the NameNode to
determine which
blocks make up the
file, and which
DataNodes those
blocks reside on
• It then
communicates
directly with the
DataNodes to read
the data
22. Hadoop Components: MapReduce
• MapReduce is a method for distributing a task across
multiple nodes
• It works like a Unix pipeline:
– cat input | grep | sort | uniq -c | cat > output
– Input | Map | Shuffle & Sort | Reduce | Output
Hue Mahout
(Web Console) (Data Mining)
Oozie
(Job Workflow & Scheduling)
(Coordination)
Zookeeper
Sqoop/Flume Pig/Hive (Analytical
(Data integration) Language)
MapReduce Runtime
(Dist. Programming Framework) Hbase
(Column NoSQL DB)
Hadoop Distributed File System (HDFS)
23. Features of MapReduce
• Automatic parallelization and distribution
• Automatic re-execution on failure
• Locality optimizations
• MapReduce abstracts all the „housekeeping‟ away from
the developer
– Developer can concentrate simply on writing the Map and
Reduce functions
Hue Mahout
(Web Console) (Data Mining)
Oozie
(Job Workflow & Scheduling)
(Coordination)
Zookeeper
Sqoop/Flume Pig/Hive (Analytical
(Data integration) Language)
MapReduce Runtime
(Dist. Programming Framework) Hbase
(Column NoSQL DB)
Hadoop Distributed File System (HDFS)
24. Example : word count
• Word count is challenging over massive amounts of
data
– Using a single compute node would be too time-consuming
– Number of unique words can easily exceed the RAM
• MapReduce breaks complex tasks down into smaller
elements which can be executed in parallel
• More nodes, more faster
25. Word Count Example
Key: offset
Value: line
Key: word Key: word
Value: count Value: sum of count
0:The cat sat on the mat
22:The aardvark sat on the sofa
27. Growing Hadoop Ecosystem
• The term „Hadoop‟ is taken to be the combination of
HDFS and MapReduce
• There are numerous other projects surrounding Hadoop
– Typically referred to as the „Hadoop Ecosystem‟
• Zookeeper
• Hive and Pig
• HBase
• Flume
• Other Ecosystem Projects
– Sqoop
– Oozie
– Hue
– Mahout
28. The Ecosystem is the System
• Hadoop has become the kernel of the distributed
operating system for Big Data
• No one uses the kernel alone
• A collection of projects at Apache
31. What is ZooKeeper
• A centralized service for maintaining
– Configuration information
– Providing distributed synchronization
• A set of tools to build distributed applications that can
safely handle partial failures
• ZooKeeper was designed to store coordination data
– Status information
– Configuration
– Location information
32. Why use ZooKeeper?
• Manage configuration across nodes
• Implement reliable messaging
• Implement redundant services
• Synchronize process execution
33. ZooKeeper Architecture
– All servers store a copy of the data (in memory)
– A leader is elected at startup
– 2 roles – leader and follower
• Followers service clients, all updates go through leader
• Update responses are sent when a majority of servers have persisted the
change
– HA support
36. I – Inspired by
• Apache open source project
• Inspired from Google Big Table
• Non-relational, distributed database written in Java
• Coordinated by Zookeeper
38. Hbase – Data Model
• Cells are “versioned”
• Table rows are sorted by row key
• Region – a row range [start-key:end-key]
39. Architecture
• Master Server (HMaster)
– Assigns regions to regionservers
– Monitors the health of regionservers
• RegionServers
– Contain regions and handle client read/write request
41. When to use HBase
• Need random, low latency access to the data
• Application has a variable schema where each row is
slightly different
• Add columns
• Most of columns are NULL in each row
43. What‟s the problem for data collection
• Data collection is currently a priori and ad hoc
• A priori – decide what you want to collect ahead of time
• Ad hoc – each kind of data source goes through its own
collection path
44. (and how can it help?)
• A distributed data collection service
• It efficiently collecting, aggregating, and moving large
amounts of data
• Fault tolerant, many failover and recovery mechanism
• One-stop solution for data collection of all formats
50. Sqoop
• Easy, parallel database import/export
• What you want do?
– Insert data from RDBMS to HDFS
– Export data from HDFS back into RDBMS
51. What is Sqoop
• A suite of tools that connect Hadoop and database
systems
• Import tables from databases into HDFS for deep
analysis
• Export MapReduce results back to a database for
presentation to end-users
• Provides the ability to import from SQL databases
straight into your Hive data warehouse
52. How Sqoop helps
• The Problem
– Structured data in traditional databases cannot be easily
combined with complex data stored in HDFS
• Sqoop (SQL-to-Hadoop)
– Easy import of data from many databases to HDFS
– Generate code for use in MapReduce applications
54. Sqoop - export process
• Exports are performed in parallel using MapReduce
55. Why Sqoop
• JDBC-based implementation
– Works with many popular database vendors
• Auto-generation of tedious user-side code
– Write MapReduce applications to work with your data, faster
• Integration with Hive
– Allows you to stay in a SQL-based environment
58. Why Hive and Pig?
• Although MapReduce is very powerful, it can also be
complex to master
• Many organizations have business or data analysts who
are skilled at writing SQL queries, but not at writing Java
code
• Many organizations have programmers who are skilled
at writing code in scripting languages
• Hive and Pig are two projects which evolved separately
to help such people analyze huge amounts of data via
MapReduce
– Hive was initially developed at Facebook, Pig at Yahoo!
59. Hive – Developed by
• What is Hive?
– An SQL-like interface to Hadoop
• Data Warehouse infrastructure that provides data
summarization and ad hoc querying on top of Hadoop
– MapRuduce for execution
– HDFS for storage
• Hive Query Language
– Basic-SQL : Select, From, Join, Group-By
– Equi-Join, Muti-Table Insert, Multi-Group-By
– Batch query
SELECT * FROM purchases WHERE price > 100 GROUP BY storeid
60. Pig – Initiated by
• A high-level scripting language (Pig Latin)
• Process data one step at a time
• Simple to write MapReduce program
• Easy understand
• Easy debug A = load ‘a.txt’ as (id, name, age, ...)
B = load ‘b.txt’ as (id, address, ...)
C = JOIN A BY id, B BY id;STORE C into ‘c.txt’
61. Hive vs. Pig
Hive Pig
Language HiveQL (SQL-like) Pig Latin, a scripting language
Schema Table definitions A schema is optionally defined
that are stored in a at runtime
metastore
Programmait Access JDBC, ODBC PigServer
62. WordCount Example
• Input
Hello World Bye World
Hello Hadoop Goodbye Hadoop
• For the given sample input the map emits
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>
< Bye, 1>
• the reduce just sums up the values
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>
63. WordCount Example In MapReduce
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
64. WordCount Example By Pig
A = LOAD 'wordcount/input' USING PigStorage as (token:chararray);
B = GROUP A BY token;
C = FOREACH B GENERATE group, COUNT(A) as count;
DUMP C;
65. WordCount Example By Hive
CREATE TABLE wordcount (token STRING);
LOAD DATA LOCAL INPATH ‟wordcount/input'
OVERWRITE INTO TABLE wordcount;
SELECT count(*) FROM wordcount GROUP BY token;
67. What is ?
• A Java Web Application
• Oozie is a workflow scheduler for Hadoop
• Crond for Hadoop
Job 1 Job 2
Job 3
Job 4 Job 5
68. Why
• Why use Oozie instead of just cascading a jobs one
after another
• Major flexibility
– Start, Stop, Suspend, and re-run jobs
• Oozie allows you to restart from a failure
– You can tell Oozie to restart a job from a specific node in the
graph or to skip specific failed nodes
69. High Level Architecture
• Web Service API
• database store :
– Workflow definitions
– Currently running workflow instances, including instance states
and variables
Oozie
WS Tomcat
Hadoop/Pig/HDFS
API web-app
DB
70. How it triggered
• Time
– Execute your workflow every 15 minutes
00:15 00:30 00:45 01:00
• Time and Data
– Materialize your workflow every hour, but only run them when
the input data is ready.
Hadoop
Input Data Exists?
01:00 02:00 03:00 04:00
72. Oozie use criteria
• Need Launch, control, and monitor jobs from your Java
Apps
– Java Client API/Command Line Interface
• Need control jobs from anywhere
– Web Service API
• Have jobs that you need to run every hour, day, week
• Need receive notification when a job done
– Email when a job is complete
74. Hue – developed by
• Hadoop User Experience
• Apache Open source project
• HUE is a web UI for Hadoop
• Platform for building custom applications with a nice UI
library
75. Hue
• HUE comes with a suite of applications
– File Browser: Browse HDFS; change permissions and
ownership; upload, download, view and edit files.
– Job Browser: View jobs, tasks, counters, logs, etc.
– Beeswax: Wizards to help create Hive tables, load data, run and
manage Hive queries, and download results in Excel format.
79. What is
• Machine-learning tool
• Distributed and scalable machine learning algorithms on
the Hadoop platform
• Building intelligent applications easier and faster
80. Why
• Current state of ML libraries
– Lack Community
– Lack Documentation and Examples
– Lack Scalability
– Are Research oriented
81. Mahout – scale
• Scale to large datasets
– Hadoop MapReduce implementations that scales linearly with
data
• Scalable to support your business case
– Mahout is distributed under a commercially friendly Apache
Software license
• Scalable community
– Vibrant, responsive and diverse
82. Mahout – four use cases
• Mahout machine learning algorithms
– Recommendation mining : takes users‟ behavior and find items
said specified user might like
– Clustering : takes e.g. text documents and groups them based
on related document topics
– Classification : learns from existing categorized documents what
specific category documents look like and is able to assign
unlabeled documents to appropriate category
– Frequent item set mining : takes a set of item groups (e.g. terms
in query session, shopping cart content) and identifies, which
individual items typically appear together
83. Use case Example
• Predict what the user likes based on
– His/Her historical behavior
– Aggregate behavior of people similar to him
84. Conclusion
Today, we introduced:
• Why Hadoop is needed
• The basic concepts of HDFS and MapReduce
• What sort of problems can be solved with Hadoop
• What other projects are included in the Hadoop
ecosystem