Basic of Big Data

Basics of Big
Data
Analytics
&
Hadoo
p
Ambuj Kumar
Ambuj_kumar@aol.com
http://ambuj4bigdata.blogspot.in
http://ambujworld.wordpress.com

Agend
a Big Data –
 Concepts overview
 Analytics –
 Hadoop –
 HDFS
 Data Flow - Read & Write
Operation
 MapReduce
 WordCount Program
 Use Cases
 Landscape
 Hadoop Features & Summary

What is Big
Data?Big data is data which is too large, complex and dynamic for any conventional data tools to capture,
store, manage and analyze.

Challenges of Big
Data
• Storage (~
Petabytes)
1
• Processing (Timely
manner)
• Variety of Data (Structured,
Semi
Structured,Un-structured)
• Cos
t
2
3
4

Big Data
AnalyticsBig data analytics is the process of examining large
amounts of data of a variety of types (big data) to
uncover hidden patterns, unknown correlations and other
useful information.
Big Data AnalyticsSolutions
There are many different Big Data Analytics Solutions out
in the market.
 Tableau – visualization tools
 SAS – Statistical computing
 IBM and Oracle –They have a range of tools for Big Data
Analysis
 Revolution – Statistical computing
 R – Open source tool for Statisticalcomputing

What is Hadoop?
 Open-source data storage and processingAPI
 Massively scalable, automaticallyparallelizable
 Based on work from Google
 GFS + MapReduce + BigTable
 Current Distributions based on Open Source and VendorWork
 Apache Hadoop
 Cloudera – CDH4
 Hortonworks
 MapR
 AWS
 Windows Azure HDInsight

Why Use Hadoop?
Cheaper
Scales to Petabytes
or more
Faster
Parallel data
processing
Better
Suited for particular
types of BigData
problems

Hadoop
HistoryIn 2008, Hadoop becameApache Top Level Project

Comparing:RDBMS vs.
HadoopTraditional RDBMS Hadoop / MapReduce
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch – NOT Interactive
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
Query
ResponseTime
Can be near immediate Has latency (due tobatch
processing)

Where is Hadoop
used?
Technology
Industry Use Cases
Search
People you may know
Movie recommendations
Banks
Fraud Detection
Regulatory
Risk management
Media
Retail
Marketing analytics
Customer service
Product recommendations
Manufacturing Preventive maintenance

Companies Using Hadoop
 Search
Yahoo,Amazon,Zvents
 Log Processing
Facebook,Yahoo,
ContextWeb.Joost,Last.fm
 Recommendation
Systems
Facebook,Linkedin
 DataWarehouse
Facebook,AOL
 Video & ImageAnalysis
NewYorkTimes,Eyealike
------- Almost in every
domain!

Hadoop is a set of Apache
Frameworks and more…
 Data storage (HDFS)
 Runs on commodity hardware (usually
Linux)
 Horizontally scalable
 Processing (MapReduce)
 Parallelized (scalable) processing Fault Tolerant
 Other Tools / Frameworks
 Data Access
 HBase, Hive, Pig,
Mahout
 Tools
 Hue, Sqoop
 Monitoring
 Greenplum, Cloudera
Hadoop Core - HDFS
MapReduceAPI
Monitoring &Alerting
Tools & Libraries
DataAccess

Core parts of Hadoop
distribution
HDFS Storage
Redundant (3copies)
For large files – large
blocks
64 or 128 MB / block
Can scale to 1000s of
nodes
MapReduce API
Batch (Job) processing
Distributed and Localized
to clusters (Map)
Auto-Parallelizable for
huge amounts of data
Fault-tolerant (auto
retries)
Adds high availability and
more
Other Libraries
Pig
Hive
HBase
Others

Hadoop Cluster HDFS
(Physical) Storage
Name Node
Data Node 1 Data Node 2 Data Node 3
Secondary
Name Node
• Contains web site to view
cluster information
• V2 Hadoop uses multiple
Name Nodes for HA
One Name Node
Many Data Nodes
• 3 copies of each node by
default
Work with data in HDFS
• Using common Linux shell
commands
• Block size is 64 or 128 MB

MapReduce Job – Logical
View

Common Hadoop
Distributions
Open Source
Apache
Commercial
Cloudera
Hortonworks
MapR
AWS MapReduce
Microsoft
HDInsight

HDFS
:Architecture Master
NameNode
Slave
Bunch of DataNodes
HDFS Layers
NameNode
Storage
…………
NS
Block Management
NameNode
DataNode
DataNode DataNode DataNode DataNode DataNode
DataNode
Name
Space
Block
Storage

HDFS : Basic
Features
Highly fault-
tolerant High
throughput
Suitable for applications with large data
sets Streaming access to file system
data
Can be built out of commodity hardware

HDFS Write
(1/2)
Client Name Node
1
2
Data Node
A
Data Node
B
Data Node
C
Data Node
D
A2 A3 A4A1
3
Client contacts NameNode to write data
NameNode says write it to thesenodes
Client sequentiallywrites
blocks to DataNode

HDFS Write
(2/2)
Client Name Node
Data Node
A
Data Node
B
Data Node
C
Data Node
D
A1
DataNodes replicatedata
blocks, orchestrated
by the NameNode
A2
A4
A2 A1
A3
A3 A2
A4
A4 A1
A3

HDFS
Read
Client Name Node
1
2
Data Node
A
Data Node
B
Data Node
C
Data Node
D
3
Client contacts NameNode to read data
NameNode says you can findit here
Client sequentially
reads blocks from
DataNode
A1 A2
A4
A2 A1
A3
A3 A2
A4
A4 A1
A3

HA (High Availability) for
NameNode
NameNode (StandBy)
DataNode
NameNode (Active)
Active NameNode
Do normal namenode’s operation
Standby NameNode
Maintain NameNode’s data
Ready to be active NameNode
DataNode DataNode DataNode DataNode

MapRedu
ce
MapReduce job consist of two tasks
 Map Task
 Reduce Task
Blocks of data distributed across several
machinesare processed by map tasks parallel
 Results are aggregated in the reducer
 Works only on KEY/VALUE pair

MapReduce:Word
Count
Deer 1
Bear 1
River 1
Car 1
Car 1
River 1
Deer 1
Car 1
Bear 1
Bear 2
Car 3
Deer 2
River 2
Can we do word count in parallel?
Deer Bear River
Car Car River
Deer Car Bear

Data Flow in a MapReduce
Program in Hadoop

Mapper
ClassPackage ambuj.com.wc;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends
Mapper<LongWritable, Text, Text, LongWritable> {
private final static LongWritable one = new LongWritable(1);
private Text word = newText();
@Override
public void map(LongWritable inputKey, Text inputVal, Context context)
throws IOException, InterruptedException {
String line = inputVal.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}

Reducer
Classpackage ambuj.com.wc;
import java.io.IOException;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends
Reducer<Text, LongWritable, Text, LongWritable> {
@Override
public void reduce(Text key, Iterable<LongWritable> listOfValues,
Context context) throws IOException, InterruptedException {
long sum = 0;
for (LongWritable val : listOfValues) {
sum = sum + val.get();
}
context.write(key, new LongWritable(sum));
}
}

Driver
Class
package ambuj.com.wc;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class WordCountDriver extends Configured implements Tool {
@Override
public int run(String[] args) throws Exception {
Configuration conf = newConfiguration();
Job job = new Job(conf,"WordCount");
job.setJarByClass(WordCountDriver.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
return 0;
}
public static void main(String[] args) throws Exception {
ToolRunner.run(new WordCountDriver(), args);
}
}

A view of
Hadoop Client Job
Data Node
Task
Tracker
Task
Task
Task
Job Tracker Name Node
Data Node
Task
Tracker
Task
Task
Task
Data Node
Task
Tracker
Task
Task
Task
MasterSlave
Blocks HDFS
MapReduce

Use
Cases
 Utilities want to predict power consumption
 Banks and insurance companies want to
understand risk
 Fraud detection
 Marketing departments want to understand
customers
 Recommendations
 Location-Based Ad Targeting
 Threat Analysis

Hadoop Features &
SummaryDistributed frame work for processing and storing
data generally on commodity hardware.
Completely open source and written in Java.
 Store anything
 Unstructured or semi structured data,
 Storage capacity
 Scale linearly, cost in not exponential.
 Data locality and process in yourway.
 Code moves todata
 In MR you specify the actual steps in processing the data and drive the out put.
 Stream access: Process data in any language.
 Failure and fault tolerance:
 Detect Failure and Heals itself.
 Reliable, data replicated, failed task are rerun , no need maintain backup of data
 Cost effective: Hadoop is designed to be a scale-out architecture operating on a cluster of
commodity
PC machines.
The Hadoop framework transparently for customization to provides applications both reliability,
adaption
and data motion.
Primarily used for batch processing, not real-time/ transactional user applications.

References -
Hadoop
 Hadoop:The Definitive Guide,Third Edition by
Tom White.
 http://hadoop.apache.org
 http://www.cloudera.com
 http://ambuj4bigdata.blogspot.com
 http://ambujworld.wordpress.com

Basic of Big Data

More Related Content

What's hot

Similar to Basic of Big Data

Recently uploaded

Basic of Big Data