SlideShare a Scribd company logo
1 of 78
BIG DATA ANALYTICS
WITH APACHE- HADOOP
“Big Data: A Revolution that Will Transform How We Live, Work, and Think”
-Viktor Mayer-SchĂśnberger and Kenneth Cukier
Team Members
Abhishek Kumar : Y11UC010
Sachin Mittal : Y11UC189
Subodh Rawani : Y11UC230
Suman Saurabh : Y11UC231
Contents
1. What is Big Data ?
 Definition
 Turning Data to Value: 5v’s
2. Big Data Analytics
3. Big Data and Hadoop
 History of Hadoop
 About Apache Hadoop
 Key Features of Hadoop
4. Hadoop and MapReduce
 About MapReduce
 MapReduce Architecture
 MapReduce Functionality
 MapReduce Examples
1) What is Big Data?
Definition
“Data is the oil of the 21st century, and analytics is the combustion engine”
-Peter Sondergaard, Senior Vice President, Gartner Research
“Big- Data are high volume, high velocity and high variety of information assets that require new
form of processing to enable enhanced decision making insight discovery & process
optimisation.”
“It is a subjective term, what involves is analysis of data from multiple sources and is joined and
aggregated in arbitrary ways enabling deeper analyses than any one system can provide”.
-Tom White in Hadoop the Definitive Guide
Big Data is fuelled by two things:
• The increasing ‘datafication’ of the world, allows to generate new data at frightening rates.
• Technological advancement to harness those large and complex data and perform analysis
using improved techniques.
Big data describes the exponential growth and availability of data, both structured and unstructured. This data
are from everywhere: Climate Sensors, Social Media post, Digital files, Buy/Sell transaction records, Cell phone
GPS signal and others.
Statistics of Data Generated
Big Data in Today’s Business and Technology
Environment
 235 Terabytes of data has been collected by the
U.S. Library of Congress in April 2011. (Source)
 Facebook stores, accesses, and analyzes 30+
Petabytes of user generated data. (Source)
 Walmart handles more than 1 million customer
transactions every hour, which is imported into
databases estimated to contain more than 2.5
petabytes of data. (Source)
 More than 5 billion people are calling, texting,
tweeting and browsing on mobile phones
worldwide. (Source)
 In 2008, Google was processing 20,000 terabytes
of data (20 petabytes) a day. (Source)
The Rapid Growth of Unstructured Data
 YouTube users upload 48 hours of new video
every minute of the day. (Source)
 Brands and organizations on Facebook receive
34,722 Likes every minute of the day. (Source)
 Twitter’s sees roughly 175 million tweets every day,
and has more than 465 million accounts. (Source)
 In late 2011, IDC Digital Universe published a
report indicating that some 1.8 zettabytes of data
will be created that year. (Source)
 In other words, the amount of data in the world
today is equal to:
 Every person in the world having more than 215m high-
resolution MRI scans a day.
 More than 200bn HD movies – which would take a person
47m years to watch.
`
Turning Big Data into Value: 5V’s
The Digital Era gives unprecedented
amounts of data in terms of Volume,
Velocity, Variety and Veracity and
properly channelling them to Value.
Value
Volume: Refers to the Terabytes, Petabytes as well
as Zettabytes of data generated every second.
Velocity: Speed at which new data is generated
every second. E.g. Google, Twitter, Facebook
Variety: Different formats data such as text, images,
video, video and so on can be stored and processed
rather than only Relational Databases.
Veracity: Trustworthiness of the data. E.g. Twitter
data with hash tags, abbreviations, typos and
colloquial speech as well as the reliability and
accuracy of content. Though not reliable can also be
processed.
Value: Having access to big data is no good unless
we can turn it into value.
The ‘Datafication’ of
our World;
•Activities
•Conversations
•Words
•Voice
•Social Media
•Browser logs
•Photos
•Videos
•Sensors
•Etc.
Volume
Veracity
Variety
Velocity
Analysis
Analysing
Big Data:
•Text analytics
•Sentiment
analysis
•Face recognition
•Voice analytics
•Movement
analytics
•Etc.
Value
Copied from: Š 2014 Advanced Performance Institute, BWMC Ltd.
New technologies in Distributed Systems and Cloud Computing together with the latest
software and analysis approaches allow us to store and process data to Value at massive rate.
Some Big Data Use Case By Industry
Telecommunications
Network analytics
Location-based services
Retail
Merchandise optimization
Supply-Chain Management
Banking
Fraud Detection
Trade Surveillance
Media
Click- Fraud Prevention
Social Graph Analysis
Energy
Smart Meter Analytics
Distribution load forecasting
Manufacturing
Customer Care Call Centers
Customer Relationship
Public
Threats Detection
Cyber Security
Healthcare
Clinical Trails data Analysis
Supply Chain Management
Insurance
Catastrophe Modelling
Claims Fraud
Challenges of big data
 How to store and protect Big data?
 How to organize and catalog the data that you have backed up?
 How to keep costs low while ensuring that all the critical data is
available you need it.
 Analytical Challenges
 Human Resources and Manpower
 Technical Challenges
 Privacy and Security
2) Big Data Analytics
Why Big-Data Analytics?
• Understand existing data resource.
• Process them and uncover pattern,
correlations and other useful
information that can be used to make
better decisions.
• With big data analytics, data scientists
and others can analyse huge volumes
of data that conventional analytics and
business intelligence solutions can't
touch.
Traditional vs. Big Data Approaches
IT
Structures the
data to answer
that question
IT
Delivers a platform to
enable creative
discovery
Business
Explores what questions
could be asked
Business Users
Determine what
question to ask
Monthly sales reports
Profitability analysis
Customer surveys
Brand sentiment
Product strategy
Maximum asset utilization
Big Data Approach
Iterative & Exploratory Analysis
Traditional Approach
Structured & Repeatable Analysis
Tools Employed For Data Analytics
• NoSQL Databases: MongoDB,
Cassandra, Hbase, Hypertable.
• Storage: S3, Hadoop
Distributed File System
• Servers: EC2, Google App
Engine, Heroku
• MapReduce: Hadoop, Hive, Pig,
Cascading, S4, MapR
• Processing: R, Yahoo! Pipes,
Solr/Lucene, BigSheets,
Practical Examples of Data Analytics
To better understand and target customers, companies expand their traditional
data sets with social media data, browser, text analytics or sensor data to get a
more complete picture of their customers. The big objective, in many cases, is
to create predictive models. Using big data, Telecom companies can now better
predict customer churn; retailers can predict what products will sell, and car
insurance companies understand how well their customers actually drive.
Better understand and target customers
.
The computing power of big data analytics enables us to find new cures and
better understand and predict disease patterns. We can use all the data from
smart watches and wearable devices to better understand links between
lifestyles and diseases. Big data analytics also allow us to monitor and predict
epidemics and disease outbreaks, simply by listening to what people are saying,
i.e. “Feeling rubbish today - in bed with a cold” or searching for on the Internet.
Improving Health
Copied from: Š 2014 Advanced Performance Institute, BWMC Ltd.
Practical Examples of Data Analytics
Security services use big data analytics to foil terrorist plots and detect cyber
attacks. Police forces use big data tools to catch criminals and even predict
criminal activity and credit card companies use big data analytics it to detect
fraudulent transactions.
Improving Security and Law Enforcement.
Big data is used to improve many aspects of our cities and countries. For
example, it allows cities to optimize traffic flows based on real time traffic
information as well as social media and weather data. A number of cities are
currently using big data analytics with the aim of turning themselves into Smart
Cities, where the transport infrastructure and utility processes are all joined up.
Where a bus would wait for a delayed train and where traffic signals predict
traffic volumes and operate to minimize jams.
Improving and Optimizing Cities and Countries
Copied from: Š 2014 Advanced Performance Institute, BWMC Ltd.
3) Big Data and Hadoop
Brief history of Hadoop
Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used
text search library. Hadoop has its origins in Apache Nutch, an open source web
search engine, itself a part of the Lucene project.
Nutch was started in 2002, and a working crawler and search system quickly emerged.
However their architecture wouldn’t scale to the billions of pages on the Web. In 2003
Google published paper on Google’s Distributed Filesystem (GFS) which was being
used in production at Google. Hence in 2004 they implemented Nutch Distributed
Filesystem (NDFS) using GFS architecture that would solve their storage needs for
very large files generated as a part of the web crawl and indexing process.
In 2004, Google published the paper that introduced MapReduce to the world. NDFS
and the MapReduce implementation in Nutch were applicable beyond the realm of
search, and in February 2006 they moved out of Nutch to form an independent
subproject of Lucene called Hadoop.
Apache Hadoop
 Framework for the distributed
processing of large data sets across
clusters of computers using simple
programming models.
 Designed to scale up from a single
server to thousands of machines, with
a very high degree of fault tolerance.
 Rather than relying on high-end
hardware, the resiliency of these
clusters comes from the software’s
ability to detect and handle failures at
the application layer.
Key Features of Hadoop
1. Flexible
2. Scalable
3. Building more efficient data
economy
4. Cost Effective
5. Fault Tolerant
1) Flexible
1. Hadoop is schema-less, and can absorb any type of data,
structured or not, from any number of sources.
2. Data from multiple sources can be joined and aggregated in arbitrary
ways enabling deeper analyses than any one system can provide.
3. We can develop Map- Reduce programs on Linux, Windows, OS-X in
any language like Python, R, C++, Perl, Ruby, etc.
2) Scalable
Scalability is one of the primary forces driving popularity and adoption
of the Apache Hadoop project. A typical use case for Hadoop is an
emerging Web site starting to run a five-node. New nodes can be
added as needed, and added without needing to change data formats,
how data is loaded, how jobs are written, or the applications on top.
1. Yahoo reportedly ran numerous clusters having 4000+ nodes with
four 1 TB drives per node, 15 PB of total storage capacity.
2. Facebook’s 2000-node warehouse cluster is provisioned for 21 PB of
total storage capacity. Extrapolating the announced growth rate, its
namespace should have close to 200 million objects by now.
3. eBay runs a 700-node cluster. Each node has 24 TB of local disk
storage, 72 GB of RAM, and a 12-core CPU. Total cluster size is 16
PB. It is configured to run 26,000 MapReduce tasks simultaneously.
3) Building more efficient data economy
Data is the new currency of the modern world. Businesses that
successfully maximize its value will have a decisive impact on their own
value and on their customers success.
Apache Hadoop allows businesses to create highly scalable and cost-
efficient data stores. It offers data value at unprecedented scale.
4) Cost Effective
Hadoop brings massively parallel computing to commodity servers. The
result is a sizeable decrease in the cost per terabyte of storage, which
in turn makes it affordable to model all your data.
It's a cost-effective alternative to a conventional extract, transform, and
load (ETL) process that extracts data from different systems, converts it
into a structure suitable for analysis and reporting, and loads into
database.
5) Fault tolerant
When you lose a node, the system redirects work to another location of
the data and continues processing without missing a fright beat.
When any node becomes non-functional, then the node present nearby
ie. Supernode which is near completion or has already completed its
task reassigns itself to the task of that faulty node, The description of
which is present in the shared memory. Therefore a faulty node does
not have to wait for the Master node to notice about its non-
functionality and hence reduce execution time in case any of the node
gets faulty.
Hadoop Ecosystem
HDFS Architecture
HDFS is a filesystem designed for storing
very large files with streaming data access
patterns, running on clusters of commodity
hardware. HDFS clusters consist of a
NameNode that manages the file system
metadata and DataNodes that store the
actual data.
Uses:
• Storage of large imported files from
applications outside of the Hadoop
ecosystem.
• Staging of imported files to be
processed by Hadoop applications.
Hive connects the gap between SQL based RDBMS and NoSQL based
Hadoop. Datasets from HDFS and HBase can be mapped onto Hive from
which queries can be written in an SQL like language called HiveQL.
Though Hive may not be the perfect panacea for complex operations, it
reduces the difficulty of having to write MapReduce jobs if a
programmer knows SQL..
•Hbase:
• Hive:
Inspired by Google’s BigTable, HBase is a NoSQL distributed column-
oriented database that runs on top of HDFS on which random read/write
can be performed. HBase enables you to store and retrieve random data
in near real-time. It can also be combined with MapReduce to ease bulk
operations such as indexing or analysis.
•Pig: Apache Pig uses the data flow language Pig Latin. Pig supports relational
operations such as join, group and aggregate and it can be scaled across
multiple servers simultaneously. Time intensive ETL operations, analytics
on sample data, running complex tasks that collates multiple data
sources are some of the use cases that can be handled using Pig.
Flume is a distributed system that aggregates streaming data from
different sources and adds them to a centralized datastore for Hadoop
cluster such as HDFS. Flume facilitates data aggregation which involves
importing and processing data for computation into HDFS or storage in
databases.
• Sqoop:
•Flume:
Sqoop is the latest Hadoop framework to get enlisted in Bossie award for
open source big data tools. Sqoop enables two-way import/export of
bulk data between HDFS/Hive/HBase and relational or structured
databases. Unlike Flume, Sqoop helps in data transfer of structured
datasets.
• Mahout: Mahout is a suite of scalable machine learning libraries implemented on
top of MapReduce. Commercial use cases of machine learning include
predictive analysis via collaborative filtering, clustering and classification.
Product/service recommendations, investigative data mining, statistical
analysis are some of its generic use cases.
4) Hadoop and MapReduce
MapReduce
 MapReduce is a programming paradigm for easily writing applications which process
vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
 The framework is divided into two parts:
 Map, allows to parcels out work to different nodes in the distributed cluster.
 Reduce, collates the work and resolves the results into a single value.
 MapReduce framework consists of a single master JobTracker and one
slave TaskTracker per cluster-node. Master is responsible for scheduling the jobs'
component tasks on the slaves, monitoring them and re-executing the failed tasks.
 Although the Hadoop framework is implemented in Java, MapReduce applications
can be written in Python, Ruby, R, C++. Eg. Hadoop Streaming, Hadoop Pipes.
Hadoop- MapReduce Architecture
Map Reduce core functionality (I)
•
Data flow beyond the two key pieces (map and reduce):
• Input reader – divides input into appropriate size splits which get
assigned to a Map function.
• Map function – maps file data to smaller, intermediate <key, value>
pairs.
• Compare function – input for Reduce is pulled from the Map
intermediate output and sorted according to the compare function.
• Reduce function – takes intermediate values and reduces to a
smaller solution handed back to the framework.
• Output writer – writes file output
How MapReduce Works
User to do list:
 Indicate
• input/output files
• M: number of map tasks
• R: number of reduce tasks
• W: number of
machines
 Write map and reduce
functions
 Submit the job
 Input files are split into M pieces
on distributed file system
• Typically ~ 64 MB blocks
 Intermediate files created from
map tasks are written to local disk
 A sorted and shuffled output is sent
to reduce framework (combiner is
also used in most of the cases).
 Output files are written to
distributed file system.
How MapReduce Works (Cont..)
MAP Reduce Examples
1. WordCount ( Reads the text file and counts how often words occur ).
2. TopN ( To find top-n used words of a text file ).
1. WordCount
Reads text files and counts how often each word occur.
The input and the output are text files,
Need three classes:
• WordCount.java: Driver class with main function
• WordMapper.java: Mapper class with map method
• SumReducer.java: Reducer class with reduce method
WordCount Example (Contd.)
WordMapper.java
Mapper class with map function
For the given sample input
assuming two map nodes
The sample input is distributed to the maps
the first map emits:
<Hello, 1> <World, 1> <Bye, 1> <World, 1>
The second map emits:
<Hello, 1> <Hadoop, 1> <Goodbye, 1> <Hadoop, 1>
WordCount Example (Contd.)
SumReducer.java
Reducer class with reduce function
For the input from two Mappers
the reduce method just sums up the values,
which are the occurrence counts for each key
Thus the output of the job is:
<Bye, 1> <Goodbye, 1> <Hadoop, 2> <Hello, 2> <World, 2>
WordCount (Driver)
Job job = Job.getInstance(new Configuration());
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(WordCount.class);
job.submit();
}
}
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("usage: [input] [output]");
System.exit(-1);
}
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("usage: [input] [output]");
System.exit(-1);
}
Job job = Job.getInstance(new Configuration());
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(WordCount.class);
job.submit();
}
}
Check Input and Output files
WordCount (Driver)
Set output (key, value) types
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("usage: [input] [output]");
System.exit(-1);
}
Job job = Job.getInstance(new Configuration());
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(WordCount.class);
job.submit();
}
}
WordCount (Driver)
Set Mapper/Reducer classes
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("usage: [input] [output]");
System.exit(-1);
}
Job job = Job.getInstance(new Configuration());
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(WordCount.class);
job.submit();
}
}
WordCount (Driver)
Set Input/Output format classes
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("usage: [input] [output]");
System.exit(-1);
}
Job job = Job.getInstance(new Configuration());
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(WordCount.class);
job.submit();
}
}
WordCount (Driver)
Set Input/Output paths
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("usage: [input] [output]");
System.exit(-1);
}
Job job = Job.getInstance(new Configuration());
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(WordCount.class);
job.submit();
}
}
WordCount (Driver)
Set Driver class
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("usage: [input] [output]");
System.exit(-1);
}
Job job = Job.getInstance(new Configuration());
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(WordCount.class);
job.submit();
}
}
WordCount (Driver)
Submit the job to the master node
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("usage: [input] [output]");
System.exit(-1);
}
Job job = Job.getInstance(new Configuration());
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(WordCount.class);
job.submit();
}
}
WordCount (Driver)
WordMapper (Mapper class)
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {
private Text word = new Text();
private final static IntWritable one = new IntWritable(1);
@Override
public void map(Object key, Text value, Context contex) throws IOException, InterruptedException {
// Break line into words for processing
StringTokenizer wordList = new StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {
word.set(wordList.nextToken());
contex.write(word, one);
}
}
}
public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {
private Text word = new Text();
private final static IntWritable one = new IntWritable(1);
@Override
public void map(Object key, Text value,
Context contex) throws IOException, InterruptedException {
// Break line into words for processing
StringTokenizer wordList = new StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {
word.set(wordList.nextToken());
contex.write(word, one);
}
}
}
Extends mapper class with input/
output keys and values
WordMapper (Mapper class)
public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {
private Text word = new Text();
private final static IntWritable one = new IntWritable(1);
@Override
public void map(Object key, Text value,
Context contex) throws IOException, InterruptedException {
// Break line into words for processing
StringTokenizer wordList = new StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {
word.set(wordList.nextToken());
contex.write(word, one);
}
}
}
Output (key, value) typesWordMapper (Mapper class)
public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {
private Text word = new Text();
private final static IntWritable one = new IntWritable(1);
@Override
public void map(Object key, Text value,
Context contex) throws IOException, InterruptedException {
// Break line into words for processing
StringTokenizer wordList = new StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {
word.set(wordList.nextToken());
contex.write(word, one);
}
}
}
Input (key, value) types
Output as Context type
WordMapper (Mapper class)
Read words from each line
of the input file
public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {
private Text word = new Text();
private final static IntWritable one = new IntWritable(1);
@Override
public void map(Object key, Text value,
Context contex) throws IOException, InterruptedException {
// Break line into words for processing
StringTokenizer wordList = new StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {
word.set(wordList.nextToken());
contex.write(word, one);
}
}
}
WordMapper (Mapper class)
Count each word
public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {
private Text word = new Text();
private final static IntWritable one = new IntWritable(1);
@Override
public void map(Object key, Text value,
Context contex) throws IOException, InterruptedException {
// Break line into words for processing
StringTokenizer wordList = new StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {
word.set(wordList.nextToken());
contex.write(word, one);
}
}
}
WordMapper (Mapper class)
Shuffler/Sorter
Maps emit (key, value) pairs
Shuffler/Sorter of Hadoop framework
Sort (key, value) pairs by key
Then, append the value to make (key, list of values) pair
For example,
The first, second maps emit:
<Hello, 1> <World, 1> <Bye, 1> <World, 1>
<Hello, 1> <Hadoop, 1> <Goodbye, 1> <Hadoop, 1>
Shuffler produces and it becomes the input of the reducer
<Bye, 1>, <Goodbye, 1>, <Hadoop, <1,1>>, <Hello, <1,
1>>, <World, <1,1>>
SumReducer (Reducer class)
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable totalWordCount = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int wordCount = 0;
Iterator<IntWritable> it=values.iterator();
while (it.hasNext()) {
wordCount += it.next().get();
}
totalWordCount.set(wordCount);
context.write(key, totalWordCount);
}
}
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable totalWordCount = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = 0;
Iterator<IntWritable> it=values.iterator();
while (it.hasNext()) {
wordCount += it.next().get();
}
totalWordCount.set(wordCount);
context.write(key, totalWordCount);
}
}
Extends Reducer class with input/
output keys and valuesSumReducer (Reducer class)
Set output value type
SumReducer (Reducer class)
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable totalWordCount = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = 0;
Iterator<IntWritable> it=values.iterator();
while (it.hasNext()) {
wordCount += it.next().get();
}
totalWordCount.set(wordCount);
context.write(key, totalWordCount);
}
}
Set input (key, list of values) type
and output as Context class
SumReducer (Reducer class)
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable totalWordCount = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = 0;
Iterator<IntWritable> it=values.iterator();
while (it.hasNext()) {
wordCount += it.next().get();
}
totalWordCount.set(wordCount);
context.write(key, totalWordCount);
}
}
For each word,
Count/sum the number of values
SumReducer (Reducer class)
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable totalWordCount = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = 0;
Iterator<IntWritable> it=values.iterator();
while (it.hasNext()) {
wordCount += it.next().get();
}
totalWordCount.set(wordCount);
context.write(key, totalWordCount);
}
}
For each word,
Total count becomes the value
SumReducer (Reducer class)
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable totalWordCount = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = 0;
Iterator<IntWritable> it=values.iterator();
while (it.hasNext()) {
wordCount += it.next().get();
}
totalWordCount.set(wordCount);
context.write(key, totalWordCount);
}
}
Reducer
Input: Shuffler produces and it becomes the input of the
reducer
<Bye, 1>, <Goodbye, 1>, <Hadoop, <1,1>>, <Hello, <1, 1>>, <World,
<1,1>>
Output
<Bye, 1>, <Goodbye, 1>, <Hadoop, 2>, <Hello, 2>, <World, 2>
SumReducer
Map()
The Mapper implementation, via the map method, processes one line at a
time, as provided by the specified TextInputFormat. It then splits the line
into tokens separated by whitespaces, via the StringTokenizer, and emits a
key-value pair of < <word>, 1>.
For asample input the first map emits:
< Deer, 1>
< Beer, 1>
< River, 1>
The second map emits:
< Car, 1>
< River, 1>
< Car, 1>
Map() and Reduce()
The output of the first map:
< Deer, 1>
< Beer, 1>
< River, 1>
The output of the second map:
< Car, 2>
< River, 1>
Map() and Reduce() (Continued)
Reducer()
The Reducer implementation, via the reduce method just sums up the
values, which are the occurence counts for each key (i.e. words in this
example).
2. TopN
 We want to find top-n used words of a text file: “Flatland” by E. Abbot.
 The input and the output are text files,
 Need three classes
 TopN.java
 Driver class with main function
 TopNMapper.java
 Mapper class with map method
 TopNReducer.java
 Reducer class with reduce method
TopN(Driver)
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf,
args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: TopN <in> <out>");
System.exit(2);
}
Job job = Job.getInstance(conf);
job.setJobName("Top N");
job.setJarByClass(TopN.class);
job.setMapperClass(TopNMapper.class);
//job.setCombinerClass(TopNReducer.class);
job.setReducerClass(TopNReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import java.io.IOException;
import java.util.*;
public class TopN {
public static void main(String[] args) throws Exception {
TopNMapper
/**
* The mapper reads one line at the time, splits it into an array of single words and emits every
* word to the reducers with the value of 1.
*/
public static class TopNMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private String tokens = "[_|$#<>^=[]*/,;,.-:()?!"']";
@Override
public void map(Object key, Text value, Context context) throws IOException, InterruptedException
{
String cleanLine = value.toString().toLowerCase().replaceAll(tokens, " ");
StringTokenizer itr = new StringTokenizer(cleanLine);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken().trim());
context.write(word, one);
}
}
}
TopNReducer
/**
* The reducer retrieves every word and puts it into a Map: if the word already exists in the
* map, increments its value, otherwise sets it to 1.
*/
public static class TopNReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private Map<Text, IntWritable> countMap = new HashMap<>();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
InterruptedException {
//computes the number of occurrences of a single word
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
// puts the number of occurrences of this word into the map.
// We need to create another Text object because the Text instance
// we receive is the same for all the words
countMap.put(new Text(key), new IntWritable(sum));
}
@Override
protected void cleanup(Context context) throws IOException, InterruptedException {
Map<Text, IntWritable> sortedMap = sortByValues(countMap);
int counter = 0;
for (Text key : sortedMap.keySet()) {
if (counter++ == 20) {
break;
}
context.write(key, sortedMap.get(key));
}
}
}
TopN- Results
The 2286
Of 1634
And 1098
That 499
You 429
Not 317
But 279
For 267
By 317
In shuffle and sort phase, the partioner will send
every single word (the key) with the value “1” to
the reducers.
All these network transmissions can be
minimized if we reduce the data locally the data
that the mapper will emit.
This is obtained by Combiner.
TopNCombiner
/**
* The combiner retrieves every word and puts it into a Map: if the word already exists in the
* map, increments its value, otherwise sets it to 1.
*/
public static class TopNCombiner extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
InterruptedException {
// computes the number of occurrences of a single word
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
Hadoop Output: With and Without Combiner
Without Combiner ->
 Map input records = 4239
 Map output records = 37817
 Map output bytes = 359621
 Input split bytes = 118
 Combine input records = 0
 Combine output records = 0
 Reduce input groups = 4987
 Reduce shuffle bytes = 435261
 Reduce input records = 37817
 Reduce output records = 20
With Combiner ->
 Map input records = 4239
 Map output records = 37817
 Map output bytes = 359621
 Input split bytes = 116
 Combine input records = 37817
 Combine output records = 20
 Reduce input groups = 20
 Reduce shuffle bytes = 194
 Reduce input records = 20
 Reduce output records = 20
Advantages and Disadvantages of using Combiner
Advantages ->
Network transmission are minimized.
Disadvantages ->
Hadoop doesn’t guarantee the execution of a combiner: it can be
executed 0,1 or multiple times on the same input.
Key-value pairs emitted from mapper are stored in local file
system, and execution of combiner can cause extensive IO
operations.
Sources
 http://wikibon.org/blog/big-data-statistics/
 https://en.wikipedia.org/wiki/Big_data
 http://blog.qburst.com/2014/08/hadoop-big-data-analytics-tools/

More Related Content

What's hot

Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyRohit Dubey
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolutionitnewsafrica
 
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadhMithlesh Sadh
 
What is big data?
What is big data?What is big data?
What is big data?David Wellman
 
Big Data ppt
Big Data pptBig Data ppt
Big Data pptVivek Gautam
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big DataBernard Marr
 
Big data 2017 final
Big data 2017   finalBig data 2017   final
Big data 2017 finalAmjid Ali
 
BĂźyĂźk veri(bigdata)
BĂźyĂźk veri(bigdata)BĂźyĂźk veri(bigdata)
BĂźyĂźk veri(bigdata)HĂźlya Soylu
 
BigQuery for Beginners
BigQuery for BeginnersBigQuery for Beginners
BigQuery for BeginnersBetter&Stronger
 
bigquery.pptx
bigquery.pptxbigquery.pptx
bigquery.pptxHarissh16
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and HadoopRahul Agarwal
 
Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...
Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...
Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...Simplilearn
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 

What's hot (20)

Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadh
 
What is big data?
What is big data?What is big data?
What is big data?
 
Big Data
Big DataBig Data
Big Data
 
Big Data ppt
Big Data pptBig Data ppt
Big Data ppt
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big Data
 
Big data 2017 final
Big data 2017   finalBig data 2017   final
Big data 2017 final
 
Big data
Big dataBig data
Big data
 
Big Data
Big DataBig Data
Big Data
 
BĂźyĂźk veri(bigdata)
BĂźyĂźk veri(bigdata)BĂźyĂźk veri(bigdata)
BĂźyĂźk veri(bigdata)
 
BigQuery for Beginners
BigQuery for BeginnersBigQuery for Beginners
BigQuery for Beginners
 
bigquery.pptx
bigquery.pptxbigquery.pptx
bigquery.pptx
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Big_data_ppt
Big_data_ppt Big_data_ppt
Big_data_ppt
 
Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...
Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...
Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Big data analysis
Big data analysisBig data analysis
Big data analysis
 
Snowflake Overview
Snowflake OverviewSnowflake Overview
Snowflake Overview
 

Viewers also liked

Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBernard Marr
 
Digital World Overview Final La 020411
Digital World Overview Final La 020411Digital World Overview Final La 020411
Digital World Overview Final La 020411leorodriquez
 
Supernap: the world’s most powerful data center is here - by Supernap Italia ...
Supernap: the world’s most powerful data center is here - by Supernap Italia ...Supernap: the world’s most powerful data center is here - by Supernap Italia ...
Supernap: the world’s most powerful data center is here - by Supernap Italia ...festival ICT 2016
 
CSCMP 2014: Big Data Use in Retail Supply Chains
CSCMP 2014: Big Data Use in Retail Supply ChainsCSCMP 2014: Big Data Use in Retail Supply Chains
CSCMP 2014: Big Data Use in Retail Supply ChainsAnnibalSodero
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
 
A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data ConceptDharmesh Tank
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answersKalyan Hadoop
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
How to hack into the big data team
How to hack into the big data teamHow to hack into the big data team
How to hack into the big data teamData Science Thailand
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsAsad Masood Qazi
 
MongoDB and hadoop
MongoDB and hadoopMongoDB and hadoop
MongoDB and hadoopSteven Francia
 
The 5 key V's of Big Data
The 5 key V's of Big DataThe 5 key V's of Big Data
The 5 key V's of Big DataAnric Blatt
 
VENU_Hadoop_Resume
VENU_Hadoop_ResumeVENU_Hadoop_Resume
VENU_Hadoop_ResumeVenu Gopal
 
BigData & Supply Chain: A "Small" Introduction
BigData & Supply Chain: A "Small" IntroductionBigData & Supply Chain: A "Small" Introduction
BigData & Supply Chain: A "Small" IntroductionIvan Gruer
 
Profiting from customer profitability + big data fitzgerald analytics
Profiting from customer profitability + big data fitzgerald analyticsProfiting from customer profitability + big data fitzgerald analytics
Profiting from customer profitability + big data fitzgerald analyticsFitzgerald Analytics, Inc.
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapakapa rohit
 
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...Amazon Web Services
 
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Edureka!
 

Viewers also liked (20)

Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must Know
 
Digital World Overview Final La 020411
Digital World Overview Final La 020411Digital World Overview Final La 020411
Digital World Overview Final La 020411
 
Supernap: the world’s most powerful data center is here - by Supernap Italia ...
Supernap: the world’s most powerful data center is here - by Supernap Italia ...Supernap: the world’s most powerful data center is here - by Supernap Italia ...
Supernap: the world’s most powerful data center is here - by Supernap Italia ...
 
CSCMP 2014: Big Data Use in Retail Supply Chains
CSCMP 2014: Big Data Use in Retail Supply ChainsCSCMP 2014: Big Data Use in Retail Supply Chains
CSCMP 2014: Big Data Use in Retail Supply Chains
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
 
A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data Concept
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
How to hack into the big data team
How to hack into the big data teamHow to hack into the big data team
How to hack into the big data team
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
 
MongoDB and hadoop
MongoDB and hadoopMongoDB and hadoop
MongoDB and hadoop
 
The 5 key V's of Big Data
The 5 key V's of Big DataThe 5 key V's of Big Data
The 5 key V's of Big Data
 
VENU_Hadoop_Resume
VENU_Hadoop_ResumeVENU_Hadoop_Resume
VENU_Hadoop_Resume
 
BigData & Supply Chain: A "Small" Introduction
BigData & Supply Chain: A "Small" IntroductionBigData & Supply Chain: A "Small" Introduction
BigData & Supply Chain: A "Small" Introduction
 
Profiting from customer profitability + big data fitzgerald analytics
Profiting from customer profitability + big data fitzgerald analyticsProfiting from customer profitability + big data fitzgerald analytics
Profiting from customer profitability + big data fitzgerald analytics
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016  Webi...
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...
 
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
 
Hadoop
HadoopHadoop
Hadoop
 

Similar to Big data analytics with Apache Hadoop

Unit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptxUnit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptxYashiBatra1
 
Big data seminor
Big data seminorBig data seminor
Big data seminorberasrujana
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notesMohit Saini
 
An Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data AnalyticsAn Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data AnalyticsAudrey Britton
 
Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.saranya270513
 
UNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdfUNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdfvvpadhu
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond Rajesh Kumar
 
Bigdata the technological renaissance
Bigdata the technological renaissanceBigdata the technological renaissance
Bigdata the technological renaissanceRituBhargava7
 
Big data (word file)
Big data  (word file)Big data  (word file)
Big data (word file)Shahbaz Anjam
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 

Similar to Big data analytics with Apache Hadoop (20)

Unit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptxUnit-1 -2-3- BDA PIET 6 AIDS.pptx
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
 
Big data seminor
Big data seminorBig data seminor
Big data seminor
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 
An Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data AnalyticsAn Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data Analytics
 
Social media with big data analytics
Social media with big data analyticsSocial media with big data analytics
Social media with big data analytics
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.
 
Sample
Sample Sample
Sample
 
UNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdfUNIT 1 -BIG DATA ANALYTICS Full.pdf
UNIT 1 -BIG DATA ANALYTICS Full.pdf
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond
 
Big data
Big dataBig data
Big data
 
Big Data
Big DataBig Data
Big Data
 
Bigdata " new level"
Bigdata " new level"Bigdata " new level"
Bigdata " new level"
 
Bigdata the technological renaissance
Bigdata the technological renaissanceBigdata the technological renaissance
Bigdata the technological renaissance
 
Big data (word file)
Big data  (word file)Big data  (word file)
Big data (word file)
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 

Recently uploaded

Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 

Recently uploaded (20)

Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 

Big data analytics with Apache Hadoop

  • 1. BIG DATA ANALYTICS WITH APACHE- HADOOP “Big Data: A Revolution that Will Transform How We Live, Work, and Think” -Viktor Mayer-SchĂśnberger and Kenneth Cukier
  • 2. Team Members Abhishek Kumar : Y11UC010 Sachin Mittal : Y11UC189 Subodh Rawani : Y11UC230 Suman Saurabh : Y11UC231
  • 3. Contents 1. What is Big Data ?  Definition  Turning Data to Value: 5v’s 2. Big Data Analytics 3. Big Data and Hadoop  History of Hadoop  About Apache Hadoop  Key Features of Hadoop 4. Hadoop and MapReduce  About MapReduce  MapReduce Architecture  MapReduce Functionality  MapReduce Examples
  • 4. 1) What is Big Data?
  • 5. Definition “Data is the oil of the 21st century, and analytics is the combustion engine” -Peter Sondergaard, Senior Vice President, Gartner Research “Big- Data are high volume, high velocity and high variety of information assets that require new form of processing to enable enhanced decision making insight discovery & process optimisation.” “It is a subjective term, what involves is analysis of data from multiple sources and is joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide”. -Tom White in Hadoop the Definitive Guide Big Data is fuelled by two things: • The increasing ‘datafication’ of the world, allows to generate new data at frightening rates. • Technological advancement to harness those large and complex data and perform analysis using improved techniques.
  • 6. Big data describes the exponential growth and availability of data, both structured and unstructured. This data are from everywhere: Climate Sensors, Social Media post, Digital files, Buy/Sell transaction records, Cell phone GPS signal and others.
  • 7. Statistics of Data Generated Big Data in Today’s Business and Technology Environment  235 Terabytes of data has been collected by the U.S. Library of Congress in April 2011. (Source)  Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data. (Source)  Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes of data. (Source)  More than 5 billion people are calling, texting, tweeting and browsing on mobile phones worldwide. (Source)  In 2008, Google was processing 20,000 terabytes of data (20 petabytes) a day. (Source) The Rapid Growth of Unstructured Data  YouTube users upload 48 hours of new video every minute of the day. (Source)  Brands and organizations on Facebook receive 34,722 Likes every minute of the day. (Source)  Twitter’s sees roughly 175 million tweets every day, and has more than 465 million accounts. (Source)  In late 2011, IDC Digital Universe published a report indicating that some 1.8 zettabytes of data will be created that year. (Source)  In other words, the amount of data in the world today is equal to:  Every person in the world having more than 215m high- resolution MRI scans a day.  More than 200bn HD movies – which would take a person 47m years to watch.
  • 8. `
  • 9. Turning Big Data into Value: 5V’s The Digital Era gives unprecedented amounts of data in terms of Volume, Velocity, Variety and Veracity and properly channelling them to Value. Value Volume: Refers to the Terabytes, Petabytes as well as Zettabytes of data generated every second. Velocity: Speed at which new data is generated every second. E.g. Google, Twitter, Facebook Variety: Different formats data such as text, images, video, video and so on can be stored and processed rather than only Relational Databases. Veracity: Trustworthiness of the data. E.g. Twitter data with hash tags, abbreviations, typos and colloquial speech as well as the reliability and accuracy of content. Though not reliable can also be processed. Value: Having access to big data is no good unless we can turn it into value.
  • 10. The ‘Datafication’ of our World; •Activities •Conversations •Words •Voice •Social Media •Browser logs •Photos •Videos •Sensors •Etc. Volume Veracity Variety Velocity Analysis Analysing Big Data: •Text analytics •Sentiment analysis •Face recognition •Voice analytics •Movement analytics •Etc. Value Copied from: Š 2014 Advanced Performance Institute, BWMC Ltd. New technologies in Distributed Systems and Cloud Computing together with the latest software and analysis approaches allow us to store and process data to Value at massive rate.
  • 11. Some Big Data Use Case By Industry Telecommunications Network analytics Location-based services Retail Merchandise optimization Supply-Chain Management Banking Fraud Detection Trade Surveillance Media Click- Fraud Prevention Social Graph Analysis Energy Smart Meter Analytics Distribution load forecasting Manufacturing Customer Care Call Centers Customer Relationship Public Threats Detection Cyber Security Healthcare Clinical Trails data Analysis Supply Chain Management Insurance Catastrophe Modelling Claims Fraud
  • 12.
  • 13. Challenges of big data  How to store and protect Big data?  How to organize and catalog the data that you have backed up?  How to keep costs low while ensuring that all the critical data is available you need it.  Analytical Challenges  Human Resources and Manpower  Technical Challenges  Privacy and Security
  • 14. 2) Big Data Analytics
  • 15. Why Big-Data Analytics? • Understand existing data resource. • Process them and uncover pattern, correlations and other useful information that can be used to make better decisions. • With big data analytics, data scientists and others can analyse huge volumes of data that conventional analytics and business intelligence solutions can't touch.
  • 16. Traditional vs. Big Data Approaches IT Structures the data to answer that question IT Delivers a platform to enable creative discovery Business Explores what questions could be asked Business Users Determine what question to ask Monthly sales reports Profitability analysis Customer surveys Brand sentiment Product strategy Maximum asset utilization Big Data Approach Iterative & Exploratory Analysis Traditional Approach Structured & Repeatable Analysis
  • 17. Tools Employed For Data Analytics • NoSQL Databases: MongoDB, Cassandra, Hbase, Hypertable. • Storage: S3, Hadoop Distributed File System • Servers: EC2, Google App Engine, Heroku • MapReduce: Hadoop, Hive, Pig, Cascading, S4, MapR • Processing: R, Yahoo! Pipes, Solr/Lucene, BigSheets,
  • 18. Practical Examples of Data Analytics To better understand and target customers, companies expand their traditional data sets with social media data, browser, text analytics or sensor data to get a more complete picture of their customers. The big objective, in many cases, is to create predictive models. Using big data, Telecom companies can now better predict customer churn; retailers can predict what products will sell, and car insurance companies understand how well their customers actually drive. Better understand and target customers . The computing power of big data analytics enables us to find new cures and better understand and predict disease patterns. We can use all the data from smart watches and wearable devices to better understand links between lifestyles and diseases. Big data analytics also allow us to monitor and predict epidemics and disease outbreaks, simply by listening to what people are saying, i.e. “Feeling rubbish today - in bed with a cold” or searching for on the Internet. Improving Health Copied from: Š 2014 Advanced Performance Institute, BWMC Ltd.
  • 19. Practical Examples of Data Analytics Security services use big data analytics to foil terrorist plots and detect cyber attacks. Police forces use big data tools to catch criminals and even predict criminal activity and credit card companies use big data analytics it to detect fraudulent transactions. Improving Security and Law Enforcement. Big data is used to improve many aspects of our cities and countries. For example, it allows cities to optimize traffic flows based on real time traffic information as well as social media and weather data. A number of cities are currently using big data analytics with the aim of turning themselves into Smart Cities, where the transport infrastructure and utility processes are all joined up. Where a bus would wait for a delayed train and where traffic signals predict traffic volumes and operate to minimize jams. Improving and Optimizing Cities and Countries Copied from: Š 2014 Advanced Performance Institute, BWMC Ltd.
  • 20. 3) Big Data and Hadoop
  • 21. Brief history of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the Lucene project. Nutch was started in 2002, and a working crawler and search system quickly emerged. However their architecture wouldn’t scale to the billions of pages on the Web. In 2003 Google published paper on Google’s Distributed Filesystem (GFS) which was being used in production at Google. Hence in 2004 they implemented Nutch Distributed Filesystem (NDFS) using GFS architecture that would solve their storage needs for very large files generated as a part of the web crawl and indexing process. In 2004, Google published the paper that introduced MapReduce to the world. NDFS and the MapReduce implementation in Nutch were applicable beyond the realm of search, and in February 2006 they moved out of Nutch to form an independent subproject of Lucene called Hadoop.
  • 22. Apache Hadoop  Framework for the distributed processing of large data sets across clusters of computers using simple programming models.  Designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance.  Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer.
  • 23. Key Features of Hadoop 1. Flexible 2. Scalable 3. Building more efficient data economy 4. Cost Effective 5. Fault Tolerant
  • 24. 1) Flexible 1. Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. 2. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide. 3. We can develop Map- Reduce programs on Linux, Windows, OS-X in any language like Python, R, C++, Perl, Ruby, etc.
  • 25. 2) Scalable Scalability is one of the primary forces driving popularity and adoption of the Apache Hadoop project. A typical use case for Hadoop is an emerging Web site starting to run a five-node. New nodes can be added as needed, and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top. 1. Yahoo reportedly ran numerous clusters having 4000+ nodes with four 1 TB drives per node, 15 PB of total storage capacity. 2. Facebook’s 2000-node warehouse cluster is provisioned for 21 PB of total storage capacity. Extrapolating the announced growth rate, its namespace should have close to 200 million objects by now. 3. eBay runs a 700-node cluster. Each node has 24 TB of local disk storage, 72 GB of RAM, and a 12-core CPU. Total cluster size is 16 PB. It is configured to run 26,000 MapReduce tasks simultaneously.
  • 26. 3) Building more efficient data economy Data is the new currency of the modern world. Businesses that successfully maximize its value will have a decisive impact on their own value and on their customers success. Apache Hadoop allows businesses to create highly scalable and cost- efficient data stores. It offers data value at unprecedented scale.
  • 27. 4) Cost Effective Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data. It's a cost-effective alternative to a conventional extract, transform, and load (ETL) process that extracts data from different systems, converts it into a structure suitable for analysis and reporting, and loads into database.
  • 28. 5) Fault tolerant When you lose a node, the system redirects work to another location of the data and continues processing without missing a fright beat. When any node becomes non-functional, then the node present nearby ie. Supernode which is near completion or has already completed its task reassigns itself to the task of that faulty node, The description of which is present in the shared memory. Therefore a faulty node does not have to wait for the Master node to notice about its non- functionality and hence reduce execution time in case any of the node gets faulty.
  • 30. HDFS Architecture HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware. HDFS clusters consist of a NameNode that manages the file system metadata and DataNodes that store the actual data. Uses: • Storage of large imported files from applications outside of the Hadoop ecosystem. • Staging of imported files to be processed by Hadoop applications.
  • 31. Hive connects the gap between SQL based RDBMS and NoSQL based Hadoop. Datasets from HDFS and HBase can be mapped onto Hive from which queries can be written in an SQL like language called HiveQL. Though Hive may not be the perfect panacea for complex operations, it reduces the difficulty of having to write MapReduce jobs if a programmer knows SQL.. •Hbase: • Hive: Inspired by Google’s BigTable, HBase is a NoSQL distributed column- oriented database that runs on top of HDFS on which random read/write can be performed. HBase enables you to store and retrieve random data in near real-time. It can also be combined with MapReduce to ease bulk operations such as indexing or analysis. •Pig: Apache Pig uses the data flow language Pig Latin. Pig supports relational operations such as join, group and aggregate and it can be scaled across multiple servers simultaneously. Time intensive ETL operations, analytics on sample data, running complex tasks that collates multiple data sources are some of the use cases that can be handled using Pig.
  • 32. Flume is a distributed system that aggregates streaming data from different sources and adds them to a centralized datastore for Hadoop cluster such as HDFS. Flume facilitates data aggregation which involves importing and processing data for computation into HDFS or storage in databases. • Sqoop: •Flume: Sqoop is the latest Hadoop framework to get enlisted in Bossie award for open source big data tools. Sqoop enables two-way import/export of bulk data between HDFS/Hive/HBase and relational or structured databases. Unlike Flume, Sqoop helps in data transfer of structured datasets. • Mahout: Mahout is a suite of scalable machine learning libraries implemented on top of MapReduce. Commercial use cases of machine learning include predictive analysis via collaborative filtering, clustering and classification. Product/service recommendations, investigative data mining, statistical analysis are some of its generic use cases.
  • 33. 4) Hadoop and MapReduce
  • 34. MapReduce  MapReduce is a programming paradigm for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.  The framework is divided into two parts:  Map, allows to parcels out work to different nodes in the distributed cluster.  Reduce, collates the work and resolves the results into a single value.  MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. Master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks.  Although the Hadoop framework is implemented in Java, MapReduce applications can be written in Python, Ruby, R, C++. Eg. Hadoop Streaming, Hadoop Pipes.
  • 36. Map Reduce core functionality (I) • Data flow beyond the two key pieces (map and reduce): • Input reader – divides input into appropriate size splits which get assigned to a Map function. • Map function – maps file data to smaller, intermediate <key, value> pairs. • Compare function – input for Reduce is pulled from the Map intermediate output and sorted according to the compare function. • Reduce function – takes intermediate values and reduces to a smaller solution handed back to the framework. • Output writer – writes file output
  • 37. How MapReduce Works User to do list:  Indicate • input/output files • M: number of map tasks • R: number of reduce tasks • W: number of machines  Write map and reduce functions  Submit the job  Input files are split into M pieces on distributed file system • Typically ~ 64 MB blocks  Intermediate files created from map tasks are written to local disk  A sorted and shuffled output is sent to reduce framework (combiner is also used in most of the cases).  Output files are written to distributed file system.
  • 39. MAP Reduce Examples 1. WordCount ( Reads the text file and counts how often words occur ). 2. TopN ( To find top-n used words of a text file ).
  • 40. 1. WordCount Reads text files and counts how often each word occur. The input and the output are text files, Need three classes: • WordCount.java: Driver class with main function • WordMapper.java: Mapper class with map method • SumReducer.java: Reducer class with reduce method
  • 41.
  • 42. WordCount Example (Contd.) WordMapper.java Mapper class with map function For the given sample input assuming two map nodes The sample input is distributed to the maps the first map emits: <Hello, 1> <World, 1> <Bye, 1> <World, 1> The second map emits: <Hello, 1> <Hadoop, 1> <Goodbye, 1> <Hadoop, 1>
  • 43. WordCount Example (Contd.) SumReducer.java Reducer class with reduce function For the input from two Mappers the reduce method just sums up the values, which are the occurrence counts for each key Thus the output of the job is: <Bye, 1> <Goodbye, 1> <Hadoop, 2> <Hello, 2> <World, 2>
  • 44. WordCount (Driver) Job job = Job.getInstance(new Configuration()); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setJarByClass(WordCount.class); job.submit(); } } import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.println("usage: [input] [output]"); System.exit(-1); }
  • 45. public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.println("usage: [input] [output]"); System.exit(-1); } Job job = Job.getInstance(new Configuration()); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setJarByClass(WordCount.class); job.submit(); } } Check Input and Output files WordCount (Driver)
  • 46. Set output (key, value) types public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.println("usage: [input] [output]"); System.exit(-1); } Job job = Job.getInstance(new Configuration()); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setJarByClass(WordCount.class); job.submit(); } } WordCount (Driver)
  • 47. Set Mapper/Reducer classes public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.println("usage: [input] [output]"); System.exit(-1); } Job job = Job.getInstance(new Configuration()); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setJarByClass(WordCount.class); job.submit(); } } WordCount (Driver)
  • 48. Set Input/Output format classes public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.println("usage: [input] [output]"); System.exit(-1); } Job job = Job.getInstance(new Configuration()); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setJarByClass(WordCount.class); job.submit(); } } WordCount (Driver)
  • 49. Set Input/Output paths public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.println("usage: [input] [output]"); System.exit(-1); } Job job = Job.getInstance(new Configuration()); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setJarByClass(WordCount.class); job.submit(); } } WordCount (Driver)
  • 50. Set Driver class public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.println("usage: [input] [output]"); System.exit(-1); } Job job = Job.getInstance(new Configuration()); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setJarByClass(WordCount.class); job.submit(); } } WordCount (Driver)
  • 51. Submit the job to the master node public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.out.println("usage: [input] [output]"); System.exit(-1); } Job job = Job.getInstance(new Configuration()); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setJarByClass(WordCount.class); job.submit(); } } WordCount (Driver)
  • 52. WordMapper (Mapper class) import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class WordMapper extends Mapper<Object, Text, Text, IntWritable> { private Text word = new Text(); private final static IntWritable one = new IntWritable(1); @Override public void map(Object key, Text value, Context contex) throws IOException, InterruptedException { // Break line into words for processing StringTokenizer wordList = new StringTokenizer(value.toString()); while (wordList.hasMoreTokens()) { word.set(wordList.nextToken()); contex.write(word, one); } } }
  • 53. public class WordMapper extends Mapper<Object, Text, Text, IntWritable> { private Text word = new Text(); private final static IntWritable one = new IntWritable(1); @Override public void map(Object key, Text value, Context contex) throws IOException, InterruptedException { // Break line into words for processing StringTokenizer wordList = new StringTokenizer(value.toString()); while (wordList.hasMoreTokens()) { word.set(wordList.nextToken()); contex.write(word, one); } } } Extends mapper class with input/ output keys and values WordMapper (Mapper class)
  • 54. public class WordMapper extends Mapper<Object, Text, Text, IntWritable> { private Text word = new Text(); private final static IntWritable one = new IntWritable(1); @Override public void map(Object key, Text value, Context contex) throws IOException, InterruptedException { // Break line into words for processing StringTokenizer wordList = new StringTokenizer(value.toString()); while (wordList.hasMoreTokens()) { word.set(wordList.nextToken()); contex.write(word, one); } } } Output (key, value) typesWordMapper (Mapper class)
  • 55. public class WordMapper extends Mapper<Object, Text, Text, IntWritable> { private Text word = new Text(); private final static IntWritable one = new IntWritable(1); @Override public void map(Object key, Text value, Context contex) throws IOException, InterruptedException { // Break line into words for processing StringTokenizer wordList = new StringTokenizer(value.toString()); while (wordList.hasMoreTokens()) { word.set(wordList.nextToken()); contex.write(word, one); } } } Input (key, value) types Output as Context type WordMapper (Mapper class)
  • 56. Read words from each line of the input file public class WordMapper extends Mapper<Object, Text, Text, IntWritable> { private Text word = new Text(); private final static IntWritable one = new IntWritable(1); @Override public void map(Object key, Text value, Context contex) throws IOException, InterruptedException { // Break line into words for processing StringTokenizer wordList = new StringTokenizer(value.toString()); while (wordList.hasMoreTokens()) { word.set(wordList.nextToken()); contex.write(word, one); } } } WordMapper (Mapper class)
  • 57. Count each word public class WordMapper extends Mapper<Object, Text, Text, IntWritable> { private Text word = new Text(); private final static IntWritable one = new IntWritable(1); @Override public void map(Object key, Text value, Context contex) throws IOException, InterruptedException { // Break line into words for processing StringTokenizer wordList = new StringTokenizer(value.toString()); while (wordList.hasMoreTokens()) { word.set(wordList.nextToken()); contex.write(word, one); } } } WordMapper (Mapper class)
  • 58. Shuffler/Sorter Maps emit (key, value) pairs Shuffler/Sorter of Hadoop framework Sort (key, value) pairs by key Then, append the value to make (key, list of values) pair For example, The first, second maps emit: <Hello, 1> <World, 1> <Bye, 1> <World, 1> <Hello, 1> <Hadoop, 1> <Goodbye, 1> <Hadoop, 1> Shuffler produces and it becomes the input of the reducer <Bye, 1>, <Goodbye, 1>, <Hadoop, <1,1>>, <Hello, <1, 1>>, <World, <1,1>>
  • 59. SumReducer (Reducer class) import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable totalWordCount = new IntWritable(); @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0; Iterator<IntWritable> it=values.iterator(); while (it.hasNext()) { wordCount += it.next().get(); } totalWordCount.set(wordCount); context.write(key, totalWordCount); } }
  • 60. public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable totalWordCount = new IntWritable(); @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0; Iterator<IntWritable> it=values.iterator(); while (it.hasNext()) { wordCount += it.next().get(); } totalWordCount.set(wordCount); context.write(key, totalWordCount); } } Extends Reducer class with input/ output keys and valuesSumReducer (Reducer class)
  • 61. Set output value type SumReducer (Reducer class) public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable totalWordCount = new IntWritable(); @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0; Iterator<IntWritable> it=values.iterator(); while (it.hasNext()) { wordCount += it.next().get(); } totalWordCount.set(wordCount); context.write(key, totalWordCount); } }
  • 62. Set input (key, list of values) type and output as Context class SumReducer (Reducer class) public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable totalWordCount = new IntWritable(); @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0; Iterator<IntWritable> it=values.iterator(); while (it.hasNext()) { wordCount += it.next().get(); } totalWordCount.set(wordCount); context.write(key, totalWordCount); } }
  • 63. For each word, Count/sum the number of values SumReducer (Reducer class) public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable totalWordCount = new IntWritable(); @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0; Iterator<IntWritable> it=values.iterator(); while (it.hasNext()) { wordCount += it.next().get(); } totalWordCount.set(wordCount); context.write(key, totalWordCount); } }
  • 64. For each word, Total count becomes the value SumReducer (Reducer class) public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable totalWordCount = new IntWritable(); @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0; Iterator<IntWritable> it=values.iterator(); while (it.hasNext()) { wordCount += it.next().get(); } totalWordCount.set(wordCount); context.write(key, totalWordCount); } }
  • 65. Reducer Input: Shuffler produces and it becomes the input of the reducer <Bye, 1>, <Goodbye, 1>, <Hadoop, <1,1>>, <Hello, <1, 1>>, <World, <1,1>> Output <Bye, 1>, <Goodbye, 1>, <Hadoop, 2>, <Hello, 2>, <World, 2> SumReducer
  • 66. Map() The Mapper implementation, via the map method, processes one line at a time, as provided by the specified TextInputFormat. It then splits the line into tokens separated by whitespaces, via the StringTokenizer, and emits a key-value pair of < <word>, 1>. For asample input the first map emits: < Deer, 1> < Beer, 1> < River, 1> The second map emits: < Car, 1> < River, 1> < Car, 1> Map() and Reduce() The output of the first map: < Deer, 1> < Beer, 1> < River, 1> The output of the second map: < Car, 2> < River, 1>
  • 67. Map() and Reduce() (Continued) Reducer() The Reducer implementation, via the reduce method just sums up the values, which are the occurence counts for each key (i.e. words in this example).
  • 68. 2. TopN  We want to find top-n used words of a text file: “Flatland” by E. Abbot.  The input and the output are text files,  Need three classes  TopN.java  Driver class with main function  TopNMapper.java  Mapper class with map method  TopNReducer.java  Reducer class with reduce method
  • 69.
  • 70. TopN(Driver) Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: TopN <in> <out>"); System.exit(2); } Job job = Job.getInstance(conf); job.setJobName("Top N"); job.setJarByClass(TopN.class); job.setMapperClass(TopNMapper.class); //job.setCombinerClass(TopNReducer.class); job.setReducerClass(TopNReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; import java.io.IOException; import java.util.*; public class TopN { public static void main(String[] args) throws Exception {
  • 71. TopNMapper /** * The mapper reads one line at the time, splits it into an array of single words and emits every * word to the reducers with the value of 1. */ public static class TopNMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); private String tokens = "[_|$#<>^=[]*/,;,.-:()?!"']"; @Override public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String cleanLine = value.toString().toLowerCase().replaceAll(tokens, " "); StringTokenizer itr = new StringTokenizer(cleanLine); while (itr.hasMoreTokens()) { word.set(itr.nextToken().trim()); context.write(word, one); } } }
  • 72. TopNReducer /** * The reducer retrieves every word and puts it into a Map: if the word already exists in the * map, increments its value, otherwise sets it to 1. */ public static class TopNReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private Map<Text, IntWritable> countMap = new HashMap<>(); @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { //computes the number of occurrences of a single word int sum = 0; for (IntWritable val : values) { sum += val.get(); } // puts the number of occurrences of this word into the map. // We need to create another Text object because the Text instance // we receive is the same for all the words countMap.put(new Text(key), new IntWritable(sum)); }
  • 73. @Override protected void cleanup(Context context) throws IOException, InterruptedException { Map<Text, IntWritable> sortedMap = sortByValues(countMap); int counter = 0; for (Text key : sortedMap.keySet()) { if (counter++ == 20) { break; } context.write(key, sortedMap.get(key)); } } }
  • 74. TopN- Results The 2286 Of 1634 And 1098 That 499 You 429 Not 317 But 279 For 267 By 317 In shuffle and sort phase, the partioner will send every single word (the key) with the value “1” to the reducers. All these network transmissions can be minimized if we reduce the data locally the data that the mapper will emit. This is obtained by Combiner.
  • 75. TopNCombiner /** * The combiner retrieves every word and puts it into a Map: if the word already exists in the * map, increments its value, otherwise sets it to 1. */ public static class TopNCombiner extends Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { // computes the number of occurrences of a single word int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }
  • 76. Hadoop Output: With and Without Combiner Without Combiner ->  Map input records = 4239  Map output records = 37817  Map output bytes = 359621  Input split bytes = 118  Combine input records = 0  Combine output records = 0  Reduce input groups = 4987  Reduce shuffle bytes = 435261  Reduce input records = 37817  Reduce output records = 20 With Combiner ->  Map input records = 4239  Map output records = 37817  Map output bytes = 359621  Input split bytes = 116  Combine input records = 37817  Combine output records = 20  Reduce input groups = 20  Reduce shuffle bytes = 194  Reduce input records = 20  Reduce output records = 20
  • 77. Advantages and Disadvantages of using Combiner Advantages -> Network transmission are minimized. Disadvantages -> Hadoop doesn’t guarantee the execution of a combiner: it can be executed 0,1 or multiple times on the same input. Key-value pairs emitted from mapper are stored in local file system, and execution of combiner can cause extensive IO operations.