SlideShare a Scribd company logo
1 of 11
ACADGILDACADGILD
INTRODUCTION
There have been huge disasters in history, but the magnitude of the Titanic’s disaster ranks as high as
the depth it sank to. So much so that subsequent disasters have always been described as “titanic in
proportion” – implying huge losses.
Anyone who has read about the Titanic, know that a perfect combination of natural events and human
errors, led to the sinking of the Titanic on its fateful maiden journey from Southampton to New York on
April 14, 1912.
There have been several questions put forward to understand the cause/s of the tragedy – foremost
among them is: What made it sink and even more intriguing How can a 46,000 ton ship sink to the
depth of 13,000 feet in a matter of 3 hours? This is a mind boggling question indeed!
There have been as many inquisitions as there have been questions raised and equally that many types
of analysis methods applied to arrive at conclusions. But this blog is not about analyzing why or what
made the Titanic sink – it is about analyzing the data that is present about the Titanic publicly. It
actually uses Hadoop MapReduce to analyze and arrive at:
•The average age of the people (both male and female) who died in the tragedy using Hadoop
MapReduce.
•How many persons survived – class wise.
This blog is about analyzing the data of Titanic.This total analysis is performed in Hadoop
MapReduce.
This Titanic data is publically available and the titanic data set is described below under the heading
Data Set Description.
Using that dataset we will perform some Analysis and will draw out some insights like finding the
average age of male and females died in Titanic,Number of males and females died in each
compartment.
DATA SET DESCRIPTION
Column 1 : PassengerId
Column 2 : Survived (survived=0 & died=1)
Column 3 : Pclass
Column 4 : Name
https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/
ACADGILDACADGILD
Column 5 : Sex
Column 6 : Age
Column 7 : SibSp
Column 8 : Parch
Column 9 : Ticket
Column 10 : Fare
Column 11 : Cabin
Column 12 : Embarked
You can download the data set from the below link
Titanic Data set
Problem statement 1:
In this problem statement we will find the average age of males and females who died in the Titanic
tragedy.
Source code
Now from the mapper we will derive:
•the gender as key
•age as values
These valus will be passed to the shuffle and sort phase and are further sent to the reducer phase where
the aggregation of the values is performed.
Mapper code:
public class Average_age {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private Text gender = new Text();
private IntWritable age = new IntWritable();
public void map(LongWritable key, Text value, Context context ) throws IOException,
InterruptedException {
String line = value.toString();
String str[]=line.split(",");
if(str.length>6){
https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/
ACADGILDACADGILD
gender.set(str[4]);
if((str[1].equals("0")) ){
if(str[5].matches("d+")){
int i=Integer.parseInt(str[5]);
age.set(i);
}
}
}
context.write(gender, age);
}
}
The explanation for the code in each line of the above Mapper is:
•line 1 we are taking a class by name Average_age,
•In line 2 the Mapper default class is extended to have the
arguments keyIn as LongWritable and ValueIn as Text and KeyOut as Text
and ValueOut as IntWritable.
•In line 3, the variables declared are: a private Text variable gender which will store the gender
of the person who died in the Titanic tragedy.
•In line 4 the variables declared are: a private IntWritable variable age which will store the age
of the MapReduce deals with Key and Value pairs.Here the key is set as gender and value
as age.
•In line 5 we are overriding the map method which will run one time for every line.
•In line 6 we are storing the line in a string variable
•In line 7 we are splitting the line by using comma “,” delimiter and storing the values in a
String Array so that all the columns in a row are stored in the string array.
•In line 8 we are taking a condition if we have the string array length greater than 6 which
means if the line or row has at least 7 columns then the do the rest this helps in eliminating
the ArrayIndexOutOfBoundsException in some cases if the data inside the data set is
incorrect.
•In line 9 we are storing the gender which is in 5th
•In line 10 we are including a condition whether the person is alive or not if he is not alive then
proceed.
•In line 11 we are checking whether the data in that index is numeric data or not by using a
regular expression which can be achieved by “matches function in java”,if it is numeric data
then it will proceed.
https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/
ACADGILDACADGILD
•In line 12 we are converting that numeric data into an integer data by type casting.
•In line 13 we are storing the age of the person in age
•In line 17 we are writing the key and value into the context which will be the output of the map
method.
Reducer Code:
public static class Reduce extends Reducer<Text,IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
int l=0;
for (IntWritable val : values) {
l+=1;
sum += val.get();
}
sum=sum/l;
context.write(key, new IntWritable(sum));
}
}
Explanation of the code in the Reducer code:
•line 1 extends the default Reducer class with arguments KeyIn as Text and ValueIn as
IntWritable which are same as the outputs of the mapper class and KeyOut as Text
and ValueOut as IntWritbale which will be final outputs of our MapReduce program.
•In line 2 we overriding the Reduce method which will run each time for every key.
•In line 4 declaring an integer sum which will store the sum of all the ages of people into it.
•In line 5 we are taking another variable as “l” which will be incremented every time as many
values are there for that key.
•In line 6 a foreach loop is taken which will run each time for the values inside the “Iterable
values” which are coming from the shuffle and sort phase after the mapper phase.
https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/
ACADGILDACADGILD
•Value iterated.
•In line 8 we are storing and calculating the sum of the values.
•Closing bracket.
•In line 10 perform the average of the obtained sum and writes the respected key and the
obtained sum as value to the context.
Configuration Code:
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
This two configuration classes are included in the main class whereas to clarify the Output key type of
mapper and the output value type of the Mapper.
You can download the whole discussed source code from the below link:
https://github.com/kiran0541/Map-Reduce/blob/master/Average%20age%20of%20male%20and
%20female%20people%20died%20in%20titanic
Way to to execute the Jar file to get the result of the first problem statement:
hadoop jar average.jar /TitanicData.txt /avg_out
Here ‘hadoop’ specifies we are running a Hadoop command and jar specifies which type of application
we are running and average.jar is the jar file which we have created which consists the above source
code and the path of the Input file name in our case it is TitanicData.txt and the output file where to
store the output here we have given it as avg_out.
Way to view the output:
hadoop dfs –cat /avg_out/part-r-00000
Here ‘hadoop’ specifies that we are running a Hadoop command and dfs specifies that we are
performing an operation related to Hadoop Distributed File Sysytem and ‘- cat’ is used to view the
contents of a file and avg_out/part-r-00000 is the file where output is stored.Part file is created by
default by the TextInputFormat class of Hadoop.
[acadgild@localhost -]$ hadoop jar avgage.jar /TitanicData.txt /avg_out
15/10/23 06:25:37 WARN util.NativeCodeLoader: Unable to load native-hadoop libra ry for your
platform... using builtin-java classes where applicable
https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/
ACADGILDACADGILD
15/10/23 06:25:40 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0 :8032
15/10/23 06:25:43 WARN mapreduce.JobSubmitter: Hadoop command-line option parsin g not
performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
15/10/23 06:25:44 INFO input.FileInputFormat: Total input paths to process : 1
15/10/23 06:25:44 INFO mapreduce.JobSubmitter: number of splits:1
15/10/23 06:25:45 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_14 45558139946
0012
15/10/23 06:25:46 INFO impl.YarnClientlmpl: Submitted application application 14
45558139946_0012
15/10/23 06:25:46 INFO mapreduce.Job: The url to track the job: http://localhost
.localdomain:8088/proxy/application 1445558139946_0012/
15/10/23 06:25:46 INFO mapreduce.Job: Running job: job 1445558139946_0012
15/10/23 06:26:06 INFO mapreduce.Job: Job job_1445558139946_0012 running in uber mode : false
15/10/23 06:26:06 INFO mapreduce.Job: map 0% reduce 0%
15/10/23 06:26:21 INFO mapreduce.Job: map 100% reduce 0%
15/10/23 06:26:39 INFO mapreduce.Job: map 100% reduce 100%
15/10/23 06:26:40 INFO mapreduce.Job: Job job_1445558139946_0012 completed successfully
Output:
[acadgild@localhost -]$ hadoop dfs -cat /avg_out/part-r-00000 DEPRECATED: Use of this script to
execute hdfs command is deprecated. Instead use the hdfs command for it.
15/10/23 06:27:29 WARN util.NativeCodeLoader: Unable to load native-hadoop libra ry for your
platform... using builtin-java classes where applicable
Sex 0
female 28
male 30
https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/
ACADGILDACADGILD
Problem statement 2:
In this problem statement we will find the number of people died or survived in each class with their
genders and ages.
Source code:
Now from the mapper we want to get the composite key which is the combination of Passenger class
and gender and their age and final int value ‘1’ as values which will be passed to the shuffle and sort
phase and are further sent to the reducer phase where the aggregation of the values is performed.
Mapper code:
public class Female_survived {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {‘
private Text people = new Text();
private final static IntWritable one = new IntWritable(1);
public void map(LongWritable key, Text value, Context context ) throws IOException,
InterruptedException {
String line = value.toString();
String str[]=line.split(",");
if(str.length>6){
String survived=str[1]+" "+str[4]+" "+str[5];
people.set(survived);
context.write(people, one);
}
}
}
Explanation of the above Mapper code
•In line 1 we are taking a class by name survived
•In line 2 extending the Mapper default class having the arguments keyIn as LongWritable
and ValueIn as Text and KeyOut as Text and ValueOut as IntWritable.
•In line 3 declaring a private Text variable ‘people’ which will store the Text value as
•In line 4 declaring a private static final IntWritable variable ‘one’ which will be one and can
not be modified.MapReduce deals with Key and Value pairs.Here we can set the key
as pclass and value as one.
•In line 5 we are overriding the map method which will run one time for every line.
•In line 6 we are storing the line in a string variable ‘line’ .
•In line 7 we are splitting the line by using comma “,” delimiter and storing the values in a
https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/
ACADGILDACADGILD
String Array so that all the columns in a row are stored in the string array.
•In line 8 we are taking a condition if we have the string array length greater than 6 which
means if the line or row has at least 7 columns then do the rest this helps in eliminating
the ArrayIndexOutOfBoundsException in some cases if the data inside the data set is
incorrect.
•In line 9 we are making a composite by combining the necessary columns 2nd column which
specifies whether the person is alive or not and the 5th column to specify the persons gender and
6th column to specify their age.
•This is the easiest way of forming a composite key in MapReduce and the functionality is very
similar to the original way of making a composite key.
•In line 10 we are storing the prepare composite key in people.
•In line 11 we are writing the key and value into the context which will be the output of the map
method.
Reducer code:
public static class Reduce extends Reducer<Text,IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
While coming to the Reducer code
•In line 1 extends the default Reducer class with arguments KeyIn as Text and ValueIn as
IntWritable which are same as the outputs of the mapper class and KeyOut as Text
and ValueOut as IntWritbale which will be final outputs of our MapReduce program.
•In line 2 we overriding the Reduce method which will run each time for every key.
•In line 4 declaring an integer variable sum which will store the sum of all the values for each
key.
•In line 5 a foreach loop is taken which will run each time for the values inside the “Iterable
values” which are coming from the shuffle and sort phase after the mapper phase.
•In line 6 we are storing and calculating the sum of the values.
https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/
ACADGILDACADGILD
•In line 8 writes the respected key and the obtained sum as value to the context.
In the above example use of composite key made the complicated program simple.
Conf code
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
This two configuration classes are included in the main class whereas to clarify the Output key type of
mapper and the output value type of the Mapper.
You can download the whole source code from the below link
https://github.com/kiran0541/Map-Reduce/blob/master/Number%20of%20people%20died
%20or%20survived%20in%20each%20class%20their%20genders%20and%20their%20ages
%20in%20tatnic
Way to to execute the Jar file to get the result of the fourth problem statement:
hadoop jar peope_survived.jar /TitanicData.txt /people_out
Here ‘hadoop’ specifies we are running a Hadoop command and jar specifies which type of application
we are running and people_survived.jar is the jar file which we have created which consists the above
source code and the path of the Input file name in our case it is TitanicData.txt and the output file
where to store the output here we have given it as people_out.
Way to view the output
hadoop dfs –cat /people_out/part-r-00000
Here ‘hadoop’ specifies that we are running a Hadoop command and dfs specifies that we are
performing an operation related to Hadoop Distributed File Sysytem and ‘- cat’ is used to view the
contents of a file and people_out/part-r-00000 is the file where output is stored.Part file is created by
default by the TextInputFormat class of Hadoop.
[acadgild@localhost -]$ hadoop jar peope_survived.jar /TitanicData.txt /people_o ut
15/10/23 03:11:21 WARN util.NativeCodeLoader: Unable to load native-hadoop libra ry for your
platform... using builtin-java classes where applicable
15/10/23 03:11:26 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0 :8032
https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/
ACADGILDACADGILD
15/10/23 03:11:28 WARN mapreduce.JobSubmitter: Hadoop command-line option parsin g not
performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
15/10/23 03:11:28 INFO input.FileInputFormat: Total input paths to process : 1
15/10/23 03:11:28 INFO mapreduce.JobSubmitter: number of splits:1
15/10/23 03:11:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_14 45558139946
0005
15/10/23 03:11:29 INFO impl.YarnClientImpl: Submitted application application 14
45558139946_0005
15/10/23 03:11:30 INFO mapreduce.Job: The url to track the job: http://localhost
.localdomain:8088/proxy/application 1445558139946_0005/
15/10/23 03:11:30 INFO mapreduce.Job: Running job: job_1445558139946 0005 1
Output:
[acadgild@localhost -]$ hadoop dfs -cat /people_out/part-r-00000 DEPRECATED: Use of this script
to execute hdfs command is deprecated. Instead use the hdfs command for it.
15/10/23 03:08:58 WARN util.NativeCodeLoader: Unable to load native-hadoop libra ry for your
platform... using builtin-java classes where applicable
0 female 10 1
0 female 11 1
0 female 14 1
0 female 14.5 1
0 female 16 1
0 female 17 1
0 female 18 5
0 female 2 4
0 female 20 2
https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/
ACADGILDACADGILD
0 female 21 3
0 female 22 2
0 female 23 1
0 female 24 2
0 female 25 3
0 female 26 2
0 female 27 1
We hope this blog helped you in understanding MapReduce. Keep visiting our site for more updates on
BigData and other technologies.
https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/

More Related Content

Similar to ACADGILD:: HADOOP LESSON

Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Lviv Startup Club
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout source{d}
 
Compass Framework
Compass FrameworkCompass Framework
Compass FrameworkLukas Vlcek
 
Going beyond Apache Parquet's default settings
Going beyond Apache Parquet's default settingsGoing beyond Apache Parquet's default settings
Going beyond Apache Parquet's default settingsUwe Korn
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkVictor Coustenoble
 
Tutorial On Database Management System
Tutorial On Database Management SystemTutorial On Database Management System
Tutorial On Database Management Systempsathishcs
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
Lightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and SparkLightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and SparkTim Vincent
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Introduction to R Short course Fall 2016
Introduction to R Short course Fall 2016Introduction to R Short course Fall 2016
Introduction to R Short course Fall 2016Spencer Fox
 
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit
 
ASP.NET 08 - Data Binding And Representation
ASP.NET 08 - Data Binding And RepresentationASP.NET 08 - Data Binding And Representation
ASP.NET 08 - Data Binding And RepresentationRandy Connolly
 
Cassandra Client Tutorial
Cassandra Client TutorialCassandra Client Tutorial
Cassandra Client TutorialJoe McTee
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 

Similar to ACADGILD:: HADOOP LESSON (20)

Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
 
Compass Framework
Compass FrameworkCompass Framework
Compass Framework
 
Going beyond Apache Parquet's default settings
Going beyond Apache Parquet's default settingsGoing beyond Apache Parquet's default settings
Going beyond Apache Parquet's default settings
 
Data Visualization With R
Data Visualization With RData Visualization With R
Data Visualization With R
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
 
Tutorial On Database Management System
Tutorial On Database Management SystemTutorial On Database Management System
Tutorial On Database Management System
 
Odp
OdpOdp
Odp
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
Lightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and SparkLightning Fast Analytics with Cassandra and Spark
Lightning Fast Analytics with Cassandra and Spark
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Unit 3
Unit 3Unit 3
Unit 3
 
0xdata H2O Podcast
0xdata H2O Podcast0xdata H2O Podcast
0xdata H2O Podcast
 
Introduction to R Short course Fall 2016
Introduction to R Short course Fall 2016Introduction to R Short course Fall 2016
Introduction to R Short course Fall 2016
 
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
 
ASP.NET 08 - Data Binding And Representation
ASP.NET 08 - Data Binding And RepresentationASP.NET 08 - Data Binding And Representation
ASP.NET 08 - Data Binding And Representation
 
Cassandra Client Tutorial
Cassandra Client TutorialCassandra Client Tutorial
Cassandra Client Tutorial
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 

More from Padma shree. T

ACADGILD:: FRONTEND LESSON -Ruby on rails vs groovy on rails
ACADGILD:: FRONTEND LESSON -Ruby on rails vs groovy on railsACADGILD:: FRONTEND LESSON -Ruby on rails vs groovy on rails
ACADGILD:: FRONTEND LESSON -Ruby on rails vs groovy on railsPadma shree. T
 
ACADGILD:: ANDROID LESSON-How to analyze &amp; manage memory on android like ...
ACADGILD:: ANDROID LESSON-How to analyze &amp; manage memory on android like ...ACADGILD:: ANDROID LESSON-How to analyze &amp; manage memory on android like ...
ACADGILD:: ANDROID LESSON-How to analyze &amp; manage memory on android like ...Padma shree. T
 
ACADGILD:: HADOOP LESSON - File formats in apache hive
ACADGILD:: HADOOP LESSON - File formats in apache hiveACADGILD:: HADOOP LESSON - File formats in apache hive
ACADGILD:: HADOOP LESSON - File formats in apache hivePadma shree. T
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON Padma shree. T
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON Padma shree. T
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON Padma shree. T
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON Padma shree. T
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON Padma shree. T
 
ACADGILD:: ANDROID LESSON
ACADGILD:: ANDROID LESSON ACADGILD:: ANDROID LESSON
ACADGILD:: ANDROID LESSON Padma shree. T
 
ACADGILD:: ANDROID LESSON
ACADGILD:: ANDROID LESSON ACADGILD:: ANDROID LESSON
ACADGILD:: ANDROID LESSON Padma shree. T
 

More from Padma shree. T (10)

ACADGILD:: FRONTEND LESSON -Ruby on rails vs groovy on rails
ACADGILD:: FRONTEND LESSON -Ruby on rails vs groovy on railsACADGILD:: FRONTEND LESSON -Ruby on rails vs groovy on rails
ACADGILD:: FRONTEND LESSON -Ruby on rails vs groovy on rails
 
ACADGILD:: ANDROID LESSON-How to analyze &amp; manage memory on android like ...
ACADGILD:: ANDROID LESSON-How to analyze &amp; manage memory on android like ...ACADGILD:: ANDROID LESSON-How to analyze &amp; manage memory on android like ...
ACADGILD:: ANDROID LESSON-How to analyze &amp; manage memory on android like ...
 
ACADGILD:: HADOOP LESSON - File formats in apache hive
ACADGILD:: HADOOP LESSON - File formats in apache hiveACADGILD:: HADOOP LESSON - File formats in apache hive
ACADGILD:: HADOOP LESSON - File formats in apache hive
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
 
ACADGILD:: ANDROID LESSON
ACADGILD:: ANDROID LESSON ACADGILD:: ANDROID LESSON
ACADGILD:: ANDROID LESSON
 
ACADGILD:: ANDROID LESSON
ACADGILD:: ANDROID LESSON ACADGILD:: ANDROID LESSON
ACADGILD:: ANDROID LESSON
 

Recently uploaded

9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...Pooja Nehwal
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 

Recently uploaded (20)

9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 

ACADGILD:: HADOOP LESSON

  • 1. ACADGILDACADGILD INTRODUCTION There have been huge disasters in history, but the magnitude of the Titanic’s disaster ranks as high as the depth it sank to. So much so that subsequent disasters have always been described as “titanic in proportion” – implying huge losses. Anyone who has read about the Titanic, know that a perfect combination of natural events and human errors, led to the sinking of the Titanic on its fateful maiden journey from Southampton to New York on April 14, 1912. There have been several questions put forward to understand the cause/s of the tragedy – foremost among them is: What made it sink and even more intriguing How can a 46,000 ton ship sink to the depth of 13,000 feet in a matter of 3 hours? This is a mind boggling question indeed! There have been as many inquisitions as there have been questions raised and equally that many types of analysis methods applied to arrive at conclusions. But this blog is not about analyzing why or what made the Titanic sink – it is about analyzing the data that is present about the Titanic publicly. It actually uses Hadoop MapReduce to analyze and arrive at: •The average age of the people (both male and female) who died in the tragedy using Hadoop MapReduce. •How many persons survived – class wise. This blog is about analyzing the data of Titanic.This total analysis is performed in Hadoop MapReduce. This Titanic data is publically available and the titanic data set is described below under the heading Data Set Description. Using that dataset we will perform some Analysis and will draw out some insights like finding the average age of male and females died in Titanic,Number of males and females died in each compartment. DATA SET DESCRIPTION Column 1 : PassengerId Column 2 : Survived (survived=0 & died=1) Column 3 : Pclass Column 4 : Name https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/
  • 2. ACADGILDACADGILD Column 5 : Sex Column 6 : Age Column 7 : SibSp Column 8 : Parch Column 9 : Ticket Column 10 : Fare Column 11 : Cabin Column 12 : Embarked You can download the data set from the below link Titanic Data set Problem statement 1: In this problem statement we will find the average age of males and females who died in the Titanic tragedy. Source code Now from the mapper we will derive: •the gender as key •age as values These valus will be passed to the shuffle and sort phase and are further sent to the reducer phase where the aggregation of the values is performed. Mapper code: public class Average_age { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private Text gender = new Text(); private IntWritable age = new IntWritable(); public void map(LongWritable key, Text value, Context context ) throws IOException, InterruptedException { String line = value.toString(); String str[]=line.split(","); if(str.length>6){ https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/
  • 3. ACADGILDACADGILD gender.set(str[4]); if((str[1].equals("0")) ){ if(str[5].matches("d+")){ int i=Integer.parseInt(str[5]); age.set(i); } } } context.write(gender, age); } } The explanation for the code in each line of the above Mapper is: •line 1 we are taking a class by name Average_age, •In line 2 the Mapper default class is extended to have the arguments keyIn as LongWritable and ValueIn as Text and KeyOut as Text and ValueOut as IntWritable. •In line 3, the variables declared are: a private Text variable gender which will store the gender of the person who died in the Titanic tragedy. •In line 4 the variables declared are: a private IntWritable variable age which will store the age of the MapReduce deals with Key and Value pairs.Here the key is set as gender and value as age. •In line 5 we are overriding the map method which will run one time for every line. •In line 6 we are storing the line in a string variable •In line 7 we are splitting the line by using comma “,” delimiter and storing the values in a String Array so that all the columns in a row are stored in the string array. •In line 8 we are taking a condition if we have the string array length greater than 6 which means if the line or row has at least 7 columns then the do the rest this helps in eliminating the ArrayIndexOutOfBoundsException in some cases if the data inside the data set is incorrect. •In line 9 we are storing the gender which is in 5th •In line 10 we are including a condition whether the person is alive or not if he is not alive then proceed. •In line 11 we are checking whether the data in that index is numeric data or not by using a regular expression which can be achieved by “matches function in java”,if it is numeric data then it will proceed. https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/
  • 4. ACADGILDACADGILD •In line 12 we are converting that numeric data into an integer data by type casting. •In line 13 we are storing the age of the person in age •In line 17 we are writing the key and value into the context which will be the output of the map method. Reducer Code: public static class Reduce extends Reducer<Text,IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; int l=0; for (IntWritable val : values) { l+=1; sum += val.get(); } sum=sum/l; context.write(key, new IntWritable(sum)); } } Explanation of the code in the Reducer code: •line 1 extends the default Reducer class with arguments KeyIn as Text and ValueIn as IntWritable which are same as the outputs of the mapper class and KeyOut as Text and ValueOut as IntWritbale which will be final outputs of our MapReduce program. •In line 2 we overriding the Reduce method which will run each time for every key. •In line 4 declaring an integer sum which will store the sum of all the ages of people into it. •In line 5 we are taking another variable as “l” which will be incremented every time as many values are there for that key. •In line 6 a foreach loop is taken which will run each time for the values inside the “Iterable values” which are coming from the shuffle and sort phase after the mapper phase. https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/
  • 5. ACADGILDACADGILD •Value iterated. •In line 8 we are storing and calculating the sum of the values. •Closing bracket. •In line 10 perform the average of the obtained sum and writes the respected key and the obtained sum as value to the context. Configuration Code: job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); This two configuration classes are included in the main class whereas to clarify the Output key type of mapper and the output value type of the Mapper. You can download the whole discussed source code from the below link: https://github.com/kiran0541/Map-Reduce/blob/master/Average%20age%20of%20male%20and %20female%20people%20died%20in%20titanic Way to to execute the Jar file to get the result of the first problem statement: hadoop jar average.jar /TitanicData.txt /avg_out Here ‘hadoop’ specifies we are running a Hadoop command and jar specifies which type of application we are running and average.jar is the jar file which we have created which consists the above source code and the path of the Input file name in our case it is TitanicData.txt and the output file where to store the output here we have given it as avg_out. Way to view the output: hadoop dfs –cat /avg_out/part-r-00000 Here ‘hadoop’ specifies that we are running a Hadoop command and dfs specifies that we are performing an operation related to Hadoop Distributed File Sysytem and ‘- cat’ is used to view the contents of a file and avg_out/part-r-00000 is the file where output is stored.Part file is created by default by the TextInputFormat class of Hadoop. [acadgild@localhost -]$ hadoop jar avgage.jar /TitanicData.txt /avg_out 15/10/23 06:25:37 WARN util.NativeCodeLoader: Unable to load native-hadoop libra ry for your platform... using builtin-java classes where applicable https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/
  • 6. ACADGILDACADGILD 15/10/23 06:25:40 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0 :8032 15/10/23 06:25:43 WARN mapreduce.JobSubmitter: Hadoop command-line option parsin g not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 15/10/23 06:25:44 INFO input.FileInputFormat: Total input paths to process : 1 15/10/23 06:25:44 INFO mapreduce.JobSubmitter: number of splits:1 15/10/23 06:25:45 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_14 45558139946 0012 15/10/23 06:25:46 INFO impl.YarnClientlmpl: Submitted application application 14 45558139946_0012 15/10/23 06:25:46 INFO mapreduce.Job: The url to track the job: http://localhost .localdomain:8088/proxy/application 1445558139946_0012/ 15/10/23 06:25:46 INFO mapreduce.Job: Running job: job 1445558139946_0012 15/10/23 06:26:06 INFO mapreduce.Job: Job job_1445558139946_0012 running in uber mode : false 15/10/23 06:26:06 INFO mapreduce.Job: map 0% reduce 0% 15/10/23 06:26:21 INFO mapreduce.Job: map 100% reduce 0% 15/10/23 06:26:39 INFO mapreduce.Job: map 100% reduce 100% 15/10/23 06:26:40 INFO mapreduce.Job: Job job_1445558139946_0012 completed successfully Output: [acadgild@localhost -]$ hadoop dfs -cat /avg_out/part-r-00000 DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it. 15/10/23 06:27:29 WARN util.NativeCodeLoader: Unable to load native-hadoop libra ry for your platform... using builtin-java classes where applicable Sex 0 female 28 male 30 https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/
  • 7. ACADGILDACADGILD Problem statement 2: In this problem statement we will find the number of people died or survived in each class with their genders and ages. Source code: Now from the mapper we want to get the composite key which is the combination of Passenger class and gender and their age and final int value ‘1’ as values which will be passed to the shuffle and sort phase and are further sent to the reducer phase where the aggregation of the values is performed. Mapper code: public class Female_survived { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {‘ private Text people = new Text(); private final static IntWritable one = new IntWritable(1); public void map(LongWritable key, Text value, Context context ) throws IOException, InterruptedException { String line = value.toString(); String str[]=line.split(","); if(str.length>6){ String survived=str[1]+" "+str[4]+" "+str[5]; people.set(survived); context.write(people, one); } } } Explanation of the above Mapper code •In line 1 we are taking a class by name survived •In line 2 extending the Mapper default class having the arguments keyIn as LongWritable and ValueIn as Text and KeyOut as Text and ValueOut as IntWritable. •In line 3 declaring a private Text variable ‘people’ which will store the Text value as •In line 4 declaring a private static final IntWritable variable ‘one’ which will be one and can not be modified.MapReduce deals with Key and Value pairs.Here we can set the key as pclass and value as one. •In line 5 we are overriding the map method which will run one time for every line. •In line 6 we are storing the line in a string variable ‘line’ . •In line 7 we are splitting the line by using comma “,” delimiter and storing the values in a https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/
  • 8. ACADGILDACADGILD String Array so that all the columns in a row are stored in the string array. •In line 8 we are taking a condition if we have the string array length greater than 6 which means if the line or row has at least 7 columns then do the rest this helps in eliminating the ArrayIndexOutOfBoundsException in some cases if the data inside the data set is incorrect. •In line 9 we are making a composite by combining the necessary columns 2nd column which specifies whether the person is alive or not and the 5th column to specify the persons gender and 6th column to specify their age. •This is the easiest way of forming a composite key in MapReduce and the functionality is very similar to the original way of making a composite key. •In line 10 we are storing the prepare composite key in people. •In line 11 we are writing the key and value into the context which will be the output of the map method. Reducer code: public static class Reduce extends Reducer<Text,IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } While coming to the Reducer code •In line 1 extends the default Reducer class with arguments KeyIn as Text and ValueIn as IntWritable which are same as the outputs of the mapper class and KeyOut as Text and ValueOut as IntWritbale which will be final outputs of our MapReduce program. •In line 2 we overriding the Reduce method which will run each time for every key. •In line 4 declaring an integer variable sum which will store the sum of all the values for each key. •In line 5 a foreach loop is taken which will run each time for the values inside the “Iterable values” which are coming from the shuffle and sort phase after the mapper phase. •In line 6 we are storing and calculating the sum of the values. https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/
  • 9. ACADGILDACADGILD •In line 8 writes the respected key and the obtained sum as value to the context. In the above example use of composite key made the complicated program simple. Conf code job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); This two configuration classes are included in the main class whereas to clarify the Output key type of mapper and the output value type of the Mapper. You can download the whole source code from the below link https://github.com/kiran0541/Map-Reduce/blob/master/Number%20of%20people%20died %20or%20survived%20in%20each%20class%20their%20genders%20and%20their%20ages %20in%20tatnic Way to to execute the Jar file to get the result of the fourth problem statement: hadoop jar peope_survived.jar /TitanicData.txt /people_out Here ‘hadoop’ specifies we are running a Hadoop command and jar specifies which type of application we are running and people_survived.jar is the jar file which we have created which consists the above source code and the path of the Input file name in our case it is TitanicData.txt and the output file where to store the output here we have given it as people_out. Way to view the output hadoop dfs –cat /people_out/part-r-00000 Here ‘hadoop’ specifies that we are running a Hadoop command and dfs specifies that we are performing an operation related to Hadoop Distributed File Sysytem and ‘- cat’ is used to view the contents of a file and people_out/part-r-00000 is the file where output is stored.Part file is created by default by the TextInputFormat class of Hadoop. [acadgild@localhost -]$ hadoop jar peope_survived.jar /TitanicData.txt /people_o ut 15/10/23 03:11:21 WARN util.NativeCodeLoader: Unable to load native-hadoop libra ry for your platform... using builtin-java classes where applicable 15/10/23 03:11:26 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0 :8032 https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/
  • 10. ACADGILDACADGILD 15/10/23 03:11:28 WARN mapreduce.JobSubmitter: Hadoop command-line option parsin g not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 15/10/23 03:11:28 INFO input.FileInputFormat: Total input paths to process : 1 15/10/23 03:11:28 INFO mapreduce.JobSubmitter: number of splits:1 15/10/23 03:11:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_14 45558139946 0005 15/10/23 03:11:29 INFO impl.YarnClientImpl: Submitted application application 14 45558139946_0005 15/10/23 03:11:30 INFO mapreduce.Job: The url to track the job: http://localhost .localdomain:8088/proxy/application 1445558139946_0005/ 15/10/23 03:11:30 INFO mapreduce.Job: Running job: job_1445558139946 0005 1 Output: [acadgild@localhost -]$ hadoop dfs -cat /people_out/part-r-00000 DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it. 15/10/23 03:08:58 WARN util.NativeCodeLoader: Unable to load native-hadoop libra ry for your platform... using builtin-java classes where applicable 0 female 10 1 0 female 11 1 0 female 14 1 0 female 14.5 1 0 female 16 1 0 female 17 1 0 female 18 5 0 female 2 4 0 female 20 2 https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/
  • 11. ACADGILDACADGILD 0 female 21 3 0 female 22 2 0 female 23 1 0 female 24 2 0 female 25 3 0 female 26 2 0 female 27 1 We hope this blog helped you in understanding MapReduce. Keep visiting our site for more updates on BigData and other technologies. https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/https://acadgild.com/blog/analyzing-titanic-data-with-hadoop-mapreduce/