SlideShare a Scribd company logo
1 of 28
Download to read offline
What is Big Data?

Introduction to Hadoop.

Hadoop Distributed File System.

MapReduce and Hadoop MapReduce.

Hadoop/MapReduce (1)
Eslam Montaser Roushdi
Facultad de Inform´tica
a
Universidad Complutense de Madrid
Grupo G-Tec UCM
www.tecnologiaUCM.es

February, 2014

How to link R and Hadoop?
What is Big Data?

Introduction to Hadoop.

Hadoop Distributed File System.

MapReduce and Hadoop MapReduce.

How to link R and Hadoop?

Our aim
Study how to handle the Big Data by using Hadoop/MapReduce, in order
to analysis this data by using RHadoop.
What is Big Data?

Introduction to Hadoop.

Hadoop Distributed File System.

Outlines:

1

What is Big Data?

2

Introduction to Hadoop.

3

Hadoop Distributed File System.

4

MapReduce and Hadoop MapReduce.

5

How to link R and Hadoop?

MapReduce and Hadoop MapReduce.

How to link R and Hadoop?
What is Big Data?

Introduction to Hadoop.

Hadoop Distributed File System.

MapReduce and Hadoop MapReduce.

How to link R and Hadoop?

1) What is Big Data?
Large datasets
Big Data has to deal with large and complex datasets that can be
structured, semi-structured, or unstructured.
Five exabytes of data are generated every days!! ”Eric Schmidt”.
How to explore, analyze such large datasets?
Facebook
> 800 million active users in Facebook.
interacting with > 1 billion objects.
2.5 petabytes of userdata per day!
What is Big Data?

Introduction to Hadoop.

1) What is Big Data?

Hadoop Distributed File System.

MapReduce and Hadoop MapReduce.

How to link R and Hadoop?
What is Big Data?

Introduction to Hadoop.

Hadoop Distributed File System.

MapReduce and Hadoop MapReduce.

How to link R and Hadoop?

2) Introduction to Hadoop.
What is Hadoop?
Hadoop was created by Doug Cutting and Mike Cafarella in 2005.
Apache Hadoop is an open source Java framework for processing and
querying vast amounts of data on large clusters of commodity hardware.
Supports applications under a free license.
Operates on unstructured and structured data.
A large and active ecosystem.
Handles thousands of nodes and petabytes of data.
Hadoop enables scalable, cost-effective, flexible, fault-tolerant solutions.
What is Big Data?

Introduction to Hadoop.

Hadoop Distributed File System.

2) Introduction to Hadoop.
Studying Hadoop components
Mahout.
Pig.
Hive.
HBase.
Sqoop.
ZooKeper.
Ambari.

MapReduce and Hadoop MapReduce.

How to link R and Hadoop?
What is Big Data?

Introduction to Hadoop.

Hadoop Distributed File System.

MapReduce and Hadoop MapReduce.

How to link R and Hadoop?

2) Introduction to Hadoop.
Exploring Hadoop features
Hadoop Common: common utilities package.
HDFS: Hadoop Distributed File System with high throughput access to
application data.
MapReduce: A software framework for distributed processing of large
data sets on computer clusters.
Hadoop YARN: a resource-management platform responsible for
managing compute resources in clusters and using them for scheduling the
applications of users.
What is Big Data?

Introduction to Hadoop.

Hadoop Distributed File System.

MapReduce and Hadoop MapReduce.

How to link R and Hadoop?

3) Hadoop Distributed File System.

Hadoop Distributed File System (HFDS)
Is a distributed, scalable, and portable file-system written in Java for the
Hadoop framework.
The characteristics of HDFS:
Fault tolerant.
Runs with commodity hardware.
Able to handle large datasets.
Master slave paradigm.
Write once file access only.
HDFS components
NameNode.
DataNode.
Secondary NameNode.
What is Big Data?

Introduction to Hadoop.

Hadoop Distributed File System.

3) Hadoop Distributed File System.
HDFS architecture

MapReduce and Hadoop MapReduce.

How to link R and Hadoop?
What is Big Data?

Introduction to Hadoop.

Hadoop Distributed File System.

MapReduce and Hadoop MapReduce.

How to link R and Hadoop?

4) MapReduce and Hadoop MapReduce.

MapReduce
Is derived from Google MapReduce.
Is a programming model for processing large datasets distributed on a
large cluster.
MapReduce is the heart of Hadoop.
MapReduce paradigm
Map phase: Once divided, datasets are assigned to the task tracker to
perform the Map phase. The data functional operation will be performed
over the data, emitting the mapped key and value pairs as the output of
the Map phase, (i.e data processing).
Reduce phase: The master node then collects the answers to all the
subproblems and combines them in some way to form the output; the
answer to the problem it was originally trying to solve, (i.e data collection
and digesting).
What is Big Data?

Introduction to Hadoop.

Hadoop Distributed File System.

MapReduce and Hadoop MapReduce.

How to link R and Hadoop?

4) MapReduce and Hadoop MapReduce.
MapReduce Goals
1

Scalabilityto large data volumes:
Scan 100 TB on 1 node @ 50 MB/s = 24 days.
Scan on 1000-node cluster = 35 minutes.

2

Cost-efficiency:
Commodity nodes (cheap, but unreliable).
Commodity network.
Automatic fault-tolerance (fewer admins).
Easy to use (fewer programmers).

Challenges
1

Cheap nodes fail, especially if you have many:
Mean time between failures for 1 node = 3 years.
MTBF for 1000 nodes = 1 day.

Solution: Build fault-tolerance into system.
2

Programming distributed systems is hard:
Solution: Users write data-parallel ”map” and ”reduce” functions, system
handles work distribution and faults.
What is Big Data?

Introduction to Hadoop.

Hadoop Distributed File System.

4) MapReduce and Hadoop MapReduce.
MapReduce components
JobTracker.
TaskTracker.

MapReduce and Hadoop MapReduce.

How to link R and Hadoop?
What is Big Data?

Introduction to Hadoop.

Hadoop Distributed File System.

MapReduce and Hadoop MapReduce.

How to link R and Hadoop?

4) MapReduce and Hadoop MapReduce.
The five common steps of parallel computing
1

Preparing the Map() input: This will take the input data row wise and
emit key value pairs per rows, or we can explicitly change as per the
requirement.
Map input: list (k1, v1).

2

Run the user-provided Map() code.
Map output: list (k2, v2).

3

Shuffle the Map output to the Reduce processors. Also, shuffle the similar
keys (grouping them) and input them to the same reducer.

4

Run the user-provided Reduce() code: This phase will run the custom
reducer code designed by developer to run on shuffled data and emit key
and value.
Reduce input: (k2, list(v2)).
Reduce output: (k3, v3).

5

Produce the final output: Finally, the master node collects all reducer
output and combines and writes them in a text file.
What is Big Data?

Introduction to Hadoop.

Hadoop Distributed File System.

4) MapReduce and Hadoop MapReduce.

MapReduce and Hadoop MapReduce.

How to link R and Hadoop?
What is Big Data?

Introduction to Hadoop.

Hadoop Distributed File System.

MapReduce and Hadoop MapReduce.

How to link R and Hadoop?

4) MapReduce and Hadoop MapReduce.

Hadoop MapReduce
Is a popular Java framework for easily written applications. It processes vast
amounts of data (multiterabyte datasets) in parallel on large clusters
(thousands of nodes) of commodity hardware in a reliable and fault-tolerant
manner.
Listing Hadoop MapReduce entities
Client: This initializes the job.
JobTracker: This monitors the job.
TaskTracker: This executes the job.
HDFS: This stores the input and output data.
What is Big Data?

Introduction to Hadoop.

Hadoop Distributed File System.

4) MapReduce and Hadoop MapReduce.
Hadoop MapReduce scenario
The loading of data into HDFS.
The execution of the Map phase.
Shuffling and sorting.
The execution of the Reduce phase.

MapReduce and Hadoop MapReduce.

How to link R and Hadoop?
What is Big Data?

Introduction to Hadoop.

Hadoop Distributed File System.

MapReduce and Hadoop MapReduce.

How to link R and Hadoop?

4) MapReduce and Hadoop MapReduce.

Understanding the Hadoop MapReduce fundamentals
MapReduce objects.
i) Mapper. ii) Reducer. iii) Driver.
MapReduce dataflow.
1

Preloading data in HDFS.

2

Running MapReduce by calling Driver.

3

Reading of input data by the Mappers, which results in the splitting of the
data execution of the Mapper custom logic and the generation of
intermediate key-value pairs.

4

Executing Combiner and the shuffle phase to optimize the overall Hadoop
MapReduce process.

5

Sorting and providing of intermediate key-value pairs to the Reduce phase.
The Reduce phase is then executed. Reducers take these partitioned
key-value pairs and aggregate them based on Reducer logic.

6

The final output data is stored at HDFS.
What is Big Data?

Introduction to Hadoop.

5) Hadoop MapReduce.

Example Word Count

Hadoop Distributed File System.

MapReduce and Hadoop MapReduce.

How to link R and Hadoop?
What is Big Data?

Introduction to Hadoop.

Hadoop Distributed File System.

MapReduce and Hadoop MapReduce.

4) MapReduce and Hadoop MapReduce.
Example Word Count (1).
Map:
public static class MapClass extends MapReduceBase
implements Mapper {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(WritableComparable key, Writable value,
OutputCollector output, Reporter reporter)
throws IOException {
String line = ((Text)value).toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}

How to link R and Hadoop?
What is Big Data?

Introduction to Hadoop.

Hadoop Distributed File System.

MapReduce and Hadoop MapReduce.

How to link R and Hadoop?

4) MapReduce and Hadoop MapReduce.

Example Word Count (2)
Reduce:
public static class Reduce extends MapReduceBase implements Reducer {
public void reduce(WritableComparable key, Iterator
values, OutputCollector output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum + = ((IntWritable) values.next()).get();
}
output.collect(key, new IntWritable(sum));
}
}
What is Big Data?

Introduction to Hadoop.

Hadoop Distributed File System.

MapReduce and Hadoop MapReduce.

4) MapReduce and Hadoop MapReduce.

Example Word Count (3)
WordCount:
public static void main(String[ ]args) throws IOException {
//checking goes here
JobConf conf = new JobConf();
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MapClass.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputPath(new Path(args[0]));
conf.setOutputPath(new Path(args[1]));
JobClient.runJob(conf);
}

How to link R and Hadoop?
What is Big Data?

Introduction to Hadoop.

Hadoop Distributed File System.

MapReduce and Hadoop MapReduce.

How to link R and Hadoop?

4) MapReduce and Hadoop MapReduce.

What is MapReduce used for?
1

At Google
Index construction for Google Search (replaced in 2010 by Caffeine).
Article clustering for Google News.
Statistical machine translation.

2

At Yahoo!
”Web map” powering Yahoo! Search.
Spam detection for Yahoo! Mail.

3

At Facebook
Data mining, Web log processing.
SearchBox (with Cassandra).
Facebook Messages (with HBase).
Spam detection and Ad optimization.

4

In research
Bioinformatics (Maryland).
Particle physics (Nebraska).
What is Big Data?

Introduction to Hadoop.

Hadoop Distributed File System.

MapReduce and Hadoop MapReduce.

How to link R and Hadoop?

5) How to link R and Hadoop?

Learning RHadoop
RHadoop is a great open source software framework of R for performing data
analytics with the Hadoop platform via R functions.

RHadoop with R packages
These three different R packages have been designed on Hadoop’s two main
features HDFS and MapReduce:
rhdfs: This is an R package for providing all Hadoop HDFS access to R. All
distributed files can be managed with R functions.
rmr: This is an R package for providing Hadoop MapReduce interfaces to R.
With the help of this package, the Mapper and Reducer can easily be
developed.
rhbase: This is an R package for handling data at HBase distributed database
through R.
What is Big Data?

Introduction to Hadoop.

Hadoop Distributed File System.

MapReduce and Hadoop MapReduce.

How to link R and Hadoop?

5) How to link R and Hadoop?

Learning RHIPE
R and Hadoop Integrated Programming Environment (RHIPE) is a
free and open source project. RHIPE is widely used for performing Big
Data analysis with D & R analysis.
D & R analysis is used to divide huge data, process it in parallel on a
distributed network to produce intermediate output, and finally recombine
all this intermediate output into a set.
RHIPE is designed to carry out D & R analysis on complex Big Data in R
on the Hadoop platform.
RHadoop
What is Big Data?

Introduction to Hadoop.

Hadoop Distributed File System.

MapReduce and Hadoop MapReduce.

How to link R and Hadoop?

5) How to link R and Hadoop?
Learning Hadoop streaming
Hadoop streaming is a utility that comes with the Hadoop distribution.
This utility allows you to create and run MapReduce jobs with any
executable or script as the Mapper and/or Reducer. This is supported by
R, Python, Ruby, Bash, Perl, and so on. We will use the R language with
a bash script.
There is one R package named HadoopStreaming that has been
developed for performing data analysis on Hadoop clusters with the help
of R scripts, which is an interface to Hadoop streaming with R.
Finally
Three ways to link R and Hadoop are as follows:
RHIPE.
RHadoop.
Hadoop streaming.
What is Big Data?

Introduction to Hadoop.

Hadoop Distributed File System.

MapReduce and Hadoop MapReduce.

Thanks!!

How to link R and Hadoop?
What is Big Data?

Introduction to Hadoop.

Hadoop Distributed File System.

MapReduce and Hadoop MapReduce.

How to link R and Hadoop?

References

Glen Mules, Warren Pettit, Introduction to MapReduce Programming,
September 2013.
John Maindonald, W. John Braun, Introduction and Hadoop Overview,
Lab Course: Databases & Cloud Computing, University of Freiburg, 2012.
Vignesh Prajapati, Big Data Analytics with R and Hadoop, November
2013.
Tom White, Hadoop: The Definitive Guide, Second Edition, October 2010.

More Related Content

What's hot

Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformBikas Saha
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reducePaladion Networks
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2Giovanna Roda
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce FrameworkEdureka!
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 

What's hot (20)

Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Hadoop
HadoopHadoop
Hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute Platform
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
SQLBits XI - ETL with Hadoop
SQLBits XI - ETL with HadoopSQLBits XI - ETL with Hadoop
SQLBits XI - ETL with Hadoop
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 

Viewers also liked

Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Revolution Analytics
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File Systemelliando dias
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHanborq Inc.
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Jeffrey Breen
 
2.introduction to hdfs
2.introduction to hdfs2.introduction to hdfs
2.introduction to hdfsdatabloginfo
 
Hadoop distributed file system rev3
Hadoop distributed file system rev3Hadoop distributed file system rev3
Hadoop distributed file system rev3Sung-jae Park
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemRutvik Bapat
 
Hadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapaHadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapakapa rohit
 
How we (Almost) Forgot Lambda Architecture and used Elasticsearch
How we (Almost) Forgot Lambda Architecture and used ElasticsearchHow we (Almost) Forgot Lambda Architecture and used Elasticsearch
How we (Almost) Forgot Lambda Architecture and used ElasticsearchMichael Stockerl
 
epoll() - The I/O Hero
epoll() - The I/O Heroepoll() - The I/O Hero
epoll() - The I/O HeroMohsin Hijazee
 
HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015Avinash Ramineni
 
Big Data Analysis With RHadoop
Big Data Analysis With RHadoopBig Data Analysis With RHadoop
Big Data Analysis With RHadoopDavid Chiu
 
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2Jeffrey Breen
 
NoSQL databases - An introduction
NoSQL databases - An introductionNoSQL databases - An introduction
NoSQL databases - An introductionPooyan Mehrparvar
 
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Jeffrey Breen
 
Epoll - from the kernel side
Epoll -  from the kernel sideEpoll -  from the kernel side
Epoll - from the kernel sidellj098
 
Why R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformWhy R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformSyracuse University
 
R server and spark
R server and sparkR server and spark
R server and sparkBAINIDA
 

Viewers also liked (20)

Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
 
Hadoop hdfs
Hadoop hdfsHadoop hdfs
Hadoop hdfs
 
2.introduction to hdfs
2.introduction to hdfs2.introduction to hdfs
2.introduction to hdfs
 
Hadoop distributed file system rev3
Hadoop distributed file system rev3Hadoop distributed file system rev3
Hadoop distributed file system rev3
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Hadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapaHadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapa
 
How we (Almost) Forgot Lambda Architecture and used Elasticsearch
How we (Almost) Forgot Lambda Architecture and used ElasticsearchHow we (Almost) Forgot Lambda Architecture and used Elasticsearch
How we (Almost) Forgot Lambda Architecture and used Elasticsearch
 
epoll() - The I/O Hero
epoll() - The I/O Heroepoll() - The I/O Hero
epoll() - The I/O Hero
 
HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015
 
Big Data Analysis With RHadoop
Big Data Analysis With RHadoopBig Data Analysis With RHadoop
Big Data Analysis With RHadoop
 
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2
Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2
 
NoSQL databases - An introduction
NoSQL databases - An introductionNoSQL databases - An introduction
NoSQL databases - An introduction
 
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
 
Epoll - from the kernel side
Epoll -  from the kernel sideEpoll -  from the kernel side
Epoll - from the kernel side
 
Why R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformWhy R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics Platform
 
R server and spark
R server and sparkR server and spark
R server and spark
 

Similar to Hadoop, MapReduce and R = RHadoop

Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
 
Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceKrishna Sangeeth KS
 
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
62_Tazeen_Sayed_Hadoop_Ecosystem.pptxTazeenSayed3
 
Understanding hadoop
Understanding hadoopUnderstanding hadoop
Understanding hadoopRexRamos9
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415SANTOSH WAYAL
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programsjani shaik
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics iosrjce
 

Similar to Hadoop, MapReduce and R = RHadoop (20)

Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
 
Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and Mapreduce
 
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
 
Understanding hadoop
Understanding hadoopUnderstanding hadoop
Understanding hadoop
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
B017320612
B017320612B017320612
B017320612
 
Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics
 

More from Victoria López

Alan turing uva-presentationdec-2019
Alan turing uva-presentationdec-2019Alan turing uva-presentationdec-2019
Alan turing uva-presentationdec-2019Victoria López
 
Seminar UvA 2018- socialbigdata
Seminar UvA  2018- socialbigdataSeminar UvA  2018- socialbigdata
Seminar UvA 2018- socialbigdataVictoria López
 
BIG DATA EN CIENCIAS DE LA SALUD Y CIENCIAS SOCIALES
BIG DATA EN CIENCIAS DE LA SALUD Y CIENCIAS SOCIALESBIG DATA EN CIENCIAS DE LA SALUD Y CIENCIAS SOCIALES
BIG DATA EN CIENCIAS DE LA SALUD Y CIENCIAS SOCIALESVictoria López
 
ICCES'2016 BIG DATA IN HEALTHCARE AND SOCIAL SCIENCES
ICCES'2016  BIG DATA IN HEALTHCARE AND SOCIAL SCIENCESICCES'2016  BIG DATA IN HEALTHCARE AND SOCIAL SCIENCES
ICCES'2016 BIG DATA IN HEALTHCARE AND SOCIAL SCIENCESVictoria López
 
Presentación Gupo G-TeC en Social Big Data
Presentación Gupo G-TeC en Social Big DataPresentación Gupo G-TeC en Social Big Data
Presentación Gupo G-TeC en Social Big DataVictoria López
 
Big data systems and analytics
Big data systems and analyticsBig data systems and analytics
Big data systems and analyticsVictoria López
 
Big Data. Complejidad,algoritmos y su procesamiento
Big Data. Complejidad,algoritmos y su procesamientoBig Data. Complejidad,algoritmos y su procesamiento
Big Data. Complejidad,algoritmos y su procesamientoVictoria López
 
APLICACIÓN DE TÉCNICAS DE OPTIMIZACIÓN Y BIG DATA AL PROBLEMA DE BÚSQUEDA...
APLICACIÓN DE TÉCNICAS DE OPTIMIZACIÓN Y BIG DATA AL PROBLEMA DE BÚSQUEDA...APLICACIÓN DE TÉCNICAS DE OPTIMIZACIÓN Y BIG DATA AL PROBLEMA DE BÚSQUEDA...
APLICACIÓN DE TÉCNICAS DE OPTIMIZACIÓN Y BIG DATA AL PROBLEMA DE BÚSQUEDA...Victoria López
 
G te c sesion1a-bioinformatica y big data
G te c sesion1a-bioinformatica y big dataG te c sesion1a-bioinformatica y big data
G te c sesion1a-bioinformatica y big dataVictoria López
 
G te c sesion1b-casos de uso
G te c sesion1b-casos de usoG te c sesion1b-casos de uso
G te c sesion1b-casos de usoVictoria López
 
G te c sesion2a-data collection
G te c sesion2a-data collectionG te c sesion2a-data collection
G te c sesion2a-data collectionVictoria López
 
G tec sesion2b-host-cloud y cloudcomputing
G tec sesion2b-host-cloud y cloudcomputingG tec sesion2b-host-cloud y cloudcomputing
G tec sesion2b-host-cloud y cloudcomputingVictoria López
 
G te c sesion3a-bases de datos modernas
G te c sesion3a-bases de datos modernasG te c sesion3a-bases de datos modernas
G te c sesion3a-bases de datos modernasVictoria López
 
G te c sesion3b- mapreduce
G te c sesion3b- mapreduceG te c sesion3b- mapreduce
G te c sesion3b- mapreduceVictoria López
 
G te c sesion4a-bigdatasystemsanalytics
G te c sesion4a-bigdatasystemsanalyticsG te c sesion4a-bigdatasystemsanalytics
G te c sesion4a-bigdatasystemsanalyticsVictoria López
 
G te c sesion4b-complejidad y tpa
G te c sesion4b-complejidad y tpaG te c sesion4b-complejidad y tpa
G te c sesion4b-complejidad y tpaVictoria López
 
Open Data para Smartcity-Facultad de Estudios Estadísticos
Open Data para Smartcity-Facultad de Estudios EstadísticosOpen Data para Smartcity-Facultad de Estudios Estadísticos
Open Data para Smartcity-Facultad de Estudios EstadísticosVictoria López
 
Deep Learning + R by Gabriel Valverde
Deep Learning + R by Gabriel ValverdeDeep Learning + R by Gabriel Valverde
Deep Learning + R by Gabriel ValverdeVictoria López
 
Fortune Time Institute: Big Data - Challenges for Smartcity
Fortune Time Institute: Big Data - Challenges for SmartcityFortune Time Institute: Big Data - Challenges for Smartcity
Fortune Time Institute: Big Data - Challenges for SmartcityVictoria López
 

More from Victoria López (20)

Alan turing uva-presentationdec-2019
Alan turing uva-presentationdec-2019Alan turing uva-presentationdec-2019
Alan turing uva-presentationdec-2019
 
Seminar UvA 2018- socialbigdata
Seminar UvA  2018- socialbigdataSeminar UvA  2018- socialbigdata
Seminar UvA 2018- socialbigdata
 
Jornada leiden short
Jornada leiden shortJornada leiden short
Jornada leiden short
 
BIG DATA EN CIENCIAS DE LA SALUD Y CIENCIAS SOCIALES
BIG DATA EN CIENCIAS DE LA SALUD Y CIENCIAS SOCIALESBIG DATA EN CIENCIAS DE LA SALUD Y CIENCIAS SOCIALES
BIG DATA EN CIENCIAS DE LA SALUD Y CIENCIAS SOCIALES
 
ICCES'2016 BIG DATA IN HEALTHCARE AND SOCIAL SCIENCES
ICCES'2016  BIG DATA IN HEALTHCARE AND SOCIAL SCIENCESICCES'2016  BIG DATA IN HEALTHCARE AND SOCIAL SCIENCES
ICCES'2016 BIG DATA IN HEALTHCARE AND SOCIAL SCIENCES
 
Presentación Gupo G-TeC en Social Big Data
Presentación Gupo G-TeC en Social Big DataPresentación Gupo G-TeC en Social Big Data
Presentación Gupo G-TeC en Social Big Data
 
Big data systems and analytics
Big data systems and analyticsBig data systems and analytics
Big data systems and analytics
 
Big Data. Complejidad,algoritmos y su procesamiento
Big Data. Complejidad,algoritmos y su procesamientoBig Data. Complejidad,algoritmos y su procesamiento
Big Data. Complejidad,algoritmos y su procesamiento
 
APLICACIÓN DE TÉCNICAS DE OPTIMIZACIÓN Y BIG DATA AL PROBLEMA DE BÚSQUEDA...
APLICACIÓN DE TÉCNICAS DE OPTIMIZACIÓN Y BIG DATA AL PROBLEMA DE BÚSQUEDA...APLICACIÓN DE TÉCNICAS DE OPTIMIZACIÓN Y BIG DATA AL PROBLEMA DE BÚSQUEDA...
APLICACIÓN DE TÉCNICAS DE OPTIMIZACIÓN Y BIG DATA AL PROBLEMA DE BÚSQUEDA...
 
G te c sesion1a-bioinformatica y big data
G te c sesion1a-bioinformatica y big dataG te c sesion1a-bioinformatica y big data
G te c sesion1a-bioinformatica y big data
 
G te c sesion1b-casos de uso
G te c sesion1b-casos de usoG te c sesion1b-casos de uso
G te c sesion1b-casos de uso
 
G te c sesion2a-data collection
G te c sesion2a-data collectionG te c sesion2a-data collection
G te c sesion2a-data collection
 
G tec sesion2b-host-cloud y cloudcomputing
G tec sesion2b-host-cloud y cloudcomputingG tec sesion2b-host-cloud y cloudcomputing
G tec sesion2b-host-cloud y cloudcomputing
 
G te c sesion3a-bases de datos modernas
G te c sesion3a-bases de datos modernasG te c sesion3a-bases de datos modernas
G te c sesion3a-bases de datos modernas
 
G te c sesion3b- mapreduce
G te c sesion3b- mapreduceG te c sesion3b- mapreduce
G te c sesion3b- mapreduce
 
G te c sesion4a-bigdatasystemsanalytics
G te c sesion4a-bigdatasystemsanalyticsG te c sesion4a-bigdatasystemsanalytics
G te c sesion4a-bigdatasystemsanalytics
 
G te c sesion4b-complejidad y tpa
G te c sesion4b-complejidad y tpaG te c sesion4b-complejidad y tpa
G te c sesion4b-complejidad y tpa
 
Open Data para Smartcity-Facultad de Estudios Estadísticos
Open Data para Smartcity-Facultad de Estudios EstadísticosOpen Data para Smartcity-Facultad de Estudios Estadísticos
Open Data para Smartcity-Facultad de Estudios Estadísticos
 
Deep Learning + R by Gabriel Valverde
Deep Learning + R by Gabriel ValverdeDeep Learning + R by Gabriel Valverde
Deep Learning + R by Gabriel Valverde
 
Fortune Time Institute: Big Data - Challenges for Smartcity
Fortune Time Institute: Big Data - Challenges for SmartcityFortune Time Institute: Big Data - Challenges for Smartcity
Fortune Time Institute: Big Data - Challenges for Smartcity
 

Recently uploaded

How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 

Recently uploaded (20)

How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 

Hadoop, MapReduce and R = RHadoop

  • 1. What is Big Data? Introduction to Hadoop. Hadoop Distributed File System. MapReduce and Hadoop MapReduce. Hadoop/MapReduce (1) Eslam Montaser Roushdi Facultad de Inform´tica a Universidad Complutense de Madrid Grupo G-Tec UCM www.tecnologiaUCM.es February, 2014 How to link R and Hadoop?
  • 2. What is Big Data? Introduction to Hadoop. Hadoop Distributed File System. MapReduce and Hadoop MapReduce. How to link R and Hadoop? Our aim Study how to handle the Big Data by using Hadoop/MapReduce, in order to analysis this data by using RHadoop.
  • 3. What is Big Data? Introduction to Hadoop. Hadoop Distributed File System. Outlines: 1 What is Big Data? 2 Introduction to Hadoop. 3 Hadoop Distributed File System. 4 MapReduce and Hadoop MapReduce. 5 How to link R and Hadoop? MapReduce and Hadoop MapReduce. How to link R and Hadoop?
  • 4. What is Big Data? Introduction to Hadoop. Hadoop Distributed File System. MapReduce and Hadoop MapReduce. How to link R and Hadoop? 1) What is Big Data? Large datasets Big Data has to deal with large and complex datasets that can be structured, semi-structured, or unstructured. Five exabytes of data are generated every days!! ”Eric Schmidt”. How to explore, analyze such large datasets? Facebook > 800 million active users in Facebook. interacting with > 1 billion objects. 2.5 petabytes of userdata per day!
  • 5. What is Big Data? Introduction to Hadoop. 1) What is Big Data? Hadoop Distributed File System. MapReduce and Hadoop MapReduce. How to link R and Hadoop?
  • 6. What is Big Data? Introduction to Hadoop. Hadoop Distributed File System. MapReduce and Hadoop MapReduce. How to link R and Hadoop? 2) Introduction to Hadoop. What is Hadoop? Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Apache Hadoop is an open source Java framework for processing and querying vast amounts of data on large clusters of commodity hardware. Supports applications under a free license. Operates on unstructured and structured data. A large and active ecosystem. Handles thousands of nodes and petabytes of data. Hadoop enables scalable, cost-effective, flexible, fault-tolerant solutions.
  • 7. What is Big Data? Introduction to Hadoop. Hadoop Distributed File System. 2) Introduction to Hadoop. Studying Hadoop components Mahout. Pig. Hive. HBase. Sqoop. ZooKeper. Ambari. MapReduce and Hadoop MapReduce. How to link R and Hadoop?
  • 8. What is Big Data? Introduction to Hadoop. Hadoop Distributed File System. MapReduce and Hadoop MapReduce. How to link R and Hadoop? 2) Introduction to Hadoop. Exploring Hadoop features Hadoop Common: common utilities package. HDFS: Hadoop Distributed File System with high throughput access to application data. MapReduce: A software framework for distributed processing of large data sets on computer clusters. Hadoop YARN: a resource-management platform responsible for managing compute resources in clusters and using them for scheduling the applications of users.
  • 9. What is Big Data? Introduction to Hadoop. Hadoop Distributed File System. MapReduce and Hadoop MapReduce. How to link R and Hadoop? 3) Hadoop Distributed File System. Hadoop Distributed File System (HFDS) Is a distributed, scalable, and portable file-system written in Java for the Hadoop framework. The characteristics of HDFS: Fault tolerant. Runs with commodity hardware. Able to handle large datasets. Master slave paradigm. Write once file access only. HDFS components NameNode. DataNode. Secondary NameNode.
  • 10. What is Big Data? Introduction to Hadoop. Hadoop Distributed File System. 3) Hadoop Distributed File System. HDFS architecture MapReduce and Hadoop MapReduce. How to link R and Hadoop?
  • 11. What is Big Data? Introduction to Hadoop. Hadoop Distributed File System. MapReduce and Hadoop MapReduce. How to link R and Hadoop? 4) MapReduce and Hadoop MapReduce. MapReduce Is derived from Google MapReduce. Is a programming model for processing large datasets distributed on a large cluster. MapReduce is the heart of Hadoop. MapReduce paradigm Map phase: Once divided, datasets are assigned to the task tracker to perform the Map phase. The data functional operation will be performed over the data, emitting the mapped key and value pairs as the output of the Map phase, (i.e data processing). Reduce phase: The master node then collects the answers to all the subproblems and combines them in some way to form the output; the answer to the problem it was originally trying to solve, (i.e data collection and digesting).
  • 12. What is Big Data? Introduction to Hadoop. Hadoop Distributed File System. MapReduce and Hadoop MapReduce. How to link R and Hadoop? 4) MapReduce and Hadoop MapReduce. MapReduce Goals 1 Scalabilityto large data volumes: Scan 100 TB on 1 node @ 50 MB/s = 24 days. Scan on 1000-node cluster = 35 minutes. 2 Cost-efficiency: Commodity nodes (cheap, but unreliable). Commodity network. Automatic fault-tolerance (fewer admins). Easy to use (fewer programmers). Challenges 1 Cheap nodes fail, especially if you have many: Mean time between failures for 1 node = 3 years. MTBF for 1000 nodes = 1 day. Solution: Build fault-tolerance into system. 2 Programming distributed systems is hard: Solution: Users write data-parallel ”map” and ”reduce” functions, system handles work distribution and faults.
  • 13. What is Big Data? Introduction to Hadoop. Hadoop Distributed File System. 4) MapReduce and Hadoop MapReduce. MapReduce components JobTracker. TaskTracker. MapReduce and Hadoop MapReduce. How to link R and Hadoop?
  • 14. What is Big Data? Introduction to Hadoop. Hadoop Distributed File System. MapReduce and Hadoop MapReduce. How to link R and Hadoop? 4) MapReduce and Hadoop MapReduce. The five common steps of parallel computing 1 Preparing the Map() input: This will take the input data row wise and emit key value pairs per rows, or we can explicitly change as per the requirement. Map input: list (k1, v1). 2 Run the user-provided Map() code. Map output: list (k2, v2). 3 Shuffle the Map output to the Reduce processors. Also, shuffle the similar keys (grouping them) and input them to the same reducer. 4 Run the user-provided Reduce() code: This phase will run the custom reducer code designed by developer to run on shuffled data and emit key and value. Reduce input: (k2, list(v2)). Reduce output: (k3, v3). 5 Produce the final output: Finally, the master node collects all reducer output and combines and writes them in a text file.
  • 15. What is Big Data? Introduction to Hadoop. Hadoop Distributed File System. 4) MapReduce and Hadoop MapReduce. MapReduce and Hadoop MapReduce. How to link R and Hadoop?
  • 16. What is Big Data? Introduction to Hadoop. Hadoop Distributed File System. MapReduce and Hadoop MapReduce. How to link R and Hadoop? 4) MapReduce and Hadoop MapReduce. Hadoop MapReduce Is a popular Java framework for easily written applications. It processes vast amounts of data (multiterabyte datasets) in parallel on large clusters (thousands of nodes) of commodity hardware in a reliable and fault-tolerant manner. Listing Hadoop MapReduce entities Client: This initializes the job. JobTracker: This monitors the job. TaskTracker: This executes the job. HDFS: This stores the input and output data.
  • 17. What is Big Data? Introduction to Hadoop. Hadoop Distributed File System. 4) MapReduce and Hadoop MapReduce. Hadoop MapReduce scenario The loading of data into HDFS. The execution of the Map phase. Shuffling and sorting. The execution of the Reduce phase. MapReduce and Hadoop MapReduce. How to link R and Hadoop?
  • 18. What is Big Data? Introduction to Hadoop. Hadoop Distributed File System. MapReduce and Hadoop MapReduce. How to link R and Hadoop? 4) MapReduce and Hadoop MapReduce. Understanding the Hadoop MapReduce fundamentals MapReduce objects. i) Mapper. ii) Reducer. iii) Driver. MapReduce dataflow. 1 Preloading data in HDFS. 2 Running MapReduce by calling Driver. 3 Reading of input data by the Mappers, which results in the splitting of the data execution of the Mapper custom logic and the generation of intermediate key-value pairs. 4 Executing Combiner and the shuffle phase to optimize the overall Hadoop MapReduce process. 5 Sorting and providing of intermediate key-value pairs to the Reduce phase. The Reduce phase is then executed. Reducers take these partitioned key-value pairs and aggregate them based on Reducer logic. 6 The final output data is stored at HDFS.
  • 19. What is Big Data? Introduction to Hadoop. 5) Hadoop MapReduce. Example Word Count Hadoop Distributed File System. MapReduce and Hadoop MapReduce. How to link R and Hadoop?
  • 20. What is Big Data? Introduction to Hadoop. Hadoop Distributed File System. MapReduce and Hadoop MapReduce. 4) MapReduce and Hadoop MapReduce. Example Word Count (1). Map: public static class MapClass extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)value).toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } How to link R and Hadoop?
  • 21. What is Big Data? Introduction to Hadoop. Hadoop Distributed File System. MapReduce and Hadoop MapReduce. How to link R and Hadoop? 4) MapReduce and Hadoop MapReduce. Example Word Count (2) Reduce: public static class Reduce extends MapReduceBase implements Reducer { public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum + = ((IntWritable) values.next()).get(); } output.collect(key, new IntWritable(sum)); } }
  • 22. What is Big Data? Introduction to Hadoop. Hadoop Distributed File System. MapReduce and Hadoop MapReduce. 4) MapReduce and Hadoop MapReduce. Example Word Count (3) WordCount: public static void main(String[ ]args) throws IOException { //checking goes here JobConf conf = new JobConf(); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputPath(new Path(args[0])); conf.setOutputPath(new Path(args[1])); JobClient.runJob(conf); } How to link R and Hadoop?
  • 23. What is Big Data? Introduction to Hadoop. Hadoop Distributed File System. MapReduce and Hadoop MapReduce. How to link R and Hadoop? 4) MapReduce and Hadoop MapReduce. What is MapReduce used for? 1 At Google Index construction for Google Search (replaced in 2010 by Caffeine). Article clustering for Google News. Statistical machine translation. 2 At Yahoo! ”Web map” powering Yahoo! Search. Spam detection for Yahoo! Mail. 3 At Facebook Data mining, Web log processing. SearchBox (with Cassandra). Facebook Messages (with HBase). Spam detection and Ad optimization. 4 In research Bioinformatics (Maryland). Particle physics (Nebraska).
  • 24. What is Big Data? Introduction to Hadoop. Hadoop Distributed File System. MapReduce and Hadoop MapReduce. How to link R and Hadoop? 5) How to link R and Hadoop? Learning RHadoop RHadoop is a great open source software framework of R for performing data analytics with the Hadoop platform via R functions. RHadoop with R packages These three different R packages have been designed on Hadoop’s two main features HDFS and MapReduce: rhdfs: This is an R package for providing all Hadoop HDFS access to R. All distributed files can be managed with R functions. rmr: This is an R package for providing Hadoop MapReduce interfaces to R. With the help of this package, the Mapper and Reducer can easily be developed. rhbase: This is an R package for handling data at HBase distributed database through R.
  • 25. What is Big Data? Introduction to Hadoop. Hadoop Distributed File System. MapReduce and Hadoop MapReduce. How to link R and Hadoop? 5) How to link R and Hadoop? Learning RHIPE R and Hadoop Integrated Programming Environment (RHIPE) is a free and open source project. RHIPE is widely used for performing Big Data analysis with D & R analysis. D & R analysis is used to divide huge data, process it in parallel on a distributed network to produce intermediate output, and finally recombine all this intermediate output into a set. RHIPE is designed to carry out D & R analysis on complex Big Data in R on the Hadoop platform. RHadoop
  • 26. What is Big Data? Introduction to Hadoop. Hadoop Distributed File System. MapReduce and Hadoop MapReduce. How to link R and Hadoop? 5) How to link R and Hadoop? Learning Hadoop streaming Hadoop streaming is a utility that comes with the Hadoop distribution. This utility allows you to create and run MapReduce jobs with any executable or script as the Mapper and/or Reducer. This is supported by R, Python, Ruby, Bash, Perl, and so on. We will use the R language with a bash script. There is one R package named HadoopStreaming that has been developed for performing data analysis on Hadoop clusters with the help of R scripts, which is an interface to Hadoop streaming with R. Finally Three ways to link R and Hadoop are as follows: RHIPE. RHadoop. Hadoop streaming.
  • 27. What is Big Data? Introduction to Hadoop. Hadoop Distributed File System. MapReduce and Hadoop MapReduce. Thanks!! How to link R and Hadoop?
  • 28. What is Big Data? Introduction to Hadoop. Hadoop Distributed File System. MapReduce and Hadoop MapReduce. How to link R and Hadoop? References Glen Mules, Warren Pettit, Introduction to MapReduce Programming, September 2013. John Maindonald, W. John Braun, Introduction and Hadoop Overview, Lab Course: Databases & Cloud Computing, University of Freiburg, 2012. Vignesh Prajapati, Big Data Analytics with R and Hadoop, November 2013. Tom White, Hadoop: The Definitive Guide, Second Edition, October 2010.