SlideShare a Scribd company logo
Workshop – Hadoop + R

Carlos Gil Bellosta
Big Data
Analytics
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Big Data Analytics
R & Hadoop

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Carlos J. Gil Bellosta

Details of
mapreduce

cgb@datanalytics.com

Scoring,
sampling &
simulating

November 2013

Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks
Big Data
Analytics

Table of Contents

Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

1 Intro to Hadoop & R

All about Hadoop
Hadoop FS
Hadoop & mapreduce

All about R
2 Counting (& Graphics)
3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modelling
6 Final remarks
Big Data
Analytics

File system: manages all about
files

Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

• Examples: diskettes, hard disks, RAIDs,... magnetic tapes!
• Combination of hardware and software to hide boring

activities from users:
•
•
•
•

Find space to write the files
Read/write files
Manage fragmentation
Etc.

• How many devices per FS?
• 1-to-1: diskettes, CD-ROMs, HDDs,...
• n-to-1: partitioned HDDs,...
• 1-to-n: RAIDs, Hadoop
Big Data
Analytics

Hadoop goodies (as a FS)

Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

• Chuncks (large) files among machines
• Replicates chunks (default, 3)
• Balances data
• Robust to hardware failures
• It is rack aware

Obviously, it requires some system to keep track of:
• Which servers/racks are up/down
• Where each chunk is located
• ...
Big Data
Analytics

How to work with data in Hadoop?

Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce

• Provides a shell (ls, cp, etc.)
• You can put/get data from your local FS to Hadoop FS
• This is:
• You can dump your data to your local machine
• You can run your programs in your local machine
• You can put results back into Hadoop
• But what if the file is too large?

Scoring,
sampling &
simulating

Solution

Data
modelling

Rather than bringing the data to the code, why not moving the
code to the data?

Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

One of the ways to move code to data is known as mapreduce.
Big Data
Analytics

Mapreduce

Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

• Two step process:
• Map: run your code on chunks all over
• Reduce: reshape the output into the desired format
• Hadoop manages issues:
• System failures
• Threads that do not return
• And all (?) that made life of OpenMP, MPI, etc. users
miserable
• Slotted approach: mapreduce provides slots where you put

the mappers/reducers code
• The code is for you to provide!
Big Data
Analytics

What is R?

Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

• R is a
• software package?
• programming language?
• environment?

for data analysis and graphics.
• R users are (should be?) used to the mapreduce approach:

ddply(dfx, .(group, sex), summarize,
mean = mean(age),
sd
= sd(age))
Big Data
Analytics

Table of Contents

Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

1 Intro to Hadoop & R
2 Counting (& Graphics)

Graphics & big data
Let’s count... hexagons
3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modelling
6 Final remarks
Big Data
Analytics
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

Visualizing a million
Big Data
Analytics
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

Fluctuation plot
Big Data
Analytics
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

Table plot
Big Data
Analytics
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

• Non-trivial counting exercise (no, we are not counting

words today!)
• Good visualization features for big datasets
• Fits in mapreduce framework:
• Map: Assigns points to hexagons
• Reduce: aggregates counts on hexagons
• The output is small and can be plotted locally
Big Data
Analytics

Table of Contents

Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

1 Intro to Hadoop & R
2 Counting (& Graphics)

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating

3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modelling

Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

6 Final remarks
Big Data
Analytics
Carlos J. Gil
Bellosta

What you see: input/output, map,
reduce

Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

• input:
• Type: text, csv, R object,...
• Options: separator,...
• output: similar to input
• map & reduce:
• Functions with (k,v) argument (k, key; v, value)
• They return a k,v list
• Thus, mapreduces can be chained together (the output of
the first one is the input for the second)
Big Data
Analytics
Carlos J. Gil
Bellosta

What you don’t see

Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

$HADOOP jar $HADOOP_STREAMING -D stream.map.input=typedbytes
-D stream.map.output=typedbytes
-D stream.reduce.input=typedbytes
-D stream.reduce.output=typedbytes
-D mapred.reduce.tasks=0
-input /tmp/RtmpUUrNMj/file68c0185e60c
-output /tmp/RtmpUUrNMj/file68c04c25d5f0
-mapper "Rscript rmr-streaming-map68c018acf680 "
-file /tmp/RtmpUUrNMj/rmr-local-env68c0101c8e8a
-file /tmp/RtmpUUrNMj/rmr-global-env68c03abb4080
-file /tmp/RtmpUUrNMj/rmr-streaming-map68c018acf680
-inputformat org.apache.hadoop.streaming.AutoInputFormat
-outputformat org.apache.hadoop.mapred.SequenceFileOutputForm
Big Data
Analytics

Table of Contents

Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

1 Intro to Hadoop & R
2 Counting (& Graphics)

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating

3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modelling

Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

6 Final remarks
Big Data
Analytics

Scoring

Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

• Externals consultants build a model (using R and small

data)
• Models in R should have a predict method
• You can then score your huge database (in batch)
• No need to rewrite the model into your systems!
Big Data
Analytics

The case for sampling

Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

• Sampling works!
• Sampled datasets can be used to build small data models
• You can use R (& mapreduce) to sample data, but you

better not
Big Data
Analytics
Carlos J. Gil
Bellosta

Running simulations on Hadoop

Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

• Some (many?) people say it is not the right tool
• You need input data, but simulations often not
• You want to control the number of mappers (which run

your simulations)
• Still mapreduce is nice for simulations...
• ... so let and old dog try its dirty trick!
Big Data
Analytics

Table of Contents

Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

1 Intro to Hadoop & R
2 Counting (& Graphics)
3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modelling

Linear Regression
Logistic Regression
Trees & Random Forests
6 Final remarks
Big Data
Analytics

Linear regression can be
parallelized

Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Simple linear regression: y ∼ α + βx

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

ˆ
β=

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

=

n
¯
¯
i=1 (xi − x )(yi − y )
=
n
(xi − x )2
¯
i=1
n
n
n
1
i=1 xi yi − n
i=1 xi
j=1 yj
n
2 ) − 1 ( n x )2
i=1 (xi
i=1 i
n

Operations are case by case!
Big Data
Analytics
Carlos J. Gil
Bellosta

Multiple linear regression

Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

• Based on X X and X y :

ˆ
β = (X X )−1 X y
• If X = [X1 |...|Xn ] (by blocks), then X X =

i

Xi Xi .
Big Data
Analytics
Carlos J. Gil
Bellosta

Can logistic regression be
parallelized? Yes and no.

Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

• Fitting logistic regression models is iterative and iterations

are not parallelizable.
• However, each iteration can be parallelized (these are not

unlike fitting linear models as before)
• We will explore two big data alternatives:
• Parallelize iterations using mapreduce (see

http://goo.gl/ftx36r)
• Split your data meaningfully and do standard logistic

regression in the nodes
Big Data
Analytics
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

How many bytes make knowledge?
(aka the fractal nature of big data)
Big Data
Analytics
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

Splitted logistic regression
Big Data
Analytics
Carlos J. Gil
Bellosta

Viable alternatives to logistic
models

Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

• Trees
• High interpretability
• But unstable and tend to miss out details
• Random forests
• Black boxes
• Superb performance
• These are collections of trees that can be built in parallel
• Both can be parallelized indifferent ways:
• Similar to partitioned logistic models above
• Within training
Big Data
Analytics

Table of Contents

Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

1 Intro to Hadoop & R
2 Counting (& Graphics)

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating

3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modelling

Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

6 Final remarks
Big Data
Analytics
Carlos J. Gil
Bellosta

Forget most of what you learned
today, seriously

Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

• People strive to extend small data models to big data (as

we did today)...
• ... but is it the way to go?
• Achtung microlocal structure
• Small data people knows microlocal structure as outliers
• Global models (linear, logistic,...) cannot (easily?) exploit
microlocal structure
• But the promises of big data lie precisely there
• (Otherwise, just sample and you will be fine)
• Areas to watch for insights on big data modelling:
• SNA (networks analysis)
• Text analysis
Big Data
Analytics
Carlos J. Gil
Bellosta

Thank you very much and...

Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R

Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons

Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests

Final remarks

... questions?
Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

More Related Content

What's hot

High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
 
Summary of HDF-EOS5 Files, Data Model and File Format
Summary of HDF-EOS5 Files, Data Model and File FormatSummary of HDF-EOS5 Files, Data Model and File Format
Summary of HDF-EOS5 Files, Data Model and File Format
The HDF-EOS Tools and Information Center
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
Makoto Yui
 
New features in the version 4.5 of the CFD meteodyn WT dedicated to wind reso...
New features in the version 4.5 of the CFD meteodyn WT dedicated to wind reso...New features in the version 4.5 of the CFD meteodyn WT dedicated to wind reso...
New features in the version 4.5 of the CFD meteodyn WT dedicated to wind reso...
Jean-Claude Meteodyn
 
Pilot Project for HDF5 Metadata Structures for SWOT
Pilot Project for HDF5 Metadata Structures for SWOTPilot Project for HDF5 Metadata Structures for SWOT
Pilot Project for HDF5 Metadata Structures for SWOT
The HDF-EOS Tools and Information Center
 
R and Data Science
R and Data ScienceR and Data Science
R and Data Science
Revolution Analytics
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
David Gleich
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
Julian Hyde
 
Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith
Sigmoid
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
Chetan Khatri
 
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysisMapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
David Gleich
 
Learning Commonalities in RDF
Learning Commonalities in RDFLearning Commonalities in RDF
Learning Commonalities in RDF
Sara EL HASSAD
 
Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.
Anirudh Gangwar
 
ICESat-2 Metadata and Status
ICESat-2 Metadata and StatusICESat-2 Metadata and Status
ICESat-2 Metadata and Status
The HDF-EOS Tools and Information Center
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
BigDataEverywhere
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
Amund Tveit
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
Senthil Kumar
 
Spatial Data, KML, and the University Web
Spatial Data, KML, and the University WebSpatial Data, KML, and the University Web
Spatial Data, KML, and the University Web
Glennon Alan
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasWes McKinney
 
Large Scale Geospatial Indexing and Analysis on Apache Spark
Large Scale Geospatial Indexing and Analysis on Apache SparkLarge Scale Geospatial Indexing and Analysis on Apache Spark
Large Scale Geospatial Indexing and Analysis on Apache Spark
Databricks
 

What's hot (20)

High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Summary of HDF-EOS5 Files, Data Model and File Format
Summary of HDF-EOS5 Files, Data Model and File FormatSummary of HDF-EOS5 Files, Data Model and File Format
Summary of HDF-EOS5 Files, Data Model and File Format
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 
New features in the version 4.5 of the CFD meteodyn WT dedicated to wind reso...
New features in the version 4.5 of the CFD meteodyn WT dedicated to wind reso...New features in the version 4.5 of the CFD meteodyn WT dedicated to wind reso...
New features in the version 4.5 of the CFD meteodyn WT dedicated to wind reso...
 
Pilot Project for HDF5 Metadata Structures for SWOT
Pilot Project for HDF5 Metadata Structures for SWOTPilot Project for HDF5 Metadata Structures for SWOT
Pilot Project for HDF5 Metadata Structures for SWOT
 
R and Data Science
R and Data ScienceR and Data Science
R and Data Science
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysisMapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
 
Learning Commonalities in RDF
Learning Commonalities in RDFLearning Commonalities in RDF
Learning Commonalities in RDF
 
Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.
 
ICESat-2 Metadata and Status
ICESat-2 Metadata and StatusICESat-2 Metadata and Status
ICESat-2 Metadata and Status
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
Spatial Data, KML, and the University Web
Spatial Data, KML, and the University WebSpatial Data, KML, and the University Web
Spatial Data, KML, and the University Web
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
 
Large Scale Geospatial Indexing and Analysis on Apache Spark
Large Scale Geospatial Indexing and Analysis on Apache SparkLarge Scale Geospatial Indexing and Analysis on Apache Spark
Large Scale Geospatial Indexing and Analysis on Apache Spark
 

Viewers also liked

Architecture to Scale. DONN ROCHETTE at Big Data Spain 2012
Architecture to Scale. DONN ROCHETTE at Big Data Spain 2012Architecture to Scale. DONN ROCHETTE at Big Data Spain 2012
Architecture to Scale. DONN ROCHETTE at Big Data Spain 2012
Big Data Spain
 
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
Big Data Spain
 
Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012
Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012
Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012
Big Data Spain
 
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Big Data Spain
 
Big data meets scalable visualizations by JAVIER DE LA TORRE at Big Data Spai...
Big data meets scalable visualizations by JAVIER DE LA TORRE at Big Data Spai...Big data meets scalable visualizations by JAVIER DE LA TORRE at Big Data Spai...
Big data meets scalable visualizations by JAVIER DE LA TORRE at Big Data Spai...
Big Data Spain
 
The Big Business of Big Data. JON BRUNER at Big Data Spain 2012
The Big Business of Big Data. JON BRUNER at Big Data Spain 2012The Big Business of Big Data. JON BRUNER at Big Data Spain 2012
The Big Business of Big Data. JON BRUNER at Big Data Spain 2012
Big Data Spain
 
Intro to the Big Data Spain 2014 conference
Intro to the Big Data Spain 2014 conferenceIntro to the Big Data Spain 2014 conference
Intro to the Big Data Spain 2014 conference
Big Data Spain
 
Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Dat...
Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Dat...Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Dat...
Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Dat...
Big Data Spain
 
Putting Hadoop on any Cloud. NATI SHALOM at Big Data Spain 2012
Putting Hadoop on any Cloud. NATI SHALOM at Big Data Spain 2012Putting Hadoop on any Cloud. NATI SHALOM at Big Data Spain 2012
Putting Hadoop on any Cloud. NATI SHALOM at Big Data Spain 2012
Big Data Spain
 
Health Insurance Predictive Analysis with Hadoop and Machine Learning. JULIEN...
Health Insurance Predictive Analysis with Hadoop and Machine Learning. JULIEN...Health Insurance Predictive Analysis with Hadoop and Machine Learning. JULIEN...
Health Insurance Predictive Analysis with Hadoop and Machine Learning. JULIEN...
Big Data Spain
 

Viewers also liked (10)

Architecture to Scale. DONN ROCHETTE at Big Data Spain 2012
Architecture to Scale. DONN ROCHETTE at Big Data Spain 2012Architecture to Scale. DONN ROCHETTE at Big Data Spain 2012
Architecture to Scale. DONN ROCHETTE at Big Data Spain 2012
 
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
 
Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012
Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012
Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012
 
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
 
Big data meets scalable visualizations by JAVIER DE LA TORRE at Big Data Spai...
Big data meets scalable visualizations by JAVIER DE LA TORRE at Big Data Spai...Big data meets scalable visualizations by JAVIER DE LA TORRE at Big Data Spai...
Big data meets scalable visualizations by JAVIER DE LA TORRE at Big Data Spai...
 
The Big Business of Big Data. JON BRUNER at Big Data Spain 2012
The Big Business of Big Data. JON BRUNER at Big Data Spain 2012The Big Business of Big Data. JON BRUNER at Big Data Spain 2012
The Big Business of Big Data. JON BRUNER at Big Data Spain 2012
 
Intro to the Big Data Spain 2014 conference
Intro to the Big Data Spain 2014 conferenceIntro to the Big Data Spain 2014 conference
Intro to the Big Data Spain 2014 conference
 
Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Dat...
Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Dat...Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Dat...
Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Dat...
 
Putting Hadoop on any Cloud. NATI SHALOM at Big Data Spain 2012
Putting Hadoop on any Cloud. NATI SHALOM at Big Data Spain 2012Putting Hadoop on any Cloud. NATI SHALOM at Big Data Spain 2012
Putting Hadoop on any Cloud. NATI SHALOM at Big Data Spain 2012
 
Health Insurance Predictive Analysis with Hadoop and Machine Learning. JULIEN...
Health Insurance Predictive Analysis with Hadoop and Machine Learning. JULIEN...Health Insurance Predictive Analysis with Hadoop and Machine Learning. JULIEN...
Health Insurance Predictive Analysis with Hadoop and Machine Learning. JULIEN...
 

Similar to Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

Apache Hadoop: DFS and Map Reduce
Apache Hadoop: DFS and Map ReduceApache Hadoop: DFS and Map Reduce
Apache Hadoop: DFS and Map Reduce
Victor Sanchez Anguix
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
Scott Leberknight
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
Siva Pandeti
 
Revolution Analytics
Revolution AnalyticsRevolution Analytics
Revolution Analyticstempledf
 
Data Analytics using MATLAB and HDF5
Data Analytics using MATLAB and HDF5Data Analytics using MATLAB and HDF5
Data Analytics using MATLAB and HDF5
The HDF-EOS Tools and Information Center
 
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Martin Junghanns
 
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache FlinkMartin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Flink Forward
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BI
Prasad Prabhu (PP)
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data science
Long Nguyen
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)
Revolution Analytics
 
MATLAB_BIg_Data_ds_Haddop_22032015
MATLAB_BIg_Data_ds_Haddop_22032015MATLAB_BIg_Data_ds_Haddop_22032015
MATLAB_BIg_Data_ds_Haddop_22032015Asaf Ben Gal
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
Donald Miner
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query OptimizationJ Singh
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Cloudera, Inc.
 
Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)
CRS4 Research Center in Sardinia
 
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Jeff Magnusson
 

Similar to Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013 (20)

Apache Hadoop: DFS and Map Reduce
Apache Hadoop: DFS and Map ReduceApache Hadoop: DFS and Map Reduce
Apache Hadoop: DFS and Map Reduce
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Revolution Analytics
Revolution AnalyticsRevolution Analytics
Revolution Analytics
 
Data Analytics using MATLAB and HDF5
Data Analytics using MATLAB and HDF5Data Analytics using MATLAB and HDF5
Data Analytics using MATLAB and HDF5
 
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
 
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache FlinkMartin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BI
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data science
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)
 
MATLAB_BIg_Data_ds_Haddop_22032015
MATLAB_BIg_Data_ds_Haddop_22032015MATLAB_BIg_Data_ds_Haddop_22032015
MATLAB_BIg_Data_ds_Haddop_22032015
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query Optimization
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
 
Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)
 
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
 

More from Big Data Spain

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data Spain
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Big Data Spain
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
Big Data Spain
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Big Data Spain
 
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Big Data Spain
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Big Data Spain
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Big Data Spain
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Big Data Spain
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...
Big Data Spain
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...
Big Data Spain
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Big Data Spain
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
Big Data Spain
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Big Data Spain
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Big Data Spain
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Big Data Spain
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Big Data Spain
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
Big Data Spain
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Big Data Spain
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
Big Data Spain
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Big Data Spain
 

More from Big Data Spain (20)

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
 
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
 

Recently uploaded

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 

Recently uploaded (20)

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 

Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013

  • 1. Workshop – Hadoop + R Carlos Gil Bellosta
  • 2. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Big Data Analytics R & Hadoop Counting (& Graphics) Graphics & big data Let’s count... hexagons Carlos J. Gil Bellosta Details of mapreduce cgb@datanalytics.com Scoring, sampling & simulating November 2013 Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks
  • 3. Big Data Analytics Table of Contents Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks 1 Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R 2 Counting (& Graphics) 3 Details of mapreduce 4 Scoring, sampling & simulating 5 Data modelling 6 Final remarks
  • 4. Big Data Analytics File system: manages all about files Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks • Examples: diskettes, hard disks, RAIDs,... magnetic tapes! • Combination of hardware and software to hide boring activities from users: • • • • Find space to write the files Read/write files Manage fragmentation Etc. • How many devices per FS? • 1-to-1: diskettes, CD-ROMs, HDDs,... • n-to-1: partitioned HDDs,... • 1-to-n: RAIDs, Hadoop
  • 5. Big Data Analytics Hadoop goodies (as a FS) Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks • Chuncks (large) files among machines • Replicates chunks (default, 3) • Balances data • Robust to hardware failures • It is rack aware Obviously, it requires some system to keep track of: • Which servers/racks are up/down • Where each chunk is located • ...
  • 6. Big Data Analytics How to work with data in Hadoop? Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce • Provides a shell (ls, cp, etc.) • You can put/get data from your local FS to Hadoop FS • This is: • You can dump your data to your local machine • You can run your programs in your local machine • You can put results back into Hadoop • But what if the file is too large? Scoring, sampling & simulating Solution Data modelling Rather than bringing the data to the code, why not moving the code to the data? Linear Regression Logistic Regression Trees & Random Forests Final remarks One of the ways to move code to data is known as mapreduce.
  • 7. Big Data Analytics Mapreduce Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks • Two step process: • Map: run your code on chunks all over • Reduce: reshape the output into the desired format • Hadoop manages issues: • System failures • Threads that do not return • And all (?) that made life of OpenMP, MPI, etc. users miserable • Slotted approach: mapreduce provides slots where you put the mappers/reducers code • The code is for you to provide!
  • 8. Big Data Analytics What is R? Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks • R is a • software package? • programming language? • environment? for data analysis and graphics. • R users are (should be?) used to the mapreduce approach: ddply(dfx, .(group, sex), summarize, mean = mean(age), sd = sd(age))
  • 9. Big Data Analytics Table of Contents Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks 1 Intro to Hadoop & R 2 Counting (& Graphics) Graphics & big data Let’s count... hexagons 3 Details of mapreduce 4 Scoring, sampling & simulating 5 Data modelling 6 Final remarks
  • 10. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks Visualizing a million
  • 11. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks Fluctuation plot
  • 12. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks Table plot
  • 13. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks • Non-trivial counting exercise (no, we are not counting words today!) • Good visualization features for big datasets • Fits in mapreduce framework: • Map: Assigns points to hexagons • Reduce: aggregates counts on hexagons • The output is small and can be plotted locally
  • 14. Big Data Analytics Table of Contents Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R 1 Intro to Hadoop & R 2 Counting (& Graphics) Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating 3 Details of mapreduce 4 Scoring, sampling & simulating 5 Data modelling Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks 6 Final remarks
  • 15. Big Data Analytics Carlos J. Gil Bellosta What you see: input/output, map, reduce Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks • input: • Type: text, csv, R object,... • Options: separator,... • output: similar to input • map & reduce: • Functions with (k,v) argument (k, key; v, value) • They return a k,v list • Thus, mapreduces can be chained together (the output of the first one is the input for the second)
  • 16. Big Data Analytics Carlos J. Gil Bellosta What you don’t see Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks $HADOOP jar $HADOOP_STREAMING -D stream.map.input=typedbytes -D stream.map.output=typedbytes -D stream.reduce.input=typedbytes -D stream.reduce.output=typedbytes -D mapred.reduce.tasks=0 -input /tmp/RtmpUUrNMj/file68c0185e60c -output /tmp/RtmpUUrNMj/file68c04c25d5f0 -mapper "Rscript rmr-streaming-map68c018acf680 " -file /tmp/RtmpUUrNMj/rmr-local-env68c0101c8e8a -file /tmp/RtmpUUrNMj/rmr-global-env68c03abb4080 -file /tmp/RtmpUUrNMj/rmr-streaming-map68c018acf680 -inputformat org.apache.hadoop.streaming.AutoInputFormat -outputformat org.apache.hadoop.mapred.SequenceFileOutputForm
  • 17. Big Data Analytics Table of Contents Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R 1 Intro to Hadoop & R 2 Counting (& Graphics) Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating 3 Details of mapreduce 4 Scoring, sampling & simulating 5 Data modelling Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks 6 Final remarks
  • 18. Big Data Analytics Scoring Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks • Externals consultants build a model (using R and small data) • Models in R should have a predict method • You can then score your huge database (in batch) • No need to rewrite the model into your systems!
  • 19. Big Data Analytics The case for sampling Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks • Sampling works! • Sampled datasets can be used to build small data models • You can use R (& mapreduce) to sample data, but you better not
  • 20. Big Data Analytics Carlos J. Gil Bellosta Running simulations on Hadoop Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks • Some (many?) people say it is not the right tool • You need input data, but simulations often not • You want to control the number of mappers (which run your simulations) • Still mapreduce is nice for simulations... • ... so let and old dog try its dirty trick!
  • 21. Big Data Analytics Table of Contents Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks 1 Intro to Hadoop & R 2 Counting (& Graphics) 3 Details of mapreduce 4 Scoring, sampling & simulating 5 Data modelling Linear Regression Logistic Regression Trees & Random Forests 6 Final remarks
  • 22. Big Data Analytics Linear regression can be parallelized Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Simple linear regression: y ∼ α + βx Counting (& Graphics) Graphics & big data Let’s count... hexagons ˆ β= Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks = n ¯ ¯ i=1 (xi − x )(yi − y ) = n (xi − x )2 ¯ i=1 n n n 1 i=1 xi yi − n i=1 xi j=1 yj n 2 ) − 1 ( n x )2 i=1 (xi i=1 i n Operations are case by case!
  • 23. Big Data Analytics Carlos J. Gil Bellosta Multiple linear regression Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks • Based on X X and X y : ˆ β = (X X )−1 X y • If X = [X1 |...|Xn ] (by blocks), then X X = i Xi Xi .
  • 24. Big Data Analytics Carlos J. Gil Bellosta Can logistic regression be parallelized? Yes and no. Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks • Fitting logistic regression models is iterative and iterations are not parallelizable. • However, each iteration can be parallelized (these are not unlike fitting linear models as before) • We will explore two big data alternatives: • Parallelize iterations using mapreduce (see http://goo.gl/ftx36r) • Split your data meaningfully and do standard logistic regression in the nodes
  • 25. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks How many bytes make knowledge? (aka the fractal nature of big data)
  • 26. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks Splitted logistic regression
  • 27. Big Data Analytics Carlos J. Gil Bellosta Viable alternatives to logistic models Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks • Trees • High interpretability • But unstable and tend to miss out details • Random forests • Black boxes • Superb performance • These are collections of trees that can be built in parallel • Both can be parallelized indifferent ways: • Similar to partitioned logistic models above • Within training
  • 28. Big Data Analytics Table of Contents Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R 1 Intro to Hadoop & R 2 Counting (& Graphics) Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating 3 Details of mapreduce 4 Scoring, sampling & simulating 5 Data modelling Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks 6 Final remarks
  • 29. Big Data Analytics Carlos J. Gil Bellosta Forget most of what you learned today, seriously Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks • People strive to extend small data models to big data (as we did today)... • ... but is it the way to go? • Achtung microlocal structure • Small data people knows microlocal structure as outliers • Global models (linear, logistic,...) cannot (easily?) exploit microlocal structure • But the promises of big data lie precisely there • (Otherwise, just sample and you will be fine) • Areas to watch for insights on big data modelling: • SNA (networks analysis) • Text analysis
  • 30. Big Data Analytics Carlos J. Gil Bellosta Thank you very much and... Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks ... questions?