Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- The promise and peril of abundance:... by Big Data Spain 571 views
- Architecture to Scale. DONN ROCHETT... by Big Data Spain 1169 views
- CloudMC: A cloud computing map-redu... by Big Data Spain 1338 views
- Memory efficient applications. FRAN... by Big Data Spain 759 views
- Crunching Data with Google BigQuery... by Big Data Spain 4120 views
- The Big Business of Big Data. JON B... by Big Data Spain 10522 views

1,108 views

Published on

Published in:
Technology

No Downloads

Total views

1,108

On SlideShare

0

From Embeds

0

Number of Embeds

18

Shares

0

Downloads

22

Comments

0

Likes

2

No embeds

No notes for slide

- 1. Workshop – Hadoop + R Carlos Gil Bellosta
- 2. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Big Data Analytics R & Hadoop Counting (& Graphics) Graphics & big data Let’s count... hexagons Carlos J. Gil Bellosta Details of mapreduce cgb@datanalytics.com Scoring, sampling & simulating November 2013 Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks
- 3. Big Data Analytics Table of Contents Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks 1 Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R 2 Counting (& Graphics) 3 Details of mapreduce 4 Scoring, sampling & simulating 5 Data modelling 6 Final remarks
- 4. Big Data Analytics File system: manages all about ﬁles Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks • Examples: diskettes, hard disks, RAIDs,... magnetic tapes! • Combination of hardware and software to hide boring activities from users: • • • • Find space to write the ﬁles Read/write ﬁles Manage fragmentation Etc. • How many devices per FS? • 1-to-1: diskettes, CD-ROMs, HDDs,... • n-to-1: partitioned HDDs,... • 1-to-n: RAIDs, Hadoop
- 5. Big Data Analytics Hadoop goodies (as a FS) Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks • Chuncks (large) ﬁles among machines • Replicates chunks (default, 3) • Balances data • Robust to hardware failures • It is rack aware Obviously, it requires some system to keep track of: • Which servers/racks are up/down • Where each chunk is located • ...
- 6. Big Data Analytics How to work with data in Hadoop? Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce • Provides a shell (ls, cp, etc.) • You can put/get data from your local FS to Hadoop FS • This is: • You can dump your data to your local machine • You can run your programs in your local machine • You can put results back into Hadoop • But what if the ﬁle is too large? Scoring, sampling & simulating Solution Data modelling Rather than bringing the data to the code, why not moving the code to the data? Linear Regression Logistic Regression Trees & Random Forests Final remarks One of the ways to move code to data is known as mapreduce.
- 7. Big Data Analytics Mapreduce Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks • Two step process: • Map: run your code on chunks all over • Reduce: reshape the output into the desired format • Hadoop manages issues: • System failures • Threads that do not return • And all (?) that made life of OpenMP, MPI, etc. users miserable • Slotted approach: mapreduce provides slots where you put the mappers/reducers code • The code is for you to provide!
- 8. Big Data Analytics What is R? Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks • R is a • software package? • programming language? • environment? for data analysis and graphics. • R users are (should be?) used to the mapreduce approach: ddply(dfx, .(group, sex), summarize, mean = mean(age), sd = sd(age))
- 9. Big Data Analytics Table of Contents Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks 1 Intro to Hadoop & R 2 Counting (& Graphics) Graphics & big data Let’s count... hexagons 3 Details of mapreduce 4 Scoring, sampling & simulating 5 Data modelling 6 Final remarks
- 10. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks Visualizing a million
- 11. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks Fluctuation plot
- 12. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks Table plot
- 13. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks • Non-trivial counting exercise (no, we are not counting words today!) • Good visualization features for big datasets • Fits in mapreduce framework: • Map: Assigns points to hexagons • Reduce: aggregates counts on hexagons • The output is small and can be plotted locally
- 14. Big Data Analytics Table of Contents Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R 1 Intro to Hadoop & R 2 Counting (& Graphics) Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating 3 Details of mapreduce 4 Scoring, sampling & simulating 5 Data modelling Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks 6 Final remarks
- 15. Big Data Analytics Carlos J. Gil Bellosta What you see: input/output, map, reduce Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks • input: • Type: text, csv, R object,... • Options: separator,... • output: similar to input • map & reduce: • Functions with (k,v) argument (k, key; v, value) • They return a k,v list • Thus, mapreduces can be chained together (the output of the ﬁrst one is the input for the second)
- 16. Big Data Analytics Carlos J. Gil Bellosta What you don’t see Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks $HADOOP jar $HADOOP_STREAMING -D stream.map.input=typedbytes -D stream.map.output=typedbytes -D stream.reduce.input=typedbytes -D stream.reduce.output=typedbytes -D mapred.reduce.tasks=0 -input /tmp/RtmpUUrNMj/file68c0185e60c -output /tmp/RtmpUUrNMj/file68c04c25d5f0 -mapper "Rscript rmr-streaming-map68c018acf680 " -file /tmp/RtmpUUrNMj/rmr-local-env68c0101c8e8a -file /tmp/RtmpUUrNMj/rmr-global-env68c03abb4080 -file /tmp/RtmpUUrNMj/rmr-streaming-map68c018acf680 -inputformat org.apache.hadoop.streaming.AutoInputFormat -outputformat org.apache.hadoop.mapred.SequenceFileOutputForm
- 17. Big Data Analytics Table of Contents Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R 1 Intro to Hadoop & R 2 Counting (& Graphics) Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating 3 Details of mapreduce 4 Scoring, sampling & simulating 5 Data modelling Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks 6 Final remarks
- 18. Big Data Analytics Scoring Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks • Externals consultants build a model (using R and small data) • Models in R should have a predict method • You can then score your huge database (in batch) • No need to rewrite the model into your systems!
- 19. Big Data Analytics The case for sampling Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks • Sampling works! • Sampled datasets can be used to build small data models • You can use R (& mapreduce) to sample data, but you better not
- 20. Big Data Analytics Carlos J. Gil Bellosta Running simulations on Hadoop Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks • Some (many?) people say it is not the right tool • You need input data, but simulations often not • You want to control the number of mappers (which run your simulations) • Still mapreduce is nice for simulations... • ... so let and old dog try its dirty trick!
- 21. Big Data Analytics Table of Contents Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks 1 Intro to Hadoop & R 2 Counting (& Graphics) 3 Details of mapreduce 4 Scoring, sampling & simulating 5 Data modelling Linear Regression Logistic Regression Trees & Random Forests 6 Final remarks
- 22. Big Data Analytics Linear regression can be parallelized Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Simple linear regression: y ∼ α + βx Counting (& Graphics) Graphics & big data Let’s count... hexagons ˆ β= Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks = n ¯ ¯ i=1 (xi − x )(yi − y ) = n (xi − x )2 ¯ i=1 n n n 1 i=1 xi yi − n i=1 xi j=1 yj n 2 ) − 1 ( n x )2 i=1 (xi i=1 i n Operations are case by case!
- 23. Big Data Analytics Carlos J. Gil Bellosta Multiple linear regression Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks • Based on X X and X y : ˆ β = (X X )−1 X y • If X = [X1 |...|Xn ] (by blocks), then X X = i Xi Xi .
- 24. Big Data Analytics Carlos J. Gil Bellosta Can logistic regression be parallelized? Yes and no. Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks • Fitting logistic regression models is iterative and iterations are not parallelizable. • However, each iteration can be parallelized (these are not unlike ﬁtting linear models as before) • We will explore two big data alternatives: • Parallelize iterations using mapreduce (see http://goo.gl/ftx36r) • Split your data meaningfully and do standard logistic regression in the nodes
- 25. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks How many bytes make knowledge? (aka the fractal nature of big data)
- 26. Big Data Analytics Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks Splitted logistic regression
- 27. Big Data Analytics Carlos J. Gil Bellosta Viable alternatives to logistic models Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks • Trees • High interpretability • But unstable and tend to miss out details • Random forests • Black boxes • Superb performance • These are collections of trees that can be built in parallel • Both can be parallelized indiﬀerent ways: • Similar to partitioned logistic models above • Within training
- 28. Big Data Analytics Table of Contents Carlos J. Gil Bellosta Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R 1 Intro to Hadoop & R 2 Counting (& Graphics) Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating 3 Details of mapreduce 4 Scoring, sampling & simulating 5 Data modelling Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks 6 Final remarks
- 29. Big Data Analytics Carlos J. Gil Bellosta Forget most of what you learned today, seriously Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks • People strive to extend small data models to big data (as we did today)... • ... but is it the way to go? • Achtung microlocal structure • Small data people knows microlocal structure as outliers • Global models (linear, logistic,...) cannot (easily?) exploit microlocal structure • But the promises of big data lie precisely there • (Otherwise, just sample and you will be ﬁne) • Areas to watch for insights on big data modelling: • SNA (networks analysis) • Text analysis
- 30. Big Data Analytics Carlos J. Gil Bellosta Thank you very much and... Intro to Hadoop & R All about Hadoop Hadoop FS Hadoop & mapreduce All about R Counting (& Graphics) Graphics & big data Let’s count... hexagons Details of mapreduce Scoring, sampling & simulating Data modelling Linear Regression Logistic Regression Trees & Random Forests Final remarks ... questions?

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment