This document summarizes a presentation about working with large datasets in R. It discusses how R loads all data into memory by default, which can cause issues for large datasets that exceed available RAM. It then overviews several packages for working with "big data" in R, including bigmemory and ff, which allow accessing data from files without loading entirely into memory. The document provides examples of using bigmemory to create large matrices stored on disk but shared across multiple R sessions, as well as applying it to analyze a large airline on-time performance dataset consisting of 120 million rows.
Introduction to Graph neural networks @ Vienna Deep Learning meetupLiad Magen
Graphs are useful data structures that can be used to model various sorts of data: from molecular protein structures to social networks, pandemic spreading models, and visually rich content such as websites & invoices. In the recent few years, graph neural networks have done a huge leap forward. It is a powerful tool that every data scientist should know. In this talk, we will review their basic structure, show some example usages, and explore the existing (python) tools.
Introduction to Graph neural networks @ Vienna Deep Learning meetupLiad Magen
Graphs are useful data structures that can be used to model various sorts of data: from molecular protein structures to social networks, pandemic spreading models, and visually rich content such as websites & invoices. In the recent few years, graph neural networks have done a huge leap forward. It is a powerful tool that every data scientist should know. In this talk, we will review their basic structure, show some example usages, and explore the existing (python) tools.
Enhanced Deep Residual Networks for Single Image Super-ResolutionNAVER Engineering
발표자: 김희원 (서울대학교 박사과정)
발표일: 2017.9.
(현)서울대학교 전기정보공학 석박통합과정 재학
Best Paper Award of NTIRE 2017 Workshop: Challenge Track
개요:
Single Image Super-Resolution은 저해상도 이미지를 고해상도의 원본 이미지로 복원시키는 연구 분야입니다. 실생활에서 접할 수 있는 흔한 예로는 SNS 사진 중 작은 부분을 크게 확대해도 선명하게 하는 것이나, thumb nail로 원본 이미지만큼의 해상도를 만들어 내는 것입니다.
이번 발표에서는 딥러닝 전과 후의 연구방향에 대해서 알아본 후, CVPR 2017의 2nd NTIRE Workshop Challenge에서 우승한 저희 팀의 연구를 신경망 구조에 대한 분석을 중심으로 살펴보려고 합니다.
Data Visualization GIS and Maps, The Visualization Process Visualization Strategies: Present or explore? The cartographic toolbox: What kind of data do I have?, How can I map my data? How to map?: How to map qualitative data, How to map quantitative data, How to map the terrain elevation, How to map time series Map Cosmetics, Map Dissemination
A database is a collection of information that is organized so that it can easily be accessed, managed, and updated. In one view, databases can be classified according to types of content: bibliographic, full-text, numeric, and images.....
Prediction of Car Price using Linear Regressionijtsrd
In this paper, we look at how supervised machine learning techniques can be used to forecast car prices in India. Data from the online marketplace quikr was used to make the predictions. The predictions were made using a variety of methods, including multiple linear regression analysis, Random forest regressor and Randomized search CV. The predictions are then analyzed and compared to determine which ones provide the best results. A seemingly simple problem turned out to be extremely difficult to solve accurately. All of the strategies yielded similar results. In the future, we want to make predictions using more advanced technologies. Ravi Shastri | Dr. A Rengarajan "Prediction of Car Price using Linear Regression" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-4 , June 2021, URL: https://www.ijtsrd.compapers/ijtsrd42421.pdf Paper URL: https://www.ijtsrd.comcomputer-science/data-processing/42421/prediction-of-car-price-using-linear-regression/ravi-shastri
Big Data refers to a large amount of data both structured and unstructured. For managing and analyzing this amount of data we need technologies like Hadoop and language like R.
http://www.techsparks.co.in/thesis-in-big-data-with-r/
Enhanced Deep Residual Networks for Single Image Super-ResolutionNAVER Engineering
발표자: 김희원 (서울대학교 박사과정)
발표일: 2017.9.
(현)서울대학교 전기정보공학 석박통합과정 재학
Best Paper Award of NTIRE 2017 Workshop: Challenge Track
개요:
Single Image Super-Resolution은 저해상도 이미지를 고해상도의 원본 이미지로 복원시키는 연구 분야입니다. 실생활에서 접할 수 있는 흔한 예로는 SNS 사진 중 작은 부분을 크게 확대해도 선명하게 하는 것이나, thumb nail로 원본 이미지만큼의 해상도를 만들어 내는 것입니다.
이번 발표에서는 딥러닝 전과 후의 연구방향에 대해서 알아본 후, CVPR 2017의 2nd NTIRE Workshop Challenge에서 우승한 저희 팀의 연구를 신경망 구조에 대한 분석을 중심으로 살펴보려고 합니다.
Data Visualization GIS and Maps, The Visualization Process Visualization Strategies: Present or explore? The cartographic toolbox: What kind of data do I have?, How can I map my data? How to map?: How to map qualitative data, How to map quantitative data, How to map the terrain elevation, How to map time series Map Cosmetics, Map Dissemination
A database is a collection of information that is organized so that it can easily be accessed, managed, and updated. In one view, databases can be classified according to types of content: bibliographic, full-text, numeric, and images.....
Prediction of Car Price using Linear Regressionijtsrd
In this paper, we look at how supervised machine learning techniques can be used to forecast car prices in India. Data from the online marketplace quikr was used to make the predictions. The predictions were made using a variety of methods, including multiple linear regression analysis, Random forest regressor and Randomized search CV. The predictions are then analyzed and compared to determine which ones provide the best results. A seemingly simple problem turned out to be extremely difficult to solve accurately. All of the strategies yielded similar results. In the future, we want to make predictions using more advanced technologies. Ravi Shastri | Dr. A Rengarajan "Prediction of Car Price using Linear Regression" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-4 , June 2021, URL: https://www.ijtsrd.compapers/ijtsrd42421.pdf Paper URL: https://www.ijtsrd.comcomputer-science/data-processing/42421/prediction-of-car-price-using-linear-regression/ravi-shastri
Big Data refers to a large amount of data both structured and unstructured. For managing and analyzing this amount of data we need technologies like Hadoop and language like R.
http://www.techsparks.co.in/thesis-in-big-data-with-r/
Asserting that Big Data is vital to business is an understatement. Organizations have generated more and more data for years, but struggle to use it effectively. Clearly Big Data has more important uses than ensuring compliance with regulatory requirements. In addition, data is being generated with greater velocity, due to the advent of new pervasive devices (e.g., smartphones, tablets, etc.), social Web sites (e.g., Facebook, Twitter, LinkedIn, etc.) and other sources like GPS, Google Maps, heat/pressure sensors, etc.
Well-defined introduction about working with Big Data and introduction to the Hadoop Ecosystem.
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
Talk soon!
High Performance Predictive Analytics in R and HadoopDataWorks Summit
Hadoop is rapidly being adopted as a major platform for storing and managing massive amounts of data, and for computing descriptive and query types of analytics on that data. However, it has a reputation for not being a suitable environment for high performance complex iterative algorithms such as logistic regression, generalized linear models, and decision trees. At Revolution Analytics we think that reputation is unjustified, and in this talk I discuss the approach we have taken to porting our suite of High Performance Analytics algorithms to run natively and efficiently in Hadoop. Our algorithms are written in C++ and R, and are based on a platform that automatically and efficiently parallelizes a broad class of algorithms called Parallel External Memory Algorithms (PEMA’s). This platform abstracts both the inter-process communication layer and the data source layer, so that the algorithms can work in almost any environment in which messages can be passed among processes and with almost any data source. MPI and RPC are two traditional ways to send messages, but messages can also be passed using files, as in Hadoop. I describe how we use the file-based communication choreographed by MapReduce and how we efficiently access data stored in HDFS.
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
A guide to using Apache Hadoop as your open source big data platform of choice, including the vendors that make various Hadoop flavors, related open source tools, Hadoop capabilities and suitable applications.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Datasets, LA R Users' Group 8/17/10
1. Los Angeles R Users’ Group
Taking R to the Limit, Part II:
Working with Large Datasets
Ryan R. Rosario
August 17, 2010
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
2. The Brutal Truth
We are here because we love R. Despite our enthusiasm, R has two
major limitations, and some people may have a longer list.
1 Regardless of the number of cores on your CPU, R will only
use 1 on a default build. (Part I)
2 R reads data into memory by default. Other packages (SAS
particularly) and programming languages can read data from
files on demand.
Easy to exhaust RAM by storing unnecessary data.
232
The OS and system architecture can only access 10242 = 4GB
of memory on a 32 bit system, but typically R will throw an
exception at 2GB.
Not wise to use more memory than available. System will start
swapping which leads to thrashing and slows your system to a
crawl.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
3. The Brutal Truth
There are a couple of solutions:
1 Buy more RAM.
2 Use a database.
3 Build from source and use special build options for 64 bit.
Still not a solution!
1 R, even R64, does not have a int641 data type. Not possible
to index data frames or matrices with huge number of rows or
columns.
4 Sample, resample, or use some Monte Carlo method.
5 Let clever developers solve these problems for you!
We will discuss number 5 above.
1
http://permalink.gmane.org/gmane.comp.lang.r.devel/17281
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
4. The Agenda
This talk will survey several HPC packages in R with some
demonstrations. We will focus on four areas of HPC:
1 Explicit Parallelism: the user controls the parallelization.
(Part I)
2 Implicit Parallelism: the system abstracts it away. (Part I)
3 Large Memory: working with large amounts of data
without choking R.
4 Map/Reduce: basically an abstraction for parallelism that
divides data rather than architecture.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
5. Disclaimer
1 There is a ton of material here. I provide a lot of slides for
your reference.
2 We may not be able to go over all of them in the time allowed
so some slides will go by quickly.
3 Demonstrations will be time permitting.
4 All experiments were run on a Ubuntu 10.04 system with an
Intel Core 2 6600 CPU (2.4GHz) with 2GB RAM. I am
planning a sizable upgrade this summer.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
6. Big Data
“Big Data” is a catch phrase for any dataset or data application that does not
fit into available RAM on one system.
R has several a few packages for big data support. We will talk about the
following:
1 bigmemory
2 ff
We will also discuss some uses of parallelism to accomplish the same goal using
Hadoop and MapReduce:
1 HadoopStreaming
2 Rhipe
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
7. Some Basic Terminology
I will use the word RAM to refer to physical memory, for
simplicity; chips that are installed on the system motherboard.
Virtual memory, or swap is disk space that is used to store
objects that do not fit into RAM and are less frequently accessed.
This is SLOW.
A cluster is a group of systems that communicate with each other
to accomplish a computation.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
8. Background bigmemory ff
Large Datasets
R reads data into RAM all at once, if using the usual read.table
function. Objects in R live in memory entirely. Keeping
unnecessary data in RAM will cause R to choke eventually.
Specifically,
1 on most systems it is not possible to use more than 2GB of
memory.
2 the range of indexes that can be used is limited due to lack of
a 64 bit integer data type in R and R64.
3 on 32 bit systems, the maximum amount of virtual memory
space is limited to between 2 and 4GB.
4 relying on virtual memory will cause the system to grind to a
halt – “thrashing.”
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
9. Background bigmemory ff
Large Datasets
There are two major solutions in R:
1 bigmemory: “It is ideal for problems involving the analysis in
R of manageable subsets of the data, or when an analysis is
conducted mostly in C++.” It’s part of the “big” family,
some of which we will discuss.
2 ff: file-based access to datasets that cannot fit in memory.
3 can also use databases which provide fast read/write access
for piecemeal analysis.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
10. Background bigmemory ff
The “big” Family
The big family consists of several packages for performing tasks
on large datasets.
1 bigmemory is our focus.
2 biganalytics provides analysis routines on big.matrix
such as GLM and bigkmeans.
3 synchronicity adds Boost mutex functionality to R.
4 bigtabulate adds table and split-like support for R
matrices and big.matrix memory efficiently.
5 bigalgebra provides BLAS and LAPACK linear algebra
routines for native R matrices and big.matrix.
6 bigvideo provides video camera streaming via OpenCV.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
11. Background bigmemory ff
bigmemory
bigmemory implements several matrix objects.
1 big.matrix is an R object that simply points to a data
structure in C++. Local to a single R process and is limited
by available RAM.
2 shared.big.matrix is similar, but can be shared among
multiple R processes (think parallelism on data!)
3 filebacked.big.matrix does not point to a data structure;
rather, it points to a file on disk containing the matrix, and
the file can be shared across a cluster!
Pitfall! Remember that matrices contain only one type of data.
Additionally, the data types for elements are dictated by C++ not
R: double, integer, short, char.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
12. Background bigmemory ff
bigmemory and Shared Memory
Shared Memory allows us to store data in RAM and share it among multiple
processes. Suppose we want to store some data in shared memory so it can be
read by multiple instances of R. This allows the user the ability to use
multiple instances of R for performing different analytics simultaneously.
R Web User 1
Caching
My Dataset
(Shared Memory)
R Web User 2 Web Administrator
Doing Analysis
RAM
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
13. Background bigmemory ff
Demonstration: The Logic
First, construct a big.matrix object. Let us suppose we want to
create a large matrix of 0s and 1s.
We can construct a matrix of type int (4 bytes), short (2 bytes),
double (8 bytes), or char (1 byte). Since all we need is 0 and 1,
we use char. We also zero out the matrix.
> A <- big.matrix(m, n, type="char", init=0, shared=TRUE)
> A
An object of class big.matrix
Slot "address":
<pointer: 0x2d93490>
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
14. Background bigmemory ff
Demonstration: The Logic
We have now created a pointer to a C++ matrix that is on disk.
But, to share this matrix we need to share this descriptor.
Then, we can open a second R session, load the location of the
matrix from disk, and bind the matrix to an R variable!
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
15. Background bigmemory ff
Demonstration: The Code
Session 1
1 library ( bigmemory )
2 library ( biganalytics )
3 options ( bigmemory . typecast . warning = FALSE )
4
5 A <- big . matrix (5000 , 5000 , type = " char " , init =0)
6 # Fill the matrix by r a n d o m l y picking 20% of the p o s i t i o n s
for a 1.
7 x <- sample (1:5000 , size =5000 , replace = TRUE )
8 y <- sample (1:5000 , size =5000 , replace = TRUE )
9 for ( i in 1:5000) {
10 A [ x [ i ] , y [ i ]] <- 1
11 }
12 # Get the l o c a t i o n in RAM of the pointer to A .
13 desc <- describe ( A )
14 # Write it to disk .
15 dput ( desc , file = " / tmp / A . desc " )
16 sums <- colsum (A , 1:20)
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
16. Background bigmemory ff
Session 2
1 library ( bigmemory )
2 library ( biganalytics )
3
4 # Read the pointer from disk .
5 desc <- dget ( " / tmp / A . desc " )
6 # Attach to the pointer in RAM .
7 A <- attach . big . matrix ( desc )
8 # Check our results .
9 sums <- colsum (A , 1:20)
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
17. Background bigmemory ff
Demonstration 1
Demonstration here.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
18. Background bigmemory ff
bigmemory: Importance of Data Type
Suppose we are not careful in the C++ data type we use for the
big.matrix. Using char, the matrix requires about 24MB of
RAM.
Data Type RAM
char 24 MB
int 96 MB
double 192 MB
short 48 MB
Of course, this assumes that you use dense matrices and there is
no CPU optimization.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
19. Background bigmemory ff
bigmemory: Other Operations
big.matrix requires its own optimized versions of the R matrix
functions provided in biganalytics:
colmean(x, cols, na.rm)
colmin(x, cols, na.rm)
colmax(x, cols, na.rm)
colvar(x, cols, na.rm)
colsd(x, cols, na.rm)
colsum(x, cols, na.rm)
colprod(x, cols, na.rm)
colna(x, cols)
where x is a big.matrix, cols is a vector of column indices,
na.rm is TRUE if R should remove missing values first.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
20. Background bigmemory ff
The big Example: The Data
Let’s do something useful with some big data. Consider airline
on-time performance from 1987 to 2008. This is the format of the
data:
11GB comma-separated values file.
120 million rows, 29 columns
Factors coded as integers.
We will estimate the age of an aircraft at each departure.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
21. Background bigmemory ff
Step 1: Read in the Data
1 library ( bigmemory )
2 library ( biganalytics )
3 x <- read . big . matrix ( " airline . csv " , type = "
integer " , header = TRUE ,
4 backingfile = " airline . bin " ,
5 descriptorfile = " airline . desc " ,
6 extraCols = " Age " )
Initially takes about 28 minutes to run, the first time only.
Subsequent accesses are very fast.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
22. Background bigmemory ff
Step 1: Read in the Data
read.big.matrix inherits from read.table so your favorite
parameters are available for use.
It also adds a few more:
1 type, the C++ type to use for the matrix.
2 separated, separate the columns into individual files if TRUE.
3 extracols explicitly adds columns to the matrix that you
may use later.
4 backingfile, backingpath and descriptorfile control
where important data about the matrix is stored on disk.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
23. Background bigmemory ff
Step 2a: Some Initial Tasks
Next, estimate the birthmonth of the plane using the first
departure of that plane.
8 birthmonth <- function ( y ) {
9 minYear <- min ( y [ , ’ Year ’] , na . rm = TRUE )
10 these <- which ( y [ , ’ Year ’ ]== minYear )
11 minMonth <- min ( y [ these , ’ Month ’] , na . rm = TRUE )
12 return (12 * minYear + minMonth - 1)
13 }
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
24. Background bigmemory ff
Step 2b v1: Calculate Each Plane’s Birthmonth the Dumb
Way
We could use a loop or possible an apply variant...
14 aircrafts <- unique ( x [ , ’ TailNum ’ ])
15 acStart <- rep (0 , length ( aircrafts ) )
16 for ( i in aircrafts ) {
17 acStart [ i ] <- birthmonth ( x [ mwhich (x , ’ TailNum ’ , i , ’ eq ’) ,
18 c ( ’ Year ’ , ’ Month ’) , drop = FALSE ])
19 }
...this takes about 9 hours...
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
25. Background bigmemory ff
Step 2b v2: Calculate Each Plane’s Birthmonth the big
Way
First separate/divide the data by tail number (aircraft ID).
14 library ( bigtabulate )
15 acindices <- bigsplit (x , ’ TailNum ’)
Each entry i of acindices contains a vector of indices
corresponding to TailFin i.
bigsplit runs about twice as fast as split (6s) and requires
about 2/3 peak RAM.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
26. Background bigmemory ff
Step 2c v1: Compute an Estimate of Birthmonth using
sapply
Now that we have used bigsplit to quickly split up the
big.matrix, we can use the native sapply:
16 acStart <- sapply ( acindices , function ( i )
birthmonth ( x [i , c ( ’ Year ’ , ’ Month ’) , drop =
FALSE ]) )
which took a mere 8 seconds!
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
27. Background bigmemory ff
Step 2c v2: Compute an Estimate of Birthmonth using
foreach
We can also use this opportunity to revisit foreach:
16 library ( doMC )
17 registerDoMC ( cores =2)
18 acStart <- foreach ( i = acindices , . combine = c )
% dopar % {
19 return ( birthmonth ( x [i , c ( ’ Year ’ , ’ Month ’) ,
drop = FALSE ]) )
20 }
Both cores share access to the same instance of the data smoothly.
Year and Month are cached in RAM. This example took 0.663s.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
28. Background bigmemory ff
Step 3: Finally, Compute an Estimate of Age
21 x [ , ’ Age ’] <- x [ , ’ Year ’] * as . integer (12) + x [ , ’ Month ’] -
as . integer ( acStart [ x [ , ’ TailNum ’ ]])
Here, arithmetic is conducted using R vectors that are extracted
from the big.matrix. Careful use of as.integer helps reduce
memory overhead.
This data management task requires 8 minutes.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
29. Background bigmemory ff
Before we Continue: The Point
The point of this demonstration:
1 show how, by using file descriptors and file-backed matrices,
working with big data is easy.
2 show how bigmemory can be integrated with parallelism
packages relatively easily.
3 show how other big packages assist with the purpose of the
bigmemory package to provide the user the familiar R
interface after performing some initial maintenance.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
30. Background bigmemory ff
The big Example 2: Linear Models
Now we will try to predict arrival delay (ArrDelay) using Age and
Year.
ˆ ˆ ˆ
ArrDelay = β0 + β1 Age + β2 Year
biganalytics provides a wrapper to the biglm package.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
31. Background bigmemory ff
The big Example 2: Linear Models
Fitting a linear model looks relatively familiar...
1 library ( biganalytics )
2 blm <- biglm . big . matrix ( ArrDelay ~ Age + Year ,
data = x )
1 Without bigmemory, this process would take about 10GB of
RAM.
2 Expected time to completion would be ∞ since most systems
do not have that much RAM.
3 With bigmemory processing took 4.5 minutes with a few
hundred MB of memory overhead.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
32. Background bigmemory ff
The big Example 2: Linear Models
Just like with base lm, we can get information about the fitted
model.
> blm
Large data regression model: biglm(formula = formula, data = data, ...)
Sample size = 6452
> summary(blm)
Large data regression model: biglm(formula = formula, data = data, ...)
Sample size = 6452
Coef (95% CI) SE p
(Intercept) 6580.4311 1916.9282 11243.9340 2331.7515 0.0048
Age 0.2687 0.0804 0.4569 0.0941 0.0043
Year -3.2753 -5.6005 -0.9500 1.1626 0.0048
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
33. Background bigmemory ff
bigmemory: Other Features
can write a big.matrix to ASCII file using
write.big.matrix.
can bin data for counting, creating histograms or
visualizations using binit.
create a copy of the content using deepcopy.
can create a hash into a big.matrix using hash.mat
search a matrix using mwhich including powerful comparison
operators using C++, not R.
separated columns: columns of a matrix are separated in
RAM, rather than contiguous.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
34. Background bigmemory ff
bigmemory: Important Considerations
passing a big.matrix to a function is call by reference not
call by value!
Nerd alert! bigmemory provides transparent read/write
locking to big.matrix, so race conditions are minimized.
For more information, see http://www.bigmemory.org, or the
documentation.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
35. Background bigmemory ff
bigmemory: Conclusion
Advantages
1 can store a matrix in memory, restart R, and gain access to
the matrix without reloading data. Great for big data.
2 can share the matrix among multiple R instances or sessions.
3 access is fast because RAM is fast. C++ also helps!
Disadvantages
1 no communication among instances of R; can use files instead.
2 limited by available RAM, unless using
filebacked.big.matrix.
3 matrix disappears on reboot, unless using
filebacked.big.matrix.
4 filesize limitations on 32 bit systems for filebacked matrices.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
36. Background bigmemory ff
ff: “fast access files”
ff is another solution that is based on using files.
provides data structures that are stored on disk.
they act as if they are in memory; only necessary/active parts
of the data from disk are mapped into main memory.
supports R standard atomic types: double, logical, raw,
integer,
as well as non-standard atomic types boolean (1 bit), quad
(2 bit unsigned), nibble (4 bit unsigned), byte (1 byte
signed with NA), ubyte (1 byte unsigned), short and ushort
(1 byte signed w/NA and unsigned resp.), and single (4 byte
float with NA).
and non-atomic types such as factor, ordered, POSIXct,
Date, etc.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
37. Background bigmemory ff
ff
C support for vectors, matrices and arrays. (FAST)
provides an analog for data.frame called ffdf, also with
import/export functionality.
ff objects can be stored and reopened across R sessions.
ff files can be shared by multiple ff R objects in the same, or
different R sessions.
tons of optimizations provide little noticeable overhead.
Virtual functions allow matrix operations without touching a
single byte on disk.
Disk I/O is SLOW. ff optimizes by using binary files.
closely integrated with package bit which can manipulate
and process, well, bits.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
38. Background bigmemory ff
ff: The Logic
In bigmemory R keeps a pointer to a C++ matrix. The matrix is
stored in RAM or on disk. In ff R keeps metadata about the
object, and the object is stored in a flat binary file.
ff is somewhat difficult to jump into because there are so few
examples and only one tutorial. There is a lot of information about
the technical side of the package.
In any case, the goal is to get rid of the following error message:
1 > x <- rep (0 , 2^31 - 1)
2 Error : cannot allocate vector of length
2147483647
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
39. Background bigmemory ff
ff
The package ff performs the following functionality to the user:
creating and/or opening flat files using ffopen. Using the
parameters length or dim will create a new file.
I/O operations using the common brackets [ ] notation.
Other functions for ff objects such as the usual dim and
length as well as some other useful functions such as sample.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
40. Background bigmemory ff
ff Example: Introduction
Let’s start with an introductory demonstration. To create a
one-dimensional flat file,
1 library ( ff )
2 # creating the file
3 my . obj <- ff ( vmode = " double " , length =10)
4 # modifying the file .
5 my . obj [1:10] <- iris $ Sepal . Width [1:10]
Let’s take a look...
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
41. Background bigmemory ff
ff Example: Introduction
We can also create a multi-dimensional flat file.
1 # c r e a t i n g a m u l t i d i m e n s i o n a l file
2 multi . obj <- ff ( vmode = " double " , dim = c (10 , 3) )
3 multi . obj [1:10 , 1] <- iris [1:10 ,1]
Let’s take a look...
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
42. Background bigmemory ff
ff Example: Introduction
We can also create an ff data frame made up of ff atomics.
1 Girth <- ff ( trees $ Girth )
2 Height <- ff ( trees $ Height )
3 Volume <- ff ( trees $ Volume )
4 # Create data frame with some added p a r a m e t e r s .
5 fftrees <- ffdf ( Girth = Girth , Height = Height , Volume = Volume )
Let’s take a look...
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
43. Background bigmemory ff
ff Parameters
ff, ffm and to some degree ffdf support some other options you
may need. R is usually pretty good at picking good defaults for
you. Some of them are below:
Parameter Description
initdata Value to use to initialize the object for construction.
length Optional length of vector. Used for construction.
vmode Virtual storage mode (makes big difference in memory overhead).
filename Give a name for the file created for the object.
overwrite If TRUE, allows overwriting of file objects.
If no filename is given, a new file is created. If a filename is given
and the file exists, the object will be loaded into R.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
44. Background bigmemory ff
ff Saving Data
We can use the ffsave function to save a series of objects to file. The specified
objects are saved using save and are given the extension .RData. The ff files related
to the objects are saved and zipped using an external zip application, and given the
extension .ffData. ffsave has some useful parameters. See ?ffsave for more
information:
ffsave(..., list = character(0L), file = stop("’file’ must be specified",
envir = parent.frame(), rootpath = NULL, add = FALSE,
move = FALSE, compress = !move, compression_level = 6,
precheck=TRUE)}
... the objects to be saved.
the file parameters specifies the root name (no extension) for the file. It is
best to use absolute paths.
add=TRUE to add these objects to an existing archive.
compress=TRUE to save and compress.
safe=TRUE to write a temporary file first for verification, then move to a
persistent location.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
45. Background bigmemory ff
ff: Other File I/O Operations
Before discussing ffload in a few slides, there are some other
operations worth mentioning.
ffsave.image allows the user to save the entire workspace to
an ffarchive.
ffinfo, when passed the path for an ffarchive (without
extension) displays information about the archive.
ffdrop allows the user to delete an ffarchive (no
extension).
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
46. Background bigmemory ff
ff Example
As we saw earlier, we can use biglm to fit a linear model to big
data.
6 library ( biganalytics )
7 model <- biglm ( log ( Volume ) ~ log ( Girth ) +
log ( Height ) , data = fftrees )
Demonstration here.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
47. Background bigmemory ff
ff: Conclusion
Advantages
1 allows R to work with multiple HUGE datasets.
2 clean system; does not make a mess with a ton of files.
3 several optimizations show that ff has a bright future.
Disadvantages
1 few examples; somewhat difficult to introduce.
2 performing analysis may require some clever forethought since
not all of the data is in RAM.
3 unzipping files on load is a huge bottleneck.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
48. Background bigmemory ff
bigmemory vs. ff
Which one to use is a matter of taste. Performance is about the
same: the first row of numbers is the initial processing and the
second uses caching:
Source: http://www.agrocampus-ouest.fr/math/useR-2009/slides/Emerson+Kane.pdf
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
49. Map/Reduce mapReduce HadoopStreaming
MapReduce
MapReduce is a way of dividing a large job into many smaller jobs producing
an output, and then combining the individual outputs into one output. It is a
classic divide and conquer approach that is embarrasingly parallel and can
easily be distributed among many cores or a CPU, or among many CPUs and
nodes. Oh, and it is patented by Google, its biggest user.
Two fundamental steps:
1 map step: perform some operation f , in parallel, on data and output a
key/value pair for each record/row.
2 reduce step: group common elements and compute some summary
statistic, one per group.
The notion of map and reduce comes from Lisp and functional programming,
where map performs some operation on every element in a collection, and
reduce collapses the results of that operation into one result for the collection.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
50. Map/Reduce mapReduce HadoopStreaming
Example: Word Counts
Suppose we have 5 manuscripts and we want to count number of times each
word occurs. (This is a common task in data mining and natural language
processing). The map phase parses the text and produces output like word:
count.
The reduce phase then groups records by word and sums (the reduce
operation) the counts.
the: 200 the: 200 the: 200 the: 200 the: 200
it: 750 it: 750 it: 750 it: 750 it: 750
and: 1000 and: 1000 and: 1000 and: 1000 and: 1000
fire: 1 fire: 1 fire: 1 fire: 1 fire: 1
truck: 1 truck: 1 truck: 1 truck: 1 truck: 1
the: 1000
it: 3750
and: 5000
fire: 5
truck: 5
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
51. Map/Reduce mapReduce HadoopStreaming
mapReduce
mapReduce is a pure R implementation of MapReduce. A matter
of fact, the authors of the package state that mapReduce is simply:
apply(map(data), reduce)
By default, mapReduce uses the same parallelization functionality
as sapply which is not preferrable.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
52. Map/Reduce mapReduce HadoopStreaming
mapReduce
The form of the mapReduce function is very simple.
mapReduce(map, ..., data, apply = sapply)
map is an expression that yields a vector that partitions data
into groups. Can use a single variable as well. This is a bit
different from Hadoop’s implementation.
... is one or more reduce functions; typically summary
statistics.
data is a data.frame.
apply is the parallelization toolkit to use: papply,
multicore, snow.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
53. Map/Reduce mapReduce HadoopStreaming
mapReduce Demonstration
1 data ( iris )
2 mapReduce (
3 map = Species ,
4 mean . sepal . length = mean ( Sepal . Length ) ,
5 max . sepal . length = max ( Sepal . Length ) ,
6 data = iris
7 )
mean.sepal.length max.sepal.length
setosa 5.006 5.8
versicolor 5.936 7.0
virginica 6.588 7.9
Demonstration here.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
54. Map/Reduce mapReduce HadoopStreaming
Hadoop
Hadoop is an open-source implementation of MapReduce that has
gained a lot of traction in the data mining community.
http://hadoop.apache.org
Cloudera and Yahoo provide their own implementations of Hadoop,
which may be easier to use in the cloud:
http://www.cloudera.com/developers/downloads/hadoop-distro/
http://developer.yahoo.com/hadoop/distribution/
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
55. Map/Reduce mapReduce HadoopStreaming
Hadoop Infrastructure
Hadoop is a separate open source project, so will not discuss it in
detail.
1 Hadoop allows use of parallelism to solve problems using big
data.
2 Hadoop is most effective with a cluster and Hadoop assigns
certain machines critical tasks.
3 Hadoop uses its own filesystem by default, called HDFS.
4 Map and reduce functions are the same as mentioned in the
Intro to Map-Reduce slide.
5 Map, reduce, job control, logging etc. methods are written in
Java.
We will exploit some workarounds to deal with point #5.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
56. Map/Reduce mapReduce HadoopStreaming
Hadoop Installation
Installing Hadoop on most systems is simple.
1 Download the tarball from
http://www.apache.org/dyn/closer.cgi/hadoop/core/
2 Extract using tar xf.
3 Add the Hadoop bin directory to your path.
4 Set environment variable JAVA HOME to point to the location
of system Java (Oracle/Sun preferred).
Hadoop is then ready to run locally. To run on a cluster requires
some more configuration beyond the scope of this talk.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
57. Map/Reduce mapReduce HadoopStreaming
HadoopStreaming
Typically, Hadoop jobs are written in Java. This was a roadblock
to many developers that would incur time having to port code over
to Java.
Hadoop is distributed with Hadoop Streaming, which allows
map/reduce jobs to be written in any language as long as it can
read and write from stdin and stdout respectively.
This includes R. Cue the HadoopStreaming package!
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
58. Map/Reduce mapReduce HadoopStreaming
HadoopStreaming
One way to easily use Map/Reduce in R is with the
HadoopStreaming package. The workflow is as follows:
1 Open a connection: to a file, STDIN, or a pipe.
2 Write map and reduce functions either in one file, or in
separate files.
3 Run the job on data either from the command line using R
only, or with Hadoop.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
59. Map/Reduce mapReduce HadoopStreaming
Data Mining with R
The data for this demonstration is a large 14GB file containing a
tweet ID and the content of a user’s tweet.
tweetID tweet_text
The typical “hello world” example for map/reduce is word
counting, so let’s construct the a list of words and the number of
times they appear over all tweets in the sample.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
60. Map/Reduce mapReduce HadoopStreaming
HadoopStreaming: The Map Function
The map function takes a line of input, does something with it,
and outputs a key/value pair that is sorted and passed to the
reducer. For this example, I split the text of the tweet and output
each word as it appears, and the number 1 to denote that the word
was seen once.
The key is the word, and the value is the intermediate count (1 in
this case).
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
61. Map/Reduce mapReduce HadoopStreaming
HadoopStreaming: The Reduce Function
Before data enters the reduce phase, it is sorted on its key. The
reduce function takes a key/value pair and performs an aggregation
on each value associated with the same key and writes it to disk
(or HDFS with Hadoop) in some format specified by the user.
In the output, the “key” is a word and the “value” is the number
of times that word appeared in all tweets in the file.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
62. Map/Reduce mapReduce HadoopStreaming
HadoopStreaming: Anatomy of a HadoopStreaming Job
First, we create a file that will contain the map and reduce
functions.
1 # ! / usr / bin / env Rscript # allows script to be
EXECUTABLE .
2
3 library ( Ha do o pS tr ea m in g ) # need the package .
4
5 # user can create own command line a r g u m e n t s .
6 # By default , certain a r g u m e n t s are already parsed for you
.
7 opts <- c ()
8 # Gets a r g u m e n t s from the e n v i r o n m e n t
9 op <- hsCmdLineArgs ( opts , op en Co n ne ct io n s = TRUE )
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
63. Map/Reduce mapReduce HadoopStreaming
HadoopStreaming: Anatomy of a HadoopStreaming Job
When running an R script, we can pass command line arguments.
For example:
./mapReduce.R -m
Short Name Character Argument Type Description
mapper m None logical Runs the mapper.
reducer r None logical Runs the reducer.
infile i Required character Input file, otherwise STDIN.
outfile o Required character Output file, otherwise STDOUT.
Some selected arguments are above.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
64. Map/Reduce mapReduce HadoopStreaming
HadoopStreaming: Anatomy of a HadoopStreaming Job
op is populated with a bunch of command line arguments for you. If -m is passed to a
script, op$mapper is TRUE and the mapper is run.
10 if ( op $ mapper ) {
11 mapper <- function ( x ) {
12 # t o k e n i z e each tweet
13 words <- unlist ( strsplit (x , " [[: punct :][: space
:]]+ " ) )
14 words <- words [ ! ( words == ’ ’) ]
15 # Create a data frame with 1 column : the words .
16 df <- data . frame ( Word = words )
17 # Add a column called count , i n i t i a l i z e d to 1.
18 df [ , ’ Count ’] = 1
19 # Send this out to the console for the reducer .
20 hsWriteTable ( df [ , c ( ’ Word ’ , ’ Count ’) ] , file = op
$ outcon , sep = ’ , ’)
21 }
22 # Read a line from IN .
23 hsLineReader ( op $ incon , chunkSize = op $ chunksize , FUN =
mapper )
24 }
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
65. Map/Reduce mapReduce HadoopStreaming
HadoopStreaming: Anatomy of a HadoopStreaming Job
op is populated with a bunch of command line arguments for you. If -r is passed to a
script, op$reducer is TRUE and the reducer is run.
25 else if ( op $ reducer ) {
26 # Define the reducer f u n c t i o n .
27 # It just prints the word , the sum of the counts
28 # s e p a r a t e d by comma .
29 reducer <- function ( d ) {
30 cat ( d [1 , ’ Word ’] , sum ( d $ Count ) , ’ n ’ , sep = ’ , ’)
31 }
32 # Define the column names and types for output .
33 cols = list ( Word = ’ ’ , Count =0)
34 hsTableReader ( op $ incon , cols , chunkSize = op $ chunkSize ,
skip =0 ,
35 sep = ’ , ’ , keyCol = ’ Word ’ , singleKey =T , ignoreKey =F ,
36 FUN = reducer )
37 }
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
66. Map/Reduce mapReduce HadoopStreaming
HadoopStreaming: Anatomy of a HadoopStreaming Job
Finally, we clean after ourselves and close the connections to the in
and out connections.
38 if ( ! is . na ( opts $ infile ) ) {
39 close ( opt $ incon )
40 }
41
42 if ( ! is . na ( opt $ outfile ) ) {
43 close ( opt $ outcon )
44 }
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
67. Map/Reduce mapReduce HadoopStreaming
HadoopStreaming: Running a HadoopStreaming Job
HadoopStreaming allows the user to pass data to map/reduce
from the command line using a pipe, or using a file. To input data
using a pipe:
cat twitter.tsv | ./count.R -m | sort | ./count.R -r
To input data using a file:
./count.R -m -i twitter.tsv | sort | ./count.R -r
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
68. Map/Reduce mapReduce HadoopStreaming
HadoopStreaming: Running a HadoopStreaming Job
Or we can run the script using, you know, Hadoop Streaming!
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.20.2-streaming.jar
-input /home/ryan/hdfs/in
-output ~/hdfs/out
-mapper "count.R -m"
-reducer "count.R -r"
-file ./count.R
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
69. Map/Reduce mapReduce HadoopStreaming
Hadoop produces a lot of status and progress output and provides
a web interface that you can explore when using it.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
70. Map/Reduce mapReduce HadoopStreaming
HadoopStreaming: Running a HadoopStreaming Job
Output looks as follows.
a,13
about,2
action,1
acutely,1
adapting,1
affairs,1
after,1
again,2
ah,2
Ah,1
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
71. Map/Reduce mapReduce HadoopStreaming
HadoopStreaming vs. Rhipe
Rhipe is another R interface for Hadoop that provides a more
“native” feel to it.
1 incorporates an rhlapply function similar to the standard
apply variants.
2 uses Google Protocol Buffers.
3 seems to have great flexibility in modifying Hadoop
parameters.
4 has more use cases and flexibility in how to run jobs and how
to transmit and receive data.
5 but... is more complicated to use and requires a more
in-depth knowledge of Hadoop, which is beyond the scope of
the group’s purpose.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
72. Map/Reduce mapReduce HadoopStreaming
The Hadoop Barnyard
Hadoop also has several other projects that run on top of it:
Pig: data-flow (query) language and execution framework for
parallel computing.
ZooKeeper: high-performance coordination service for
distributed applications.
Hive: data warehouse infrastructure providing summarization
and ad-hoc querying.
HBase: a scalable, distributed database supporting structured
data storage for large tables.
Avro: data serialization providing dynamic integration with
scripting languages.
Chukwa: data collection system for managing large
distributed systems.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
73. Map/Reduce mapReduce HadoopStreaming
The Hadoop Barnyard
There are other packages that can interface with Hadoop, but are
part of a different family.
Mahout: suite of scalable machine learning libraries.
Nutch: provides web search and crawling application software
on top of Lucene.
Lucene: an efficient indexer for information retrieval.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
74. When to Use What?
This is my personal opinion based on experience with both R and
Hadoop.
For datasets with size in the range 10GB, bigmemory and ff
handle themselves well.
For “larger” datasets, use Hadoop (integrated with R?)
Not enough research on data on the scale of TB or PB in R.
Hadoop is superior here.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
75. Other Solutions: Hardware
There are other solutions to some of these problems.
Buy more RAM... lots of it... lots of fast RAM
Let your system run out of memory and configure it to use
solid state drives (SSD) for swap?
Use a very expensive SAS drive.
use a GPU.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
76. Other Solutions: Software
(use the packages we discussed today)
use databases that can be called from R.
if data is sparse, use sparse matrix packages sparseM or slam.
use a different language like C/C++/FORTRAN and
interface with R.
use a different language entirely.
Revolution R is beta testing big dataset support.
use Hadoop or Amazon EC2 or Elastic MapReduce with(out)
R
use SAS
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
77. In Conclusion
1 R provides ways to deal with big data.
2 They are fairly easy to use.
3 Worth learning as R gains popularity and datasets grow huge
in size.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
78. Keep in Touch!
My email: ryan@stat.ucla.edu
My blog: http://www.bytemining.com
Follow me on Twitter: @datajunkie
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
79. The End
Questions?
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
80. Thank You!
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group
81. References
Most examples and descriptions can from package documentation.
Other references not previously mentioned:
1 “Big Data in R” by G Stat: March 24, 2010.
http://statistics.org.il/wp-content/uploads/2010/04/Big Memory%20V0.pdf
2 “The ff Package: Handling Large Data Sets in R with
Memory Mapped Pages of Binary Flat Files” by Adler,
Nenadic, Zucchini, Glaser: UseR! 2007.
http://user2007.org/program/presentations/adler.pdf
3 “The Bigmemory Project” by Michael Kane and John
Emerson: April 29, 2010.
http://cran.r-project.org/web/packages/bigmemory/vignettes/Overview.pdf
4 “Airline On-time Performance.” Data Expo 2009.
http://stat-computing.org/dataexpo/2009/the-data.html.
Ryan R. Rosario
Taking R to the Limit: Part II - Large Datasets Los Angeles R Users’ Group