Analytics Beyond RAM Capacity using R

Analytics Beyond RAM Capacity
using R
Dr. Alex Palamides
Athens Big Data Meetup
September 19th 2017

Page | 2
• R is important not because it's language certainly it's a language absolutely directly tailored to the needs of predictive
analytics but it is used in a huge growing community
• so R is much more than a language it is
• is a language
• ecosystem
• Community
• and a vast array techniques that data scientists can draw from in solving new problems
Athens Big Data – Modelling beyond RAM capacity with R
Introduction to R

Page | 3
Athens Big Data– Modelling beyond RAM capacity with R
Introduction to R
R ranks #5 in IEEE Spectrum 2016 while two years ago R was ranking 9th

Page | 4
Athens Big Data– Modelling beyond RAM capacity with R
Introduction to R
• Open source statistical programming language based upon “S”
• R is one of the most popular data science tools (along with Python)
• The base functionality can be expanded using “packages”
• The usage of R has dramatically increased over recent years:
• Popular with educational and
research communities
• Known to be used at many of the
leading tech firms (Airbnb,
Facebook, Google, Twitter, Uber,
etc.)
• R Consortium support from
Google, IBM, Microsoft, Oracle,
etc.
• Microsoft purchase of Revolutions
Analytics (R Open, R Server, SQL
Server, AzureML)
• RStudio is a popular (IDE) for R / R
Tools for Visual Studio

Page | 5
Introduction to R – Data handling / visualization
• Common file formats are easily read into R
– library(data.table), fread(…) for CSV or text files (as an alternative to
read.csv(…))
– library(readxl) for Excel
– library(haven) for SAS datasets
• Access and submit SQL queries using ODBC and library(dplyr)
• Data is usually stored in a data.frame
object
Two main packages are used for
processing data in R
– library(dplyr) uses action verbs to act
upon data frames
– library(data.table) is faster and more
powerful however the syntax is more
challenging to learn
• library(ggplot2) is a very popular
graphics package for R

Page | 6
Introduction to R – Model Building
• glm(…) to bulit generalized models . Commonly
use for logistic Regression
• Use step(…) to execute stepwise regression
• lm for linear regression
• Rpart (…) for CART trees
• RandomForest (…) for RF Trees
• Knn(…) for K- Nearest Neighbourhood
• Nnet(…) for neural networks
• rcorr.cens(…) for Gini Coefficient
• caret::R2 for R squared and model
• tuning And so on…
• In general there are multiple ways to create
models due to the open source community

Page | 7
R– Limitations and Solutions
• In-Memory Operation
• Lack of Parallelism
• Expensive Data Movement &
Duplication
Couple of scalable R solutions:
• Choose R packages with big data support on single machines
• The “bigmemory” project
• “ff” and related packages
• Scale from single machines to distributed computing
• SparkR
• sparklyr
• RevoScaleR (Microsoft R Server)
and more!

Page | 8
R– Limitations and Solutions
MSR is a family of R based products that live both independently and inside a SQL server
database and other platforms. They give users a multiplicity of methods by which take Data in
the organization , they apply predictive analytics to develop learning and insight and deploy that
directly as applications usable by the business on which they can take direct action

Page | 9
Core Idea
• Microsoft R Server (MSR) on the other hand by utilizing RevoScaleR package
capabilities follows a different approach; Datasets are stored on the disk and
computations are performed into chunks of data, therefore data is
inherently distributed
• In the MSR most common data operations (manipulation and analysis) are
supported by counterpart functions in addition to the support in (indirectly)
utilizing open-source R algorithms

Page | 10
Intro description
• 100% compatible with open source R
Any code/package that works today with R will work in R Server.
• Ability to parallelize any R function
Ideal for parameter sweeps, simulation, scoring.
• Wide range of scalable and distributed “rx” pre-fixed functions in “RevoScaleR” package.
Transformations: rxDataStep()
Statistics: rxSummary(), rxQuantile(), rxChiSquaredTest(), rxCrossTabs()…
Algorithms: rxLinMod(), rxLogit(), rxKmeans(), rxBTrees(), rxDForest()…
Parallelism: rxSetComputeContext()

Page | 11
Microsoft R Open ( the Interpreter)
Increased Performance and
scalability through
parallelization and streaming

Page | 12
Microsoft R Server Components

Page | 13
R Open – Traditional Connection to a DB

Page | 14
Scale through Parallelization
In R Server also typical data can be pulled from the source.
Overcoming most of these limitations by increased performance due to
parallelized computations
Faster implementation of algorithms as they are written in c++
As operations take place one block at ta time , no need to place all data
in RAM. Results are updated chunk by chunk

Page | 15
Scale through Parallelization

Page | 16
Microsoft R Server

Page | 17
Functions rebuilt to ensure high performance.

Page | 18
State-of-the-Art Machine Learning Algorithms :
rxFastTrees: An implementation of FastRank, an efficient implementation of the MART gradient boosting
algorithm.
rxFastForest: A random forest and Quantile regression forest implementation using rxFastTrees.
rxLogisticRegression: Logistic regression using L-BFGS.
rxOneClassSvm: One class support vector machines.
rxNeuralNet: Binary, multi-class, and regression neural net.
rxFastLinear: Stochastic dual coordinate ascent optimization for linear binary classification and regression.
rxPredict.mlModel: Scores using a model created by one of the machine learning algorithms.
Helper functions for arguments: loss functions :expLoss/ logLoss /et all, Kernel functions linearKernel /
rbfKernel et Text Tools:
featurizeText: language detection, tokenization, stopwords removing, text normalization and feature
generation
categorical transforms with dictionary, feature selection from specified variables and other
Microsoft ML package

Page | 19
RevoscaleR Performance Benchmarking
When it comes to scaling , performance is
critical . An example of the performance
improvements available for users of
Microsoft R :
• the blue bar is a depiction of the
performance rate and the data size
capacity of the open source are
generalized linear model it's commonly
used ; logistic regression
algorithm.
• The red bar is the effect that Microsoft
is able to bring by creating an
equivalent algorithm that is massively
paralyzed , remote executable and
most importantly rewritten in C++
maximize the performance of the
algorithm
• as you can see there's about forty to
one difference in the performance rate
• Essentially no limit to the scalability of
the scale: Runtime increases almost
linearly with the data size

Page | 20
Coding Example ( local Execution)

Page | 22
Remote Execution Architecture
• Faster Computation
• Larger Data Sets
• Fewer Security concerns

Page | 23
Coding Example ( Remote Execution)

Page | 24
Getting Started (0)
Microsoft R is a collection of packages, interpreters, and infrastructure for
developing and deploying R-based machine learning and data science solutions
on a range of platforms
• Microsoft R Server is the flagship product and supports very large workloads
in the enterprise.
• Microsoft R Client is a free workstation version. It includes the same R Server
functionality, but for local workloads.
• Microsoft R Open is Microsoft's distribution of open source R, without the
proprietary packages and infrastructure of our other products. This R
distribution is included in both Microsoft R Client and R Server.
• Student / Developer Free Version available
• Supported in several platforms:
• R Server for Hadoop
• R Server for Linux
• R Server for Windows
• R Server for Teradata
• R Server for Azure
• Embedded in SQL Server 2017 as R Services

Page | 25
• Data import and exploration :
• mysource <- file.path(rxGetOption("sampleDataDir"), "AirlineDemoSmall.csv") ## point to the source file
• airXdfData <- rxImport(inData=mysource) # import the file as data frame- be careful of data limitation issues#
• airXdfData <- rxImport(inData=mysource, outFile="c:/Users/Temp/airExample.xdf") # import as XDF , i.e., store the
# file in the hard drive
• An .xdf file is a binary file format native to Microsoft R, used for persisting data on disk. An .xdf file is column-based,
one column per variable, which is optimum for the variable orientation of data used in statistics and predictive .
includes precomputed metadata that is immediately available with no additional processing analytics.
• Examine object metadata:
• rxGetInfo(airXdfData, getVarInfo = TRUE) #
• For example : rxGetInfo(airXdfData) results in :
Variable information: Var 1: ArrDelay 702 factor levels: 6 -8 -2 1 -14 ... 451 430 597 513 432
Var 2: CRSDepTime, Type: numeric, Storage: float32, Low/High: (0.0167, 23.9833)
Var 3: DayOfWeek 7 factor levels: Monday Tuesday Wednesday Thursday Friday Saturday Sunday
• Summarize data :
• rxSummary(~ArrDelay+CRSDepTime+DayOfWeek, data=airXdfData)
• Number of valid observations:
• Statistics (mean , StdDev etc) for numerical variables
• Counts per level for categorial variables
• rxSummary(formula = ~ArrDelay:DayOfWeek, data=airXdfData) # Statistics by category
Getting Started (1)

Page | 26
• Data transformations - One function to rule them all :
• All transformations are performed via the rxDataStep command
• airXdfData <- rxDataStep(inData = airXdfData, outFile = "c:/Users/Temp/airExample.xdf",
transforms=list(VeryLate = (ArrDelay > 120 | is.na(ArrDelay))), overwrite = TRUE)
• A full RevoScaleR data step consists of the following steps:+
• Read in the data a block (200,000 rows) at a time.
• For each block, pass the ArrDelay data to the R interpreter for processing the transformation to
create VeryLate.
• Write the data out to the dataset a block at a time. The argument overwrite=TRUE allows us to
overwrite the data file.
• From XDF to data frame :
• myData <- rxDataStep(inData = airXdfData, rowSelection = ArrDelay > 240 & ArrDelay <= 300, varsToKeep
= c("ArrDelay", "DayOfWeek"))
• Subseting :
• rxReadXdf(airXdfData, numRows=10, startRow=100000)
• Visualizations :
• rxHistogram(~ArrDelay, data = myData)
Getting Started (2)

Page | 27
• Modelling :
• rxLinMod(formula = ArrDelay ~ DayOfWeek, data = airXdfData) # simple linear regression
• arrDelayLm3 <- rxLinMod(ArrDelay ~ DayOfWeek:F(CRSDepTime), data = airXdfData, cube = TRUE)
# interaction and on the fly factor conversion
• myTab <- rxCrossTabs(ArrDelay~DayOfWeek, data = airLateDS) ## Tabulation by gou
• logitObj <- rxLogit(Late~DepHour + Night, data = airExtraDS) logistic regression
• predictDS <- rxPredict(modelObject = logitObj, data = airExtraDS, outData = airExtraDS) #
Predictions from model objects
• Modelling Using a Compute Cluster :
• myCluster <- RxSparkConnect(nameNode = "my-name-service-server", port = 8020)
• rxSetComputeContext(myCluster)
• With your compute context set to the cluster, all of the RevoScaleR data analysis functions automatically
distribute computations across the nodes of the cluster
• delayCarrierLocDist <- rxLinMod(ArrDelay ~ UniqueCarrier+Origin+Dest, data = dataFile, cube = TRUE,
blocksPerRead = 30) # # Regresssion runs in the clusters
• rxSetComputeContext("local")) # reset compute context back to the local machine
• And many more applicable in various operation types: RxXdf Data, rxFactors rxSplit, rxMerge, rxRocCurve
, rxDTree ,, rxKmeans ,rxDForest , rxNaiveBayes , RxHadoopMR , RxSpark , RxInTeradata , RxInSqlServer ,
RxLocalParallel , rxExec
Getting Started (3)

Page | 28
Thank you!
Analytics Beyond RAM Capacity using R Athens Big Data
Meetup September 19th 2017
Dr. Alex Palamides
www.linkedin.com/in/alex-palamides
palamid@gmail.com
Contact Details

Analytics Beyond RAM Capacity using R

More Related Content

What's hot

Similar to Analytics Beyond RAM Capacity using R

Recently uploaded

Analytics Beyond RAM Capacity using R

Editor's Notes