SlideShare a Scribd company logo
1 of 4
Download to read offline
R on Netezza's TwinFin Appliance
by Łukasz Bartnik, Netezza
lbartnik@netezza.com


Netezza & TwinFin
TwinFin, called also a “Data Warehouse Appliance” or “Analytic Appliance”, is a massively
parallel processing (MPP) database management system (DBMS). What distinguishes it among
other DBMSes, is the fact that it was designed from scratch, including the special hardware using
field programmable gate arrays (FPGAs), in order to attain the best performance and avoid the
pitfalls of parallel data processing.
Unlike TwinFin, other database platforms were designed initially to work on single computers – at
most multiprocessor, and bringing the parallel processing to them causes many problems, arising
from the different software architecture, etc. This is obviously the case always when a big and
complicated computer system is to be redesigned and adapted to a new hardware platform.
The main concepts behind TwinFin are:
   •   On-Stream AnalyticsTM – which can be expressed as “bring algorithms to data” instead of
       bringing data to algorithms; this means that TwinFin comes with a set of mechanisms and
       interfaces which support designing and writing parallel versions of algorithms utilizing the
       specific capabilities of this platform
   •   no indices supporting SQL queries – all data processing is done on-the-fly and takes a huge
       advantage of the FPGAs to filter out and extract only the desired data
   •   intelligent query optimizer – which prepares an execution plan with minimal effort put to
       move data between the cluster nodes; the level of expertise required to implement such
       optimizer is substantial, and this is one of the biggest advantages of the whole platform
Being a cluster, TwinFin consists of a number of nodes. One, called Host, is the master node user
connects to from his client computer; SQL interpreter & query optimizer work on the Host, also the
SQL stored procedures are executed on this node. Slave nodes, called SPUs (Snippet Processing
Units), are capable of executing SQL queries compiled on Host, UDXs (User Defined Functions
and Aggregates implemented in C++), they also are the actual place where the data is stored.


TwinFin & R
The TwinFin-R approach is quite straightforward. Since the data processing is handled in parallel, it
is obvious that also R has to comply with this paradigm. We have implemented a set of either
entirely new or new versions of previously existing mechanisms:
   •   apply & tapply – which are used to run a user-defined function on each row or each group of
       rows given a grouping column, respectively
   •   nzrun, nzfilter, etc. – which are different, specialized (or generalized) versions of the two
       above (more information below)
These functions are used with nz.data.frame – a Netezza's version of a well-known R data.frame
class. Basically, nz.data.frame is a pointer to a table which resides on the TwinFin. It implements a
number of methods for extracting a subset of its data, gathering meta-info, similar to the data.frame
ones, and working with parallel data-processing algorithms.
Sample code:
> library(nzr)
    > nzconnect(“user”, “password”, “host”, “database”)
    > library(rpart)
    > data(kyphosis)
    # this creates a table out of kyphosis data.frame
    # and sends its data to TwinFin
    > invisible(as.nz.data.frame(kyphosis))
    > nzQuery("SELECT * FROM kyphosis")
       KYPHOSIS AGE NUMBER START
    1    absent 71       3     5
    2    absent 158      3    14
    3   present 128      4     5
   [ cut ]
    # now create a nz.data.frame
    > k <- nz.data.frame(“kyphosis”)
    > as.data.frame(k)
       KYPHOSIS AGE NUMBER START
    1    absent 71       3     5
    2    absent 158      3    14
    3   present 128      4     5
   [ cut ]
    > nzQuery(“SELECT * FROM kyphosis”)
      COUNT
    1    81


R on TwinFin – internals
The “R on TwinFin” package is built on top of RODBC: each of its functions/methods invoked in R
client is translated into SQL query and run directly on TwinFin. If a user-defined function has to be
applied on a TwinFin table, it is serialized and sent to TwinFin as a string literal.
There is a low-level API available on TwinFin, which is hidden by function like nzapply, nzfilter,
etc. When a user-defined code is run in parallel on TwinFin, on each node (SPU) a new R session is
started, which then proceeds with the code execution. Each such sessions has to communicate with
the DBMS process using the aforementioned low-level API to fetch & send data, report error, etc.
Below, are some of its functions:
   •    getNext() – fetches new row from the appliance
   •    getInputColumn(index) – return the value of the column specified by its index
   •    setOutput(index,value) – sets the output column value
   •    outputResult() – sends the current output row to the appliance
   •    columnCount() - returns the number of input columns
   •    userError() – reports an error
These functions might be used directly when nzrun is used to run a user-defined code/function, or
indirectly, through a high-level API, when nzapply or nztapply is used. The two following examples
show the same task approached in both ways.
    >   ndf <- nz.data.frame(“kyphosis”)
    >   fun <- function(x) {
    >     while(getNext()) {
    >       v <- inputColumn(0)
    >       setOutput(0, v ^ 2)
    >       outputResult()
    >     }
    >   }
    >   nzrun(ndf[,2], fun)
  Example 1: low-level API, nzrun
   > ndf <- nz.data.frame(“kyphosis”)
   > nzapply(ndf[,2], NULL, function(x) x[[1]]^2)
  Example 2: high-level API, nzapply
R on TwinFin – a bit more sophisticated
Given is a TwinFin table with transactions data; its columns are shop_id, prod_1, prod_2,
prod_3.The first column determined the shop where the transaction took place, the other three store
the total amount spent on three different product types. The goal is, for each shop, to build a linear
price model of the form:
    prod_1 ~ prod_2 + prod_3
This can be accomplished with nztapply, where the grouping column will be shop_id:
    >   transactions <- nz.data.frame(“transactions”)
    >   prodlm <- function(x) {
    >     return(lm(prod_1 ~ prod_2 + prod_3, data=x)$coefficients)
    >   }
    >   nztapply(transactions, shop_id, prodlm)
This performs the linear-model-building procedure for each group determined by the shop_id
column. Groups are processed independently, and if they reside on different nodes (SPUs) due to
the data distribution specified for the TwinFin table, then the computations can take place in
parallel.
The transactions table can be generated with the function given in Appendix A.


Decision Trees – interfacing TwinFin algorithms in R
Since R implementation of computation-intensive algorithms might not be efficient enough, there is
a number of algorithms whose design and implementation (in SQL and C++) takes into
consideration properties specific to TwinFin.
One of them is the Decision Trees algorithm. It requires three datasets: training, testing, and
pruning. Then, a R wrapper dectree2 is called, which starts the computations on TwinFin. Object of
class dectree2 is returned as the result.
    >   adultTrain <- nz.data.frame("adult_train")
    >   adultTest <- nz.data.frame("adult_test")
    >   adultPrune <- nz.data.frame("adult_prune")
    >   tr <- dectree2(income~., data=adultTrain, valtable=adultPrune,
                       minsplit=1000, id="id", qmeasure="wAcc")
This can be then printed (Example 3) or plotted (see Fig. 1):
    > print(tr)
   node), split, n, deviance, yval, (yprob)
         * denotes terminal node

     16) root 0 0 small ( 0.5 0.5 ) *
               8) education_num < 10 0 0 small ( 0.5 0.5 )
             4) capital_gain < 5013 0 0 small ( 0.5 0.5 )
           1) NANA 0 0 small ( 0.5 0.5 )
        2) marital_status=Married-civ-spouse 0 0 small ( 0.5 0.5 )
              17) education_num > 7 0 0 small ( 0.5 0.5 )
                34) age < 35 0 0 small ( 0.5 0.5 ) *
                   140) hours_per_week < 34 0 0 small ( 0.5 0.5 ) *
                35) age > 35 0 0 small ( 0.5 0.5 )
   [ cut ]
   Example 3: printing a TwinFin decision tree
Figure 1: Example plot of an dectree2 object


Appendix A
Generating data for the nztapply example in Section 4.
   gendata <- function(prodno = 3, shopsno = 4, tno = 300) {
       trans <- data.frame(matrix(NA, shopsno*tno, 4))
       names(trans) <- c("shop_id", "prod_1", "prod_2", "prod_3")
       r <- 1
       for (s in seq(shopsno)) {
           sigma <- matrix(NA, prodno, prodno)
           for (t in seq(prodno)) {
               x <- seq(prodno-t+1)+t-1
               sigma[t,x] <- sigma[x,t] <- runif(length(x),max=0.75)
           }
           diag(sigma) <- 1
           chsigma <- t(chol(sigma))
           for (r in seq(tno)+tno*(s-1))
               trans[r,] <- c(s,abs(chsigma %*% rnorm(prodno)))
       }
       return(trans)
   }


Bibliography
1. http://www.netezza.com/data-warehouse-appliance-products/twinfin.aspx
2. http://www.netezza.com/documents/whitepapers/streaming_analytic_white_paper.pdf
3. http://www.netezza.com/analystReports/2006/bloor_100906.pdf

More Related Content

What's hot

Using Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve PerformaceUsing Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve PerformaceBiju Nair
 
Parameter substitution in Aginity Workbench
Parameter substitution in Aginity WorkbenchParameter substitution in Aginity Workbench
Parameter substitution in Aginity WorkbenchMary Uguet
 
Backup Options for IBM PureData for Analytics powered by Netezza
Backup Options for IBM PureData for Analytics powered by NetezzaBackup Options for IBM PureData for Analytics powered by Netezza
Backup Options for IBM PureData for Analytics powered by NetezzaTony Pearson
 
Netezza workload management
Netezza workload managementNetezza workload management
Netezza workload managementBiju Nair
 
IBM Pure Data System for Analytics (Netezza)
IBM Pure Data System for Analytics (Netezza)IBM Pure Data System for Analytics (Netezza)
IBM Pure Data System for Analytics (Netezza)Girish Srivastava
 
Netezza Deep Dives
Netezza Deep DivesNetezza Deep Dives
Netezza Deep DivesRush Shah
 
Netezza Online Training by www.etraining.guru in India
Netezza Online Training by www.etraining.guru in IndiaNetezza Online Training by www.etraining.guru in India
Netezza Online Training by www.etraining.guru in IndiaRavikumar Nandigam
 
Netezza vs teradata
Netezza vs teradataNetezza vs teradata
Netezza vs teradataAsis Mohanty
 
Netezza Architecture and Administration
Netezza Architecture and AdministrationNetezza Architecture and Administration
Netezza Architecture and AdministrationBraja Krishna Das
 
The IBM Netezza datawarehouse appliance
The IBM Netezza datawarehouse applianceThe IBM Netezza datawarehouse appliance
The IBM Netezza datawarehouse applianceIBM Danmark
 
Ibm pure data system for analytics n200x
Ibm pure data system for analytics n200xIbm pure data system for analytics n200x
Ibm pure data system for analytics n200xIBM Sverige
 
Datastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobsDatastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobsshanker_uma
 
The IBM Netezza Data Warehouse Appliance
The IBM Netezza Data Warehouse ApplianceThe IBM Netezza Data Warehouse Appliance
The IBM Netezza Data Warehouse ApplianceIBM Sverige
 
Oracle dba online training
Oracle dba online trainingOracle dba online training
Oracle dba online trainingkeylabstraining
 
Data warehousing labs maunal
Data warehousing labs maunalData warehousing labs maunal
Data warehousing labs maunalEducation
 

What's hot (20)

Using Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve PerformaceUsing Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve Performace
 
Parameter substitution in Aginity Workbench
Parameter substitution in Aginity WorkbenchParameter substitution in Aginity Workbench
Parameter substitution in Aginity Workbench
 
Backup Options for IBM PureData for Analytics powered by Netezza
Backup Options for IBM PureData for Analytics powered by NetezzaBackup Options for IBM PureData for Analytics powered by Netezza
Backup Options for IBM PureData for Analytics powered by Netezza
 
Netezza workload management
Netezza workload managementNetezza workload management
Netezza workload management
 
IBM Pure Data System for Analytics (Netezza)
IBM Pure Data System for Analytics (Netezza)IBM Pure Data System for Analytics (Netezza)
IBM Pure Data System for Analytics (Netezza)
 
IBM Netezza
IBM NetezzaIBM Netezza
IBM Netezza
 
Netezza All labs
Netezza All labsNetezza All labs
Netezza All labs
 
Netezza Deep Dives
Netezza Deep DivesNetezza Deep Dives
Netezza Deep Dives
 
Netezza Online Training by www.etraining.guru in India
Netezza Online Training by www.etraining.guru in IndiaNetezza Online Training by www.etraining.guru in India
Netezza Online Training by www.etraining.guru in India
 
Netezza vs teradata
Netezza vs teradataNetezza vs teradata
Netezza vs teradata
 
Netezza pure data
Netezza pure dataNetezza pure data
Netezza pure data
 
Netezza Architecture and Administration
Netezza Architecture and AdministrationNetezza Architecture and Administration
Netezza Architecture and Administration
 
The IBM Netezza datawarehouse appliance
The IBM Netezza datawarehouse applianceThe IBM Netezza datawarehouse appliance
The IBM Netezza datawarehouse appliance
 
Ibm pure data system for analytics n200x
Ibm pure data system for analytics n200xIbm pure data system for analytics n200x
Ibm pure data system for analytics n200x
 
Datastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobsDatastage parallell jobs vs datastage server jobs
Datastage parallell jobs vs datastage server jobs
 
The IBM Netezza Data Warehouse Appliance
The IBM Netezza Data Warehouse ApplianceThe IBM Netezza Data Warehouse Appliance
The IBM Netezza Data Warehouse Appliance
 
Fast Analytics
Fast Analytics Fast Analytics
Fast Analytics
 
Oracle dba online training
Oracle dba online trainingOracle dba online training
Oracle dba online training
 
Datastage
DatastageDatastage
Datastage
 
Data warehousing labs maunal
Data warehousing labs maunalData warehousing labs maunal
Data warehousing labs maunal
 

Viewers also liked

Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientistsAjay Ohri
 
Big Data, Bigger Campaigns: Using IBM’s Unica and Netezza Platforms to Increa...
Big Data, Bigger Campaigns: Using IBM’s Unica and Netezza Platforms to Increa...Big Data, Bigger Campaigns: Using IBM’s Unica and Netezza Platforms to Increa...
Big Data, Bigger Campaigns: Using IBM’s Unica and Netezza Platforms to Increa...graemeknows
 
A Workshop on R
A Workshop on RA Workshop on R
A Workshop on RAjay Ohri
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FutureWes McKinney
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Wes McKinney
 
Association Rule Mining with R
Association Rule Mining with RAssociation Rule Mining with R
Association Rule Mining with RYanchang Zhao
 

Viewers also liked (7)

Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
 
Big Data, Bigger Campaigns: Using IBM’s Unica and Netezza Platforms to Increa...
Big Data, Bigger Campaigns: Using IBM’s Unica and Netezza Platforms to Increa...Big Data, Bigger Campaigns: Using IBM’s Unica and Netezza Platforms to Increa...
Big Data, Bigger Campaigns: Using IBM’s Unica and Netezza Platforms to Increa...
 
A Workshop on R
A Workshop on RA Workshop on R
A Workshop on R
 
Data Analysis With Pandas
Data Analysis With PandasData Analysis With Pandas
Data Analysis With Pandas
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
Association Rule Mining with R
Association Rule Mining with RAssociation Rule Mining with R
Association Rule Mining with R
 

Similar to Using R on Netezza

Deep learning with kafka
Deep learning with kafkaDeep learning with kafka
Deep learning with kafkaNitin Kumar
 
Power ai tensorflowworkloadtutorial-20171117
Power ai tensorflowworkloadtutorial-20171117Power ai tensorflowworkloadtutorial-20171117
Power ai tensorflowworkloadtutorial-20171117Ganesan Narayanasamy
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
 
Spark Structured APIs
Spark Structured APIsSpark Structured APIs
Spark Structured APIsKnoldus Inc.
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Serban Tanasa
 
Eclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectEclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectMatthew Gerring
 
Task and Data Parallelism: Real-World Examples
Task and Data Parallelism: Real-World ExamplesTask and Data Parallelism: Real-World Examples
Task and Data Parallelism: Real-World ExamplesSasha Goldshtein
 
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres OpenJohn Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres OpenPostgresOpen
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelJenny Liu
 
Go Faster With Native Compilation
Go Faster With Native CompilationGo Faster With Native Compilation
Go Faster With Native CompilationPGConf APAC
 
Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2Rajeev Rastogi (KRR)
 
Better Network Management Through Network Programmability
Better Network Management Through Network ProgrammabilityBetter Network Management Through Network Programmability
Better Network Management Through Network ProgrammabilityCisco Canada
 
Kapacitor - Real Time Data Processing Engine
Kapacitor - Real Time Data Processing EngineKapacitor - Real Time Data Processing Engine
Kapacitor - Real Time Data Processing EnginePrashant Vats
 
Overview of Chainer and Its Features
Overview of Chainer and Its FeaturesOverview of Chainer and Its Features
Overview of Chainer and Its FeaturesSeiya Tokui
 
Apache Flink & Graph Processing
Apache Flink & Graph ProcessingApache Flink & Graph Processing
Apache Flink & Graph ProcessingVasia Kalavri
 
Free ebooks download ! Edhole
Free ebooks download ! EdholeFree ebooks download ! Edhole
Free ebooks download ! EdholeEdhole.com
 

Similar to Using R on Netezza (20)

Flink internals web
Flink internals web Flink internals web
Flink internals web
 
Deep learning with kafka
Deep learning with kafkaDeep learning with kafka
Deep learning with kafka
 
Power ai tensorflowworkloadtutorial-20171117
Power ai tensorflowworkloadtutorial-20171117Power ai tensorflowworkloadtutorial-20171117
Power ai tensorflowworkloadtutorial-20171117
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
Database programming
Database programmingDatabase programming
Database programming
 
Spark Structured APIs
Spark Structured APIsSpark Structured APIs
Spark Structured APIs
 
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
 
Eclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectEclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science Project
 
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
 
Task and Data Parallelism: Real-World Examples
Task and Data Parallelism: Real-World ExamplesTask and Data Parallelism: Real-World Examples
Task and Data Parallelism: Real-World Examples
 
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres OpenJohn Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
 
Go Faster With Native Compilation
Go Faster With Native CompilationGo Faster With Native Compilation
Go Faster With Native Compilation
 
Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2
 
Better Network Management Through Network Programmability
Better Network Management Through Network ProgrammabilityBetter Network Management Through Network Programmability
Better Network Management Through Network Programmability
 
Kapacitor - Real Time Data Processing Engine
Kapacitor - Real Time Data Processing EngineKapacitor - Real Time Data Processing Engine
Kapacitor - Real Time Data Processing Engine
 
Overview of Chainer and Its Features
Overview of Chainer and Its FeaturesOverview of Chainer and Its Features
Overview of Chainer and Its Features
 
Apache Flink & Graph Processing
Apache Flink & Graph ProcessingApache Flink & Graph Processing
Apache Flink & Graph Processing
 
Free ebooks download ! Edhole
Free ebooks download ! EdholeFree ebooks download ! Edhole
Free ebooks download ! Edhole
 

More from Ajay Ohri

Introduction to R ajay Ohri
Introduction to R ajay OhriIntroduction to R ajay Ohri
Introduction to R ajay OhriAjay Ohri
 
Introduction to R
Introduction to RIntroduction to R
Introduction to RAjay Ohri
 
Social Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 ElectionSocial Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 ElectionAjay Ohri
 
Download Python for R Users pdf for free
Download Python for R Users pdf for freeDownload Python for R Users pdf for free
Download Python for R Users pdf for freeAjay Ohri
 
Install spark on_windows10
Install spark on_windows10Install spark on_windows10
Install spark on_windows10Ajay Ohri
 
Ajay ohri Resume
Ajay ohri ResumeAjay ohri Resume
Ajay ohri ResumeAjay Ohri
 
National seminar on emergence of internet of things (io t) trends and challe...
National seminar on emergence of internet of things (io t)  trends and challe...National seminar on emergence of internet of things (io t)  trends and challe...
National seminar on emergence of internet of things (io t) trends and challe...Ajay Ohri
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessAjay Ohri
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data ScienceAjay Ohri
 
Software Testing for Data Scientists
Software Testing for Data ScientistsSoftware Testing for Data Scientists
Software Testing for Data ScientistsAjay Ohri
 
A Data Science Tutorial in Python
A Data Science Tutorial in PythonA Data Science Tutorial in Python
A Data Science Tutorial in PythonAjay Ohri
 
How does cryptography work? by Jeroen Ooms
How does cryptography work?  by Jeroen OomsHow does cryptography work?  by Jeroen Ooms
How does cryptography work? by Jeroen OomsAjay Ohri
 
Using R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsUsing R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsAjay Ohri
 
Kush stats alpha
Kush stats alpha Kush stats alpha
Kush stats alpha Ajay Ohri
 
Analyze this
Analyze thisAnalyze this
Analyze thisAjay Ohri
 
Summer school python in spanish
Summer school python in spanishSummer school python in spanish
Summer school python in spanishAjay Ohri
 

More from Ajay Ohri (20)

Introduction to R ajay Ohri
Introduction to R ajay OhriIntroduction to R ajay Ohri
Introduction to R ajay Ohri
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Social Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 ElectionSocial Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 Election
 
Pyspark
PysparkPyspark
Pyspark
 
Download Python for R Users pdf for free
Download Python for R Users pdf for freeDownload Python for R Users pdf for free
Download Python for R Users pdf for free
 
Install spark on_windows10
Install spark on_windows10Install spark on_windows10
Install spark on_windows10
 
Ajay ohri Resume
Ajay ohri ResumeAjay ohri Resume
Ajay ohri Resume
 
National seminar on emergence of internet of things (io t) trends and challe...
National seminar on emergence of internet of things (io t)  trends and challe...National seminar on emergence of internet of things (io t)  trends and challe...
National seminar on emergence of internet of things (io t) trends and challe...
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help business
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data Science
 
Tradecraft
Tradecraft   Tradecraft
Tradecraft
 
Software Testing for Data Scientists
Software Testing for Data ScientistsSoftware Testing for Data Scientists
Software Testing for Data Scientists
 
Craps
CrapsCraps
Craps
 
A Data Science Tutorial in Python
A Data Science Tutorial in PythonA Data Science Tutorial in Python
A Data Science Tutorial in Python
 
How does cryptography work? by Jeroen Ooms
How does cryptography work?  by Jeroen OomsHow does cryptography work?  by Jeroen Ooms
How does cryptography work? by Jeroen Ooms
 
Using R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsUsing R for Social Media and Sports Analytics
Using R for Social Media and Sports Analytics
 
Kush stats alpha
Kush stats alpha Kush stats alpha
Kush stats alpha
 
Analyze this
Analyze thisAnalyze this
Analyze this
 
Summer school python in spanish
Summer school python in spanishSummer school python in spanish
Summer school python in spanish
 

Recently uploaded

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 

Recently uploaded (20)

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 

Using R on Netezza

  • 1. R on Netezza's TwinFin Appliance by Łukasz Bartnik, Netezza lbartnik@netezza.com Netezza & TwinFin TwinFin, called also a “Data Warehouse Appliance” or “Analytic Appliance”, is a massively parallel processing (MPP) database management system (DBMS). What distinguishes it among other DBMSes, is the fact that it was designed from scratch, including the special hardware using field programmable gate arrays (FPGAs), in order to attain the best performance and avoid the pitfalls of parallel data processing. Unlike TwinFin, other database platforms were designed initially to work on single computers – at most multiprocessor, and bringing the parallel processing to them causes many problems, arising from the different software architecture, etc. This is obviously the case always when a big and complicated computer system is to be redesigned and adapted to a new hardware platform. The main concepts behind TwinFin are: • On-Stream AnalyticsTM – which can be expressed as “bring algorithms to data” instead of bringing data to algorithms; this means that TwinFin comes with a set of mechanisms and interfaces which support designing and writing parallel versions of algorithms utilizing the specific capabilities of this platform • no indices supporting SQL queries – all data processing is done on-the-fly and takes a huge advantage of the FPGAs to filter out and extract only the desired data • intelligent query optimizer – which prepares an execution plan with minimal effort put to move data between the cluster nodes; the level of expertise required to implement such optimizer is substantial, and this is one of the biggest advantages of the whole platform Being a cluster, TwinFin consists of a number of nodes. One, called Host, is the master node user connects to from his client computer; SQL interpreter & query optimizer work on the Host, also the SQL stored procedures are executed on this node. Slave nodes, called SPUs (Snippet Processing Units), are capable of executing SQL queries compiled on Host, UDXs (User Defined Functions and Aggregates implemented in C++), they also are the actual place where the data is stored. TwinFin & R The TwinFin-R approach is quite straightforward. Since the data processing is handled in parallel, it is obvious that also R has to comply with this paradigm. We have implemented a set of either entirely new or new versions of previously existing mechanisms: • apply & tapply – which are used to run a user-defined function on each row or each group of rows given a grouping column, respectively • nzrun, nzfilter, etc. – which are different, specialized (or generalized) versions of the two above (more information below) These functions are used with nz.data.frame – a Netezza's version of a well-known R data.frame class. Basically, nz.data.frame is a pointer to a table which resides on the TwinFin. It implements a number of methods for extracting a subset of its data, gathering meta-info, similar to the data.frame ones, and working with parallel data-processing algorithms. Sample code:
  • 2. > library(nzr) > nzconnect(“user”, “password”, “host”, “database”) > library(rpart) > data(kyphosis) # this creates a table out of kyphosis data.frame # and sends its data to TwinFin > invisible(as.nz.data.frame(kyphosis)) > nzQuery("SELECT * FROM kyphosis") KYPHOSIS AGE NUMBER START 1 absent 71 3 5 2 absent 158 3 14 3 present 128 4 5 [ cut ] # now create a nz.data.frame > k <- nz.data.frame(“kyphosis”) > as.data.frame(k) KYPHOSIS AGE NUMBER START 1 absent 71 3 5 2 absent 158 3 14 3 present 128 4 5 [ cut ] > nzQuery(“SELECT * FROM kyphosis”) COUNT 1 81 R on TwinFin – internals The “R on TwinFin” package is built on top of RODBC: each of its functions/methods invoked in R client is translated into SQL query and run directly on TwinFin. If a user-defined function has to be applied on a TwinFin table, it is serialized and sent to TwinFin as a string literal. There is a low-level API available on TwinFin, which is hidden by function like nzapply, nzfilter, etc. When a user-defined code is run in parallel on TwinFin, on each node (SPU) a new R session is started, which then proceeds with the code execution. Each such sessions has to communicate with the DBMS process using the aforementioned low-level API to fetch & send data, report error, etc. Below, are some of its functions: • getNext() – fetches new row from the appliance • getInputColumn(index) – return the value of the column specified by its index • setOutput(index,value) – sets the output column value • outputResult() – sends the current output row to the appliance • columnCount() - returns the number of input columns • userError() – reports an error These functions might be used directly when nzrun is used to run a user-defined code/function, or indirectly, through a high-level API, when nzapply or nztapply is used. The two following examples show the same task approached in both ways. > ndf <- nz.data.frame(“kyphosis”) > fun <- function(x) { > while(getNext()) { > v <- inputColumn(0) > setOutput(0, v ^ 2) > outputResult() > } > } > nzrun(ndf[,2], fun) Example 1: low-level API, nzrun > ndf <- nz.data.frame(“kyphosis”) > nzapply(ndf[,2], NULL, function(x) x[[1]]^2) Example 2: high-level API, nzapply
  • 3. R on TwinFin – a bit more sophisticated Given is a TwinFin table with transactions data; its columns are shop_id, prod_1, prod_2, prod_3.The first column determined the shop where the transaction took place, the other three store the total amount spent on three different product types. The goal is, for each shop, to build a linear price model of the form: prod_1 ~ prod_2 + prod_3 This can be accomplished with nztapply, where the grouping column will be shop_id: > transactions <- nz.data.frame(“transactions”) > prodlm <- function(x) { > return(lm(prod_1 ~ prod_2 + prod_3, data=x)$coefficients) > } > nztapply(transactions, shop_id, prodlm) This performs the linear-model-building procedure for each group determined by the shop_id column. Groups are processed independently, and if they reside on different nodes (SPUs) due to the data distribution specified for the TwinFin table, then the computations can take place in parallel. The transactions table can be generated with the function given in Appendix A. Decision Trees – interfacing TwinFin algorithms in R Since R implementation of computation-intensive algorithms might not be efficient enough, there is a number of algorithms whose design and implementation (in SQL and C++) takes into consideration properties specific to TwinFin. One of them is the Decision Trees algorithm. It requires three datasets: training, testing, and pruning. Then, a R wrapper dectree2 is called, which starts the computations on TwinFin. Object of class dectree2 is returned as the result. > adultTrain <- nz.data.frame("adult_train") > adultTest <- nz.data.frame("adult_test") > adultPrune <- nz.data.frame("adult_prune") > tr <- dectree2(income~., data=adultTrain, valtable=adultPrune, minsplit=1000, id="id", qmeasure="wAcc") This can be then printed (Example 3) or plotted (see Fig. 1): > print(tr) node), split, n, deviance, yval, (yprob) * denotes terminal node 16) root 0 0 small ( 0.5 0.5 ) * 8) education_num < 10 0 0 small ( 0.5 0.5 ) 4) capital_gain < 5013 0 0 small ( 0.5 0.5 ) 1) NANA 0 0 small ( 0.5 0.5 ) 2) marital_status=Married-civ-spouse 0 0 small ( 0.5 0.5 ) 17) education_num > 7 0 0 small ( 0.5 0.5 ) 34) age < 35 0 0 small ( 0.5 0.5 ) * 140) hours_per_week < 34 0 0 small ( 0.5 0.5 ) * 35) age > 35 0 0 small ( 0.5 0.5 ) [ cut ] Example 3: printing a TwinFin decision tree
  • 4. Figure 1: Example plot of an dectree2 object Appendix A Generating data for the nztapply example in Section 4. gendata <- function(prodno = 3, shopsno = 4, tno = 300) { trans <- data.frame(matrix(NA, shopsno*tno, 4)) names(trans) <- c("shop_id", "prod_1", "prod_2", "prod_3") r <- 1 for (s in seq(shopsno)) { sigma <- matrix(NA, prodno, prodno) for (t in seq(prodno)) { x <- seq(prodno-t+1)+t-1 sigma[t,x] <- sigma[x,t] <- runif(length(x),max=0.75) } diag(sigma) <- 1 chsigma <- t(chol(sigma)) for (r in seq(tno)+tno*(s-1)) trans[r,] <- c(s,abs(chsigma %*% rnorm(prodno))) } return(trans) } Bibliography 1. http://www.netezza.com/data-warehouse-appliance-products/twinfin.aspx 2. http://www.netezza.com/documents/whitepapers/streaming_analytic_white_paper.pdf 3. http://www.netezza.com/analystReports/2006/bloor_100906.pdf