SlideShare a Scribd company logo
Streaming Data,
Concurrency And R

     Rory Winston

   rory@theresearchkitchen.com
About Me




      Independent Software Consultant
      M.Sc. Applied Computing, 2000
      M.Sc. Finance, 2008
      Apache Committer
      Working in the financial sector for the last 7 years or so
      Interested in practical applications of functional languages and
      machine learning
      Relatively recent convert to R ( ≈ 2 years)
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
R - Pros and Cons


   Pro
         Designed by statisticians    Con
         Can be extremely elegant
                                            Designed by statisticians
         Comprehensive extension
                                            Can be clunky (S4)
         library
                                            Bewildering array of
         Open-source
                                            overlapping extensions
         Huge parallelization effort
                                            Inherently single-threaded
         Fantastic reporting
                                            Incredibly Popular
         capabilities
         Incredibly Popular
Parallelization vs. Concurrency



        R interpreter is single threaded
        Some historical context for this (BLAS implementations)
        Not necessarily a limitation in the general context
        Multithreading can be complex and problematic
        Instead a focus on parallelization:
             Distributed computation: gridR, nws, snow
             Multicore/multi-cpu scaling: Rmpi, Romp, pnmath/pnmath0
             Interfaces to Pthreads/PBLAS/OpenMP/MPI/Globus/etc.
        Parallelization suits cpu-bound large data processing
        applications
Other Scalability and Performance Work




        JIT/bytecode compilation (Ra)
        Implicit vectorization a la Matlab (code analysis)
        Large (≥ RAM) dataset handling (bigmemory,ff)
        Many incremental performance improvements (e.g. less
        internal copying)
        Next: GPU/massive multicore...?
What Benefit Concurrency?




       Real-time (streaming to be more precise) data analysis
       Growing Interest in using R for streaming data, not just offline
       analyis
       GUI toolkit integration
       Fine-grained control over independent task execution
       "I believe that explicit concurrency management tools (i.e. a
       threads toolkit) are what we really need in R at this point." -
       Luke Tierney, 2001
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Will There Be A Multithreaded R?


        Short answer is: probably not
        At least not in its current incarnation
        Internal workings of the interpreter not particularly amenable
        to concurrency:
            Functions can manipulate caller state («- vs. <-)
            Lazy evaluation machinery (promises)
            Dynamic State, garbage collection, etc.
            Scoping: global environments
            Management of resources: streams, I/O, connections, sinks
        Implications for current code
        Possibly in the next language evolution (cf. Ihaka?)
        Large amount of work (but potentially do-able)
Example Application




        Based on work I did last year and presented at UseR! 2008
        Wrote a real-time and historical market data service from
        Reuters/R
        The real-time interface used the Reuters C++ API
        R extension in C++ that spawned listening thread and
        handled updates
Simplified Architecture




                                R


                         extension (C++)



                           realtime bus
Example Usage



          rsub <- function(duration, items, callback)


   The call rsub will subscribe to the specified rate(s) for the duration
   of time specified by duration (ms). When a tick arrives, the
   callback function callback is invoked, with a data frame
   containing the fields specified in items.

   Multiple market data items may be subscribed to, and any
   combination of fields may be be specified.

   Uses the underlying RFA API, which provides a C++ interface to
   real-time market updates.
Real-Time Example


   # Specify field names to retrieve
   fields <- c("BID","ASK","TIMCOR")

   # Subscribe to EUR/USD and GBP/USD ticks
   items <- list()
   items[[1]] <- c("IDN_SELECTFEED", "EUR=", fields)
   items[[2]] <- c("IDN_SELECTFEED", "GBP=", fields)

   # Simple Callback Function
   callback <- function(df) { print(paste("Received",df)) }

   # Subscribe for 1 hour
   ONE_HOUR <- 1000*(60)^2
   rsub(ONE_HOUR, items, callback)
Issues With This Approach




        As R interpreter is single threaded, cannot spawn thread for
        callbacks
        Thus, interpreter thread is locked for the duration of
        subscription
        Not a great user experience
        Need to find alternative mechanism
Alternative Approach



        If we cannot run subscriber threads in-process, need to
        decouple
        Standard approach: add an extra layer and use some form of
        IPC
        For instance, we could:
            Subscribe in a dedicated R process (A)
            Push incoming data onto a socket
            R process (B) reads from a listening socket
        Sockets could also be another IPC primitive, e.g. pipes
        Also note that R supports asynchronous I/O (?isIncomplete)
        Look at the ibrokers package for examples of this
The bigmemoRy package



       From the description: "Use C++ to create, store,
       access, and manipulate massive matrices"
       Allows creation of large matrices
       These matrices can be mapped to files/shared memory
       It is the shared memory functionality that we will use
       The next version (3.0) will be unveiled at UseR! 2009

   big.matrix(nrow, ncol, type = "integer", ....)
   shared.big.matrix(nrow, ncol, type = "integer", ...)
   filebacked.big.matrix(nrow, ncol, type = "integer", ...)
Sample Usage




   > library(bigmemory) # Note: I'm using pre-release
   > X <- shared.big.matrix(type="double", ncol=1000, nrow=1000)
   > X
   An object of class “big.matrix”
   Slot "address":
   <pointer: 0x7378a0>
Create Shared Memory Descriptor

   > desc <- describe(X)
   > desc
   $sharedType
   [1] "SharedMemory"

   $sharedName
   [1] "53f14925-dca1-42a8-a547-e1bccae999ce"

   $nrow
   [1] 1000

   $ncol
   [1] 1000

   $rowNames
   NULL
Export the Descriptor




    In R session 1:

    > dput(desc, file="~/matrix.desc")

    In R session 2:

    > library(bigmemory)
    > desc <- dget("~/matrix.desc")
    > X <- attach.big.matrix(desc)

    Now R sessions A and B share the same big.matrix instance
Share Data Between Sessions




   R session 1:

   > X[1,1] <- 1.2345

   R session 2:

   > X[1,1]
   [1] 1.2345

   Thus, streaming data can be continuously fed into session A
   And concurrently processed in session B
Summary




      Lack of threads not a barrier to concurrent analysis
      Packages like bigmemory, nws, etc. facilitate decoupling via
      IPC
      nws goes a step further, with a distributed workspace
      Many applications for streaming data:
          Data collection/monitoring
          Development of pricing/risk algorithms
          Low-frequency execution (??)
          ...
References




        http://cran.r-project.org/web/packages/bigmemory/
        http://www.cs.uiowa.edu/ luke/R/thrgui/
        http://www.milbo.users.sonic.net/ra/index.html
        http://www.cs.kent.ac.uk/projects/cxxr/
        http://www.theresearchkitchen.com/blog

More Related Content

Similar to Realtime r

Unleashing the Potential: Navigating the Versatility and Simplicity of Python...
Unleashing the Potential: Navigating the Versatility and Simplicity of Python...Unleashing the Potential: Navigating the Versatility and Simplicity of Python...
Unleashing the Potential: Navigating the Versatility and Simplicity of Python...
Flexsin
 
Decision trees in hadoop
Decision trees in hadoopDecision trees in hadoop
Decision trees in hadoop
Revolution Analytics
 
IIIF: International Image Interoperability Framework @ DLF2012
IIIF: International Image Interoperability Framework @ DLF2012IIIF: International Image Interoperability Framework @ DLF2012
IIIF: International Image Interoperability Framework @ DLF2012
Tom-Cramer
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Herman Wu
 
Workflows in the Virtual Observatory
Workflows in the Virtual ObservatoryWorkflows in the Virtual Observatory
Workflows in the Virtual ObservatoryJose Enrique Ruiz
 
Python for Data Engineering: Why Do Data Engineers Use Python?
Python for Data Engineering: Why Do Data Engineers Use Python?Python for Data Engineering: Why Do Data Engineers Use Python?
Python for Data Engineering: Why Do Data Engineers Use Python?
hemayadav41
 
Cloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseCloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseDATAVERSITY
 
Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...
Gautier Poupeau
 
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFiMission to NARs with Apache NiFi
Mission to NARs with Apache NiFi
Hortonworks
 
(Costless) Software Abstractions for Parallel Architectures
(Costless) Software Abstractions for Parallel Architectures(Costless) Software Abstractions for Parallel Architectures
(Costless) Software Abstractions for Parallel Architectures
Joel Falcou
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
Presentation.pptx read and learn and download
Presentation.pptx read and learn and downloadPresentation.pptx read and learn and download
Presentation.pptx read and learn and download
Shubhamyadav41303
 
End-to-End Big Data AI with Analytics Zoo
End-to-End Big Data AI with Analytics ZooEnd-to-End Big Data AI with Analytics Zoo
End-to-End Big Data AI with Analytics Zoo
Jason Dai
 
Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith
Sigmoid
 
Open Compute and the History of the Open Source Data Center
Open Compute and the History of the Open Source Data CenterOpen Compute and the History of the Open Source Data Center
Open Compute and the History of the Open Source Data Center
Cole Crawford
 
Using Aspects for Language Portability (SCAM 2010)
Using Aspects for Language Portability (SCAM 2010)Using Aspects for Language Portability (SCAM 2010)
Using Aspects for Language Portability (SCAM 2010)
lennartkats
 
Revolution R Enterprise: 100% R and More (14 Mar 2013)
Revolution R Enterprise: 100% R and More (14 Mar 2013)Revolution R Enterprise: 100% R and More (14 Mar 2013)
Revolution R Enterprise: 100% R and More (14 Mar 2013)Revolution Analytics
 
Revolution Analytics Podcast
Revolution Analytics PodcastRevolution Analytics Podcast
Revolution Analytics Podcast
inside-BigData.com
 

Similar to Realtime r (19)

Unleashing the Potential: Navigating the Versatility and Simplicity of Python...
Unleashing the Potential: Navigating the Versatility and Simplicity of Python...Unleashing the Potential: Navigating the Versatility and Simplicity of Python...
Unleashing the Potential: Navigating the Versatility and Simplicity of Python...
 
Decision trees in hadoop
Decision trees in hadoopDecision trees in hadoop
Decision trees in hadoop
 
IIIF: International Image Interoperability Framework @ DLF2012
IIIF: International Image Interoperability Framework @ DLF2012IIIF: International Image Interoperability Framework @ DLF2012
IIIF: International Image Interoperability Framework @ DLF2012
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Workflows in the Virtual Observatory
Workflows in the Virtual ObservatoryWorkflows in the Virtual Observatory
Workflows in the Virtual Observatory
 
Python for Data Engineering: Why Do Data Engineers Use Python?
Python for Data Engineering: Why Do Data Engineers Use Python?Python for Data Engineering: Why Do Data Engineers Use Python?
Python for Data Engineering: Why Do Data Engineers Use Python?
 
Cloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseCloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBase
 
Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...
 
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFiMission to NARs with Apache NiFi
Mission to NARs with Apache NiFi
 
(Costless) Software Abstractions for Parallel Architectures
(Costless) Software Abstractions for Parallel Architectures(Costless) Software Abstractions for Parallel Architectures
(Costless) Software Abstractions for Parallel Architectures
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
Presentation.pptx read and learn and download
Presentation.pptx read and learn and downloadPresentation.pptx read and learn and download
Presentation.pptx read and learn and download
 
End-to-End Big Data AI with Analytics Zoo
End-to-End Big Data AI with Analytics ZooEnd-to-End Big Data AI with Analytics Zoo
End-to-End Big Data AI with Analytics Zoo
 
Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith
 
Open Compute and the History of the Open Source Data Center
Open Compute and the History of the Open Source Data CenterOpen Compute and the History of the Open Source Data Center
Open Compute and the History of the Open Source Data Center
 
Using Aspects for Language Portability (SCAM 2010)
Using Aspects for Language Portability (SCAM 2010)Using Aspects for Language Portability (SCAM 2010)
Using Aspects for Language Portability (SCAM 2010)
 
Revolution R Enterprise: 100% R and More (14 Mar 2013)
Revolution R Enterprise: 100% R and More (14 Mar 2013)Revolution R Enterprise: 100% R and More (14 Mar 2013)
Revolution R Enterprise: 100% R and More (14 Mar 2013)
 
Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
 
Revolution Analytics Podcast
Revolution Analytics PodcastRevolution Analytics Podcast
Revolution Analytics Podcast
 

More from Ajay Ohri

Introduction to R ajay Ohri
Introduction to R ajay OhriIntroduction to R ajay Ohri
Introduction to R ajay Ohri
Ajay Ohri
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
Ajay Ohri
 
Social Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 ElectionSocial Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 Election
Ajay Ohri
 
Pyspark
PysparkPyspark
Pyspark
Ajay Ohri
 
Download Python for R Users pdf for free
Download Python for R Users pdf for freeDownload Python for R Users pdf for free
Download Python for R Users pdf for free
Ajay Ohri
 
Install spark on_windows10
Install spark on_windows10Install spark on_windows10
Install spark on_windows10
Ajay Ohri
 
Ajay ohri Resume
Ajay ohri ResumeAjay ohri Resume
Ajay ohri Resume
Ajay Ohri
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
Ajay Ohri
 
National seminar on emergence of internet of things (io t) trends and challe...
National seminar on emergence of internet of things (io t)  trends and challe...National seminar on emergence of internet of things (io t)  trends and challe...
National seminar on emergence of internet of things (io t) trends and challe...
Ajay Ohri
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
Ajay Ohri
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help business
Ajay Ohri
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data Science
Ajay Ohri
 
Tradecraft
Tradecraft   Tradecraft
Tradecraft
Ajay Ohri
 
Software Testing for Data Scientists
Software Testing for Data ScientistsSoftware Testing for Data Scientists
Software Testing for Data Scientists
Ajay Ohri
 
Craps
CrapsCraps
Craps
Ajay Ohri
 
A Data Science Tutorial in Python
A Data Science Tutorial in PythonA Data Science Tutorial in Python
A Data Science Tutorial in Python
Ajay Ohri
 
How does cryptography work? by Jeroen Ooms
How does cryptography work?  by Jeroen OomsHow does cryptography work?  by Jeroen Ooms
How does cryptography work? by Jeroen Ooms
Ajay Ohri
 
Using R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsUsing R for Social Media and Sports Analytics
Using R for Social Media and Sports Analytics
Ajay Ohri
 
Kush stats alpha
Kush stats alpha Kush stats alpha
Kush stats alpha
Ajay Ohri
 
Analyze this
Analyze thisAnalyze this
Analyze this
Ajay Ohri
 

More from Ajay Ohri (20)

Introduction to R ajay Ohri
Introduction to R ajay OhriIntroduction to R ajay Ohri
Introduction to R ajay Ohri
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Social Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 ElectionSocial Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 Election
 
Pyspark
PysparkPyspark
Pyspark
 
Download Python for R Users pdf for free
Download Python for R Users pdf for freeDownload Python for R Users pdf for free
Download Python for R Users pdf for free
 
Install spark on_windows10
Install spark on_windows10Install spark on_windows10
Install spark on_windows10
 
Ajay ohri Resume
Ajay ohri ResumeAjay ohri Resume
Ajay ohri Resume
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
 
National seminar on emergence of internet of things (io t) trends and challe...
National seminar on emergence of internet of things (io t)  trends and challe...National seminar on emergence of internet of things (io t)  trends and challe...
National seminar on emergence of internet of things (io t) trends and challe...
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help business
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data Science
 
Tradecraft
Tradecraft   Tradecraft
Tradecraft
 
Software Testing for Data Scientists
Software Testing for Data ScientistsSoftware Testing for Data Scientists
Software Testing for Data Scientists
 
Craps
CrapsCraps
Craps
 
A Data Science Tutorial in Python
A Data Science Tutorial in PythonA Data Science Tutorial in Python
A Data Science Tutorial in Python
 
How does cryptography work? by Jeroen Ooms
How does cryptography work?  by Jeroen OomsHow does cryptography work?  by Jeroen Ooms
How does cryptography work? by Jeroen Ooms
 
Using R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsUsing R for Social Media and Sports Analytics
Using R for Social Media and Sports Analytics
 
Kush stats alpha
Kush stats alpha Kush stats alpha
Kush stats alpha
 
Analyze this
Analyze thisAnalyze this
Analyze this
 

Recently uploaded

State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 

Recently uploaded (20)

State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 

Realtime r

  • 1. Streaming Data, Concurrency And R Rory Winston rory@theresearchkitchen.com
  • 2. About Me Independent Software Consultant M.Sc. Applied Computing, 2000 M.Sc. Finance, 2008 Apache Committer Working in the financial sector for the last 7 years or so Interested in practical applications of functional languages and machine learning Relatively recent convert to R ( ≈ 2 years)
  • 3. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 4. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 5. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 6. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 7. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 8. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 9. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 10. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 11. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 12. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 13. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 14. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 15. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 16. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 17. R - Pros and Cons Pro Designed by statisticians Con Can be extremely elegant Designed by statisticians Comprehensive extension Can be clunky (S4) library Bewildering array of Open-source overlapping extensions Huge parallelization effort Inherently single-threaded Fantastic reporting Incredibly Popular capabilities Incredibly Popular
  • 18. Parallelization vs. Concurrency R interpreter is single threaded Some historical context for this (BLAS implementations) Not necessarily a limitation in the general context Multithreading can be complex and problematic Instead a focus on parallelization: Distributed computation: gridR, nws, snow Multicore/multi-cpu scaling: Rmpi, Romp, pnmath/pnmath0 Interfaces to Pthreads/PBLAS/OpenMP/MPI/Globus/etc. Parallelization suits cpu-bound large data processing applications
  • 19. Other Scalability and Performance Work JIT/bytecode compilation (Ra) Implicit vectorization a la Matlab (code analysis) Large (≥ RAM) dataset handling (bigmemory,ff) Many incremental performance improvements (e.g. less internal copying) Next: GPU/massive multicore...?
  • 20. What Benefit Concurrency? Real-time (streaming to be more precise) data analysis Growing Interest in using R for streaming data, not just offline analyis GUI toolkit integration Fine-grained control over independent task execution "I believe that explicit concurrency management tools (i.e. a threads toolkit) are what we really need in R at this point." - Luke Tierney, 2001
  • 21. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 22. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 23. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 24. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 25. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 26. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 27. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 28. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 29. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 30. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 31. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 32. Will There Be A Multithreaded R? Short answer is: probably not At least not in its current incarnation Internal workings of the interpreter not particularly amenable to concurrency: Functions can manipulate caller state («- vs. <-) Lazy evaluation machinery (promises) Dynamic State, garbage collection, etc. Scoping: global environments Management of resources: streams, I/O, connections, sinks Implications for current code Possibly in the next language evolution (cf. Ihaka?) Large amount of work (but potentially do-able)
  • 33. Example Application Based on work I did last year and presented at UseR! 2008 Wrote a real-time and historical market data service from Reuters/R The real-time interface used the Reuters C++ API R extension in C++ that spawned listening thread and handled updates
  • 34. Simplified Architecture R extension (C++) realtime bus
  • 35. Example Usage rsub <- function(duration, items, callback) The call rsub will subscribe to the specified rate(s) for the duration of time specified by duration (ms). When a tick arrives, the callback function callback is invoked, with a data frame containing the fields specified in items. Multiple market data items may be subscribed to, and any combination of fields may be be specified. Uses the underlying RFA API, which provides a C++ interface to real-time market updates.
  • 36. Real-Time Example # Specify field names to retrieve fields <- c("BID","ASK","TIMCOR") # Subscribe to EUR/USD and GBP/USD ticks items <- list() items[[1]] <- c("IDN_SELECTFEED", "EUR=", fields) items[[2]] <- c("IDN_SELECTFEED", "GBP=", fields) # Simple Callback Function callback <- function(df) { print(paste("Received",df)) } # Subscribe for 1 hour ONE_HOUR <- 1000*(60)^2 rsub(ONE_HOUR, items, callback)
  • 37. Issues With This Approach As R interpreter is single threaded, cannot spawn thread for callbacks Thus, interpreter thread is locked for the duration of subscription Not a great user experience Need to find alternative mechanism
  • 38. Alternative Approach If we cannot run subscriber threads in-process, need to decouple Standard approach: add an extra layer and use some form of IPC For instance, we could: Subscribe in a dedicated R process (A) Push incoming data onto a socket R process (B) reads from a listening socket Sockets could also be another IPC primitive, e.g. pipes Also note that R supports asynchronous I/O (?isIncomplete) Look at the ibrokers package for examples of this
  • 39. The bigmemoRy package From the description: "Use C++ to create, store, access, and manipulate massive matrices" Allows creation of large matrices These matrices can be mapped to files/shared memory It is the shared memory functionality that we will use The next version (3.0) will be unveiled at UseR! 2009 big.matrix(nrow, ncol, type = "integer", ....) shared.big.matrix(nrow, ncol, type = "integer", ...) filebacked.big.matrix(nrow, ncol, type = "integer", ...)
  • 40. Sample Usage > library(bigmemory) # Note: I'm using pre-release > X <- shared.big.matrix(type="double", ncol=1000, nrow=1000) > X An object of class “big.matrix” Slot "address": <pointer: 0x7378a0>
  • 41. Create Shared Memory Descriptor > desc <- describe(X) > desc $sharedType [1] "SharedMemory" $sharedName [1] "53f14925-dca1-42a8-a547-e1bccae999ce" $nrow [1] 1000 $ncol [1] 1000 $rowNames NULL
  • 42. Export the Descriptor In R session 1: > dput(desc, file="~/matrix.desc") In R session 2: > library(bigmemory) > desc <- dget("~/matrix.desc") > X <- attach.big.matrix(desc) Now R sessions A and B share the same big.matrix instance
  • 43. Share Data Between Sessions R session 1: > X[1,1] <- 1.2345 R session 2: > X[1,1] [1] 1.2345 Thus, streaming data can be continuously fed into session A And concurrently processed in session B
  • 44. Summary Lack of threads not a barrier to concurrent analysis Packages like bigmemory, nws, etc. facilitate decoupling via IPC nws goes a step further, with a distributed workspace Many applications for streaming data: Data collection/monitoring Development of pricing/risk algorithms Low-frequency execution (??) ...
  • 45. References http://cran.r-project.org/web/packages/bigmemory/ http://www.cs.uiowa.edu/ luke/R/thrgui/ http://www.milbo.users.sonic.net/ra/index.html http://www.cs.kent.ac.uk/projects/cxxr/ http://www.theresearchkitchen.com/blog