A quick review and demonstration on how to get started on parallel computing with R. Includes an example of SNOW cluster set up in the departmental lab.
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CARobert Metzger
Flink is a unified stream and batch processing framework that natively supports streaming topologies, long-running batch jobs, machine learning algorithms, and graph processing through a pipelined dataflow execution engine. It provides high-level APIs, automatic optimization, efficient memory management, and fault tolerance to execute all of these workloads without needing to treat the system as a black box. Flink achieves native support through its ability to execute everything as data streams, support iterative and stateful computation through caching and managed state, and optimize jobs through cost-based planning and local execution strategies like sort merge join.
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingJonathan Dursi
HTML slides and longer abstract can be found at https://github.com/ljdursi/EuroMPI2016.
For years, the academic science and engineering community was almost alone in pursuing very large-scale numerical computing, and MPI was the lingua franca for such work. But starting in the mid-2000s, we were no longer alone. First internet-scale companies like Google and Yahoo! started performing fairly basic analytics tasks at enormous scale, and since then others have begun tackling increasingly complex and data-heavy machine-learning computations, which involve very familiar scientific computing primitives such as linear algebra, unstructured mesh decomposition, and numerical optimization. These new communities have created programming environments which emphasize what we’ve learned about computer science and programmability since 1994 – with greater levels of abstraction and encapsulation, separating high-level computation from the low-level implementation details.
At about the same time, new academic research communities began using computing at scale to attack their problems - but in many cases, an ideal distributed-memory application for them begins to look more like the new concurrent distributed databases than a large CFD simulation, with data structures like dynamic hash tables and Bloom trees playing more important roles than rectangular arrays or unstructured meshes. These new academic communities are among the first to adopt emerging big-data technologies over traditional HPC options; but as big-data technologies improve their tightly-coupled number-crunching capabilities, they are unlikely to be the last.
In this talk, I sketch out the landscape of distributed technical computing frameworks and environments, and look to see where MPI and the MPI community fits in to this new ecosystem.
The document discusses recent developments in the R programming environment for data analysis, including packages like magrittr, readr, tidyr, and dplyr that enable data wrangling workflows. It provides an overview of the key functions in these packages that allow users to load, reshape, manipulate, model, visualize, and report on data in a pipeline using the %>% operator.
This document provides an overview of Apache Flink internals. It begins with an introduction and recap of Flink programming concepts. It then discusses how Flink programs are compiled into execution plans and executed in a pipelined fashion, as opposed to being executed eagerly like regular code. The document outlines Flink's architecture including the optimizer, runtime environment, and data storage integrations. It also covers iterative processing and how Flink handles iterations both by unrolling loops and with native iterative datasets.
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLFlink Forward
The document discusses Samsara, a domain specific language for distributed machine learning. It provides an algebraic expression language for linear algebra operations and optimizes distributed computations. An example of linear regression on a cereals dataset is presented to demonstrate how Samsara can be used to estimate regression coefficients in a distributed fashion. Key steps include loading data as a distributed row matrix, extracting feature and target matrices, computing the normal equations, and solving the linear system to estimate coefficients.
Mikio Braun – Data flow vs. procedural programming Flink Forward
The document discusses the differences between procedural and data flow programming styles as used in Flink. Procedural programming uses variables, loops, and functions to operate on ordered data structures. Data flow programming treats data as unordered sets and uses parallel set transformations like maps, filters, and reductions. It cannot nest operations and uses broadcast variables to combine intermediate results. The document provides examples translating algorithms like centering, sums, and linear regression from procedural to data flow styles in Flink.
This document provides a summary of MapReduce algorithms. It begins with background on the author's experience blogging about MapReduce algorithms in academic papers. It then provides an overview of MapReduce concepts including the mapper and reducer functions. Several examples of recently published MapReduce algorithms are described for tasks like machine learning, finance, and software engineering. One algorithm is examined in depth for building a low-latency key-value store. Finally, recommendations are provided for designing MapReduce algorithms including patterns, performance, and cost/maintainability considerations. An appendix lists additional MapReduce algorithms from academic papers in areas such as AI, biology, machine learning, and mathematics.
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CARobert Metzger
Flink is a unified stream and batch processing framework that natively supports streaming topologies, long-running batch jobs, machine learning algorithms, and graph processing through a pipelined dataflow execution engine. It provides high-level APIs, automatic optimization, efficient memory management, and fault tolerance to execute all of these workloads without needing to treat the system as a black box. Flink achieves native support through its ability to execute everything as data streams, support iterative and stateful computation through caching and managed state, and optimize jobs through cost-based planning and local execution strategies like sort merge join.
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingJonathan Dursi
HTML slides and longer abstract can be found at https://github.com/ljdursi/EuroMPI2016.
For years, the academic science and engineering community was almost alone in pursuing very large-scale numerical computing, and MPI was the lingua franca for such work. But starting in the mid-2000s, we were no longer alone. First internet-scale companies like Google and Yahoo! started performing fairly basic analytics tasks at enormous scale, and since then others have begun tackling increasingly complex and data-heavy machine-learning computations, which involve very familiar scientific computing primitives such as linear algebra, unstructured mesh decomposition, and numerical optimization. These new communities have created programming environments which emphasize what we’ve learned about computer science and programmability since 1994 – with greater levels of abstraction and encapsulation, separating high-level computation from the low-level implementation details.
At about the same time, new academic research communities began using computing at scale to attack their problems - but in many cases, an ideal distributed-memory application for them begins to look more like the new concurrent distributed databases than a large CFD simulation, with data structures like dynamic hash tables and Bloom trees playing more important roles than rectangular arrays or unstructured meshes. These new academic communities are among the first to adopt emerging big-data technologies over traditional HPC options; but as big-data technologies improve their tightly-coupled number-crunching capabilities, they are unlikely to be the last.
In this talk, I sketch out the landscape of distributed technical computing frameworks and environments, and look to see where MPI and the MPI community fits in to this new ecosystem.
The document discusses recent developments in the R programming environment for data analysis, including packages like magrittr, readr, tidyr, and dplyr that enable data wrangling workflows. It provides an overview of the key functions in these packages that allow users to load, reshape, manipulate, model, visualize, and report on data in a pipeline using the %>% operator.
This document provides an overview of Apache Flink internals. It begins with an introduction and recap of Flink programming concepts. It then discusses how Flink programs are compiled into execution plans and executed in a pipelined fashion, as opposed to being executed eagerly like regular code. The document outlines Flink's architecture including the optimizer, runtime environment, and data storage integrations. It also covers iterative processing and how Flink handles iterations both by unrolling loops and with native iterative datasets.
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLFlink Forward
The document discusses Samsara, a domain specific language for distributed machine learning. It provides an algebraic expression language for linear algebra operations and optimizes distributed computations. An example of linear regression on a cereals dataset is presented to demonstrate how Samsara can be used to estimate regression coefficients in a distributed fashion. Key steps include loading data as a distributed row matrix, extracting feature and target matrices, computing the normal equations, and solving the linear system to estimate coefficients.
Mikio Braun – Data flow vs. procedural programming Flink Forward
The document discusses the differences between procedural and data flow programming styles as used in Flink. Procedural programming uses variables, loops, and functions to operate on ordered data structures. Data flow programming treats data as unordered sets and uses parallel set transformations like maps, filters, and reductions. It cannot nest operations and uses broadcast variables to combine intermediate results. The document provides examples translating algorithms like centering, sums, and linear regression from procedural to data flow styles in Flink.
This document provides a summary of MapReduce algorithms. It begins with background on the author's experience blogging about MapReduce algorithms in academic papers. It then provides an overview of MapReduce concepts including the mapper and reducer functions. Several examples of recently published MapReduce algorithms are described for tasks like machine learning, finance, and software engineering. One algorithm is examined in depth for building a low-latency key-value store. Finally, recommendations are provided for designing MapReduce algorithms including patterns, performance, and cost/maintainability considerations. An appendix lists additional MapReduce algorithms from academic papers in areas such as AI, biology, machine learning, and mathematics.
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
This talk covers the current parallel capabilities in MATLAB*. Learn about its parallel language and distributed and tall arrays. Interact with GPUs both on the desktop and in the cluster. Combine this information into an interesting algorithmic framework for data analysis and simulation.
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPTathagata Das
This is the academic conference talk on Spark Streaming, where I introduce the concept of Discretized Streams and how it achieves large scale, efficient fault-tolerance streaming in a different way than traditional stream processing systems.
Vasia Kalavri – Training: Gelly School Flink Forward
- Gelly is a graph processing library built on Apache Flink that provides APIs for Java and Scala to work with graphs and perform graph algorithms
- It allows seamless integration of graph-based and record-based analysis by mixing the Gelly and Flink DataSet APIs
- Common graph algorithms like connected components, PageRank, and similarity recommendations are included in the library
Machine Learning with Apache Flink at Stockholm Machine Learning GroupTill Rohrmann
This presentation presents Apache Flink's approach to scalable machine learning: Composable machine learning pipelines, consisting of transformers and learners, and distributed linear algebra.
The presentation was held at the Machine Learning Stockholm group on the 23rd of March 2015.
This document introduces Apache Flink, an open-source stream processing framework. It discusses how Flink can be used for both streaming and batch data processing using common APIs. It also summarizes Flink's features like exactly-once stream processing, iterative algorithms, and libraries for machine learning, graphs, and SQL-like queries. The document promotes Flink as a high-performance stream processor that is easy to use and integrates streaming and batch workflows.
This document provides an overview of the basics of R including why R is used, tutorials and links for learning R, an overview of the R interface and workspace, and how to get help in R. It discusses that R is a free and open-source statistical programming language used for statistical analysis and graphics. It has a steep learning curve due to the interactive nature of analyzing data through chained commands rather than single procedures. Help is provided through a built-in system and various online tutorials.
Ufuc Celebi – Stream & Batch Processing in one SystemFlink Forward
The document describes the architecture and execution model of Apache Flink. Flink uses a distributed dataflow model where a job is represented as a directed acyclic graph of operators. The client submits this graph to the JobManager, which schedules tasks across TaskManagers. Tasks communicate asynchronously through data channels to process bounded and unbounded data in a pipelined fashion.
Apache Flink Training: DataSet API BasicsFlink Forward
This document provides an overview of the Apache Flink DataSet API. It introduces key concepts such as batch processing, data types including tuples, transformations like map, filter, group, and reduce, joining datasets, data sources and sinks, and an example word count program in Java. The word count example demonstrates reading text data, tokenizing strings, grouping and counting words, and writing the results. The document contains slides with code snippets and explanations of Flink's DataSet API concepts and features.
Presentation given on Monday 10 September at the ROOT Users' Workshop 2018 in Sarajevo. Progress update on the Automated Parallel Computation of Collaborative Statistical Models project, a collaboration between the Netherlands eScience Center and Nikhef.
We present an update on our recent efforts to further parallelize RooFit. We have performed extensive benchmarks and identified at least three bottlenecks that will benefit from parallelization. To tackle these and possible future bottlenecks, we designed a parallelization layer that allows us to parallelize existing classes with minimal effort, but with high performance and retaining as much of the existing class's interface as possible. The high-level parallelization model is a task-stealing approach. The implementation is currently based on the bi-directional memory mapped pipe (BidirMMapPipe), but could in the future be replaced by other modes of communication between processes.
Apache Flink Training: DataStream API Part 1 BasicFlink Forward
The document provides an overview of Apache Flink's DataStream API for stream processing. It discusses key concepts like stream execution environments, data types (including tuples), transformations (such as map, filter, grouping), data sources (files, sockets, collections), sinks, and fault tolerance through checkpointing. The document also contains examples of a WordCount application using the DataStream API in Java.
Many experts believe that ageing can be delayed, this is one of the main goals of the the Institute of Healthy Ageing at University College London. I will present the results of my lifespan-extension research where we integrated publicly available genes databases in order to identify ageing related genes. I will show what challenges we met and what we have learned about the process of ageing.
Ageing is one of the fundamental mysteries in biology and many scientists are starting to study this fascinating process. I am part of the research group led by Dr Eugene Schuster at UCL Institute of Healthy Ageing. We experiment with Drosophila and Caenorhabditis elegans by modifying their genes in order to create long-lived mutants. The results of our experiments are quantified using high-throughput microarray analysis. Finally we apply information technology in order to understand how the ageing process works. I will show how we mine microarrays data in order to find the connections between thousands of genes and how we identify candidates for ageing genes.
We are interested in building a better understanding of genes functions by harnessing the large quantity of experimental microarray data in the public databases. Our hope is that after understanding the ageing process in simpler organisms we will be able to apply this knowledge in humans.
Cross-referencing expressions levels in thousands of genes and hundreds of experiments turned out to be a computationally challenging problem but Hadoop and Amazon cloud came to our rescue. In this talk I will present a case study based on our use of R with Amazon Elastic MapReduce and will give background on our bioinformatics challenges.
These slides were presented at ApacheCon Europe 2012:
http://www.apachecon.eu/schedule/presentation/3/
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Tathagata Das
Spark Streaming is a framework for processing large volumes of streaming data in near-real-time. This is an introductory presentation about how Spark Streaming and Kafka can be used for high volume near-real-time streaming data processing in a cluster. This was a guest lecture in a Stanford course.
More information on the course at http://stanford.edu/~rezab/dao/
Apache Flink: API, runtime, and project roadmapKostas Tzoumas
The document provides an overview of Apache Flink, an open source stream processing framework. It discusses Flink's programming model using DataSets and transformations, real-time stream processing capabilities, windowing functions, iterative processing, and visualization tools. It also provides details on Flink's runtime architecture, including its use of pipelined and staged execution, optimizations for iterative algorithms, and how the Flink optimizer selects execution plans.
During the past few years R has become an important language for data analysis, data representation and visualization. R is a very expressive language which combines functional and dynamic aspects, with laziness and object oriented programming. However the default R implementation is neither fast nor distributed, both features crucial for "big data" processing.
Here, FastR-Flink compiler is presented, a compiler based on Oracle's R implementation FastR with support for some operations of Apache Flink, a Java/Scala framework for distributed data processing. The Apache Flink constructs such as map, reduce or filter are integrated at the compiler level to allow the execution of distributed stream and batch data processing applications directly from the R programming language.
This document discusses different frameworks for big data processing at ResearchGate, including Hive, MapReduce, and Flink. It provides an example of using Hive to find the top 5 coauthors for each author based on publication data. Code snippets in Hive SQL and Java are included to implement the top k coauthors user defined aggregate function (UDAF) in Hive. The document evaluates different frameworks based on criteria like features, performance, and usability.
Apache Flink@ Strata & Hadoop World LondonStephan Ewen
This document summarizes the key capabilities of Apache Flink, an open source platform for distributed stream and batch data processing. It discusses how Flink supports streaming dataflows, batch jobs, machine learning algorithms, and graph analysis through its unified dataflow engine. Flink compiles programs into dataflow graphs that execute all workloads as streaming topologies with checkpointing for fault tolerance. This allows Flink to natively support diverse workloads through flexible state, windows, and iterative processing.
This document provides an overview of Apache Flink and streaming analytics. It discusses key concepts in streaming such as event time vs processing time, watermarks, windows, and fault tolerance using checkpoints and savepoints. It provides examples of time-windowed and session-windowed aggregations as well as pattern detection using state. The document also covers mixing event time and processing time, window triggers, and reprocessing data from savepoints in streaming jobs.
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkFlink Forward
This document discusses Google Cloud Dataflow and how it can be executed using Apache Flink. It provides an overview of Dataflow and its API, which is similar to batch and streaming concepts in Flink. It then describes how a Dataflow program is translated to an Abstract Syntax Tree (AST) and how the AST is converted to a Flink execution graph by implementing translators for specific Dataflow transforms like ParDo and Combine. Finally, it mentions the FlinkPipelineRunner that is available on GitHub to execute Dataflow pipelines using Flink.
1) The document discusses generalized linear models (GLM) using H2O. GLM is a well-known statistical method that fits a linear model to predict outcomes.
2) H2O enables distributed, parallel GLM on large datasets with billions of data points. It supports standard GLM features like regularization to prevent overfitting.
3) An example demonstrates predicting flight delays using airline data with 116 million rows. GLM and deep learning models are fit in seconds on H2O using an 8-node cluster.
The document discusses the need for a W3C community group on RDF stream processing. It notes there is currently heterogeneity in RDF stream models, query languages, implementations, and operational semantics. The speaker proposes creating a W3C community group to better understand these differences, requirements, and potentially develop recommendations. The group's mission would be to define common models for producing, transmitting, and continuously querying RDF streams. The presentation provides examples of use cases and outlines a template for describing them to collect more cases to understand requirements.
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...Ryan Rosario
This document summarizes a presentation on parallelizing R code using various packages. It discusses R's limitations in using only one CPU core by default and reading all data into memory. It then outlines packages for explicit (Rmpi) and implicit (snowfall, foreach) parallelism as well as map-reduce and large memory techniques. The presentation provides an overview of these packages and demonstrates Rmpi for parallelizing a Fibonacci function, though this example does not see performance benefits due to overhead of setting up parallelization outweighing computation costs.
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...Ryan Rosario
This document summarizes a presentation about working with large datasets in R. It discusses how R loads all data into memory by default, which can cause issues for large datasets that exceed available RAM. It then overviews several packages for working with "big data" in R, including bigmemory and ff, which allow accessing data from files without loading entirely into memory. The document provides examples of using bigmemory to create large matrices stored on disk but shared across multiple R sessions, as well as applying it to analyze a large airline on-time performance dataset consisting of 120 million rows.
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
This talk covers the current parallel capabilities in MATLAB*. Learn about its parallel language and distributed and tall arrays. Interact with GPUs both on the desktop and in the cluster. Combine this information into an interesting algorithmic framework for data analysis and simulation.
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPTathagata Das
This is the academic conference talk on Spark Streaming, where I introduce the concept of Discretized Streams and how it achieves large scale, efficient fault-tolerance streaming in a different way than traditional stream processing systems.
Vasia Kalavri – Training: Gelly School Flink Forward
- Gelly is a graph processing library built on Apache Flink that provides APIs for Java and Scala to work with graphs and perform graph algorithms
- It allows seamless integration of graph-based and record-based analysis by mixing the Gelly and Flink DataSet APIs
- Common graph algorithms like connected components, PageRank, and similarity recommendations are included in the library
Machine Learning with Apache Flink at Stockholm Machine Learning GroupTill Rohrmann
This presentation presents Apache Flink's approach to scalable machine learning: Composable machine learning pipelines, consisting of transformers and learners, and distributed linear algebra.
The presentation was held at the Machine Learning Stockholm group on the 23rd of March 2015.
This document introduces Apache Flink, an open-source stream processing framework. It discusses how Flink can be used for both streaming and batch data processing using common APIs. It also summarizes Flink's features like exactly-once stream processing, iterative algorithms, and libraries for machine learning, graphs, and SQL-like queries. The document promotes Flink as a high-performance stream processor that is easy to use and integrates streaming and batch workflows.
This document provides an overview of the basics of R including why R is used, tutorials and links for learning R, an overview of the R interface and workspace, and how to get help in R. It discusses that R is a free and open-source statistical programming language used for statistical analysis and graphics. It has a steep learning curve due to the interactive nature of analyzing data through chained commands rather than single procedures. Help is provided through a built-in system and various online tutorials.
Ufuc Celebi – Stream & Batch Processing in one SystemFlink Forward
The document describes the architecture and execution model of Apache Flink. Flink uses a distributed dataflow model where a job is represented as a directed acyclic graph of operators. The client submits this graph to the JobManager, which schedules tasks across TaskManagers. Tasks communicate asynchronously through data channels to process bounded and unbounded data in a pipelined fashion.
Apache Flink Training: DataSet API BasicsFlink Forward
This document provides an overview of the Apache Flink DataSet API. It introduces key concepts such as batch processing, data types including tuples, transformations like map, filter, group, and reduce, joining datasets, data sources and sinks, and an example word count program in Java. The word count example demonstrates reading text data, tokenizing strings, grouping and counting words, and writing the results. The document contains slides with code snippets and explanations of Flink's DataSet API concepts and features.
Presentation given on Monday 10 September at the ROOT Users' Workshop 2018 in Sarajevo. Progress update on the Automated Parallel Computation of Collaborative Statistical Models project, a collaboration between the Netherlands eScience Center and Nikhef.
We present an update on our recent efforts to further parallelize RooFit. We have performed extensive benchmarks and identified at least three bottlenecks that will benefit from parallelization. To tackle these and possible future bottlenecks, we designed a parallelization layer that allows us to parallelize existing classes with minimal effort, but with high performance and retaining as much of the existing class's interface as possible. The high-level parallelization model is a task-stealing approach. The implementation is currently based on the bi-directional memory mapped pipe (BidirMMapPipe), but could in the future be replaced by other modes of communication between processes.
Apache Flink Training: DataStream API Part 1 BasicFlink Forward
The document provides an overview of Apache Flink's DataStream API for stream processing. It discusses key concepts like stream execution environments, data types (including tuples), transformations (such as map, filter, grouping), data sources (files, sockets, collections), sinks, and fault tolerance through checkpointing. The document also contains examples of a WordCount application using the DataStream API in Java.
Many experts believe that ageing can be delayed, this is one of the main goals of the the Institute of Healthy Ageing at University College London. I will present the results of my lifespan-extension research where we integrated publicly available genes databases in order to identify ageing related genes. I will show what challenges we met and what we have learned about the process of ageing.
Ageing is one of the fundamental mysteries in biology and many scientists are starting to study this fascinating process. I am part of the research group led by Dr Eugene Schuster at UCL Institute of Healthy Ageing. We experiment with Drosophila and Caenorhabditis elegans by modifying their genes in order to create long-lived mutants. The results of our experiments are quantified using high-throughput microarray analysis. Finally we apply information technology in order to understand how the ageing process works. I will show how we mine microarrays data in order to find the connections between thousands of genes and how we identify candidates for ageing genes.
We are interested in building a better understanding of genes functions by harnessing the large quantity of experimental microarray data in the public databases. Our hope is that after understanding the ageing process in simpler organisms we will be able to apply this knowledge in humans.
Cross-referencing expressions levels in thousands of genes and hundreds of experiments turned out to be a computationally challenging problem but Hadoop and Amazon cloud came to our rescue. In this talk I will present a case study based on our use of R with Amazon Elastic MapReduce and will give background on our bioinformatics challenges.
These slides were presented at ApacheCon Europe 2012:
http://www.apachecon.eu/schedule/presentation/3/
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Tathagata Das
Spark Streaming is a framework for processing large volumes of streaming data in near-real-time. This is an introductory presentation about how Spark Streaming and Kafka can be used for high volume near-real-time streaming data processing in a cluster. This was a guest lecture in a Stanford course.
More information on the course at http://stanford.edu/~rezab/dao/
Apache Flink: API, runtime, and project roadmapKostas Tzoumas
The document provides an overview of Apache Flink, an open source stream processing framework. It discusses Flink's programming model using DataSets and transformations, real-time stream processing capabilities, windowing functions, iterative processing, and visualization tools. It also provides details on Flink's runtime architecture, including its use of pipelined and staged execution, optimizations for iterative algorithms, and how the Flink optimizer selects execution plans.
During the past few years R has become an important language for data analysis, data representation and visualization. R is a very expressive language which combines functional and dynamic aspects, with laziness and object oriented programming. However the default R implementation is neither fast nor distributed, both features crucial for "big data" processing.
Here, FastR-Flink compiler is presented, a compiler based on Oracle's R implementation FastR with support for some operations of Apache Flink, a Java/Scala framework for distributed data processing. The Apache Flink constructs such as map, reduce or filter are integrated at the compiler level to allow the execution of distributed stream and batch data processing applications directly from the R programming language.
This document discusses different frameworks for big data processing at ResearchGate, including Hive, MapReduce, and Flink. It provides an example of using Hive to find the top 5 coauthors for each author based on publication data. Code snippets in Hive SQL and Java are included to implement the top k coauthors user defined aggregate function (UDAF) in Hive. The document evaluates different frameworks based on criteria like features, performance, and usability.
Apache Flink@ Strata & Hadoop World LondonStephan Ewen
This document summarizes the key capabilities of Apache Flink, an open source platform for distributed stream and batch data processing. It discusses how Flink supports streaming dataflows, batch jobs, machine learning algorithms, and graph analysis through its unified dataflow engine. Flink compiles programs into dataflow graphs that execute all workloads as streaming topologies with checkpointing for fault tolerance. This allows Flink to natively support diverse workloads through flexible state, windows, and iterative processing.
This document provides an overview of Apache Flink and streaming analytics. It discusses key concepts in streaming such as event time vs processing time, watermarks, windows, and fault tolerance using checkpoints and savepoints. It provides examples of time-windowed and session-windowed aggregations as well as pattern detection using state. The document also covers mixing event time and processing time, window triggers, and reprocessing data from savepoints in streaming jobs.
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkFlink Forward
This document discusses Google Cloud Dataflow and how it can be executed using Apache Flink. It provides an overview of Dataflow and its API, which is similar to batch and streaming concepts in Flink. It then describes how a Dataflow program is translated to an Abstract Syntax Tree (AST) and how the AST is converted to a Flink execution graph by implementing translators for specific Dataflow transforms like ParDo and Combine. Finally, it mentions the FlinkPipelineRunner that is available on GitHub to execute Dataflow pipelines using Flink.
1) The document discusses generalized linear models (GLM) using H2O. GLM is a well-known statistical method that fits a linear model to predict outcomes.
2) H2O enables distributed, parallel GLM on large datasets with billions of data points. It supports standard GLM features like regularization to prevent overfitting.
3) An example demonstrates predicting flight delays using airline data with 116 million rows. GLM and deep learning models are fit in seconds on H2O using an 8-node cluster.
The document discusses the need for a W3C community group on RDF stream processing. It notes there is currently heterogeneity in RDF stream models, query languages, implementations, and operational semantics. The speaker proposes creating a W3C community group to better understand these differences, requirements, and potentially develop recommendations. The group's mission would be to define common models for producing, transmitting, and continuously querying RDF streams. The presentation provides examples of use cases and outlines a template for describing them to collect more cases to understand requirements.
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...Ryan Rosario
This document summarizes a presentation on parallelizing R code using various packages. It discusses R's limitations in using only one CPU core by default and reading all data into memory. It then outlines packages for explicit (Rmpi) and implicit (snowfall, foreach) parallelism as well as map-reduce and large memory techniques. The presentation provides an overview of these packages and demonstrates Rmpi for parallelizing a Fibonacci function, though this example does not see performance benefits due to overhead of setting up parallelization outweighing computation costs.
Taking R to the Limit (High Performance Computing in R), Part 2 -- Large Data...Ryan Rosario
This document summarizes a presentation about working with large datasets in R. It discusses how R loads all data into memory by default, which can cause issues for large datasets that exceed available RAM. It then overviews several packages for working with "big data" in R, including bigmemory and ff, which allow accessing data from files without loading entirely into memory. The document provides examples of using bigmemory to create large matrices stored on disk but shared across multiple R sessions, as well as applying it to analyze a large airline on-time performance dataset consisting of 120 million rows.
Managing large datasets in R – ff examples and conceptsAjay Ohri
The document discusses managing large datasets in R using the packages 'bit' and 'ff'. It provides an introduction to the key concepts of these packages for handling datasets that are too large to fit into RAM. Examples are presented that demonstrate how to create and work with atomic vectors stored on disk using 'ff' to avoid exceeding RAM limits. Processing of very large datasets is enabled through chunked and parallelized operations.
In this talk, we will discuss our approach to bring large scale deep analytics to the masses. R is an extremely popular numerical computer environment, but scientific data processing frequently hits its memory limits. On the other hand, system to execute data intensive tasks like Hadoop or Stratosphere are not popular among R users because writing programs using these paradigms is cumbersome. We present an innovative approach to overcome these limitations using the Stratosphere/Apache Flink big data platform by means of a R package and ready-to-use distributed algorithm.
This solution allows the user, with small modifications in the R code, to easily execute distributed scenarios using popular machine learning techniques. We will cover the implementation details of the proposed solution including the architecture of the system, the functionality implemented and working examples.
In addition, we will cover what are the differences between our approach and other solutions that integrate R with Hadoop or other large-scale analytics systems.
Finally, the results of the performance tests show that this solution is competitive with the already existing R implementations for small amounts of data and able to scale-up to gigabyte level.
This document summarizes Ryan R. Rosario's presentation on accessing R from Python using RPy2 to the Los Angeles R Users' Group. The presentation introduces RPy2 as an interface to call R functions and create R data types from Python. It provides examples of using RPy2 to extract user data from a forum website, create R data frames and matrices, plot histograms and linear models in R, and call custom R functions from Python. The document also discusses some advantages and alternatives to using RPy2.
ffbase, statistical functions for large datasetsEdwin de Jonge
This document introduces ffbase, an R package that adds statistical functions and utilities for working with large datasets stored in ff format. ffbase allows standard R code to be used on ff objects by rewriting expressions to operate chunkwise. It also connects ff data to other packages for large data analysis. The goal is to make working with large out-of-memory data more convenient and productive within the R environment.
This document discusses patterns for parallel computing. It outlines key concepts like Amdahl's law and types of parallelism like data and task parallelism. Examples are provided of how major tech companies like Microsoft, Google, Amazon implement parallelism at different levels of their infrastructure and applications to scale efficiently. Design principles are discussed for converting sequential programs to parallel programs while maintaining performance.
R is more than just a language. Many of the reasons why R has become such a popular tool for data science come from the ecosystem surrounding the R project. R users benefit from the many resources and packages created by the community, while commercial companies (including Microsoft) provide tools to extend and support R, and services to help people use R.
In this talk, I will give an overview of the R Ecosystem and describe how it has been a critical component of R’s success, and include several examples of Microsoft’s contributions to the ecosystem.
(Presented to EARL London, September 2016)
Iterative computations are at the core of the vast majority of data-intensive scientific computations. Recent advancements in data intensive computational fields are fueling a dramatic growth in number as well as usage of such data intensive iterative computations. The utility computing model introduced by cloud computing combined with the rich set of cloud infrastructure services offers a very viable environment for the scientists to perform data intensive computations. However, clouds by nature offer unique reliability and sustained performance challenges to large scale distributed computations necessitating computation frameworks specifically tailored for cloud characteristics to harness the power of clouds easily and effectively. My research focuses on identifying and developing user-friendly distributed parallel computation frameworks to facilitate the optimized efficient execution of iterative as well as non-iterative data-intensive computations in cloud environments, alongside the evaluation of heterogeneous cloud resources offering GPGPU resources in addition to CPU resources, for data-intensive iterative computations.
High Performance Parallel Computing with Clouds and Cloud Technologiesjaliyae
Infrastructure services (Infrastructure-as-a-service), provided by cloud vendors, allow any user to provision a large number of compute instances fairly easily. Whether leased from public clouds or allocated from private clouds, utilizing these virtual resources to perform data/compute intensive analyses requires employing different parallel runtimes to implement such applications. Among many parallelizable problems, most “pleasingly parallel” applications can be performed using MapReduce technologies such as Hadoop, CGL-MapReduce, and Dryad, in a fairly easy manner. However, many scientific applications, which have complex communication patterns, still require low latency communication mechanisms and rich set of communication constructs offered by runtimes such as MPI. In this paper, we first discuss large scale data analysis using different MapReduce implementations and then, we present a performance analysis of high performance parallel applications on virtualized resources.
This document summarizes a seminar on parallel computing. It defines parallel computing as performing multiple calculations simultaneously rather than consecutively. A parallel computer is described as a large collection of processing elements that can communicate and cooperate to solve problems fast. The document then discusses parallel architectures like shared memory, distributed memory, and shared distributed memory. It compares parallel computing to distributed computing and cluster computing. Finally, it discusses challenges in parallel computing like power constraints and programmability and provides examples of parallel applications like GPU processing and remote sensing.
Hadoop is famously scalable. Cloud Computing is famously scalable. R – the thriving and extensible open source Data Science software – not so much. But what if we seamlessly combined Hadoop, Cloud Computing, and R to create a scalable Data Science platform? Imagine exploring, transforming, modeling, and scoring data at any scale from the comfort of your favorite R environment. Now, imagine calling a simple R function to operationalize your predictive model as a scalable, cloud-based Web Service. Learn how to leverage the magic of Hadoop on-premises or in the cloud to run your R code, thousands of open source R extension packages, and distributed implementations of the most popular machine learning algorithms at scale.
Parallel computing involves solving computational problems simultaneously using multiple processors. It can save time and money compared to serial computing and allow larger problems to be solved. Parallel programs break problems into discrete parts that can be solved concurrently on different CPUs. Shared memory parallel computers allow all processors to access a global address space, while distributed memory systems require communication between separate processor memories. Hybrid systems combine shared and distributed memory architectures.
This presentation, by big data guru Bernard Marr, outlines in simple terms what Big Data is and how it is used today. It covers the 5 V's of Big Data as well as a number of high value use cases.
This document provides an overview of big data. It defines big data as large volumes of diverse data that are growing rapidly and require new techniques to capture, store, distribute, manage, and analyze. The key characteristics of big data are volume, velocity, and variety. Common sources of big data include sensors, mobile devices, social media, and business transactions. Tools like Hadoop and MapReduce are used to store and process big data across distributed systems. Applications of big data include smarter healthcare, traffic control, and personalized marketing. The future of big data is promising with the market expected to grow substantially in the coming years.
This document discusses using the doSNOW package in R to perform parallel programming and speed up simulations. It explains how to register clusters, use foreach loops with .combine functions, and load necessary packages within loops. Testing with different numbers of clusters shows speedups over serial execution, with optimal speedups achieved when the number of clusters matches or exceeds the number of cores. Processing jobs in parallel reduces the elapsed time for each job.
Parallel R in snow (english after 2nd slide)Cdiscount
This presentation discusses parallelizing computations in R using the snow package. It demonstrates how to:
1. Create a cluster with multiple R sessions using makeCluster()
2. Split data across the sessions using clusterSplit() and export data to each node
3. Write functions to execute in parallel on each node using clusterEvalQ()
4. Collect the results, such as by summing outputs, to obtain the final parallelized computation. As an example, it shows how to parallelize the likelihood calculation for a probit regression model, reducing the computation time.
This document provides an overview of machine learning concepts and code examples in Python. It discusses the typical 5 steps of machine learning projects: collaboration, data collection, clustering, classification, and conclusion. Code snippets demonstrate each step, including collecting data with Scrapy, clustering with k-means, classification with support vector machines, and evaluating results with a confusion matrix. Dimensionality reduction techniques like principal component analysis are also covered.
There are many common workloads in R that are "embarrassingly parallel": group-by analyses, simulations, and cross-validation of models are just a few examples. In this talk I'll describe several techniques available in R to speed up workloads like these, by running multiple iterations simultaneously, in parallel.
Many of these techniques require the use of a cluster of machines running R, and I'll provide examples of using cloud-based services to provision clusters for parallel computations. In particular, I will describe how you can use the SparklyR package to distribute data manipulations using the dplyr syntax, on a cluster of servers provisioned in the Azure cloud.
Presented by David Smith at Data Day Texas in Austin, January 27 2018.
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
The document provides an overview of Spark and its machine learning library MLlib. It discusses how Spark uses resilient distributed datasets (RDDs) to perform distributed computing tasks across clusters in a fault-tolerant manner. It summarizes the key capabilities of MLlib, including its support for common machine learning algorithms and how MLlib can be used together with other Spark components like Spark Streaming, GraphX, and SQL. The document also briefly discusses future directions for MLlib, such as tighter integration with DataFrames and new optimization methods.
The document discusses query execution in database management systems. It begins with an example query on a City, Country database and represents it in relational algebra. It then discusses different query execution strategies like table scan, nested loop join, sort merge join, and hash join. The strategies are compared based on their memory and disk I/O requirements. The document emphasizes that query execution plans can be optimized for parallelism and pipelining to improve performance.
The document discusses the benefits of declarative programming using Scala. It provides examples of implementing algorithms and data structures declaratively in Scala. It also discusses the history and future of Scala, as well as how Scala encourages thinking about programs as transformations rather than changes to memory.
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
https://github.com/yaowser/data_mining_group_project
https://www.kaggle.com/c/zillow-prize-1/data
From the Zillow real estate data set of properties in the southern California area, conduct the following data cleaning, data analysis, predictive analysis, and machine learning algorithms:
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regression Model Performance, Optimizing Support Vector Machine Classifier, Accuracy of results and efficiency, Logistic Regression Feature Importance, interpretation of support vectors, Density Graph
.Net 4.0 Threading and Parallel ProgrammingAlex Moore
This document discusses parallel programming in .NET 4.0. It notes that processor speeds have stopped increasing significantly and cores are increasing instead. It introduces parallel programming concepts in .NET 4.0 like PLINQ for declarative data parallelism, Parallel.For for imperative data parallelism, and tasks for splitting computations. It also discusses concurrent collections and provides examples of using PLINQ, Parallel.For, tasks, and continuations for parallel programming.
Shape Safety in Tensor Programming is Easy for a Theorem Prover -SBTB 2021Peng Cheng
This document discusses the Shapesafe project, which uses dependent types in Scala to enable type-safe linear algebra operations. It aims to push type safety to the extreme by exploring symbolic reasoning and weird operands. The author maintains Shapesafe uses the Curry-Howard isomorphism to translate proofs to functional programs. Moving forward, Shapesafe could benefit from Scala 3's improved type inference and implicit resolution, though some Shapeless features may need to be reimplemented. The end goal is to integrate Shapesafe into machine learning libraries to catch errors at compile-time.
This document provides an overview of functional programming concepts. It discusses why functional programming is useful for building concurrent and thread-safe applications. Key concepts explained include immutable data, first class and higher order functions, lazy evaluation, pattern matching, monads, and monoids. Code examples are provided in JavaScript and Haskell to demonstrate functional programming techniques.
An introduction to the OpenMP parallel programming model.
From the Scalable Computing Support Center at Duke University (http://wiki.duke.edu/display/scsc)
This talk was based on my Master's thesis which I had completed earlier that year. It gives an overview on how certain parallel dynamic programming can be computed in parallel efficiently, and what we want that to mean here.
The plots in "Performance Examples" show speedup S on the left and efficiency E on the right, both against input size.
Read more over here: http://reitzig.github.io/publications/Reitzig2012
St Petersburg R user group meetup 2, Parallel RAndrew Bzikadze
This document provides an overview of parallel computing techniques in R using various packages like snow, multicore, and parallel. It begins with motivation for parallelizing R given its limitations of being single-threaded and memory-bound. It then covers the snow package which enables explicit parallelism across computer clusters. The multicore package provides implicit parallelism using forking, but is deprecated. The parallel package acts as a wrapper for snow and multicore. It also discusses load balancing, random number generation, and provides examples of using snow and multicore for parallel k-means clustering and lapply.
Ubix is an integrated platform built on Apache Spark that allows users to ingest data from various sources, perform multiple analytics steps and transformations, and produce powerful interactive visualizations on both historical and streaming data. It contains over 170 functions for data wrangling, machine learning, graph processing, and visualization. While users could build their own Spark workflows, Ubix aims to simplify this process and provide an out-of-the-box platform for advanced analytics.
Peter Lawrey is the CEO of Chronicle Software. He has 7 years experience working as a Java developer for investment banks and trading firms. Chronicle Software helps companies migrate to high performance Java code and was involved in one of the first large Java 8 projects in production in December 2014. The company offers workshops, training, consulting and custom development services. The talk will cover reading and writing lambdas, capturing vs non-capturing lambdas, transforming imperative code to streams, mixing imperative and functional code, and taking Q&A.
The document discusses real-time big data management and Apache Flink. It provides an overview of Apache Flink, including its architecture, components, and APIs for batch and streaming data processing. It also provides examples of word count programs in Java, Scala, and Java 8 that demonstrate how to write Flink programs for batch and streaming data.
1. The document describes the implementation of a K-means clustering algorithm from scratch in Python. It includes data normalization, K-means++ initialization, and evaluation using the Silhouette method.
2. Various techniques are tested to improve the algorithm, including normalization to handle differently scaled features, and K-means++ initialization to avoid poor initial centroid locations.
3. The algorithm outputs the centroid locations, a plot of Silhouette scores against K values, and a 3D plot visualizing the clustered data points and centroids.
What can be done with Java, but should better be done with Erlang (@pavlobaron)Pavlo Baron
Erlang excels at building distributed, fault-tolerant, concurrent applications due to its lightweight process model and built-in support for distribution. However, Java is more full-featured and is generally a better choice for applications that require more traditional object-oriented capabilities or need to interface with existing Java libraries and frameworks. Both languages have their appropriate uses depending on the requirements of the specific application being developed.
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptxEduSkills OECD
Iván Bornacelly, Policy Analyst at the OECD Centre for Skills, OECD, presents at the webinar 'Tackling job market gaps with a skills-first approach' on 12 June 2024
বাংলাদেশের অর্থনৈতিক সমীক্ষা ২০২৪ [Bangladesh Economic Review 2024 Bangla.pdf] কম্পিউটার , ট্যাব ও স্মার্ট ফোন ভার্সন সহ সম্পূর্ণ বাংলা ই-বুক বা pdf বই " সুচিপত্র ...বুকমার্ক মেনু 🔖 ও হাইপার লিংক মেনু 📝👆 যুক্ত ..
আমাদের সবার জন্য খুব খুব গুরুত্বপূর্ণ একটি বই ..বিসিএস, ব্যাংক, ইউনিভার্সিটি ভর্তি ও যে কোন প্রতিযোগিতা মূলক পরীক্ষার জন্য এর খুব ইম্পরট্যান্ট একটি বিষয় ...তাছাড়া বাংলাদেশের সাম্প্রতিক যে কোন ডাটা বা তথ্য এই বইতে পাবেন ...
তাই একজন নাগরিক হিসাবে এই তথ্য গুলো আপনার জানা প্রয়োজন ...।
বিসিএস ও ব্যাংক এর লিখিত পরীক্ষা ...+এছাড়া মাধ্যমিক ও উচ্চমাধ্যমিকের স্টুডেন্টদের জন্য অনেক কাজে আসবে ...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...PECB
Denis is a dynamic and results-driven Chief Information Officer (CIO) with a distinguished career spanning information systems analysis and technical project management. With a proven track record of spearheading the design and delivery of cutting-edge Information Management solutions, he has consistently elevated business operations, streamlined reporting functions, and maximized process efficiency.
Certified as an ISO/IEC 27001: Information Security Management Systems (ISMS) Lead Implementer, Data Protection Officer, and Cyber Risks Analyst, Denis brings a heightened focus on data security, privacy, and cyber resilience to every endeavor.
His expertise extends across a diverse spectrum of reporting, database, and web development applications, underpinned by an exceptional grasp of data storage and virtualization technologies. His proficiency in application testing, database administration, and data cleansing ensures seamless execution of complex projects.
What sets Denis apart is his comprehensive understanding of Business and Systems Analysis technologies, honed through involvement in all phases of the Software Development Lifecycle (SDLC). From meticulous requirements gathering to precise analysis, innovative design, rigorous development, thorough testing, and successful implementation, he has consistently delivered exceptional results.
Throughout his career, he has taken on multifaceted roles, from leading technical project management teams to owning solutions that drive operational excellence. His conscientious and proactive approach is unwavering, whether he is working independently or collaboratively within a team. His ability to connect with colleagues on a personal level underscores his commitment to fostering a harmonious and productive workplace environment.
Date: May 29, 2024
Tags: Information Security, ISO/IEC 27001, ISO/IEC 42001, Artificial Intelligence, GDPR
-------------------------------------------------------------------------------
Find out more about ISO training and certification services
Training: ISO/IEC 27001 Information Security Management System - EN | PECB
ISO/IEC 42001 Artificial Intelligence Management System - EN | PECB
General Data Protection Regulation (GDPR) - Training Courses - EN | PECB
Webinars: https://pecb.com/webinars
Article: https://pecb.com/article
-------------------------------------------------------------------------------
For more information about PECB:
Website: https://pecb.com/
LinkedIn: https://www.linkedin.com/company/pecb/
Facebook: https://www.facebook.com/PECBInternational/
Slideshare: http://www.slideshare.net/PECBCERTIFICATION
Leveraging Generative AI to Drive Nonprofit InnovationTechSoup
In this webinar, participants learned how to utilize Generative AI to streamline operations and elevate member engagement. Amazon Web Service experts provided a customer specific use cases and dived into low/no-code tools that are quick and easy to deploy through Amazon Web Service (AWS.)
IGCSE Biology Chapter 14- Reproduction in Plants.pdf
Parallel Computing with R
1. Parallel Computing with R
Parallel Computing with R
Literature Seminar
Abhirup Mallik
malli066@umn.edu
School of Statistics
University of Minnesota
November 15, 2013
2. Parallel Computing with R
Why Parallel?
Why Parallel?
R does not take advantage of multiple cores by default
Does not support passing by reference
3. Parallel Computing with R
Why Parallel?
Why Parallel?
R does not take advantage of multiple cores by default
Does not support passing by reference
Can not read files dynamically ... etc..
4. Parallel Computing with R
Why Parallel?
Why Parallel?
R does not take advantage of multiple cores by default
Does not support passing by reference
Can not read files dynamically ... etc..
5. Parallel Computing with R
What is Parallel computing with R
What is Parallel?
’Parallel’ : Doing more than one tasks at the same time.
Use different cores of a same CPU for different tasks.
6. Parallel Computing with R
What is Parallel computing with R
What is Parallel?
’Parallel’ : Doing more than one tasks at the same time.
Use different cores of a same CPU for different tasks.
Use different computers in a cluster for different tasks.
7. Parallel Computing with R
What is Parallel computing with R
What is Parallel?
’Parallel’ : Doing more than one tasks at the same time.
Use different cores of a same CPU for different tasks.
Use different computers in a cluster for different tasks.
8. Parallel Computing with R
How to go Parallel?
Using Multicore (Implicit Parallelism)
Main process forks to child process which runs in parallel in
different cores.
1 library ( parallel )
2 mclapply (X , FUN , ...)
Or use
1
2
3
4
5
6
library ( parallel )
... setup stuff ..
for ( isplit in 1: nsplit ) {
mcparallel ( some R expression involving isplit )
}
out <- collect ()
9. Parallel Computing with R
How to go Parallel?
Warnings:
All child process compete for memory.
Closing terminal or closing any graphical window only kills
parent.
’CRTL + C’ Kills the parent, not the children.
Kill the children if they are unresponsive.
10. Parallel Computing with R
How to go Parallel?
Using SNOW (Explicit Parallelism)
Make a cluster by any one of these options
1 cl <- makeCluster ( spec , type , ...)
2 cl <- m a k e P S O C K c l u s t e r ( names , ...)
3 cl <- ma ke F or kC lu s te r ( nnodes = , ...)
Export essential objects to the cluster:
1 clusterExport ( cl , c ( var1 , fun1 , ..) )
Evaluate on cluster:
1 clusterEvalQ ( cl , expr )
2 parLapply ( cl = NULL , X , fun , ...)
3 parSapply ( cl = NULL , X , fun , ...)
Stop the cluster
11. Parallel Computing with R
Demonstration
Demonstration
Using Swiss fertility data from 1888 (R-base).
1 > str ( swiss )
2 ’ data . frame ’: 47 obs . of
3 $ Fertility
: num
4 $ Agriculture
: num
5 $ Examination
: int
6 $ Education
: int
7 $ Catholic
: num
8 $ Infant . Mortality : num
6 variables :
80.2 83.1 92.5 85.8 76.9 76.1 ...
17 45.1 39.7 36.5 43.5 35.3 ...
15 6 5 12 17 9 16 14 12 16 ...
12 9 5 7 15 7 7 8 7 13 ...
9.96 84.84 93.4 33.77 5.16 ...
22.2 22.2 20.2 20.3 20.6 26.6 ...
12. Parallel Computing with R
Demonstration
Demonstration
10 fold cross validation
1 fold <- sample ( seq (1 , 10) , size = nrow ( swiss ) ,
2
replace = TRUE )
Cross validation for ’i’th Fold
1 fold . cv <- function ( i ) {
2 train <- swiss [ fold ! = i , ]
3 test <- swiss [ fold == i , ]
4 swiss . rf <- randomForest ( sqrt ( Fertility ) ~ .
5
- Catholic + I ( Catholic < 50) , data = train )
6 predict . test <- predict ( swiss . rf , test , type = " response " )
7 actual . test <- sqrt ( test $ Fertility )
8 err <- predict . test - actual . test
9 sum ( err * err )
10 }
13. Parallel Computing with R
Demonstration
How to create a cluster?
Create a local cluster of size 4 (parallel socket)
1 cl <- m a k e P S O C K c l u s t e r (4)
Create a local cluster on different cores of the CPU (8 cores).
1 cl <- ma ke F or kC lu s te r (8)
14. Parallel Computing with R
Demonstration
How to create a cluster in our LAB?
Create password less log in using ssh keygen (from Shell):
1 ssh - keygen -t dsa
2 cat ~ / . ssh / id _ dsa . pub >> ~ / . ssh / authorized _ keys
#check which computers are running
1 grephosts LAB
2 # Then ssh all the computers you want to connect to once ,
and it will be remembered for the session .
Now we are ready to make a cluster:
1 library ( parallel )
2 machines <- c ( " crab " , " sugar " , " strike " , " hyland " , " lovejoy "
, " driller " )
3 address <- rapply ( lapply ( machines , nsl ) , c )
4 cl <- m a k e P S O C K c l u s t e r ( address )
15. Parallel Computing with R
Demonstration
How to create a cluster in our LAB?
If you are connecting to stat.umn.edu from your own computer, to
create a password-less ssh session:
1 ssh - keygen -t dsa
2 # Then use scp to copy id _ dsa . pub to ~ / . ssh / authorized _ keys
16. Parallel Computing with R
Demonstration
Comparison
On cluster:
1
2
3
4
5
6
7
8
9
10
> system . time ({
+
garbage <- clusterEvalQ ( cl , data ( swiss ) )
+
garbage <- clusterEvalQ ( cl , library ( randomForest ) )
+
clusterExport ( cl , c ( " fold " , " fold . cv " ) )
+
c l u s t e r S e t R N G S t r e a m ( cl , 123)
+
res3 <- do . call (c , parLapply ( cl , 1:10 , fold . cv ) )
+
stopCluster ( cl )
+ })
user system elapsed
0.008
0.000
0.838
On Multicore:
1 > system . time ({
2 +
res1 <- do . call (c , mclapply (1:10 , fold . cv , mc . cores = 8) )
3
4
})
user
0.386
system elapsed
0.162
0.120
17. Parallel Computing with R
Demonstration
Using Fork cluster:
1
2
3
4
5
6
7
8
9
10
11
> system . time ({
+
cl <- m ak eF o rk Cl us t er (8)
+
garbage <- clusterEvalQ ( cl , data ( swiss ) )
+
garbage <- clusterEvalQ ( cl , library ( randomForest ) )
+
clusterExport ( cl , c ( " fold " , " fold . cv " ) )
+
c l u s t e r S e t R N G S t r e a m ( cl , 123)
+
res3 <- do . call (c , parLapply ( cl , 1:10 , fold . cv ) )
+
stopCluster ( cl )
+ })
user system elapsed
0.010
0.054
0.153
Without any parallelization:
1 > system . time ({
2 +
res2 <- do . call (c , lapply (1:10 , fold . cv ) )
3 +
})
4
user system elapsed
5
0.233
0.000
0.235
18. Parallel Computing with R
When to go Parallel?
When to go Parallel?
When gain from parallelization is much more than the cost of
data transfer, network delays, etc...
If the problem is Embarrassingly parallel: No dependency
between the parallel tasks.
19. Parallel Computing with R
When to go Parallel?
When to go Parallel?
When gain from parallelization is much more than the cost of
data transfer, network delays, etc...
If the problem is Embarrassingly parallel: No dependency
between the parallel tasks.
Cross validation or Bootstrapping are examples where going
parallel would work.
20. Parallel Computing with R
When to go Parallel?
When to go Parallel?
When gain from parallelization is much more than the cost of
data transfer, network delays, etc...
If the problem is Embarrassingly parallel: No dependency
between the parallel tasks.
Cross validation or Bootstrapping are examples where going
parallel would work.
Iterative numerical methods like co-ordinate descent or
Newton-Rapson, going parallel may not be possible.
21. Parallel Computing with R
When to go Parallel?
When to go Parallel?
When gain from parallelization is much more than the cost of
data transfer, network delays, etc...
If the problem is Embarrassingly parallel: No dependency
between the parallel tasks.
Cross validation or Bootstrapping are examples where going
parallel would work.
Iterative numerical methods like co-ordinate descent or
Newton-Rapson, going parallel may not be possible.
22. Parallel Computing with R
To infinity and beyond
What is beyond the wall?
Parallelization in Big data framework: RHadoop
Other and related implementations of parallelization: MPI,
NWS, etc...
Other cool libraries: foreach, snowfall, etc...
GPU !!
23. Parallel Computing with R
Where to get codes?
Where to get the codes?
All the codes in this presentation is available at :
https://github.com/abhirupkgp/parallelseminar/blob/master/cv.R
24. Parallel Computing with R
References
Acknowledgements and References
Sincere thanks to Charles Geyer
Resourceful slides by Ryan Rosario.
Some other and more resourceful slides.
Parallel R Book