Performance and Scale Options for R with Hadoop: A comparison of potential ar...Revolution Analytics
R and Hadoop go together. In fact, they go together so well, that the number of options available can be confusing to IT and data science teams seeking solutions under varying performance and operational requirements.
Which configuration is faster for big files? Which is faster for sharing data and servers among groups? Which eliminates data movement? Which is easiest to manage? Which works best with iterative and multistep algorithms? What are the hardware requirements of each alternative?
This webinar is intended to help new users of R with Hadoop select their best architecture for integrating Hadoop and R, by explaining the benefits of several popular configurations, their performance potential, workload handling and programming model and administrative characteristics.
Presenters from Revolution Analytics will describe the options for using Revolution R Open and Revolution R Enterprise with Hadoop including servers, edge nodes, rHadoop and ScaleR. We’ll then compare the characteristics of each configuration as regards performance but also programming model, administration, data movement, ease of scaling, mixed workload handling, and performance for large individual analyses vs. mixed workloads.
Presentation given by US Chief Scientist, Mario Inchiosa, at the June 2013 Hadoop Summit in San Jose, CA.
ABSTRACT: Hadoop is rapidly being adopted as a major platform for storing and managing massive amounts of data, and for computing descriptive and query types of analytics on that data. However, it has a reputation for not being a suitable environment for high performance complex iterative algorithms such as logistic regression, generalized linear models, and decision trees. At Revolution Analytics we think that reputation is unjustified, and in this talk I discuss the approach we have taken to porting our suite of High Performance Analytics algorithms to run natively and efficiently in Hadoop. Our algorithms are written in C++ and R, and are based on a platform that automatically and efficiently parallelizes a broad class of algorithms called Parallel External Memory Algorithms (PEMA’s). This platform abstracts both the inter-process communication layer and the data source layer, so that the algorithms can work in almost any environment in which messages can be passed among processes and with almost any data source. MPI and RPC are two traditional ways to send messages, but messages can also be passed using files, as in Hadoop. I describe how we use the file-based communication choreographed by MapReduce and how we efficiently access data stored in HDFS.
High Performance Predictive Analytics in R and HadoopDataWorks Summit
Hadoop is rapidly being adopted as a major platform for storing and managing massive amounts of data, and for computing descriptive and query types of analytics on that data. However, it has a reputation for not being a suitable environment for high performance complex iterative algorithms such as logistic regression, generalized linear models, and decision trees. At Revolution Analytics we think that reputation is unjustified, and in this talk I discuss the approach we have taken to porting our suite of High Performance Analytics algorithms to run natively and efficiently in Hadoop. Our algorithms are written in C++ and R, and are based on a platform that automatically and efficiently parallelizes a broad class of algorithms called Parallel External Memory Algorithms (PEMA’s). This platform abstracts both the inter-process communication layer and the data source layer, so that the algorithms can work in almost any environment in which messages can be passed among processes and with almost any data source. MPI and RPC are two traditional ways to send messages, but messages can also be passed using files, as in Hadoop. I describe how we use the file-based communication choreographed by MapReduce and how we efficiently access data stored in HDFS.
27 Aug 2013 Webinar High Performance Predictive Analytics in Hadoop and R presented by Mario E. Inchiosa, PhD., US Data Scientist and Kathleen Rohrecker, Director of Product Marketing
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Revolution Analytics
R and Hadoop go together. In fact, they go together so well, that the number of options available can be confusing to IT and data science teams seeking solutions under varying performance and operational requirements.
Which configuration is faster for big files? Which is faster for sharing data and servers among groups? Which eliminates data movement? Which is easiest to manage? Which works best with iterative and multistep algorithms? What are the hardware requirements of each alternative?
This webinar is intended to help new users of R with Hadoop select their best architecture for integrating Hadoop and R, by explaining the benefits of several popular configurations, their performance potential, workload handling and programming model and administrative characteristics.
Presenters from Revolution Analytics will describe the options for using Revolution R Open and Revolution R Enterprise with Hadoop including servers, edge nodes, rHadoop and ScaleR. We’ll then compare the characteristics of each configuration as regards performance but also programming model, administration, data movement, ease of scaling, mixed workload handling, and performance for large individual analyses vs. mixed workloads.
Presentation given by US Chief Scientist, Mario Inchiosa, at the June 2013 Hadoop Summit in San Jose, CA.
ABSTRACT: Hadoop is rapidly being adopted as a major platform for storing and managing massive amounts of data, and for computing descriptive and query types of analytics on that data. However, it has a reputation for not being a suitable environment for high performance complex iterative algorithms such as logistic regression, generalized linear models, and decision trees. At Revolution Analytics we think that reputation is unjustified, and in this talk I discuss the approach we have taken to porting our suite of High Performance Analytics algorithms to run natively and efficiently in Hadoop. Our algorithms are written in C++ and R, and are based on a platform that automatically and efficiently parallelizes a broad class of algorithms called Parallel External Memory Algorithms (PEMA’s). This platform abstracts both the inter-process communication layer and the data source layer, so that the algorithms can work in almost any environment in which messages can be passed among processes and with almost any data source. MPI and RPC are two traditional ways to send messages, but messages can also be passed using files, as in Hadoop. I describe how we use the file-based communication choreographed by MapReduce and how we efficiently access data stored in HDFS.
High Performance Predictive Analytics in R and HadoopDataWorks Summit
Hadoop is rapidly being adopted as a major platform for storing and managing massive amounts of data, and for computing descriptive and query types of analytics on that data. However, it has a reputation for not being a suitable environment for high performance complex iterative algorithms such as logistic regression, generalized linear models, and decision trees. At Revolution Analytics we think that reputation is unjustified, and in this talk I discuss the approach we have taken to porting our suite of High Performance Analytics algorithms to run natively and efficiently in Hadoop. Our algorithms are written in C++ and R, and are based on a platform that automatically and efficiently parallelizes a broad class of algorithms called Parallel External Memory Algorithms (PEMA’s). This platform abstracts both the inter-process communication layer and the data source layer, so that the algorithms can work in almost any environment in which messages can be passed among processes and with almost any data source. MPI and RPC are two traditional ways to send messages, but messages can also be passed using files, as in Hadoop. I describe how we use the file-based communication choreographed by MapReduce and how we efficiently access data stored in HDFS.
27 Aug 2013 Webinar High Performance Predictive Analytics in Hadoop and R presented by Mario E. Inchiosa, PhD., US Data Scientist and Kathleen Rohrecker, Director of Product Marketing
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationRevolution Analytics
Slides from Joseph Rickert's presentation at Strata NYC 2013
"Using R and Hadoop for Statistical Computation at Scale"
http://strataconf.com/stratany2013/public/schedule/detail/30632
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Revolution Analytics
Presented by David Smith, Chief Community Officer, Revolution Analytics at Garner Business Intelligence and Analytics Summit, April 2014.
In this presentation, I'll introduce the open source R language — the modern standard for Data Science — and the enhanced performance, scalability and ease-of-use capabilities of Revolution R Enterprise. Customer case studies will illustrate Revolution R Enterprise as a component of the real-time analytics deployment process, via integration with Hadoop, database warehousing systems and Cloud platforms, to implement data-driven end-user applications.
Analysts predict that the Hadoop market will reach $50.2 billion USD by 2020.1 Applications driving these large expenditures are some of the most important workloads for businesses today including:
• Analyzing clickstream data, including site-side clicks and web media tags. • Measuring sentiment by scanning product feedback, blog feeds, social media comments, and Twitter streams. • Analysis of behavior and risk by capturing vehicle telematics. • Optimizing product performance and utilization by gathering data from built-in sensors. • Tracking and analyzing people and material movement with location-aware systems. • Identifying system performance and intrusion attempts by analyzing server and network log. • Enabling automatic document and speech categorization. • Extracting learning from digitized images, voice, video, and other media types.
Predictive analytics on large data sets provides organizations with a key opportunity to improve a broad variety of business outcomes, and many have embraced Apache Hadoop as the platform of choice.
In the last few years, large businesses have adopted Apache Hadoop as a next-generation data platform, one capable of managing large data assets in a way that is flexible, scalable, and relatively low cost. However, to realize predictive benefits of big data, organizations must be able to develop or hire individuals with the requisite statistics skills, then provide them with a platform for analyzing massive data assets collected in Hadoop “data lakes.”
As users adopted Hadoop, many discovered performance and complexity limited Hadoop’s use for broad predictive analytics use. In response, the Hadoop community has focused on the Apache Spark platform to provide Hadoop with significant performance improvements. With Spark atop Hadoop, users can leverage Hadoop’s big-data management capabilities while achieving new performance levels by running analytics in Apache Spark.
What remains is a challenge—conquering the complexity of Hadoop when developing predictive analytics applications.
In this white paper, we’ll describe how Microsoft R Server helps data scientists, actuaries, risk analysts, quantitative analysts, product planners, and other R users to capture the benefits of Apache Spark on Hadoop by providing a straightforward platform that eliminates much of the complexity of using Spark and Hadoop to conduct analyses on large data assets.
In-Database Analytics Deep Dive with Teradata and RevolutionRevolution Analytics
Teradata and Revolution Analytics worked together to develop in-database analytical capabilities for Teradata Database. Teradata v14.10 provides a foundation for in-database analytics in Teradata. Revolution Analytics has ported its Revolution R Enterprise (RRE) Version 7.1 to use the in-database capabilities of version 14.10. With RRE inside Teradata, users can run fully parallelized algorithms in each node of the Teradata appliance to achieve performance and data scale heretofore unavailable. We'll get past the market-ecture quickly and dive into a “how it really works” presentation, review implications for system configuration and administration, and then take questions from Teradata users who will be charged with deploying and administering Teradata systems as platforms for big data analytics inside the database engine.
There is one consistent message we hear from customers across industries and around the world: "We would like to reduce our reliance on SAS." In this webinar, we review the top reasons customers cite for moving fromSAS to R; the benefits of open source analytics; the challenges of switching; and the tools you will need to build your own roadmap. We review the key differences between SAS and R from the user's perspective, and provide you with the tools to move forward.
Big data analytics on teradata with revolution r enterprise bill jacobsBill Jacobs
Revolution Analytics brings big data analytics to Teradata database. Presentation from Teradata Partners, October 2013 overviewing Revolution R Enterprise for Teradata by Bill Jacobs, Director, Product Marketing, Revolution Analytics.
Big Data refers to a large amount of data both structured and unstructured. For managing and analyzing this amount of data we need technologies like Hadoop and language like R.
http://www.techsparks.co.in/thesis-in-big-data-with-r/
This session will demonstrate how the all-star line-up featuring R and Storm enables real-time processing on massive data sets; a real home run! The presenters will use actual baseball data and a real-world use case to compose an implementation of the use case as Storm components (spouts, bolts, etc.) and highlight how R can be an effective tool in prototyping a solution. Attendees will leave the session with information that could easily be applied for other use cases such as video game analytics, fraud detection, intrusion detection, and consumer propensity to buy calculations.
The business need for real-time analytics at large scale has focused attention on the use of Apache Storm, but an approach that is sometimes overlooked is the use of Storm and R together. This novel combination of real-time processing with Storm and the practical but powerful statistical analysis offered by R substantially extends the usefulness of Storm as a solution to a variety of business critical problems. By architecting R into the Storm application development process, Storm developers can be much more effective. The aim of this design is not necessarily to deploy faster code but rather to deploy code faster. Just a few lines of R code can be used in place of lengthy Storm code for the purpose of early exploration – you can easily evaluate alternative approaches and quickly make a working prototype.
R is free software for data analysis and graphics that is similar to SAS and SPSS. Two million people are part of the R Open Source Community. Its use is growing very rapidly and Revolution Analytics distributes a commercial version of R that adds capabilities that are not available in the Open Source version. This 60-minute webinar is for people who are familiar with SAS or SPSS who want to know how R can strengthen their analytics strategy.
Presented by: Joseph Rickert, Data Scientist Community Manager, Revolution Analytics, Sep 25 2014.
Whenever data scientists are asked about what software they use R always comes up at the top of the list. In one recent survey, only SQL was rated higher than R. In this webinar we will explore what makes R so popular and useful. Starting with the big picture, we describe how R is organized and how to find your way around the R world. Then we will work through some examples highlighting features of R that make it attractive for data science work including:
Acquiring data
Data manipulation
Exploratory data analysis
Model building
Machine learning
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...Revolution Analytics
[Presentation by Skylar Lyon at DataWeek 2014, September 17 2014.]
I recently faced the task of how to scale out an existing analytics process. The schedule was compressed - it always is in my world. The data was big - 400+ million rows waiting in database. What did I do? I offered my favorite type of solution - quick and dirty.
At the outset, I wasn't sure how easy it would be. Nor was I certain of realized performance gains. But the concept seemed sound and the exercise fun. Let's move the compute to the data via Revolution R Enterprise for Teradata.
This presentation outlines my approach in leveraging a colleague's R models as I experimented with running R in-database. Would my path lead to significant improvement? Could it be used to productionalize the workflow?
In this presentation from Revolution Analytics, Bill Jacobs presents: Are You Ready for Big Data Analytics?
"Revolution Analytics delivers advanced analytics software at half the cost of existing solutions. By building on open source R—the world's most powerful statistics software—with innovations in big data analysis, integration and user experience, Revolution Analytics meets the demands and requirements of modern data-driven businesses."
Learn more: http://www.revolutionanalytics.com
Watch the presentation video: http://wp.me/p3RLEV-12S
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationRevolution Analytics
Slides from Joseph Rickert's presentation at Strata NYC 2013
"Using R and Hadoop for Statistical Computation at Scale"
http://strataconf.com/stratany2013/public/schedule/detail/30632
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Revolution Analytics
Presented by David Smith, Chief Community Officer, Revolution Analytics at Garner Business Intelligence and Analytics Summit, April 2014.
In this presentation, I'll introduce the open source R language — the modern standard for Data Science — and the enhanced performance, scalability and ease-of-use capabilities of Revolution R Enterprise. Customer case studies will illustrate Revolution R Enterprise as a component of the real-time analytics deployment process, via integration with Hadoop, database warehousing systems and Cloud platforms, to implement data-driven end-user applications.
Analysts predict that the Hadoop market will reach $50.2 billion USD by 2020.1 Applications driving these large expenditures are some of the most important workloads for businesses today including:
• Analyzing clickstream data, including site-side clicks and web media tags. • Measuring sentiment by scanning product feedback, blog feeds, social media comments, and Twitter streams. • Analysis of behavior and risk by capturing vehicle telematics. • Optimizing product performance and utilization by gathering data from built-in sensors. • Tracking and analyzing people and material movement with location-aware systems. • Identifying system performance and intrusion attempts by analyzing server and network log. • Enabling automatic document and speech categorization. • Extracting learning from digitized images, voice, video, and other media types.
Predictive analytics on large data sets provides organizations with a key opportunity to improve a broad variety of business outcomes, and many have embraced Apache Hadoop as the platform of choice.
In the last few years, large businesses have adopted Apache Hadoop as a next-generation data platform, one capable of managing large data assets in a way that is flexible, scalable, and relatively low cost. However, to realize predictive benefits of big data, organizations must be able to develop or hire individuals with the requisite statistics skills, then provide them with a platform for analyzing massive data assets collected in Hadoop “data lakes.”
As users adopted Hadoop, many discovered performance and complexity limited Hadoop’s use for broad predictive analytics use. In response, the Hadoop community has focused on the Apache Spark platform to provide Hadoop with significant performance improvements. With Spark atop Hadoop, users can leverage Hadoop’s big-data management capabilities while achieving new performance levels by running analytics in Apache Spark.
What remains is a challenge—conquering the complexity of Hadoop when developing predictive analytics applications.
In this white paper, we’ll describe how Microsoft R Server helps data scientists, actuaries, risk analysts, quantitative analysts, product planners, and other R users to capture the benefits of Apache Spark on Hadoop by providing a straightforward platform that eliminates much of the complexity of using Spark and Hadoop to conduct analyses on large data assets.
In-Database Analytics Deep Dive with Teradata and RevolutionRevolution Analytics
Teradata and Revolution Analytics worked together to develop in-database analytical capabilities for Teradata Database. Teradata v14.10 provides a foundation for in-database analytics in Teradata. Revolution Analytics has ported its Revolution R Enterprise (RRE) Version 7.1 to use the in-database capabilities of version 14.10. With RRE inside Teradata, users can run fully parallelized algorithms in each node of the Teradata appliance to achieve performance and data scale heretofore unavailable. We'll get past the market-ecture quickly and dive into a “how it really works” presentation, review implications for system configuration and administration, and then take questions from Teradata users who will be charged with deploying and administering Teradata systems as platforms for big data analytics inside the database engine.
There is one consistent message we hear from customers across industries and around the world: "We would like to reduce our reliance on SAS." In this webinar, we review the top reasons customers cite for moving fromSAS to R; the benefits of open source analytics; the challenges of switching; and the tools you will need to build your own roadmap. We review the key differences between SAS and R from the user's perspective, and provide you with the tools to move forward.
Big data analytics on teradata with revolution r enterprise bill jacobsBill Jacobs
Revolution Analytics brings big data analytics to Teradata database. Presentation from Teradata Partners, October 2013 overviewing Revolution R Enterprise for Teradata by Bill Jacobs, Director, Product Marketing, Revolution Analytics.
Big Data refers to a large amount of data both structured and unstructured. For managing and analyzing this amount of data we need technologies like Hadoop and language like R.
http://www.techsparks.co.in/thesis-in-big-data-with-r/
This session will demonstrate how the all-star line-up featuring R and Storm enables real-time processing on massive data sets; a real home run! The presenters will use actual baseball data and a real-world use case to compose an implementation of the use case as Storm components (spouts, bolts, etc.) and highlight how R can be an effective tool in prototyping a solution. Attendees will leave the session with information that could easily be applied for other use cases such as video game analytics, fraud detection, intrusion detection, and consumer propensity to buy calculations.
The business need for real-time analytics at large scale has focused attention on the use of Apache Storm, but an approach that is sometimes overlooked is the use of Storm and R together. This novel combination of real-time processing with Storm and the practical but powerful statistical analysis offered by R substantially extends the usefulness of Storm as a solution to a variety of business critical problems. By architecting R into the Storm application development process, Storm developers can be much more effective. The aim of this design is not necessarily to deploy faster code but rather to deploy code faster. Just a few lines of R code can be used in place of lengthy Storm code for the purpose of early exploration – you can easily evaluate alternative approaches and quickly make a working prototype.
R is free software for data analysis and graphics that is similar to SAS and SPSS. Two million people are part of the R Open Source Community. Its use is growing very rapidly and Revolution Analytics distributes a commercial version of R that adds capabilities that are not available in the Open Source version. This 60-minute webinar is for people who are familiar with SAS or SPSS who want to know how R can strengthen their analytics strategy.
Presented by: Joseph Rickert, Data Scientist Community Manager, Revolution Analytics, Sep 25 2014.
Whenever data scientists are asked about what software they use R always comes up at the top of the list. In one recent survey, only SQL was rated higher than R. In this webinar we will explore what makes R so popular and useful. Starting with the big picture, we describe how R is organized and how to find your way around the R world. Then we will work through some examples highlighting features of R that make it attractive for data science work including:
Acquiring data
Data manipulation
Exploratory data analysis
Model building
Machine learning
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...Revolution Analytics
[Presentation by Skylar Lyon at DataWeek 2014, September 17 2014.]
I recently faced the task of how to scale out an existing analytics process. The schedule was compressed - it always is in my world. The data was big - 400+ million rows waiting in database. What did I do? I offered my favorite type of solution - quick and dirty.
At the outset, I wasn't sure how easy it would be. Nor was I certain of realized performance gains. But the concept seemed sound and the exercise fun. Let's move the compute to the data via Revolution R Enterprise for Teradata.
This presentation outlines my approach in leveraging a colleague's R models as I experimented with running R in-database. Would my path lead to significant improvement? Could it be used to productionalize the workflow?
In this presentation from Revolution Analytics, Bill Jacobs presents: Are You Ready for Big Data Analytics?
"Revolution Analytics delivers advanced analytics software at half the cost of existing solutions. By building on open source R—the world's most powerful statistics software—with innovations in big data analysis, integration and user experience, Revolution Analytics meets the demands and requirements of modern data-driven businesses."
Learn more: http://www.revolutionanalytics.com
Watch the presentation video: http://wp.me/p3RLEV-12S
Applications in R - Success and Lessons Learned from the MarketplaceRevolution Analytics
Adoption of the R language has grown rapidly in the last few years, and is ranked as the number-one data science language in several surveys. This accelerating R adoption curve has been driven by the Big Data revolution, and the fact that so many data scientists — having learned R at university — are actively unlocking the secrets hidden in these new, vast data troves.
In this webinar David Smith, Chief Community Officer, will take a look at the growth of R and the innovative uses of R in business, government and non-profit sectors. Then Neera Talbert, Vice President, Professional Services will take you into the trenches of recent customer deployments and share best practices and pitfalls to avoid in deploying or expanding your own R applications.
Big Data in Action – Real-World Solution ShowcaseInside Analysis
The Briefing Room with Radiant Advisors and IBM
Live Webcast on February 25, 2014
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=53c9b7fa2000f98f5b236747e3602511
The power of Big Data depends heavily upon the context in which it's used, and most organizations are just beginning to figure out where, how and when to leverage it. One key to success is integration with existing information systems, many of which still rely on relational database technologies. Finding ways to blend these two worlds can help companies generate measurable business value in fairly short order.
Register for this episode of The Briefing Room to hear Analysts Lindy Ryan and John O'Brien as they explain how the combination of traditional Business Intelligence with Big Data Analytics can provide game-changing results in today's information economy. They'll be briefed by Eric Poulin and Paul Flach of Stream Integration who will share best practices for designing and implementing Big Data solutions. They'll discuss the components of IBM BigInsights, and explain how BigSheets can empower non-technical users who need to explore self-structured data.
Visit InsideAnlaysis.com for more information.
Future of Enterprise PaaS (Cloud Foundry Summit 2014)VMware Tanzu
Keynote delivered by Steve Winkler, Open Cloud Strategy & Dirk Basenach, VP Development at SAP.
There are many approaches to running enterprise applications in the cloud, and SAP has made a strategic choice to leverage the Open Source solution Cloud Foundry for this purpose. This presentation will provide details on SAP’s approach to Open Source in the cloud, focusing on the PaaS layer and showing how Cloud Foundry can be used to extend existing SAP products and to develop entirely new enterprise applications. Moreover, the presentation will show how the SAP HANA in-memory platform and Cloud Foundry will come together to provide an enterprise-grade, real-time open platform in the cloud.
BIG Data & Hadoop Applications in Social MediaSkillspeed
Explore the applications of BIG Data & Hadoop in Social Media via Skillspeed.
BIG Data & Hadoop in Social Media is a key differentiator, especially in terms of generating memorable customer experiences.
Herein, we discuss how leading social networks such as Facebook, Twitter, Pinterest, LinkedIN, Instagram & Stumble Upon utilize Hadoop.
To get more details regarding BIG Data & Hadoop, please visit - www.SkillSpeed.com
Game Changed – How Hadoop is Reinventing Enterprise ThinkingInside Analysis
The Briefing Room with Dr. Robin Bloor and RedPoint Global
Live Webcast on April 8, 2014
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=cfa1bffdd62dc6677fa225bdffe4a0b9
The innovation curve often arcs slowly before picking up speed. Companies that harness a major transformation early in the game can make serious headway before challengers enter the picture. The world of Hadoop features several of these upstarts, each of which uses the open-source foundation as an engine to drive vastly greater performance to a wide range of services, and even create new ones.
Register for this episode of The Briefing Room to hear veteran Analyst Dr. Robin Bloor explain how the Hadoop engine is being used to architect a new generation of enterprise applications. He’ll be briefed by George Corugedo, RedPoint Global CTO and Co-founder, who will showcase how enterprises can cost-effectively take advantage of the scalability, processing power and lower costs that Hadoop 2.0/YARN applications offer by eliminating the long-term expense of hiring MapReduce programmers.
Visit InsideAnlaysis.com for more information.
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
SAP® HANA and SAP® IQ are popular platforms for various analytical and transactional use cases. If you’re an SAP customer, you’ve experienced the benefits of deploying these solutions. However, as data volumes grow, you’re likely asking yourself: How do I scale storage to support these applications? How can I have one platform for various applications and use cases?
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...Hortonworks
Big Data is moving to the next level of maturity and it’s all about the applications. Dhruv Kumar, one of the minds behind Cascading, the most widely used and deployed development framework for building Big Data applications, will discuss how Cascading can enable developers to accelerate the time to market for their data applications, from development to production. In this session, Dhruv will introduce how to easily and reliably develop, test, and scale your data applications and then deploy them on Hadoop and Hortonworks Data Platform. He will show a demo using the Hortonworks Sandbox and Cascading. Recording is at
https://hortonworks.webex.com/hortonworks/lsr.php?RCID=e5582bcbc0516d35fc2dcf0bce86146e
The innovation provided by the Cloud Foundry community aligns very well with innovation occurring inside SAP, and both are gaining significant market momentum. Learn about SAP’s involvement with Cloud Foundry, its PaaS strategy built on SAP HANA Cloud Platform, and its commitment to the open source approach overall, in this 2014 Cloud Foundry Summit presentation by Dirk Basenach and Steve Winkler.
Getting started with Hadoop on the Cloud with BluemixNicolas Morales
Silicon Valley Code Camp -- October 11, 2014.
Session: Getting started with Hadoop on the Cloud.
Hadoop and Cloud is an almost perfect marriage. Hadoop is a distributed computing framework that leverages a cluster built on commodity hardware. The Cloud simplifies provisioning of machines and software. Getting started with Hadoop on the Cloud makes it simple to provision your environment quickly and actually get started using Hadoop. IBM Bluemix has democratized Hadoop for the masses! This session will provide a brief introduction to what Hadoop is, how does cloud work and will then focus on how to get started via a series of demos. We will conclude with a discussion around the tutorials and public datasets - all of the tools needed to get you started quickly.
Learn more about BigInsights for Hadoop: https://developer.ibm.com/hadoop/
Revolution Analytics - Presentation at Hortonworks Booth - Strata 2014Hortonworks
Join Revolution Analytics and Hortonworks during this interactive presentation to discuss how customers are using Hadoop and R in the real world. We’ll show an end-to-end customer churn analytics demonstration (leveraging Revolution Analytics, Hortonworks and Tableau) serving three user personas: a website visitor, a data scientist and a business analyst.
The Briefing Room with William McKnight and Actian
Live Webcast on October 14, 2014
Watch the archive:
https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=135528d85baa96a07850bd35961d459d
Integrating Hadoop with existing data sources, workflows and analytics can be a real challenge. While some components, like Hive and Spark, can give SQL access to Hadoop data, there isn’t much that enables Hadoop to be treated as a genuine BI and analytics platform, capable of running multiple jobs that serve multiple users and multiple applications. But what if you could turn Hadoop into a versatile, high performance development platform, forgoing all the pain of figuring out how and where to manage big data?
Register for this episode of The Briefing Room to hear veteran Analyst William McKnight as he discusses the fairly swift evolution of Hadoop’s capabilities. He’ll be briefed by Jim Hare of Actian, who will tout his company’s latest addition to its Analytic Platform: Hadoop SQL Edition. He will show how Actian has leveraged Hadoop and its scale out file system to create a fully functioning platform, providing everything from an analytic database to machine learning.
Visit InsideAnlaysis.com for more information.
Presented to eRum (Budapest), May 2018
There are many common workloads in R that are "embarrassingly parallel": group-by analyses, simulations, and cross-validation of models are just a few examples. In this talk I'll describe the doAzureParallel package, a backend to the "foreach" package that automates the process of spawning a cluster of virtual machines in the Azure cloud to process iterations in parallel. This will include an example of optimizing hyperparameters for a predictive model using the "caret" package.
By David Smith. Presented at Microsoft Build (Seattle), May 7 2018.
Your data scientists have created predictive models using open-source tools, proprietary software, or some combination of both, and now you are interested in lifting and shifting those models to the cloud. In this talk, I'll describe how data scientists can transition their existing workflows — while using mostly the same tools and processes — to train and deploy machine learning models based on open source frameworks to Azure. I'll provide guidance on keeping connections to data sources up-to-date, evaluating and monitoring models, and deploying applications that make use of those models.
Presentation delivered by David Smith to NY R Conference https://www.rstats.nyc/, April 2018:
Minecraft is an open-world creativity game, and a hit with kids. To get kids interested in learning to program with R, we created the "miner" package. This package is a collection of simple functions that allow you to connect with a Minecraft instance, manipulate the world within by creating blocks and controlling the player, and to detect events within the world and react accordingly.
The miner package is intended mainly for kids, to inspire them to learn R while playing Minecraft. But the development of the package also provides some useful insights into how to build an R package to interface with a persistent API, and how to instruct others on its use. In this talk I'll describe how to set up your own Minecraft server, and how to use and extend the package. I'll also provide a few examples of the package in action in a live Minecraft session.
While Python is a widely-used tool for AI development, in this talk I'll make the case for considering R as a platform for developing models for intelligent applications. Firstly, R provides a first-class experience working deep learning frameworks with its keras integration. Equally importantly, it provides the most comprehensive suite of statistical data analysis tools, which are extremely useful for many intelligent applications such as transfer learning. I'll give a few high-level examples in this talk, and we'll go into further detail in the accompanying interactive code lab.
There are many common workloads in R that are "embarrassingly parallel": group-by analyses, simulations, and cross-validation of models are just a few examples. In this talk I'll describe several techniques available in R to speed up workloads like these, by running multiple iterations simultaneously, in parallel.
Many of these techniques require the use of a cluster of machines running R, and I'll provide examples of using cloud-based services to provision clusters for parallel computations. In particular, I will describe how you can use the SparklyR package to distribute data manipulations using the dplyr syntax, on a cluster of servers provisioned in the Azure cloud.
Presented by David Smith at Data Day Texas in Austin, January 27 2018.
A look at the changing perceptions of R, from the early days of the R project to today. Microsoft sponsor talk, presented by David Smith to the useR!2017 conference in Brussels, July 5 2017.
Predicting Loan Delinquency at One Million Transactions per SecondRevolution Analytics
Real-time applications of predictive models must be able to generate predictions at the rate that transactions are generated. Previously, such applications of models trained using R needed to be converted to other languages like C++ or Java to achieve the required throughput. In this talk, I’ll describe how to use the in-database R processing capabilities of Microsoft R Server to detect fraud in a SQL Server database of loan records at a rate exceeding one million transactions per second. I will also show the process of training the underlying gradient-boosted tree model on a large training set using the out-of-memory algorithms of Microsoft R.
Presented by David Smith at The Data Science Summit, Chicago, April 20 2017.
The ability to independently reproduce results is a critical issue within the scientific community today, and is equally important for collaboration and compliance in business. In this talk, I'll introduce several features available in R that help you make reproducibility a standard part of your data science workflow. The talk will include tips on working with data and files, combining code and output, and managing R's changing package ecosystem.
Presented by David Smith, R Community Lead (Microsoft), at Monktoberfest October 2016.
The value of open source isn’t just in the software itself. The communities that form around open source software provide just as much value and sometimes even more: in ongoing development, in documentation, in support, in marketing, and as a supply of ready-trained employees. Companies who build on open source tend to focus on the software, but neglect communities at their peril.
In this talk, I share some of my experiences in building community for an open-source software company, Revolution Analytics, and perspectives since the acquisition by Microsoft in 2015.
R is more than just a language. Many of the reasons why R has become such a popular tool for data science come from the ecosystem surrounding the R project. R users benefit from the many resources and packages created by the community, while commercial companies (including Microsoft) provide tools to extend and support R, and services to help people use R.
In this talk, I will give an overview of the R Ecosystem and describe how it has been a critical component of R’s success, and include several examples of Microsoft’s contributions to the ecosystem.
(Presented to EARL London, September 2016)
(Presented by David Smith at useR!2016, June 2016. Recording: https://channel9.msdn.com/Events/useR-international-R-User-conference/useR2016/R-at-Microsoft )
Since the acquisition of Revolution Analytics in April 2015, Microsoft has embarked upon a project to build R technology into many Microsoft products, so that developers and data scientists can use the R language and R packages to analyze data in their data centers and in cloud environments.
In this talk I will give an overview (and a demo or two) of how R has been integrated into various Microsoft products. Microsoft data scientists are also big users of R, and I'll describe a couple of examples of R being used to analyze operational data at Microsoft. I'll also share some of my experiences in working with open source projects at Microsoft, and my thoughts on how Microsoft works with open source communities including the R Project.
Hadoop is famously scalable. Cloud Computing is famously scalable. R – the thriving and extensible open source Data Science software – not so much. But what if we seamlessly combined Hadoop, Cloud Computing, and R to create a scalable Data Science platform? Imagine exploring, transforming, modeling, and scoring data at any scale from the comfort of your favorite R environment. Now, imagine calling a simple R function to operationalize your predictive model as a scalable, cloud-based Web Service. Learn how to leverage the magic of Hadoop on-premises or in the cloud to run your R code, thousands of open source R extension packages, and distributed implementations of the most popular machine learning algorithms at scale.
With rising business challenges in the aftermarket service areas, it becomes imperative for manufacturers to gain actionable intelligence across the warranty management life cycle.
Join Revolution Analytics and Tech Mahindra to hear how to reduce the information visibility gap:
• Identify statistically significant business drivers
• Forecast warranty costs and claims
• Improve Customer Satisfaction
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
2. Revolution Confidential
Vigorous Growth of Big Data…
2
The global Big Data Market revenue is expected to grow from $1.56
billion in 2012 to $13.95 billion in 2017, at an estimated CAGR of
54.9% from 2012 to 2017.
- Marketsandmarkets.com study, 14 April 2013
“…the market for Big Data technology will reach 16.9 billion by
2015, up from $3.2 billion in 2010. That is a 40 percent-a-year
growth rate – about seven times the estimated growth rate for the
overall information technology and communications business.”
– IDC study, March 2012
3. Revolution Confidential
Big Data = Opportunity + Disruption
3
Huge New Data Assets
• Internet – Commerce, Communications, Collaboration
• Social Media – Personal, Presence, New Social Networks
• Ubiquitous Telemetry – Machines Everywhere
Huge New Data Assets
• Internet – Commerce, Communications, Collaboration
• Social Media – Personal, Presence, New Social Networks
• Ubiquitous Telemetry – Machines Everywhere
Rapidly-Evolving Platforms
• “Data Lake” vs. “Warehouse” vs. “Big Data App. Platforms”
• Vast Choices Among Open Source Platfroms
• Eliminate Time Consuming Data Movements
Rapidly-Evolving Platforms
• “Data Lake” vs. “Warehouse” vs. “Big Data App. Platforms”
• Vast Choices Among Open Source Platfroms
• Eliminate Time Consuming Data Movements
Emerging Business Opportunities
• Data Science Unlocks New Insight
• Big Data Drives Better Decisionmaking
• Platforms Evolve Rationally Toward Big Data Vision
Emerging Business Opportunities
• Data Science Unlocks New Insight
• Big Data Drives Better Decisionmaking
• Platforms Evolve Rationally Toward Big Data Vision
4. Revolution Confidential
Hadoop Analytics Platforms: Disruption,
Challenge, Growth & Opportunity At Once
4
• Java Skill Requirements
• Hadoop’s Innovation Pace
• Java Skill Requirements
• Hadoop’s Innovation Pace
• Analytical
• Write Once, Deploy Anywhere
Growth: Skill Development
• EDW Saturation
• Limited Analytical Capabilities
• EDW Saturation
• Limited Analytical Capabilities
• Data Science Skill Shortage
• MapReduce Paradigm
Disruption: Evolving Ecosystems
• Designed for Massive Scale
• Commodity Foundations
• Designed for Massive Scale
• Commodity Foundations
• Built for Data Variety
• Open Source Innovation Pace
Challenge: Big Data Readiness
• Descriptive -> Predictive
• Short Analytical Cycle Time
• Descriptive -> Predictive
• Short Analytical Cycle Time
• Ubiquitous Analytical Decisions
• Low-Latency Analytics
Opportunity: New, More Capable Analytic Foundation
5. Revolution Confidential
What We Need: Convergence
Data Science
With business solutions that fuse statistics, mathematics
and software into meaningful applications.
Software Engineering
With tools and frameworks to create agile, scalable
analytics-based applications
IT Operations Management
Deployment platforms that are integrated, cost-effective,
secure and ubiquitous.
5
6. Revolution Confidential
What is the R Statistics Language?
The R Language:
Straightforward Procedural Language for Stats, Math
and Data Science
Open Source
The R Community:
2M Users with the skill to tackle big data mathematical /
statistical and ML needs.
Began on workstation / modest SMP servers
The R Ecosystem:
4500+ Freely Available Algorithms in CRAN
Applicable to Big Data if scaled
6
7. Revolution Confidential
Why R and Hadoop?
Hadoop’s dominates Big Data Storage and
Computational platforms.
R dominates Data Science, Providing a
Language, Users Thousands of Pre-Built
Algorithms.
Bringing Them Together is Our Goal Today.
7
8. Revolution Confidential
Mission
Company Confidential – Do not distribute 8
Enterprise-ready
Revolution R Enterprise
is the only commercial big data analytics platform
based on open source R statistical computing language
Multi-platform
Scalable from desktop to big data
Delivers high performance analytics
Easier to build and deploy analytic applications
9. Revolution Confidential
Global Industries
Served
Financial Services
Digital Media
Government
Health & Life Sciences
High Tech
Manufacturing
Retail
Telco
Our Software Delivers
Power: Distributed, scalable high performance advanced analytics
Productivity: Easier to build and deploy analytic applications
Enterprise Readiness: Multi-platform
Our Philosophy
Customer-centric innovation
Easy to do business with
Our Investors
Intel Capital
North Bridge
Presidio Ventures
Who we are
Leading provider of commercial analytics platform based
on open source R statistical computing language
Customers
200+ Global 2000
Global Presence
North America / EMEA / APAC
Our Services Deliver
Knowledge: Our experts enable you to be experts
Time-to-Value: Our Quickstart projects give you a jumpstart
Guidance: Our customer support team is here to help you
Company Confidential – Do not distribute 9
10. Revolution Confidential
Big Data Speed and Scale with
Revolution R Enterprise
Fast Math Libraries
Parallelized Algorithms
In-Database Execution
Multi-Threaded Execution
Multi-Core Execution
In-Hadoop Execution
Memory Management
Parallelized User Code
11. Revolution Confidential
11
Revolution R Enterprise Propels
Enterprises into the Future
Decision
Analytic ApplicationsAnalytic Applications
Integration
MiddlewareMiddleware
Data
HadoopHadoop
Data
Warehouse
Data
Warehouse
Other
Data
Sources
Other
Data
Sources
Analytics
Revolution R Enterprise
High Performance Analytics Platform
Revolution R Enterprise
High Performance Analytics Platform
|||||||||||||||||||||||||||
12. Revolution Confidential
Digital Media & RetailDigital Media & Retail
200+ Corporate Customers and Growing
Finance & InsuranceFinance & Insurance Healthcare & Life SciencesHealthcare & Life Sciences
Manufacturing & High TechManufacturing & High TechAcademic & Gov’tAcademic & Gov’t
12
14. Revolution Confidential
R MapReduce:
Fast, Agile Analytics for Hadoop Today
R MapReduce Enables R-Based Analytics In Hadoop:
Use R to Explore and Visualize Data to Develop Insights
Build Models Using Widely-Available Techniques
Score Data Directly in Hadoop Using R Models
Run R as Mappers and Reducers in Hadoop
Advantages:
No data movement
Connects R to HDFS, Hbase and Hive
Run standard MapReduce jobs
R Programmers need not learn Java
Need Not Rewrite R into Java Pig or SQL to Score Data
No Data Movement Needed
Accelerates Projects Leveraging Libraries By Bringing
4500+ Open Source R Algorithms in CRAN1 to Hadoop
14
Data
Data
Warehouse
Data
Warehouse
Other
Data
Sources
Other
Data
Sources
Analytics
MapReduceMapReduce
Applications
Hadoop
||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||
||||||||
Other
MapReduce
Jobs
Other
MapReduce
Jobs
HDFSHDFS
HbaseHbase
R MapReduce
(RMR)
R MapReduce
(RMR)
HiveHive
1 CRAN: Comprehensive R
Archive Network – an open
source collection of 4500+ R-
based statistics, analtyics,
graphics and data manipulations
algorithms for R users.
15. Revolution Confidential
R MapReduce (RMR)
R MapReduce:
Build MapReduce Jobs Entirely In R
15
Your Creativity.
+
Your Code.
+
4500+ R Packges in
CRAN
=
Rich, Powerful Data
Analytics That
Runs in
MapReduce.
Revolution R
Enterprise
Revolution R
Enterprise
Hbase
Hadoop
Hive
HDFS
MAPMAP MAPMAP MAPMAP
REDUCEREDUCE REDUCEREDUCE CRAN Packages
16. Revolution Confidential
Why Build MapReduce Jobs using R?
What can you do with it?
Transform, Aggregate, Regress, Cluster, Filter, Simulate, Model,
Score …
Run R Programs While Leveraging Hadoop’s Scalability
Big I/O: Score data files containing billions of rows
Big Math: Run compute-intensive algorithms in parallel – Monte Carlo,
Random Trees, etc.
Deliver results to BI or Visualization Tools and Production
Applications
When to chose RMR:
Need to Develop Analytics in R, on Big data in Hadoop
Stringent Latency Requirements
Scarce R and Java Developers Need to Collaborate Not Duplicate
16
17. Revolution Confidential
R MapReduce:
Create Mappers and Reducers Using R
How:
Build R Code Using
Revolution R Enterprise
Use Open Source Algorithms
From CRAN project.
Leverage HDFS and
MapReduce Directly
Deploy R Mappers &
Reducers in Hadoop
17
Data
Data
Warehouse
Data
Warehouse
Other
Data
Sources
Other
Data
Sources
Analytics
MapReduceMapReduce
Applications
R MapReduce
(RMR)
R MapReduce
(RMR)
Revolution R
Enterprise
Revolution R
Enterprise
Hadoop
||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||
||||||||
Other
MapReduce
Jobs
Other
MapReduce
Jobs
R CodeR Code
R PackagesR Packages
HDFSHDFS
HbaseHbaseHiveHive
RRERRE
CRAN Packages
18. Revolution Confidential
Mappers & Reducers:
100% R. 100% Hadoop.
For Hadoop Users:
Integrates R with Hadoop via
Hadoop Streaming
Creates MapReduce Jobs
Compatible with JobTracker
No Need to Recode Models
No Latency to Move Data
For R Programmers
No need for Java Programming
Serialized & Deserializes Data
Between HDFS and R
Handles Standard HDFS Read &
Write Transparently
Provides Explicit Access to
HDFS, Hbase and Hive via
Packages
Access to CRAN Algorithm
Library
18
Mapper
or
Reducer
Hadoop Streaming
R Code
Revolution R
Enterprise
Revolution R
Enterprise
High-Speed
Connectors
Data Deserialization
Data Serialization
HbaseHive
HDFS
HDFS
CRAN
19. Revolution Confidential
Leveraging R with Hadoop
With R “Inside” Hadoop…
In-Place ETL
Data Transformation in R
Enrichment and Correlation Using
Other Data In Hadoop
Simulation/Experimentation
Execute Complex Simulations on
Massively-Parallel Hadoop Clusters
Scoring
Run Scoring Models Directly in
Hadoop.
No Movement Penalty
How?
Write Mappers & Reducers in R and
Deploy Using RMapReduce
Augment Hadoop with CRAN1
Packages
19
1 Use of CRAN algorithms limited to non-graphical, parallelizable algorithms
20. Revolution Confidential
Limitations of R MapReduce
R Programmer Must “Think MapReduce” –
Dividing Work into Cascades of Map, Reduce,
Repeat.
Algorithms Must be Designed for Parallelism
Including External Packages Used.
Fits:
Hadoop Literate Teams or Those With Good Support
Non-Fits:
Analytics Teams Tinkering with Hadoop on Short
Timeframes.
Company Confidential – Do not distribute 20
Data
Data
Warehouse
Data
Warehouse
Other
Data
Sources
Other
Data
Sources
Analytics
MapReduceMapReduce
Applications
R MapReduce
(RMR)
R MapReduce
(RMR)
Hadoop
||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||
||||||||
Other
MapReduce
Jobs
Other
MapReduce
Jobs
HDFSHDFS
HbaseHbaseHiveHive
21. Revolution Confidential
More Ways to Leverage R with Hadoop:
“Beside” Architectures
Inside Hadoop
In-Place ETL
Data Transformation in R
Enrichment and Correlation Using
Other Data In Hadoop
Simulation/Experimentation
Execute Complex Simulations on
Massively-Parallel Hadoop Clusters
Scoring
Run Scoring Models Directly in
Hadoop.
No Movement Penalty
How?
Write Mappers & Reducers in R and
Deploy Using RMapReduce
Augment Hadoop with CRAN1
Packages
“Beside” Architectures:
Drivers:
Large or Unpredictable R Workloads
Modest Hadoop Cluster
Shared Production Hadoop Cluster
Hadoop Novice
Large Numbers of R Users.
Modest Data Sets To Be Scored
Movement Penalty Isn’t Prohibitive
Maximized Computational Scale
Access to ScaleR Parallel External
Memory Algorithms (PEMAs)
Advantages:
Makes Hadoop Easier to Administer
Stabilies Hadoop Resource Availability
21
22. Revolution Confidential
Two Additional “Beside” Architectures
Alternatives:
RRE “Beside” Hadoop
RRE Both “Beside” and “Inside” Hadoop with RMR
“Beside” Usage:
Sample into “Beside” Server or Cluster
Analyze and Model on R Server or Cluster
Score Data on R Server or Cluster
Results to Hadoop for Use.
“Both” Usage - Same As Above Except:
Move Model to Data on Hadoop
Score Data In-Place on Hadoop
Why multiple options?
Greatest Flexibility
Optimize Skill Sets
Scale Clusters Independently
Control Concurrency and Security
Optimize Utilization
Same R Code Can Run in Both
Balance Ease of Use/Development and Resulting Performance & Scale
22
23. Revolution Confidential
Data
Warehouse
Data
Warehouse
Other
Data
Sources
Other
Data
Sources
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|||||||
|||||||
RRE “Beside” Hadoop
Separate Hadoop & R
Clusters
Connectors HDFS,
Hbase & Hive
Explore & Model Data
on R server(s)
Return Scored Data to
HDFS/Hbase/Hive
When To Use:
Small, Shared or
Production Hadoop
Cluster
Need Parallelized
Algorithms
Heavy Random
Workloads
Extensive
“Sandboxing”
Modest Data Scoring
Data Security
Constraints.
… while awaiting
YARN…
Advantages:
Concurrency By
Separation
Security By Separation
Independent
Scalability
ScaleR Parallel
Algorithms
23
DataAnalytics
MapReduceMapReduce
Applications
Hadoop
Cluster
|||||||
Other
MapReduce
Jobs
Other
MapReduce
Jobs
HDFSHDFS
HbaseHbaseHiveHive
RRERRE
CRAN Packages
Revolution R
Enterprise
Revolution R
Enterprise
||||||
ConnectR:
Hbase
HDFS
ODBC &
High-Speed
Connectors
Analytics
Apps.
Analytics
Apps.
Analytics Server
or Cluster:
Linux, Windows,
LSF or Azure
Data
Manipulation
and Analysis
Data
Manipulation
and Analysis
BI &
Visualization
24. Revolution Confidential
Data
Warehouse
Data
Warehouse
Other
Data
Sources
Other
Data
Sources
|||||||
|||||||
RRE “Beside” and “Inside” Both “Inside” and
“Beside” Platforms
Connect a Compute
Cluster to Hadoop
to Run R
Move Models to
Score Big Data on
Hadoop
When To Use:
Production Hadoop
Cluster
Need Parallelized
Algorithms
Heavy Random
Workloads
Extensive
“Sandboxing”
Large Data Scoring
Data Security
Constraints.
… while awaiting
YARN…
Advantages:
Concurrency &
Security
Independent
Scalability
Big Data Scoring
Flexibility
Low Latency
24
DataAnalytics
MapReduceMapReduce
Applications
Hadoop
Cluster
|||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Other
MapReduce
Jobs
Other
MapReduce
Jobs
HDFSHDFS
HbaseHbaseHiveHive
||||||
ConnectR:
Hbase
HDFS
ODBC &
High-Speed
Connectors
Analytics Server
or Cluster:
Linux, Windows,
LSF or Azure
R MapReduce
(RMR)
R MapReduce
(RMR)
RRERRE
CRAN Packages
Analytics
Apps.
Analytics
Apps.
Revolution R
Enterprise
Revolution R
Enterprise
ConnectR:
Hbase
HDFS
ODBC &
High-Speed
Connectors
Analytics Server
or Cluster:
Linux, Windows,
LSF or Azure
BI &
Visualization
26. Revolution Confidential
‘Beside’ and/or ‘Inside’:
Dominant Usage Patterns Observed
Use Case 1: Real-Time Scoring
Example – Fraud Prevention
Use Case 2: Modeling and Scoring
Example – Attribution Analysis
Use Case 3: Production Analytics
Example – Telematics-Assisted Underwriting
26
27. Revolution Confidential
In-House
Systems:
Transaction
History
27
Example 1:
Card Fraud Detection
MapReduceMapReduce
Hadoop
HDFSHDFS
HbaseHbase
1 Ingest
Weblog Data
Personal
Data:
Credit-
worthiness
Banking
2
4
Filter &
Xform
3
Correlate &
Rate
Transaction
Data
R MapReduce
(RMR)
R MapReduce
(RMR)
Other
MapReduce
Jobs
Other
MapReduce
Jobs
Develop
Risk
Models
6
Revolution R
Enterprise
Revolution R
Enterprise
ConnectR:
Hbase
HDFS
ODBC &
High-Speed
Connectors
R
Workstation
Deliver &
Integrate
Execute
Models5
Filter &
Score
Transactions
BI &
Visualization
Mortgage
Data
Authorization
Systems
Demographic
Data
28. Revolution Confidential
In-House
Systems:
EDW, CRM,
Datamarts
Example 2:
Attribution Analysis “Beside” Hadoop
MapReduceMapReduce
Hadoop
HDFSHDFS HbaseHbase
1
Ingest
Weblog Data
Marketing
Service
Provider
Feeds:
Acxiom
Experian
ExactTarget
Monitored
Responses
CoreMetrics
Dotomi
DoubleClick
8
3
7
4
Call center
Data
Java
MapReduce
Jobs
Java
MapReduce
Jobs
Develop
Attribution
Models
Deliver to
Users
Revolution R
Enterprise
Revolution R
Enterprise
ConnectR:
Hbase
HDFS
ODBC &
High-Speed
Connectors
Analytics
Apps.
Analytics
Apps.
Linux Server
Cluster
Server
BI &
Visualization
2
Filter &
Transform
Score
6
6
Load Analysis
Environment
Aggregate,
Profile,
& EnrichSessionize
29. Revolution Confidential
29
Example 3:
Telematics-Enhanced Underwriting
1
Ingest
8
2
Correlate Sources
3 Filter,
Aggregate &
Profile
Deliver to
Underwriting
& Call
Response
Systems
Revolution R
Enterprise
Revolution R
Enterprise
ConnectR:
Hbase
HDFS
ODBC &
High-Speed
Connectors
Underwriting
Applications
Underwriting
Applications
Linux Server
Cluster
Server
MapReduceMapReduce
Hadoop
HDFSHDFS
Other
MapReduce
Jobs
Other
MapReduce
Jobs
HbaseHbase
6
Policy Origination
Data
Vehicle Sensor
Data:
Speed
Time
Acceleration
Location
Creditworthiness
Data
Insured Data:
Loss History
Payment History
Credit File
Demographics 4
Load Model
Environment
Export
Models
Score
Large
Datasets
5R MapReduce
(RMR)
R MapReduce
(RMR)
7
Develop
Risk
Models
30. Revolution Confidential
Conclusion
Big Data Is Hard.
Hadoop is Key to Managing It.
R is Key to Applying It.
Revolution R on Hadoop Brings Data Science to
Big Data
Hadoop Brings Parallel Performance to R
R Brings a Community with Know-How to Hadoop
Revolution Analytics Can Deliver Convergence
Today.
… and the Future of R on Hadoop is Even Brighter…
30