Architecting R into Storm Application Development Process

•

4 likes•1,140 views

This document discusses combining R and Storm to perform real-time analytics on streaming data. R is a programming language for advanced statistics while Storm is a framework for processing streaming data. The document proposes running R code inside Storm bolts to leverage R's statistical capabilities for online change point detection on streaming data. As a demonstration, it detects change points in Oakland A's game score differences during their 2002 20-game winning streak, but does not find any, as it is not using the optimal data. Integrating further with data modeling teams is suggested. Combining R and Storm provides benefits like independent development timelines while enabling real-time statistical analysis on data streams.

© 2014 MapR Technologies 1
Talk Overview
• Agile Real-time Stats
• R + Storm
github.com/allenday/R-Storm
• DEMO
• How to do it?
• Q & A @allenday
Agile
Methods
Advanced
Statistics
Continuous
Real-time
Delivery
github.com/allenday/hadoop-summit-r-storm-demo-public

© 2014 MapR Technologies 2© 2014 MapR Technologies
Architecting R into the Storm
Application Development Process

© 2014 MapR Technologies 3
Allen (me) and Sungwook @ MapR
• Allen Day, Principal Data Scientist [ @allenday ]
7yr Hadoop dev, 12yr R dev/author
PhD, Human Genetics, UCLA Medicine
• Sungwook Yoon, Data Scientist
Spark & Security Expert
PhD, Computer Engineering, Purdue
• MapR [ @mapr ]
Distributes open source components for Hadoop
Adds major technology for performance, HA, industry standard APIs

© 2014 MapR Technologies 4
What’s Storm? What’s R?
• What’s Storm?
– Processes a data stream. Akin to UNIX pipe + tee & merge commands
– Runs on a cluster. Fault-tolerant and designed to scale out
– Used for: real-time analytics & machine learning
• What’s R?
– Programming language with advanced statistics libraries
– Does not scale out. Can scale up
– Used for: prototyping, data modeling, visualization
How to combine these?

© 2014 MapR Technologies 5
R outside, Storm inside: not practical. Why?
• Model-building and QA is done
on data snapshots
• However, R => Hadoop is
realistic. Key difference:
referenced data can be static
– Use MapR snapshots for dev and
QA
– See also: RHIPE (Purdue) and
RHadoop (RevolutionAnalytics)
R
Storm
User

© 2014 MapR Technologies 6
Storm outside, R inside: a good fit
• Enables separation of concerns
– Independently manage modeling,
ops timelines, and version control
– Integrate as needed
• Enables role specialization
– R built-ins allow faster iteration
and more concise stats-type code
– Do DevOps with specific SW
engineering tech, e.g. Java
Storm
R
User

© 2014 MapR Technologies 7© 2014 MapR Technologies
Q: Who really likes statistics?
A: Baseball fans
A: Team Managers = Portfolio Managers

© 2014 MapR Technologies 9
Fresh Local Data Tonight!

© 2014 MapR Technologies 10
Famous Vintage Data
Oakland Athletics
2002 Season
20 consecutive
wins – the current
record
Obligatory movie
ref… I’m from LA
LET’S GO DODGERS!

© 2014 MapR Technologies 11© 2014 MapR Technologies
Goal: Detect “Moneyball” 2002 Winning Streak

© 2014 MapR Technologies 12
Methods:
Change Point Detection
Find natural breakpoints in a
time-series set of data points
R packages implement this:
changepoint: more
sensitve, but not streaming
bcp: streaming, but less
sensitive

© 2014 MapR Technologies 13
GIFs to
MapR
Filesystem
Methods: R+Storm Demo Architecture
Storm Bolt
R online
change point
detector
Storm Bolt
(write to Jetty)
Oakland A’s
Data
(accelerated)
Jetty
Webserver
Browser
(D3.js) Us 
github.com/allenday/hadoop-summit-r-storm-demo-public

© 2014 MapR Technologies 14© 2014 MapR Technologies
50-game sliding
window/buffer to
detect change points
Cumulative history
with detected break
points
Raw data (score
difference between
A’s and opponent)
Demo

© 2014 MapR Technologies 15
Methods Details: How it’s done
• Uses R-Storm binding github.com/allenday/R-Storm
– Storm package on CRAN cran.r-project.org/web/packages/Storm
Storm (dev team)
R
(stats team)
Storm
(dev team, pure
Java)
Producer Consumer

© 2014 MapR Technologies 16
Methods Details: Easy integration
R: lambda function
storm = Storm$new();
storm$lambda = function(s) {
t = s$tuple;
t$output = vector(length=1); t$output[1] = “tada!”
s$emit(t)
}
Storm: extend ShellBolt
public static class MyRBolt extends ShellBolt implements IRichBolt
{
public RBolt() {
super("Rscript", ”my.R");
}
}

© 2014 MapR Technologies 17
Results
• Change points are identified, but none for winning streak
– Not using score difference, anyway
• Time to integrate with the modeling team!
– Send @kunpognr or @allenday a pull request on GitHub
• Applicable to many other use cases, e.g.
– Security (fraud detection, intrusion detection)
– Marketing (intent to purchase / social media streams)
– Customer Support (help desk voice calls)
Discussion

© 2014 MapR Technologies 18
Q&A
@allenday allenday@mapr.com
Engage with us!
allendaylinkedin.com/in/allenday

For the full video of this presentation, please visit: http://www.embedded-vision.com/platinum-members/perceptonic/embedded-vision-training/videos/pages/may-2014-embedded-vision-summit For more information about embedded vision, please visit: http://www.embedded-vision.com Goksel Dedeoglu, Ph.D., Founder and Lab Director of PercepTonic, presents the "Embedded Lucas-Kanade Tracking: How It Works, How to Implement It, and How to Use It" tutorial at the May 2014 Embedded Vision Summit. This tutorial is intended for technical audiences interested in learning about the Lucas-Kanade (LK) tracker, also known as the Kanade-Lucas-Tomasi (KLT) tracker. Invented in the early 80s, this method has been widely used to estimate pixel motion between two consecutive frames. Dedeoglu presents how the LK tracker works and discuss its advantages, limitations, and how to make it more robust and useful. Using DSP-optimized functions from TI's Vision Library (VLIB), he also shows how to detect feature points in real-time and track them from one frame to the next using the LK algorithm. He demonstrates this on Texas Instruments' C6678 Keystone DSP, where he detects and tracks thousands of Harris corner features in 1080p HD resolution video.

Real time-hadoop

Ted Dunning

This was one of the talks that I gave at the Strata San Jose conference. I migrated my topic a bit, but here is the original abstract: Application developers and architects today are interested in making their applications as real-time as possible. To make an application respond to events as they happen, developers need a reliable way to move data as it is generated across different systems, one event at a time. In other words, these applications need messaging. Messaging solutions have existed for a long time. However, when compared to legacy systems, newer solutions like Apache Kafka offer higher performance, more scalability, and better integration with the Hadoop ecosystem. Kafka and similar systems are based on drastically different assumptions than legacy systems and have vastly different architectures. But do these benefits outweigh any tradeoffs in functionality? Ted Dunning dives into the architectural details and tradeoffs of both legacy and new messaging solutions to find the ideal messaging system for Hadoop. Topics include: * Queues versus logs * Security issues like authentication, authorization, and encryption * Scalability and performance * Handling applications that span multiple data centers * Multitenancy considerations * APIs, integration points, and more

Where is Data Going? - RMDC Keynote

Ted Dunning

Massaro-UAV Intelligent Transportation Workshop Slides

Prithviraj (Raj) Dasgupta

Report Out: Smart Eco-Districts DC

US-Ignite

Detecting solar farms with deep learning

Jason Brown

Strata 2014 Anomaly Detection

Ted Dunning

Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time

Ted Dunning

This talk describes how indicator-based recommendations can be evolved in real time. Normally, indicator-based recommendations use a large off-line computation to understand the general structure of items to be recommended and then make recommendations in real-time to users based on a comparison of their recent history versus the large-scale product of the off-line computation. In this talk, I show how the same components of the off-line computation that guarantee linear scalability in a batch setting also give strict real-time bounds on the cost of a practical real-time implementation of the indicator computation.

Backscatter mosaics of the seafloor are now routinely produced from multibeam sonar data, and used in a wide range of marine applications. However, significant differences (up to 5 dB) have been observed between the levels of mosaics produced by different software processing a same dataset. This is a major detriment to several possible uses of backscatter mosaics, including quantitative analysis, monitoring seafloor change over time, and combining mosaics. A recently concluded international Backscatter Working Group (BSWG) identified this issue and recommended that “to check the consistency of the processing results provided by various software suites, initiatives promoting comparative tests on common data sets should be encouraged […]”. However, backscatter data processing is a complex (and often proprietary) sequence of steps, so that simply comparing end-results between software does not provide much information as to the root cause of the differences between results. In order to pinpoint the source(s) of inconsistency between software, it is necessary to understand at which stage(s) of the data processing chain do the differences become substantial. We have invited willing software developers to discuss this framework and collectively adopt a list of intermediate processing steps. We provided a small dataset consisting of various seafloor types surveyed with the same multibeam sonar system, using constant acquisition settings and sea conditions, and have the software developers generate these intermediate processing results, to be eventually compared. If the experiment proves fruitful, we may extend it to more datasets, software and intermediate results. Eventually, software developers may consider making the results from intermediate stages a standard output as well as adhering to a consistent terminology, as advocated by Schimel et al. (2018). To date, the developers of four software (Sonarscope, QPS FMGT, CARIS SIPS, MB Process) have expressed their interest in collaborating on this project.

Cheap learning-dunning-9-18-2015

Ted Dunning

Planet: Imaging Earth Every Day

Safe Software

Planet has the ambitious goal of imaging everywhere on earth once per day with a fleet of small satellites. Now with over 100 operational satellites, Planet is collecting over a hundred million square kilometers of remote sensing data every day and for the first time we are able to take actions based on the daily changes that we observe. In addition to this unique data set, Planet has taken an 'API-first' approach to distributing data, allowing our users to build their own applications or integrations directly on our platform services. Safe Software's own Planet transformer is a great example of this kind of integration, giving FME users easy access to Planet's growing archive of satellite imagery.

CEPH DAY BERLIN - CEPH IMPLEMENTATIONS FOR THE MEERKAT RADIO TELESCOPE

Ceph Community

Use of FOSS4G in hybrid systems

Michael Terner

2017 ASPRS-RMR Big Data Track: Practical Considerations and Uses of USGS 3DEP...

GIS in the Rockies

DSD-INT 2015 - Foreshore wave attenuation modelling with Xbeach using EO data...

Deltares

Resume_J_RosenbergJacob Rosenberg

15 sengupta next_generation_satellite_modelling

Sandia National Laboratories: Energy & Climate: Renewables

Possible Visions for Mahout 1.0

Ted Dunning

Radiation Test -Raspberry PI Zero-

Industrial Technology Research Institute (ITRI)(工業技術研究院, 工研院)

Atmos - Tom hartley - Modelling Bird Behaviour to Progress Wind Farm Development

Esri UK

UAV MAPPING, LIDAR MAPPING, LAND AND MINING AND ENGINEERING SURVEY - TES

Brett Johnson

Building multi-modal recommendation engines using search engines

Ted Dunning

Co gps energy efficient gps sensing with cloud offloading

ieeepondy

Kahvakuulaharjoittelun Perusteet 2010

Marko Suomi

sosiaalinen pilvi

Jaakko Sannemann

What's hot

Bluesky - Esri UK Annual Conference 2016

Esri UK

Sharing Sensitive Data Securely

Ted Dunning

C-SAW: A Framework for Graph Sampling and Random Walk on GPUs

Pandey_G

Open Backscatter Toolchain (OpenBST) Project - A Community-vetted Workflow fo...

Giuseppe Masetti

Dunning time-series-2015Ted Dunning

Backscatter Working Group Software Inter-comparison ProjectRequesting and Co...

Giuseppe Masetti

Cheap learning-dunning-9-18-2015

Ted Dunning

Planet: Imaging Earth Every Day

Safe Software

CEPH DAY BERLIN - CEPH IMPLEMENTATIONS FOR THE MEERKAT RADIO TELESCOPE

Ceph Community

Use of FOSS4G in hybrid systems

Michael Terner

2017 ASPRS-RMR Big Data Track: Practical Considerations and Uses of USGS 3DEP...

GIS in the Rockies

DSD-INT 2015 - Foreshore wave attenuation modelling with Xbeach using EO data...

Deltares

Resume_J_RosenbergJacob Rosenberg

15 sengupta next_generation_satellite_modelling

Sandia National Laboratories: Energy & Climate: Renewables

Possible Visions for Mahout 1.0

Ted Dunning

Radiation Test -Raspberry PI Zero-

Industrial Technology Research Institute (ITRI)(工業技術研究院, 工研院)

Atmos - Tom hartley - Modelling Bird Behaviour to Progress Wind Farm Development

Esri UK

UAV MAPPING, LIDAR MAPPING, LAND AND MINING AND ENGINEERING SURVEY - TES

Brett Johnson

Building multi-modal recommendation engines using search engines

Ted Dunning

Co gps energy efficient gps sensing with cloud offloading

ieeepondy

What's hot (20)

Bluesky - Esri UK Annual Conference 2016

Sharing Sensitive Data Securely

C-SAW: A Framework for Graph Sampling and Random Walk on GPUs

Open Backscatter Toolchain (OpenBST) Project - A Community-vetted Workflow fo...

Dunning time-series-2015

Backscatter Working Group Software Inter-comparison ProjectRequesting and Co...

Cheap learning-dunning-9-18-2015

Planet: Imaging Earth Every Day

CEPH DAY BERLIN - CEPH IMPLEMENTATIONS FOR THE MEERKAT RADIO TELESCOPE

Use of FOSS4G in hybrid systems

2017 ASPRS-RMR Big Data Track: Practical Considerations and Uses of USGS 3DEP...

DSD-INT 2015 - Foreshore wave attenuation modelling with Xbeach using EO data...

Resume_J_Rosenberg

15 sengupta next_generation_satellite_modelling

Possible Visions for Mahout 1.0

Radiation Test -Raspberry PI Zero-

Atmos - Tom hartley - Modelling Bird Behaviour to Progress Wind Farm Development

UAV MAPPING, LIDAR MAPPING, LAND AND MINING AND ENGINEERING SURVEY - TES

Building multi-modal recommendation engines using search engines

Co gps energy efficient gps sensing with cloud offloading

Viewers also liked

Kahvakuulaharjoittelun Perusteet 2010

Marko Suomi

sosiaalinen pilvi

Jaakko Sannemann

MAE - Informe diario 21-03-2014

Marcelo Pablo Mercs

Natural hair regrowth

Eric Dixon

IBMseminario2.0

Sosiaalinen media nuorten elämässä

Verke

Viewers also liked (6)

Kahvakuulaharjoittelun Perusteet 2010

sosiaalinen pilvi

MAE - Informe diario 21-03-2014

Natural hair regrowth

IBM

Sosiaalinen media nuorten elämässä

Similar to Architecting R into Storm Application Development Process

Big Data Everywhere Chicago: SQL on Hadoop

BigDataEverywhere

2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China

Allen Day, PhD

Analyzing Real-World Data with Apache DrillTomer Shiran

Predictive Analytics with HadoopDataWorks Summit

The Future of Hadoop: MapR VP of Product Management, Tomer ShiranMapR Technologies

Hadoop and the Future of SQL: Using BI Tools with Big Data

Senturus

Hadoop is changing how businesses operate, learn about this emerging technology stack. View the webinar video recording and download this deck: http://www.senturus.com/resource-video/hadoop-future-sql/?rId=3410. Learn the role SQL queries play for big data, and how SQL-on-Hadoop technologies enable organizations to leverage their existing SQL skills and investments in business intelligence (BI) tools to dramatically improve: 1) Recommendation engines for online retail, 2) Transactional fraud prevention for financial services, 3) Customized advertising and 4) Predictive failure analytics for manufacturing. Senturus, a business analytics consulting firm, has a resource library with hundreds of free recorded webinars, trainings, demos and unbiased product reviews. Take a look and share them with your colleagues and friends: http://www.senturus.com/resources/.

Analyzing Real-World Data with Apache Drill

tshiran

Using Hadoop to Offload Data Warehouse Processing and More - Brad AnsersonMapR Technologies

Spark SQL versus Apache Drill: Different Tools with Different Rules

DataWorks Summit/Hadoop Summit

Batter Up! Advanced Sports Analytics with R and Storm

Revolution Analytics

This session will demonstrate how the all-star line-up featuring R and Storm enables real-time processing on massive data sets; a real home run! The presenters will use actual baseball data and a real-world use case to compose an implementation of the use case as Storm components (spouts, bolts, etc.) and highlight how R can be an effective tool in prototyping a solution. Attendees will leave the session with information that could easily be applied for other use cases such as video game analytics, fraud detection, intrusion detection, and consumer propensity to buy calculations. The business need for real-time analytics at large scale has focused attention on the use of Apache Storm, but an approach that is sometimes overlooked is the use of Storm and R together. This novel combination of real-time processing with Storm and the practical but powerful statistical analysis offered by R substantially extends the usefulness of Storm as a solution to a variety of business critical problems. By architecting R into the Storm application development process, Storm developers can be much more effective. The aim of this design is not necessarily to deploy faster code but rather to deploy code faster. Just a few lines of R code can be used in place of lengthy Storm code for the purpose of early exploration – you can easily evaluate alternative approaches and quickly make a working prototype.

Geo-Distributed Big Data and Analytics

MapR Technologies

Changes in how business is done combined with multiple technology drivers make geo-distributed data increasingly important for enterprises. These changes are causing serious disruption across a wide range of industries, including healthcare, manufacturing, automotive, telecommunications, and entertainment. Technical challenges arise with these disruptions, but the good news is there are now innovative solutions to address these problems. http://info.mapr.com/WB_Geo-distributed-Big-Data-and-Analytics_Global_DG_17.05.16_RegistrationPage.html

Predictive Analytics San Diego

MapR Technologies

Ted Dunning - Keynote: How Can We Take Flink Forward?

Flink Forward

http://flink-forward.org/kb_sessions/keynote-tba/ Apache Flink has come a long way from its academic beginnings. It is now one of the most technically advanced solutions for streaming computation. And batch computation, too. Flink has serious technical advantages when compared with nearly every alternative system. This success ironically means that Apache Flink is right on the cusp of a critical moment. Over the next few months it will be decided whether Flink is the Next Big Thing or if it is a fine technology with limited impact. Right now, what you and I do can make a huge difference. But as business people like to say, what got Flink here isn’t what’s going to get it there. The challenges the Flink community faces now are different from the technical challenges it has met so far. I will talk about what I think will help and how we can all pitch in to take Flink forward.

The power of hadoop in business

MapR Technologies

Introduction to Spark

Carol McDonald

Spark & Hadoop at Production at Scale

MapR Technologies

How are leading companies deploying Spark with Hadoop in production? What insights have they learned and what key considerations should you consider to put your Spark-based innovative app to work faster? Hear real-life customer examples of turning data into action using Spark and Hadoop and how advanced users are deploying Hadoop and Spark applications in one cluster with better reliability and performance at production scale.

Hortonworks sqrrl webinar v5.pptx

Hortonworks

Almost every week, news of a proprietary or customer data breach hits the news wave. While attackers have increased the level of sophistication in their tactics, so too have organizations advanced in their ability to build a robust, data-driven defense. Join Hortonworks and Sqrrl to learn how a Modern Data Architecture with Hortonworks Data Platform (HDP) and Sqrrl Enterprise enables intuitive exploration, discovery, and pattern recognition over your big cybersecurity data. In this webinar you will learn: --How Apache Hadoop makes it the perfect fit to accumulate cybersecurity data and diagnose the latest attacks --The effective ways for pinpointing and reasoning about correlated events within your data, and assessing your network security posture. --How a Modern Data Architecture that includes the power of Hadoop with Hortonworks Data Platform with the massive, secure, entity-centric data models in Sqrrl Enterprise can discover hidden patterns and detect anomalies within your data using linked data analysis.

Hadoop: Past, Present and Future - v2.1 - SQLSaturday #340

Big Data Joe™ Rossi

Ted Dunning-Faster and Furiouser- Flink Drift

Flink Forward

http://flink-forward.org/kb_sessions/faster-and-furiouser-flink-drift/ Not long ago, we had the opportunity to test Apache Flink to see just how fast it would go on a moderately realistic task with fast hardware and with a good streaming transport layer underneath. Our goal was not so much careful comparison with other software, but flat-out speed, Flink against Flink. In the process, we learned a lot about what it takes to go fast. Some of the lessons were ones that we had “learned” a number of times before: – the bottleneck isn’t where you thought it was – copying data is expensive – context switches are expensive – measure twice, cut once But there were some real surprises along the way. The really important knobs weren’t quite what people say you should turn. One of the biggest surprises was the degree to which high performance libraries have threading built into them which makes the actual concurrrency much higher than the apparent concurrency. The result was that at least one cluster parameter needed to be adjusted by 30x to get real

TriHUG Feb: Hive on spark

trihug

Similar to Architecting R into Storm Application Development Process (20)

Big Data Everywhere Chicago: SQL on Hadoop

2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China

Analyzing Real-World Data with Apache Drill

Predictive Analytics with Hadoop

The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

Hadoop and the Future of SQL: Using BI Tools with Big Data

Analyzing Real-World Data with Apache Drill

Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson

Spark SQL versus Apache Drill: Different Tools with Different Rules

Batter Up! Advanced Sports Analytics with R and Storm

Geo-Distributed Big Data and Analytics

Predictive Analytics San Diego

Ted Dunning - Keynote: How Can We Take Flink Forward?

The power of hadoop in business

Introduction to Spark

Spark & Hadoop at Production at Scale

Hortonworks sqrrl webinar v5.pptx

Hadoop: Past, Present and Future - v2.1 - SQLSaturday #340

Ted Dunning-Faster and Furiouser- Flink Drift

TriHUG Feb: Hive on spark

More from DataWorks Summit

Data Science Crash Course

Architecting R into Storm Application Development Process

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Architecting R into Storm Application Development Process

Similar to Architecting R into Storm Application Development Process (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Architecting R into Storm Application Development Process

Editor's Notes