As the Big Data market has evolved, the focus has shifted from data operations (storage, access and processing of data) to data science (understanding, analyzing and forecasting from data). And as new models are developed, organizations need a process for deploying analytics from research into the production environment. In this talk, we'll describe the five stages of real-time analytics deployment:
Data distillation
Model development
Model validation and deployment
Model refresh
Real-time model scoring
We'll review the technologies supporting each stage, and how Revolution Analytics software works with the entire analytics stack to bring Big Data analytics to real-time production environments.
Organisations are adopting microservices to keep pace with business innovation; whilst needing to meet the resilience, scalability and security requirements critical for digital solutions. Enterprise relational DBs are often a barrier to this transformation, but they needn’t be.
This presentation delves into the challenges faced by enterprises during digital transformation and modernization initiatives which are often hamstrung by the inherent monolithic nature of enterprise databases.
Many Oracle data-centric applications consist of an intricate web of hundreds of tables, housing hundreds of thousands of lines of PL/SQL code executed within the database via packaged procedures. These relational databases have enabled us to safely and securely manage structured data for several decades, but over time they grow more complex and harder to maintain, slowing down delivery and seriously degrading application performance, business innovation all but grinds to a halt.
Given the impracticality and cost associated with complete rewrites, many organisations are turning to Microservices Architecture, to extract value from existing assets whilst gradually deconstructing the monolithic architecture to facilitate evolutionary changes.
This presentation outlines a systematic and phased approach, based on experience from multiple client initiatives, highlighting the crucial role of this transformation in enabling the creation of APIs that drive new business initiatives. The concept of domain separation, a pivotal element in the migration process, will be introduced, along with options to move certain data retrieval and processing to more appropriate architectures
What does a Maturity Curve for Enterprise Adoption of Agile and DevOps look like? Where would an organization like yours rank on the curve? Are there specific areas of improvement you might want to consider?
This presentation introduces the Process Mining as the cutting-edge data analytics approach for discovering the real processes by analyzing the event logs, detecting the bottlenecks, and generating recommendations for enhancing the business performance.
Data ingestion and distribution with apache NiFiLev Brailovskiy
In this session, we will cover our experience working with Apache NiFi, an easy to use, powerful, and reliable system to process and distribute a large volume of data. The first part of the session will be an introduction to Apache NiFi. We will go over NiFi main components and building blocks and functionality.
In the second part of the session, we will show our use case for Apache NiFi and how it's being used inside our Data Processing infrastructure.
Organisations are adopting microservices to keep pace with business innovation; whilst needing to meet the resilience, scalability and security requirements critical for digital solutions. Enterprise relational DBs are often a barrier to this transformation, but they needn’t be.
This presentation delves into the challenges faced by enterprises during digital transformation and modernization initiatives which are often hamstrung by the inherent monolithic nature of enterprise databases.
Many Oracle data-centric applications consist of an intricate web of hundreds of tables, housing hundreds of thousands of lines of PL/SQL code executed within the database via packaged procedures. These relational databases have enabled us to safely and securely manage structured data for several decades, but over time they grow more complex and harder to maintain, slowing down delivery and seriously degrading application performance, business innovation all but grinds to a halt.
Given the impracticality and cost associated with complete rewrites, many organisations are turning to Microservices Architecture, to extract value from existing assets whilst gradually deconstructing the monolithic architecture to facilitate evolutionary changes.
This presentation outlines a systematic and phased approach, based on experience from multiple client initiatives, highlighting the crucial role of this transformation in enabling the creation of APIs that drive new business initiatives. The concept of domain separation, a pivotal element in the migration process, will be introduced, along with options to move certain data retrieval and processing to more appropriate architectures
What does a Maturity Curve for Enterprise Adoption of Agile and DevOps look like? Where would an organization like yours rank on the curve? Are there specific areas of improvement you might want to consider?
This presentation introduces the Process Mining as the cutting-edge data analytics approach for discovering the real processes by analyzing the event logs, detecting the bottlenecks, and generating recommendations for enhancing the business performance.
Data ingestion and distribution with apache NiFiLev Brailovskiy
In this session, we will cover our experience working with Apache NiFi, an easy to use, powerful, and reliable system to process and distribute a large volume of data. The first part of the session will be an introduction to Apache NiFi. We will go over NiFi main components and building blocks and functionality.
In the second part of the session, we will show our use case for Apache NiFi and how it's being used inside our Data Processing infrastructure.
Business Value Measurements and the Solution Design FrameworkLeo Barella
The presentation covers a process and artifacts to establish better communication between business and IT and improve the quality and consistency of solutions. It also includes a tool to measure business value of the solutions that are being proposed and allows the business audience to make educated choices based on overall IT Business impact.
Slides from a webinar that I did recently for TIBCO. Full webinar replay with audio available at http://www.tibco.com/mk/2007/bpm-bpm11-jul-07usarc.jsp
Many significant business initiatives and large IT projects depend upon a successful data migration. Your goal is to minimize as much risk as possible through effective planning and scoping. This paper will provide insight into what issues are unique to data migration projects and offer advice on how to best approach them.
Ever wondered why you need a blueprint to make sense of it all. We breakdown the needs for a winning architecture technique with a template example based approach. Feel free to reach out to and if more details are required or you have an opportunity to explore
Mario Vivas, CEO, River Horse
ITIL4 is out and everyone is eager to learn about the updates this release introduces. This session will summarize the key changes that ITIL4 presents with a focus on the more operational processes that organizations deliver on a day to day basis (Incident, Problem, Change and so on). The ServiceNow platform has a powerful set of baseline features and optional plugin functions that can help an organization align with the recommendations of ITIL4.
Please join us and Mario to learn about how you can start applying ITIL4 concepts in your ServiceNow implementations!
Slides supporting the book "Process Mining: Discovery, Conformance, and Enhancement of Business Processes" by Wil van der Aalst. See also http://springer.com/978-3-642-19344-6 (ISBN 978-3-642-19344-6) and the website http://www.processmining.org/book/start providing sample logs.
Analyze key aspects to be considered before embarking on your cloud journey. The presentation outlines the strategies, approach, and choices that need to be made, to ensure a smooth transition to the cloud.
Business Value Measurements and the Solution Design FrameworkLeo Barella
The presentation covers a process and artifacts to establish better communication between business and IT and improve the quality and consistency of solutions. It also includes a tool to measure business value of the solutions that are being proposed and allows the business audience to make educated choices based on overall IT Business impact.
Slides from a webinar that I did recently for TIBCO. Full webinar replay with audio available at http://www.tibco.com/mk/2007/bpm-bpm11-jul-07usarc.jsp
Many significant business initiatives and large IT projects depend upon a successful data migration. Your goal is to minimize as much risk as possible through effective planning and scoping. This paper will provide insight into what issues are unique to data migration projects and offer advice on how to best approach them.
Ever wondered why you need a blueprint to make sense of it all. We breakdown the needs for a winning architecture technique with a template example based approach. Feel free to reach out to and if more details are required or you have an opportunity to explore
Mario Vivas, CEO, River Horse
ITIL4 is out and everyone is eager to learn about the updates this release introduces. This session will summarize the key changes that ITIL4 presents with a focus on the more operational processes that organizations deliver on a day to day basis (Incident, Problem, Change and so on). The ServiceNow platform has a powerful set of baseline features and optional plugin functions that can help an organization align with the recommendations of ITIL4.
Please join us and Mario to learn about how you can start applying ITIL4 concepts in your ServiceNow implementations!
Slides supporting the book "Process Mining: Discovery, Conformance, and Enhancement of Business Processes" by Wil van der Aalst. See also http://springer.com/978-3-642-19344-6 (ISBN 978-3-642-19344-6) and the website http://www.processmining.org/book/start providing sample logs.
Analyze key aspects to be considered before embarking on your cloud journey. The presentation outlines the strategies, approach, and choices that need to be made, to ensure a smooth transition to the cloud.
Big Data and Data Science: The Technologies Shaping Our LivesRukshan Batuwita
Big Data and Data Science have become increasingly imperative areas in both industry and academia to the extent that every company wants to hire a Data Scientist and every university wants to start dedicated degree programs and centres of excellence in Data Science. Big Data and Data Science have led to technologies that have already shaped different aspects of our lives such as learning, working, travelling, purchasing, social relationships, entertainments, physical activities, medical treatments, etc. This talk will attempt to cover the landscape of some of the important topics in these exponentially growing areas of Data Science and Big Data including the state-of-the-art processes, commercial and open-source platforms, data processing and analytics algorithms (specially large scale Machine Learning), application areas in academia and industry, the best industry practices, business challenges and what it takes to become a Data Scientist.
Unlocking the Value of Usage Data March 20, 2014
Dan McGaw, Director of Marketing KISSmetrics @danielmcgaw
Puja Ramani, Director of Product Management & Analytics Gainsight @pramani #customersuccess #KISSwebinar
1 The Case for User Analytics, 2 Making User Analytics Actionable, 3 Realizing ROI
We Have Entered The Age Of The Customer
Customer data is everywhere
Welcome to our world of Customer Analytics.
How it works (it’s simple and powerful)
Your customer is at the heart of KISSmetrics
How effective is my signup process?
“You can’t maximize your revenue and profit unless you are tracking the lifetime value of each of your customers. And that’s what KISSmetrics does better than anyone else.” !! — Thomas (Zappos).
Which of my marketing channels has the highest ROI?
What do my customers do before they sign up?
Are customers coming back on a regular basis?
Making User Analytics Actionable
We all know a data driven world is inevitability
We track everything from our health to our homes to our children
We have more data about our customers than ever before
So what’s stopping us?
38% of companies are not able to communicate and interpret customer analytics results.
54% can’t integrate and manage all their data sources.
The four pillar approach is your roadmap to ROI
People Objective Strategy Technology
So what can you do?
Data Science Alert Rules and Playbooks Confirm Intuition
Blend with other data sources to discover insights
Score customer health using usage data
Have one view of all your customers
Fire off tasks or outreach based on usage
Take action on early warnings and manage each event
Consistently collaborate to keep customer relationships healthy
Who’s getting ROI from usage data?
Reduce Churn
THANK YOU
Dan McGaw, Director of Marketing KISSmetrics @danielmcgaw
Puja Ramani, Director of Product Management & Analytics Gainsight @pramani
Cara ORDER ORIS CREAM CEPAT Via SMS :
1. Sebutkan nama lengkap
2. Sebutkan alamat pengiriman lengkap
3. Sebutkan berapa box ORIS
4. Sebutkan Bank yang akan anda gunakan untuk transfer
Kirim ke no hp : 081221736543
Contoh SMS :
Dewi Shanti, pesan 2 box ORIS, Jl Cirebon Raya No.10 RT 01 RW02 Kelurahan Majalengka Kecamatan Indramayu, Yogya 18750 - jawa tengah, via BCA
Kirim ke no : 081221736543
- Selanjutnya kami akan membalas total harga + ongkos kirim + rek bank.
- Konfirmasikan pembayaran anda setelah transfer via SMS
- Paket dikirim rapi (polos untuk menjaga privacy anda)
Este año, las tendencias mundiales del Facility Management
(FM) está relacionadas con el uso de nuevas
tecnologías, espacios de trabajos
confortables, sustentabilidad,
eficiencia energética y big data
Insights into Action – recent examples leveraging big data in the automotive ...Capgemini
In 2014 many OEMs started to use the latest big data technologies to leverage customer and vehicle data, together with information available through other channels. Nick Gill, Chair of Capgemini’s global automotive council, will highlight some of the leading examples of work done in the industry, and the results achieved to date.
Presented at EMC World 2015 by Nick Gill.
Real-Time Analytics and Visualization of Streaming Big Data with JReport & Sc...Mia Yuan Cao
Learn how to extract real-time insight from Big Data. JReport and ScaleDB’s combined solution delivers business value by ingesting Big Data at stunning velocity (millions of rows/second), then provides powerful visualizations, filtering and data analysis that enable you to draw quick conclusions to make agile business decisions. JReport's seamless connection to ScaleDB enables technical or non-technical users to build and modify their own reports and dashboards to visualize these vast data stores. Join us to see how.
Enterprise Approach towards Cost Savings and Enterprise AgilityNUS-ISS
Presented by Mr Poon See Hong, Deputy Director (Planning), Police Logistics Department, Singapore Police Force, at our 14th Architecture Community of Practice Forum on 21 Jul 2016.
Online hotel booking gave a new dimension for the regular hotel booking system in India and the process.
Everyone has unique expectations when it comes to booking a hotel online, now online booking facilitates and gives the user with an opportunity to reach their expectation. It can be an awesome experience, as long as the process isn’t too frustrating.
In this project I’m going to focused majorly on the betterment of user experience through a process.
Big Data and the Energy domain (vis-a-vis the respective H2020 Societal Challenge) - Opportunities, Challenges and Requirements. As presented and discussed in the public launch of the BigDataEurope project.
100% R and More: Plus What's New in Revolution R Enterprise 6.0Revolution Analytics
R users already know why the R language is the lingua franca of statisticians today: because it's the most powerful statistical language in the world. Revolution Analytics builds on the power of open source R, and adds performance, productivity and integration features to create Revolution R Enterprise. In this webinar, author and blogger David Smith will introduce the additional capabilities of Revolution R Enterprise.
VP of Product Development, Dr. Sue Ranney will also provide an overview of the features introduced in Revolution R Enterprise 6.0 including:
1. Big Data Generalized Linear Model, the new RevoScaleR function that provides a fast, scalable, distributable implementation of generalized linear models, offering impressive speed-ups relative to glm on in-memory data frames
2. Platform LSF Cluster Support, which allows you to create a distributed compute context for the Platform LSF workload manager
3. Azure Burst support added to RxHpcServer
4. Updated R engine (R 2.14.2)
5. Ability to use RevoScaleR analysis functions with non-xdf data sources such as SAS, SPSS or text
6. New methods for RxXdfData data sources including head, tail, names, dim, colnames, length, str, and formula
7. New function rxRoc for generating ROC curves
Big Data Analytics in a Heterogeneous World - Joydeep Das of SybaseBigDataCloud
Big Data Analytics is characterized by analysis of data on three vectors: exploding data volume, proliferating data variety (relational, multi-media), and accelerating data velocity. However, other key vectors such as costs and skill set needed for Big Data Analytics are often overlooked. In this session, we will consider all five vectors by exploring various techniques where traditional but progressive technologies such as column store DBMS and Event Stream Processing is combined with open source frameworks such as Hadoop to exploit the full potential of Big Data Analytics.
Agenda:
- Big Data Analytics in the real world
- Commercial and Open Source techniques
- Bringing together Commercial and Open Source techniques
* Architectures
* Programming APIs
(e.g. embedded and federated MapReduce)
- Conclusions
Big Data Paris - A Modern Enterprise ArchitectureMongoDB
Depuis les années 1980, le volume de données produit et le risque lié à ces données ont littéralement explosé. 90% des données existantes aujourd’hui ont été créé ces 2 dernières années, dont 80% sont non structurées. Avec plus d’utilisateurs et le besoin de disponibilité permanent, les risques sont beaucoup plus élevés.
Quels sont les paramètres de bases de données qu’un décideur doit prendre en compte pour déployer ses applications innovantes?
Big data analytics on teradata with revolution r enterprise bill jacobsBill Jacobs
Revolution Analytics brings big data analytics to Teradata database. Presentation from Teradata Partners, October 2013 overviewing Revolution R Enterprise for Teradata by Bill Jacobs, Director, Product Marketing, Revolution Analytics.
Open source Apache Hadoop is a great framework for distributed processing of large data sets. But there’s a difference between “playing” with big data versus solving real problems. The reality is that Hadoop alone is not enough. In fact, almost every organization that plans to use Hadoop for production use quickly discovers that it lacks the required features for enterprise use. And, fewer still have the Hadoop specialists on hand to navigate through the complexity to build reliable, robust applications. As a result, many Hadoop projects never make it to production as executives say, “we just don’t have the skills.” In this session, we will discuss these enterprise capabilities and why they’re important: analytics, visualization, security, enterprise integration, developer/admin tools, and more. Additionally, we will share several real-world client examples who have found it necessary to use an enterprise-grade Hadoop platform to tackle some of the most interesting and challenging business problems.
Sponsored by Data Transformed, the KNIME Meetup was a big success. Please find the slides for Dan's, Tom's, Anand's and Chhitesh's presentations.
Agenda:
Registration & Networking
Keynote – Dan Cox, CEO of Data Transformed
KNIME & Harvest Analytics – Tom Park
Office of State Revenue Case Study – Anand Antony
Using Spark with KNIME – Chhitesh Shrestha
Networking & Drinks
Similar to Real-time Big Data Analytics: From Deployment to Production (20)
Presented to eRum (Budapest), May 2018
There are many common workloads in R that are "embarrassingly parallel": group-by analyses, simulations, and cross-validation of models are just a few examples. In this talk I'll describe the doAzureParallel package, a backend to the "foreach" package that automates the process of spawning a cluster of virtual machines in the Azure cloud to process iterations in parallel. This will include an example of optimizing hyperparameters for a predictive model using the "caret" package.
By David Smith. Presented at Microsoft Build (Seattle), May 7 2018.
Your data scientists have created predictive models using open-source tools, proprietary software, or some combination of both, and now you are interested in lifting and shifting those models to the cloud. In this talk, I'll describe how data scientists can transition their existing workflows — while using mostly the same tools and processes — to train and deploy machine learning models based on open source frameworks to Azure. I'll provide guidance on keeping connections to data sources up-to-date, evaluating and monitoring models, and deploying applications that make use of those models.
Presentation delivered by David Smith to NY R Conference https://www.rstats.nyc/, April 2018:
Minecraft is an open-world creativity game, and a hit with kids. To get kids interested in learning to program with R, we created the "miner" package. This package is a collection of simple functions that allow you to connect with a Minecraft instance, manipulate the world within by creating blocks and controlling the player, and to detect events within the world and react accordingly.
The miner package is intended mainly for kids, to inspire them to learn R while playing Minecraft. But the development of the package also provides some useful insights into how to build an R package to interface with a persistent API, and how to instruct others on its use. In this talk I'll describe how to set up your own Minecraft server, and how to use and extend the package. I'll also provide a few examples of the package in action in a live Minecraft session.
While Python is a widely-used tool for AI development, in this talk I'll make the case for considering R as a platform for developing models for intelligent applications. Firstly, R provides a first-class experience working deep learning frameworks with its keras integration. Equally importantly, it provides the most comprehensive suite of statistical data analysis tools, which are extremely useful for many intelligent applications such as transfer learning. I'll give a few high-level examples in this talk, and we'll go into further detail in the accompanying interactive code lab.
There are many common workloads in R that are "embarrassingly parallel": group-by analyses, simulations, and cross-validation of models are just a few examples. In this talk I'll describe several techniques available in R to speed up workloads like these, by running multiple iterations simultaneously, in parallel.
Many of these techniques require the use of a cluster of machines running R, and I'll provide examples of using cloud-based services to provision clusters for parallel computations. In particular, I will describe how you can use the SparklyR package to distribute data manipulations using the dplyr syntax, on a cluster of servers provisioned in the Azure cloud.
Presented by David Smith at Data Day Texas in Austin, January 27 2018.
A look at the changing perceptions of R, from the early days of the R project to today. Microsoft sponsor talk, presented by David Smith to the useR!2017 conference in Brussels, July 5 2017.
Predicting Loan Delinquency at One Million Transactions per SecondRevolution Analytics
Real-time applications of predictive models must be able to generate predictions at the rate that transactions are generated. Previously, such applications of models trained using R needed to be converted to other languages like C++ or Java to achieve the required throughput. In this talk, I’ll describe how to use the in-database R processing capabilities of Microsoft R Server to detect fraud in a SQL Server database of loan records at a rate exceeding one million transactions per second. I will also show the process of training the underlying gradient-boosted tree model on a large training set using the out-of-memory algorithms of Microsoft R.
Presented by David Smith at The Data Science Summit, Chicago, April 20 2017.
The ability to independently reproduce results is a critical issue within the scientific community today, and is equally important for collaboration and compliance in business. In this talk, I'll introduce several features available in R that help you make reproducibility a standard part of your data science workflow. The talk will include tips on working with data and files, combining code and output, and managing R's changing package ecosystem.
Presented by David Smith, R Community Lead (Microsoft), at Monktoberfest October 2016.
The value of open source isn’t just in the software itself. The communities that form around open source software provide just as much value and sometimes even more: in ongoing development, in documentation, in support, in marketing, and as a supply of ready-trained employees. Companies who build on open source tend to focus on the software, but neglect communities at their peril.
In this talk, I share some of my experiences in building community for an open-source software company, Revolution Analytics, and perspectives since the acquisition by Microsoft in 2015.
R is more than just a language. Many of the reasons why R has become such a popular tool for data science come from the ecosystem surrounding the R project. R users benefit from the many resources and packages created by the community, while commercial companies (including Microsoft) provide tools to extend and support R, and services to help people use R.
In this talk, I will give an overview of the R Ecosystem and describe how it has been a critical component of R’s success, and include several examples of Microsoft’s contributions to the ecosystem.
(Presented to EARL London, September 2016)
(Presented by David Smith at useR!2016, June 2016. Recording: https://channel9.msdn.com/Events/useR-international-R-User-conference/useR2016/R-at-Microsoft )
Since the acquisition of Revolution Analytics in April 2015, Microsoft has embarked upon a project to build R technology into many Microsoft products, so that developers and data scientists can use the R language and R packages to analyze data in their data centers and in cloud environments.
In this talk I will give an overview (and a demo or two) of how R has been integrated into various Microsoft products. Microsoft data scientists are also big users of R, and I'll describe a couple of examples of R being used to analyze operational data at Microsoft. I'll also share some of my experiences in working with open source projects at Microsoft, and my thoughts on how Microsoft works with open source communities including the R Project.
Hadoop is famously scalable. Cloud Computing is famously scalable. R – the thriving and extensible open source Data Science software – not so much. But what if we seamlessly combined Hadoop, Cloud Computing, and R to create a scalable Data Science platform? Imagine exploring, transforming, modeling, and scoring data at any scale from the comfort of your favorite R environment. Now, imagine calling a simple R function to operationalize your predictive model as a scalable, cloud-based Web Service. Learn how to leverage the magic of Hadoop on-premises or in the cloud to run your R code, thousands of open source R extension packages, and distributed implementations of the most popular machine learning algorithms at scale.
Real-time Big Data Analytics: From Deployment to Production
1. Revolution Confidential
R eal-Time P redic tive
A nalytic s with B ig Data
from deployment to produc tion
David M S mith @ revodavid
V P Marketing and C ommunity
R evolution Analytic s
2. Today we’ll dis c us s : Revolution Confidential
What is “real-time predictive analytics with
big data?”
Case study: UpStream Software
Real-time big-data stack
Five phases of deployment to production
Just what do we mean by “Big Data” and
“Real-Time”, anyway?
Recommendations
Q&A
2
3. R eal-Time P redic tive A nalytic s with B ig Data
Revolution Confidential
Real Time
Milliseconds? Seconds? Hours? Days?
In production, continuously updated
Big Data
Size, flow rate, diversity
Conflict with ‘real time’
Predictive Analytics
Rear view: description, aggregation, tabulation
Prediction, inference, statistical modeling
3
4. Revolution Confidential
C as e S tudy
www.upstreamsoftware.com
“Given that our data sets are already
in the terabytes and are growing
rapidly, we depend on Revolution R
Enterprise’s scalability and fast
performance — we saw about a 4x From: “How Big Data is Changing Retail
performance improvement on 50 Marketing Analytics”
million records. It works brilliantly.” http://bit.ly/upstream-webinar
John Wallace, CEO Upstream Software
5. UpS tream S oftware Revolution Confidential
UpStream’s Big Data Analytics engine
analyzes and optimizes marketing mix for
UpStream’s retailer clients
Customer analytics and segmentation
Revenue/ action attribution among channels and
marketing programs
Customized Next Best Action per individual
prospect or customer
Event-Triggered Marketing: responses to
consumer actions
5
6. B ig Data Revolution Confidential
Demographics: consumer, product, market
Actions: web clicks, email clicks, mobile app usage,
call center logs, social, search …
Outcomes: impressions, touches, orders (retail, online,
mobile)
6
7. P redic tive A nalytic s Revolution Confidential
• Time-to-event models
UPSTREAM DATA • Exploratory data analysis CUSTOM VARIABLES
FORMAT (PMML)
• GAM survival models
• ETL • Scoring for inference
• N marketing channels • Scoring for prediction
• Behavioral variables
• 5 billion scores per day
• Promotional data per customer
• Overlay data
8. R eal Time Revolution Confidential
http://www.infoworld.com/d/big-data/williams-sonoma-uses-big-data-zero-in-customers-206641
November 8, 2012 8
9. P redic tive vs Des c riptive A nalytic s Revolution Confidential
Retail sales are influenced by stores near the
home
Newer customers are more sensitive to marketing
Google keywords perform worse than you think
Display advertising performs better than you think
Online sales benefit more from seasonal events
than retail stores
9
10. R eal-time Revolution Confidential
B ig Data
P redic tive
A nalytic s
S tac k
10
11. Data L ayer Revolution Confidential
Structured data
RDBMS, NoSQL, Hbase, Impala
Unstructured data
Hadoop MapReduce
Streaming data
Web, social, sensors, operational systems
Some descriptive analytics done here
11
12. A nalytic s layer Revolution Confidential
Predictive analytics technology
Development environment: build models
Production environment:
Deploy real-time scoring
Engine for dynamic analytics
Local data mart
Static, periodically updated from data layer
Improves performance
12
13. Integration L ayer Revolution Confidential
Connective tissue between end-user
applications and analytics engine
Engines for flow control, real-time scoring
Rules engine / CEP engine
API for Dynamic Analytics
Brokers communication between app developers
and data scientists
13
14. Dec is ion L ayer Revolution Confidential
End-user interface to analytics system
Customers
Operations
Business analysts
C-suite
Familiar & simple interfaces
Supports a range of end-user technologies
Level of UI complexity dictated by need
14
15. R eal-Time Deployment Revolution Confidential
Five phases of deploying real-time predictive
analytics with big data to production:
1. Data Distillation
2. Model Development
3. Validation and Deployment
4. Real-time Scoring
5. Model refresh
16
16. P has e 1: Data Dis tillation Revolution Confidential
The data in the Data Layer isn’t yet ready for
predictive modeling. We need to:
Extract features from unstructured text
Topic modeling, sentiment analysis
Combine disparate data sources
Filter for populations of interest
Select relevant features and outcomes for
modeling
Export to the data mart for predictive modeling
17
17. E xample: Data Dis tillation in Hadoop Revolution Confidential
Log Files
Event Streams HDFS Load Map-Reduce Structured
Data
rmr
Scraped Text
Unstructured Analytics
Data Data Mart
Revolution R Enterprise for Hadoop: bit.ly/r-hadoop
18
18. A utomating the E xtrac tion P roc es s Revolution Confidential
This process needs to be repeatable and
maintainable
Create a re-usable R script:
ASCII: rxDataStep
SQL: rxImport, ROracle, RMySQL
NoSQL: Rcassandra, rmongodb
Hadoop: rmr, rhbase, Rhive
Appliances: nza, teradataR, PL/R
Feeds: rjson, XML, RCurl, twitteR, python
R script runs from Analytics Layer
Processing: in Data Layer
Output: structured file for Data Mart
19
19. Data Revolution Confidential
Dis tillation
P roc es s
XD
F
20
20. P has e 2: Model development Revolution Confidential
Goal: create a predictive model that is
Powerful
Robust
Comprehensible
Implementable
Key requirements for Data Scientists:
Flexibility
Productivity
Speed
Reproducibility
21
21. T he Model Development C yc le Revolution Confidential
Download the White Paper
R is Hot
Variable bit.ly/r-is-hot
Transformation
Feature
Selection Model Predictive
Data Mart Sampling Estimation
Aggregation Model
Model
Model
Comparison /
Refinement
Benchmarking
22
22. Tec hnology c ons iderations Revolution Confidential
Moving Big Data is slow
So just move it once, to data mart near the development
and production environments
Static data needed for modeling cycle
But have a plan to refresh and update
Data layer not optimized for predictive analytics
But good for descriptive (data distillation)
Revolution R Enterprise optimizations for the
analytics layer
R; XDF file format; Parallel External Memory Algorithms
23
23. P redic tive Modeling A lgorithms Revolution Confidential
Algorithm Example Applications Big Data
ETL, data distillation, record/variable selection,
Data Step
variable transformation
✓✓
Descriptive Statistics Exploratory Data Analysis, Data Validation ✓✓
Tables & Cubes Reporting, contingency analysis ✓✓
Correlation / Covariance Factor Analysis, Value at Risk ✓✓
Linear regression Forecasting, Net present value estimation ✓✓
Logistic Regression Response modeling, offer selection ✓✓
Generalized Linear
Models
Capital reserve estimation, climate modeling ✓✓
K-means clustering Customer Segmentation ✓✓
Decision Trees Dynamic pricing, classification, variable importance ✓✓
Model Prediction Real-time Scoring (decisions, offers, actions) ✓✓
R CRAN packages Everything else
Parallel & distributed Simulations, By-Group analysis, ensemble models,
computing with R custom applications
✓
24
24. High performanc e with dis tributed c lus ters
Revolution Confidential
Data Scientists / Modelers Grid computing cluster
Revolution R Enterprise
Platform LSF
Microsoft HPC Server
Microsoft Azure
Ad-hoc grids (SNOW)
Shared SMP server
Hadoop / HDFS
25
25. P has e 3: Model Validation and Deployment Revolution Confidential
Scoring rules map parameters to outputs
Parameters: information known at real time
prospect ID, product, webpage, …
Scores: values inferred in real time
prices, recommendations, actions, …
Validation: backtest with historical data
Deployment: code running in real time
26
26. Validation for produc tion Revolution Confidential
Real Outcomes
Historical
Data
Scores
Scoring
Rules
Refresh data mart
Rebuild model, withholding a validation set
Random sampling
Create accuracy measure
ROC, positive rate, false positive rate
Measure and monitor
27
27. Deployment with R c ode Revolution Confidential
Outcomes
Historical
Scores
R Model R Model
Data
Data
New
Object Object
Scoring rules captured as “R model objects”
Move R code directly from development
environment to production server
No recoding: Lowest cost, greatest speed &
reliability
Most flexibility (not limited in model choice)
Easy to validate with custom accuracy measures
28
28. Managing and S c aling Deployed R C ode Revolution Confidential
Revolution R Enterprise includes RevoDeployR
Code management & tracking
Security & resource allocation
Scalability for real-time demands
Web Services API
Scoring and dynamic analytics
Data visualization
custom reporting
Ad-hoc analytics
Interactive data apps
29
29. C ode C onvers ion for S c oring Revolution Confidential
Business rules (e.g. IBM ILOG)
Create directly with R code from model object
Recode in other languages
SQL
C++, etc.
Low latency
Slow and costly
deployment
Potential for errors
30
30. S c oring via P MML Revolution Confidential
Some predictive models can be expressed in
the PMML standard (www.dmg.org)
Not all, but growing all the time
Scores
R Model PMML
Data
PMML
Object pmml
Easily generate PMML with R “pmml” package
PMML Scoring Engine: Zementis Adapa
Databases and appliances may support PMML
But supported standards vary
31
31. P has e 4: R eal-Time S c oring Revolution Confidential
Scoring triggered by Decision Layer, brokered by
Integration Layer
R code
Revolution R Enterprise Server
In-appliance (e.g IBM PureData System for Analytics)
SQL
In-database
Rules
Compiled code
Bespoke engines
May be using hardware from data layer
But not the actual data
32
32. R eal-Time Revolution Confidential
S c oring:
review
iLog
SQL
Near Real
Time 33
33. P has e 5: Model R efres h Revolution Confidential
Deployed scoring rules no longer connected
to Data Layer or Data Mart
Enables real-time performance
Need to refresh using recent data
Model refresh process
Use data extract script
Re-run model script
Statistical review OR automated validation
34
34. S c heduled B atc h Updates Revolution Confidential
Periodic model refresh
Weekly
Daily
Revolution R Enterprise Hourly
Production Server Cluster
Automated
Scheduler validate/deploy process
Excel
RevoDeployR Server
Web Services API
35
35. R e-developing the model Revolution Confidential
Even then, the underlying structure of the
data may change:
Important variables become non-significant
Non-significant variables become important!
New data sources become available
Model accuracy drift is a warning sign
Return to Phase 2 or Phase 1
36
36. Revolution Confidential
Kilobytes / second
Megabytes / second
How big is
‘B ig data’?
Megabytes
Gigabytes Terabytes
Petabytes Exabytes
38
37. Revolution Confidential
Seconds or less
Milliseconds
How fas t is
‘R eal time’?
Daily Hourly
Hours Minutes
Daily Hourly
Weeks Days
39
38. R ec ommendations Revolution Confidential
Architect your Predictive Analytics Stack
Best of breed vs single-vendor stack
Diversify data sources and decision apps
R = Predictive Analytics
Revolution R Enterprise provides:
Analytics Layer: big data, performance
Integration Layer: scalability, reliability
Get Help to Get Started
Consider an SI partner for architecture, best
practices and implementation advice
40
39. R es ourc es Revolution Confidential
Revolution R Enterprise : R for Big Data
www.revolutionanalytics.com/products
Revolution Analytics Consulting Services
www.revolutionanalytics.com/services
Contact us: bit.ly/hey-revo
Rhadoop : Connecting R and Hadoop
bit.ly/r-hadoop
Contact David Smith
david@revolutionanalytics.com
@revodavid
blog.revolutionanalytics.com
41
40. T hank you. Revolution Confidential
The leading commercial provider of software and support for the popular
open source R statistics language.
www.revolutionanalytics.com 650.646.9545 Twitter: @RevolutionR
42