The reason why Big Data is important is because we want to use it to make sense of our world. It’s tempting to think there’s some “magic bullet” for analyzing big data, but simple “data distillation” often isn’t enough, and unsupervised machine-learning systems can be dangerous. (Like, bringing-down-the-entire-financial-system dangerous.) Data Science is the key to unlocking insight from Big Data: by combining computer science skills with statistical analysis and a deep understanding of the data and problem we can not only make better predictions, but also fill in gaps in our knowledge, and even find answers to questions we hadn’t even thought of yet.
Big data is a phenomenon brought about by rapid data growth, complex, new, and changing data types, and parallel technology advancements; it brings huge possibilities. By optimizing these enormous amounts of structured and unstructured data, CSPs are in a unique position to capture these opportunities and create new revenue streams.
This presentation focuses on the “Data - Big Data - Bigger Data” and the Challenges, Opportunities and Solutions from these trends.
What are the Challenges this massive data brings to the table?
What are the opportunities this data provide ?
Some solutions on how to handle this data.
A data monetization framework from Accenture Interactive. Three questions your company should answer to start realizing revenue opportunities from your data.
My keynote talk at San Diego Superdata conference, looking at history and current state of Analytics and Data Mining, and examining the effects of Big Data
Big data is a phenomenon brought about by rapid data growth, complex, new, and changing data types, and parallel technology advancements; it brings huge possibilities. By optimizing these enormous amounts of structured and unstructured data, CSPs are in a unique position to capture these opportunities and create new revenue streams.
This presentation focuses on the “Data - Big Data - Bigger Data” and the Challenges, Opportunities and Solutions from these trends.
What are the Challenges this massive data brings to the table?
What are the opportunities this data provide ?
Some solutions on how to handle this data.
A data monetization framework from Accenture Interactive. Three questions your company should answer to start realizing revenue opportunities from your data.
My keynote talk at San Diego Superdata conference, looking at history and current state of Analytics and Data Mining, and examining the effects of Big Data
W-JAX Keynote - Big Data and Corporate Evolutionjstogdill
A look at corporate evolution from the industrial revolution to the information age - with a focus on how Big Data will make an impact.
Presented at W-JAX Java Conference in Munich Germany, 11-8-11
Data visualization trends in Business Intelligence: Allison Sapka at Analytic...Fitzgerald Analytics, Inc.
Allison Sapka's presentation at the Analytics and Data in Financial Services Meetup in Dec 2012. Alison discusses trends in Data Visualization, including why visualization is so powerful when implemented well, and confusing or misleading when done badly
The enterprise software stack is undergoing once in a generation refresh largely driven by virtualization, data explosion, infrastructure commoditization, socialization, unlimited connectivity, and online services. With ever growing security parameter and attack vectors, enterprises are looking for ways to secure information access without compromising the business agility unleashed by the abovementioned forces. This presentation focuses on the emerging opportunities in the enterprise space that entrepreneurs can leverage to build the technology giants of tomorrow.
Semantics, Deep Learning, and the Transformation of BusinessSteve Omohundro
Deep learning is likely to have a big impact on business. McKinsey predicts that AI and robotics will create $50 trillion of value over the next 10 years. Over $1 billion of venture investment has gone to 250 deep learning startups over the past year. Deep learning systems have recently broken records in speech recognition, image recognition, image captioning, translation, drug discovery and other tasks. Why is this happening now and how is it likely to play out? We review the development of AI and the pendulum swings between the "neats" and the "scruffies". We describe traditional approaches to semantics through logics and grammars and the new deep learning vector semantics. We relate it to Roger Shepard's cognitive geometry and the structure of biological networks. We also describe limitations of deep learning for safety and regulation. We show how it fits into the rational agent framework and discuss what the next steps may be.
A introductory lecture on how international media is using data visualization to tell stories. Some live demonstrations in the class are not reflected in the slides. Also the in-class exercises are not included.
Today, we have data – lots of it. We can process information – in many ways. And with these two tools and a little bit of creativity, we are discovering the vast depths of human behavior and by extension, a way to accurately predict the future -- and our future happiness. In fact, we can quantify human movement, behaviors, desires, and even moods on a scale that wasn’t possible before a series of advances in processing power, developments in psychology and social network science, and most importantly, access to data.
In advertising, industry, and humanity, we have experienced the evolution from Web 1.0 (informational) to Web 2.0 (platform) to Web 3.0 (semantic) to elements of Web 4.0 (anticipatory) – In this anticipatory era, what can we dream of next? Beyond addressability and increasing ad relevance, how can businesses utilize these advances in product development and other market initiatives? Can we make the leap from inductive logic to human-paralleled intuition? Can this make up for our human brain mechanics that make predicting our own happiness so difficult?
In this talk we’ll cover the evolutions in data access, models for information processing, and the science of collaboration to see not only how they have been leveraged in businesses but also how they are used to better understand human behavior, and hopefully in the near future, a little bit of happiness.
Data Science ATL Meetup - Risk I/O Security Data ScienceMichael Roytman
This is a talk about data science operations and the applications of Risk I/Os insights to the security industry - how we went about mining insights from our large dataset
A global revolution is in full swing, and the Sustainable Brands Conference is where sustainability, brand and innovation leaders gather to learn, share and strategize to shape the future. SB'12 was the largest gathering to date, a kinetic convergence of innovators from more than 150 companies from around the world finding new ways to create monumental disruption in traditional models of commerce and consumption.
Using big data and implementing hadoop is a trend that people jump all to quickly to. Instead understanding the run time complexity of one's algorithms, reducing said complexity and managing the process from start to finish in a lean and agile way can yield massive cost savings - or save your organization.
An introductory presentation about possibilities that Big Data opens to public safety company like e.g. taking advantage of smart city grids, crime and accident databases
Over the past weeks we have been examining the inference process- big.docxlmark1
Over the past weeks we have been examining the inference process, big data (which feeds this process) and now the inter-dependent nature of digital devices in a digital age. I invite you to follow the links below and explore the constructs of Information Overload and Machine Data. Find others of your own, as well.
Do either of these constructs resonate with you? How do they interact with the erosion of boundaries between work and home life that many of you have referenced? How does our dependence on digital devices feed machine data and big data? How does it compromise our personal security?
Try weaving one of these ideas into your thinking about ICT.
The links are:
Big Data Means Information Overload
https://www.youtube.com/watch?v=6MpfVD-c-QI
http://www.forbes.com/sites/laurashin/2014/11/14/10-steps-to-conquering-information-overload/#7bd535d424fe
Machine Data and Operational Intelligence
http://www.splunk.com/en_us/resources/machine-data.html
http://internetofthingsagenda.techtarget.com/definition/machine-data
Solution
hadoop
.
The REAL Impact of Big Data on PrivacyClaudiu Popa
The awesome promise of Big Data is tempered by the need to protect personal information. Data scientists must expertly navigate the legislative waters and acquire the skills to protect privacy and security. This talk provides enterprise leaders with answers and suggests questions to ask when the time comes to consider the vast opportunities offered by big data.
Moderator:
Richard Villars,
Vice President, Information & Cloud
IDC
Panelists:
Andrew Stokes,
Chief Scientist,
Deutsche Bank Global Technology
Elad Yoran,
Chairman and CEO
Vaultive, Inc.
Gordon Haff,
Cloud Evangelist, Red Hat
Greg Brown,
VP and CTO, Cloud and Data Center
solutions, McAfee
John Engates,
CTO, Rackspace
Reuven Cohen,
Senior Vice President
Virtustream
Everybody has heard of Big Data, and its promise as the next great frontier for innovation. However, Big Data is neither new nor easily defined. What are the key drivers that make Big Data so critically important today? What is the single idea behind Big Data that promises such game changing outcomes for capable organizations? Who are the skilled talent that deliver Big Data results?
This presentation briefly reviews the opportunities, motivation and trends that are driving Big Data disruption. Data science is introduced as the enabling engine for Big Data transformation via the creation of new Data Products. The data scientist is defined and his tools, workflow and challenges are reviewed. Finally, practical tips are presented for approaching data product development.
Key takeaways include:
- Big Data disruption is driven by four megatrends
- Data is the essential raw material for creating valuable Data Products
- Data scientists are heterogeneous by role & skill set, but share common tools, workflows and challenges
- Data science talent is more important than raw data for Big Data success
These slides are modified from an invited presentation for the Gwinnett Chamber of Commerce on March 18, 2014. An excerpt was presented at the Georgia Pacific Social Media Working Session on March 19, 2014.
Presented to eRum (Budapest), May 2018
There are many common workloads in R that are "embarrassingly parallel": group-by analyses, simulations, and cross-validation of models are just a few examples. In this talk I'll describe the doAzureParallel package, a backend to the "foreach" package that automates the process of spawning a cluster of virtual machines in the Azure cloud to process iterations in parallel. This will include an example of optimizing hyperparameters for a predictive model using the "caret" package.
By David Smith. Presented at Microsoft Build (Seattle), May 7 2018.
Your data scientists have created predictive models using open-source tools, proprietary software, or some combination of both, and now you are interested in lifting and shifting those models to the cloud. In this talk, I'll describe how data scientists can transition their existing workflows — while using mostly the same tools and processes — to train and deploy machine learning models based on open source frameworks to Azure. I'll provide guidance on keeping connections to data sources up-to-date, evaluating and monitoring models, and deploying applications that make use of those models.
More Related Content
Similar to The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough
W-JAX Keynote - Big Data and Corporate Evolutionjstogdill
A look at corporate evolution from the industrial revolution to the information age - with a focus on how Big Data will make an impact.
Presented at W-JAX Java Conference in Munich Germany, 11-8-11
Data visualization trends in Business Intelligence: Allison Sapka at Analytic...Fitzgerald Analytics, Inc.
Allison Sapka's presentation at the Analytics and Data in Financial Services Meetup in Dec 2012. Alison discusses trends in Data Visualization, including why visualization is so powerful when implemented well, and confusing or misleading when done badly
The enterprise software stack is undergoing once in a generation refresh largely driven by virtualization, data explosion, infrastructure commoditization, socialization, unlimited connectivity, and online services. With ever growing security parameter and attack vectors, enterprises are looking for ways to secure information access without compromising the business agility unleashed by the abovementioned forces. This presentation focuses on the emerging opportunities in the enterprise space that entrepreneurs can leverage to build the technology giants of tomorrow.
Semantics, Deep Learning, and the Transformation of BusinessSteve Omohundro
Deep learning is likely to have a big impact on business. McKinsey predicts that AI and robotics will create $50 trillion of value over the next 10 years. Over $1 billion of venture investment has gone to 250 deep learning startups over the past year. Deep learning systems have recently broken records in speech recognition, image recognition, image captioning, translation, drug discovery and other tasks. Why is this happening now and how is it likely to play out? We review the development of AI and the pendulum swings between the "neats" and the "scruffies". We describe traditional approaches to semantics through logics and grammars and the new deep learning vector semantics. We relate it to Roger Shepard's cognitive geometry and the structure of biological networks. We also describe limitations of deep learning for safety and regulation. We show how it fits into the rational agent framework and discuss what the next steps may be.
A introductory lecture on how international media is using data visualization to tell stories. Some live demonstrations in the class are not reflected in the slides. Also the in-class exercises are not included.
Today, we have data – lots of it. We can process information – in many ways. And with these two tools and a little bit of creativity, we are discovering the vast depths of human behavior and by extension, a way to accurately predict the future -- and our future happiness. In fact, we can quantify human movement, behaviors, desires, and even moods on a scale that wasn’t possible before a series of advances in processing power, developments in psychology and social network science, and most importantly, access to data.
In advertising, industry, and humanity, we have experienced the evolution from Web 1.0 (informational) to Web 2.0 (platform) to Web 3.0 (semantic) to elements of Web 4.0 (anticipatory) – In this anticipatory era, what can we dream of next? Beyond addressability and increasing ad relevance, how can businesses utilize these advances in product development and other market initiatives? Can we make the leap from inductive logic to human-paralleled intuition? Can this make up for our human brain mechanics that make predicting our own happiness so difficult?
In this talk we’ll cover the evolutions in data access, models for information processing, and the science of collaboration to see not only how they have been leveraged in businesses but also how they are used to better understand human behavior, and hopefully in the near future, a little bit of happiness.
Data Science ATL Meetup - Risk I/O Security Data ScienceMichael Roytman
This is a talk about data science operations and the applications of Risk I/Os insights to the security industry - how we went about mining insights from our large dataset
A global revolution is in full swing, and the Sustainable Brands Conference is where sustainability, brand and innovation leaders gather to learn, share and strategize to shape the future. SB'12 was the largest gathering to date, a kinetic convergence of innovators from more than 150 companies from around the world finding new ways to create monumental disruption in traditional models of commerce and consumption.
Using big data and implementing hadoop is a trend that people jump all to quickly to. Instead understanding the run time complexity of one's algorithms, reducing said complexity and managing the process from start to finish in a lean and agile way can yield massive cost savings - or save your organization.
An introductory presentation about possibilities that Big Data opens to public safety company like e.g. taking advantage of smart city grids, crime and accident databases
Over the past weeks we have been examining the inference process- big.docxlmark1
Over the past weeks we have been examining the inference process, big data (which feeds this process) and now the inter-dependent nature of digital devices in a digital age. I invite you to follow the links below and explore the constructs of Information Overload and Machine Data. Find others of your own, as well.
Do either of these constructs resonate with you? How do they interact with the erosion of boundaries between work and home life that many of you have referenced? How does our dependence on digital devices feed machine data and big data? How does it compromise our personal security?
Try weaving one of these ideas into your thinking about ICT.
The links are:
Big Data Means Information Overload
https://www.youtube.com/watch?v=6MpfVD-c-QI
http://www.forbes.com/sites/laurashin/2014/11/14/10-steps-to-conquering-information-overload/#7bd535d424fe
Machine Data and Operational Intelligence
http://www.splunk.com/en_us/resources/machine-data.html
http://internetofthingsagenda.techtarget.com/definition/machine-data
Solution
hadoop
.
The REAL Impact of Big Data on PrivacyClaudiu Popa
The awesome promise of Big Data is tempered by the need to protect personal information. Data scientists must expertly navigate the legislative waters and acquire the skills to protect privacy and security. This talk provides enterprise leaders with answers and suggests questions to ask when the time comes to consider the vast opportunities offered by big data.
Moderator:
Richard Villars,
Vice President, Information & Cloud
IDC
Panelists:
Andrew Stokes,
Chief Scientist,
Deutsche Bank Global Technology
Elad Yoran,
Chairman and CEO
Vaultive, Inc.
Gordon Haff,
Cloud Evangelist, Red Hat
Greg Brown,
VP and CTO, Cloud and Data Center
solutions, McAfee
John Engates,
CTO, Rackspace
Reuven Cohen,
Senior Vice President
Virtustream
Everybody has heard of Big Data, and its promise as the next great frontier for innovation. However, Big Data is neither new nor easily defined. What are the key drivers that make Big Data so critically important today? What is the single idea behind Big Data that promises such game changing outcomes for capable organizations? Who are the skilled talent that deliver Big Data results?
This presentation briefly reviews the opportunities, motivation and trends that are driving Big Data disruption. Data science is introduced as the enabling engine for Big Data transformation via the creation of new Data Products. The data scientist is defined and his tools, workflow and challenges are reviewed. Finally, practical tips are presented for approaching data product development.
Key takeaways include:
- Big Data disruption is driven by four megatrends
- Data is the essential raw material for creating valuable Data Products
- Data scientists are heterogeneous by role & skill set, but share common tools, workflows and challenges
- Data science talent is more important than raw data for Big Data success
These slides are modified from an invited presentation for the Gwinnett Chamber of Commerce on March 18, 2014. An excerpt was presented at the Georgia Pacific Social Media Working Session on March 19, 2014.
Similar to The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough (20)
Presented to eRum (Budapest), May 2018
There are many common workloads in R that are "embarrassingly parallel": group-by analyses, simulations, and cross-validation of models are just a few examples. In this talk I'll describe the doAzureParallel package, a backend to the "foreach" package that automates the process of spawning a cluster of virtual machines in the Azure cloud to process iterations in parallel. This will include an example of optimizing hyperparameters for a predictive model using the "caret" package.
By David Smith. Presented at Microsoft Build (Seattle), May 7 2018.
Your data scientists have created predictive models using open-source tools, proprietary software, or some combination of both, and now you are interested in lifting and shifting those models to the cloud. In this talk, I'll describe how data scientists can transition their existing workflows — while using mostly the same tools and processes — to train and deploy machine learning models based on open source frameworks to Azure. I'll provide guidance on keeping connections to data sources up-to-date, evaluating and monitoring models, and deploying applications that make use of those models.
Presentation delivered by David Smith to NY R Conference https://www.rstats.nyc/, April 2018:
Minecraft is an open-world creativity game, and a hit with kids. To get kids interested in learning to program with R, we created the "miner" package. This package is a collection of simple functions that allow you to connect with a Minecraft instance, manipulate the world within by creating blocks and controlling the player, and to detect events within the world and react accordingly.
The miner package is intended mainly for kids, to inspire them to learn R while playing Minecraft. But the development of the package also provides some useful insights into how to build an R package to interface with a persistent API, and how to instruct others on its use. In this talk I'll describe how to set up your own Minecraft server, and how to use and extend the package. I'll also provide a few examples of the package in action in a live Minecraft session.
While Python is a widely-used tool for AI development, in this talk I'll make the case for considering R as a platform for developing models for intelligent applications. Firstly, R provides a first-class experience working deep learning frameworks with its keras integration. Equally importantly, it provides the most comprehensive suite of statistical data analysis tools, which are extremely useful for many intelligent applications such as transfer learning. I'll give a few high-level examples in this talk, and we'll go into further detail in the accompanying interactive code lab.
There are many common workloads in R that are "embarrassingly parallel": group-by analyses, simulations, and cross-validation of models are just a few examples. In this talk I'll describe several techniques available in R to speed up workloads like these, by running multiple iterations simultaneously, in parallel.
Many of these techniques require the use of a cluster of machines running R, and I'll provide examples of using cloud-based services to provision clusters for parallel computations. In particular, I will describe how you can use the SparklyR package to distribute data manipulations using the dplyr syntax, on a cluster of servers provisioned in the Azure cloud.
Presented by David Smith at Data Day Texas in Austin, January 27 2018.
A look at the changing perceptions of R, from the early days of the R project to today. Microsoft sponsor talk, presented by David Smith to the useR!2017 conference in Brussels, July 5 2017.
Predicting Loan Delinquency at One Million Transactions per SecondRevolution Analytics
Real-time applications of predictive models must be able to generate predictions at the rate that transactions are generated. Previously, such applications of models trained using R needed to be converted to other languages like C++ or Java to achieve the required throughput. In this talk, I’ll describe how to use the in-database R processing capabilities of Microsoft R Server to detect fraud in a SQL Server database of loan records at a rate exceeding one million transactions per second. I will also show the process of training the underlying gradient-boosted tree model on a large training set using the out-of-memory algorithms of Microsoft R.
Presented by David Smith at The Data Science Summit, Chicago, April 20 2017.
The ability to independently reproduce results is a critical issue within the scientific community today, and is equally important for collaboration and compliance in business. In this talk, I'll introduce several features available in R that help you make reproducibility a standard part of your data science workflow. The talk will include tips on working with data and files, combining code and output, and managing R's changing package ecosystem.
Presented by David Smith, R Community Lead (Microsoft), at Monktoberfest October 2016.
The value of open source isn’t just in the software itself. The communities that form around open source software provide just as much value and sometimes even more: in ongoing development, in documentation, in support, in marketing, and as a supply of ready-trained employees. Companies who build on open source tend to focus on the software, but neglect communities at their peril.
In this talk, I share some of my experiences in building community for an open-source software company, Revolution Analytics, and perspectives since the acquisition by Microsoft in 2015.
R is more than just a language. Many of the reasons why R has become such a popular tool for data science come from the ecosystem surrounding the R project. R users benefit from the many resources and packages created by the community, while commercial companies (including Microsoft) provide tools to extend and support R, and services to help people use R.
In this talk, I will give an overview of the R Ecosystem and describe how it has been a critical component of R’s success, and include several examples of Microsoft’s contributions to the ecosystem.
(Presented to EARL London, September 2016)
(Presented by David Smith at useR!2016, June 2016. Recording: https://channel9.msdn.com/Events/useR-international-R-User-conference/useR2016/R-at-Microsoft )
Since the acquisition of Revolution Analytics in April 2015, Microsoft has embarked upon a project to build R technology into many Microsoft products, so that developers and data scientists can use the R language and R packages to analyze data in their data centers and in cloud environments.
In this talk I will give an overview (and a demo or two) of how R has been integrated into various Microsoft products. Microsoft data scientists are also big users of R, and I'll describe a couple of examples of R being used to analyze operational data at Microsoft. I'll also share some of my experiences in working with open source projects at Microsoft, and my thoughts on how Microsoft works with open source communities including the R Project.
Hadoop is famously scalable. Cloud Computing is famously scalable. R – the thriving and extensible open source Data Science software – not so much. But what if we seamlessly combined Hadoop, Cloud Computing, and R to create a scalable Data Science platform? Imagine exploring, transforming, modeling, and scoring data at any scale from the comfort of your favorite R environment. Now, imagine calling a simple R function to operationalize your predictive model as a scalable, cloud-based Web Service. Learn how to leverage the magic of Hadoop on-premises or in the cloud to run your R code, thousands of open source R extension packages, and distributed implementations of the most popular machine learning algorithms at scale.
The Rise of Data Science in the Age of Big Data Analytics: Why data distillation and machine learning aren't enough
1. Revolution Confidential
T he R is e of Data
S c ienc e in the age of
B ig Data A nalytic s
Why Data Dis tillation and Mac hine
L earning A ren’t E nough
David M S mith
V P Marketing and C ommunity
R evolution Analytic s
2. Today, we’ll dis c us s : Revolution Confidential
What is Data Science?
Why machine learning isn’t enough
Why Data Science works
The Data Scientists Toolkit
The Future of Big Data Analytics
Closing thoughts and resources
2
4. Where is it s afe to fis h near S an F ranc is c o? Revolution Confidential
San Francisco Estuary Institute
http://www.sfei.org/tools/wqt 4
5. Hurric ane S andy Revolution Confidential
Bob Rudis
http://rud.is/b/2012/10/28/watch-sandy-in-r-including-forecast-cone/
5
6. Hurric ane S andy Revolution Confidential
Ed Chen
http://blog.echen.me/hurricane-sandy-outages/
6
7. When did Mic hael J ac ks on have his
bigges t hits ? Revolution Confidential
New York Times, June 25 2009 (3 hours after Michael Jackson’s death)
http://www.nytimes.com/interactive/2009/06/25/arts/0625-jackson-graphic.html 7
8. T hree E s s ential S kills of Data S c ientis ts Revolution Confidential
Models
Data Integration
Visualization
Mashups
Predictions
Applications
Uncertainty
Problems Effective
Data Sources Data
Credibility Applications
Drew Conway
http://www.dataists.com/2010/09/the-data-science-venn-diagram/ 8
10. Mac hine learning (ML ) for predic tions Revolution Confidential
Building the Model
Responses
Features
scoring Scoring new data
ML rules
Predictions (scores)
New Data
scoring
Validating the Model
Predictions rules
Response
Validation
scoring
set
rules
“Accuracy”
10
15. A ns wer Unas ked Ques tions Revolution Confidential
Revolutions blog: “The Uncanny Valley of Big Data”
http://blog.revolutionanalytics.com/2012/02/the-uncanny-valley-of-big-data.html 15
16. F ill in knowledge gaps Revolution Confidential
“Companies that have
massive amounts of data
without massive amounts
of clue are going to be
displaced by startups that
have less data but more
clue.” -- Tim O’Reilly
“More data beats
better algorithms,
every time” – Google
Google Research, “The Unreasonable Effectiveness of Data”:
http://googleresearch.blogspot.com/2009/03/unreasonable-effectiveness-of-data.html
Tim O’Reilly on Google+: https://plus.google.com/107033731246200681024/posts/4Xa76AtxYwd
TechnoCalifornia: http://technocalifornia.blogspot.com/2012/07/more-data-or-better-models.html 16
19. 0. Data (B ig & Mes s y) Revolution Confidential
19
20. 1. A language for programming with data Revolution Confidential
Download the White Paper
R is Hot
bit.ly/r-is-hot
20
21. Data import and pre-
processing
Revolution Confidential
User-defined functions
Internet API interface
XML parsing
Grant awards to homeless veterans FY09
Iterative data processing Data: Data.gov
Analysis: Drew Conway
Custom graphics
21
22. 2. S peed. L ots and lots of s peed. Revolution Confidential
Variable
Transformation
Feature
Selection Model
Data Sampling Estimation Predictions
Aggregation
Model
Model
Comparison /
Refinement
Benkmarking
22
23. Us e all available c omputing c yc les Revolution Confidential
Shared Memory
Data Data Data
Core 0 Core 1 Core 2 Core n
Disk (Thread 0) (Thread 1) (Thread 2) (Thread n)
Multicore Processor (4, 8, 16+ cores)
23
24. 3. A lgorithms that don’t c hoke on B ig Data
Revolution Confidential
Compute
Node
Data
Partition
Compute
Data Node
Partition
BIG
Data
Master
Node
Partition Compute
DATA Node
Data
Partition
Compute
Node
PEMAs: Parallel External-Memory Algorithms
24
25. Drink les s c offee! Revolution Confidential
Single Threaded
Non-optimized
algorithms
Optimized
Parallelized
Algorithms
25
26. 4. Move c ode to data (not vic e vers a) Revolution Confidential
Map-Reduce
RHadoop: http://bit.ly/RHadoop 26
27. B ig Data A pplianc es Revolution Confidential
More info: http://bit.ly/R-Netezza
27
28. P lay Nic e with Others Revolution Confidential
Presentation Layer
• Business Intelligence Tools
• Web-based data apps
• Reporting / Spreadsheets
Analytics Layer
•R
Data Layer
• Relational datastores
• Unstructured datastores
28
29. What every data s c ientis t needs Revolution Confidential
Revolution R
Open-Source R Enterprise
Interface with multiple data sources ✓ ✓✓
Exploratory data analysis ✓✓ ✓✓
Wide range of statistical methods ✓✓ ✓✓
High-speed computation ✘ ✓✓
Big Data support ✘ ✓✓
Data/code locality (Hadoop, etc.) ✘ ✓✓
Print-quality data visualization ✓ ✓
Scheduled batch production ✓ ✓✓
Works in a multi-tool ecosystem ✓✓ ✓✓
Integration into Data Apps ✘ ✓✓
29
30. R evolution R E nterpris e: B ig-Data R Revolution Confidential
Revolution R
Open-Source R Enterprise
Interface with multiple data sources ✓ ✓✓
Exploratory data analysis ✓✓ ✓✓
Wide range of statistical methods ✓✓ ✓✓
High-speed computation ✘ ✓✓
Big Data support ✘ ✓✓
Data/code locality (Hadoop, etc.) ✘ ✓✓
Print-quality data visualization ✓✓ ✓✓
Scheduled batch production ✓ ✓✓
Works in a multi-tool ecosystem ✓✓ ✓✓
Integration into Data Apps ✘ ✓✓
www.revolutionanalytics.com/products 30
32. A nd … the future? Revolution Confidential
Even more data
Cloud computing
Demand for
Data Scientists
Diverging paradigms for data analytics
http://www.indeed.com/jobtrends 32
33. Diverging data paradigms Revolution Confidential
More data, better fault tolerance
Files Data Hadoop
Clusters Appliances NoSQL
Exploration Storage
Modeling Preprocessing
Easier programming, better performance
Production
33
34. Data S c ienc e in P roduc tion Revolution Confidential
Real-time Big Data Analytics: From
Deployment to Production
Thursday, November 29, 2012
10:00AM - 11:00AM Pacific Time
www.revolutionanalytics.com/news-events/free-webinars/
34
35. B uilding Data S c ienc e Teams Revolution Confidential
DJ Patil in O’Reilly Radar: http://oreil.ly/I3H5fI
Statistics and Data Science graduates
Kaggle and Chorus
Revolution Analytics R Training:
http://www.revolutionanalytics.com/services/training/
35
36. C los ing T houghts Revolution Confidential
Data Science process leads to more
powerful, and more useful models
Data Scientists need a technology platform
to think about, explore, and model data
Revolution R Enterprise is R for Big Data
36
37. R es ourc es Revolution Confidential
Revolution R Enterprise : R for Big Data
www.revolutionanalytics.com/products
Rhadoop : Connecting R and Hadoop
bit.ly/r-hadoop
Contact David Smith
david@revolutionanalytics.com
@revodavid
blog.revolutionanalytics.com
37
38. T hank you. Revolution Confidential
The leading commercial provider of software and support for the popular
open source R statistics language.
www.revolutionanalytics.com 650.646.9545 Twitter: @RevolutionR
38