SlideShare a Scribd company logo
1 of 12
1
Big Data
Past, Present & Future
Where are We Headed?
Rob Peglar
CTO Americas
Isilon Storage Division
EMC Corporation
rob.peglar@emc.com
@peglarr
2
• In order to understand what’s coming, we must
understand our past
• We must also understand that
Big Data is fundamentally
different than what we’re used to
• Consider the difference between a still photograph
and a movie – and our human perception of them
– More than a collection of still photographs – why?
Prediction is Very Difficult -
Especially About the Future
- Niels Bohr
3
The Past –
and I Mean the Past
• Consider the census…
• From the Latin “censere”
– meaning “to estimate”
• “In those days a decree went out from Emperor Augustus that all
the world should be registered.” Luke 2:1
• The Domesday Book of 1086 – England
– Comprehensive tally of people, their land, and property
• The US Constitution mandates a decennial census
– The 1880 census took eight years (!) to complete
• This led to Hollerith’s punched card tabulator in 1890
– The beginning of automated data processing
– Reduced the census time to one year
4
Sampling – Good or Bad?
• Sampling precision improves optimally
with randomness
– Not sample size
– Jerzy Neyman (Poland, 1934) proved this
• Neyman, J.(1934) "On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection", Journal of
the Royal Statistical Society, 97 (4), 557–625
• Good - Sampling was a solution to information overload
• Bad - Systematic bias in sampling gives wrong conclusions
• A seismic shift is occurring – from
– Sampling, keeping datasets small on purpose, using them once…to
– N=all, keeping datasets large on purpose, using them many times
• Why? The outliers are the most interesting!
– Examples – credit card fraud, language translation, insurability
– Don’t just follow the rules, look for the exceptions
Williams
Tube
1946
1024 bits
5
The Journey from
Clean to Messy
• 1998 – Linden et al, collaborative
filtering patent, working at a Seattle startup selling books
online
– G. Linden J. Jacobi and E. Benson, Collaborative Recommendations Using Item-to-Item Similarity Mappings, US Patent 6,266,649 (to Amazon.com),
Patent and Trademark Office, Washington, D.C., 2001
• “If it works perfectly, Amazon should show you just one
book – the next one you will buy.” (Linden)
• Hypothesis-driven approach becomes data-driven
– “Proving” something (causation)  correlation
• McGregor et al – using big data to improve the NICU
– 16 data streams, 1,260 data points/sec
– Valid improvement of premature infant adverse outcomes
– No “proof” – it helps doctors make better diagnostic decisions
– Carolyn McGregor, "Big Data in Neonatal Intensive Care," Computer, vol. 46, no. 6, pp. 54-59, June 2013, doi:10.1109/MC.2013.157
6
Manholes and Raw Data - Correlations
• 94,000 miles of underground cable in NYC, 51,000 manholes in
just Manhattan w/service boxes below
• 1 in 20 cables laid before 1930; some Edison-era
• Records kept since 1880’s – 38 different terms
– All hand-written, paper, cards, ledgers, etc.
• 2008 - How to prevent fires, exploding manholes?
• Machine-correlate 106 predictors of imminent disaster
– Top 10% predicted were 44% of total failures
• Chris Anderson – “data deluge makes scientific method obsolete”
– http://archive.wired.com/science/discoveries/magazine/16-07/pb_theory
• “Datafication” – everything is data
– Numbers to words to images to locations to relationships to feelings …
– Graph theory & graph analysis changes the way we perceive the world
7© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved.
The Present - Architecture
BUSINESS PROCESSINFO PROCESSINGDATA ACQUISITIONDATA CREATION
END USERSANALYSTS / SCIENTISTSARCHITECTS / ENGINEERSPRODUCERS
Shared Nothing
Scale-out Storage + SSD
MPP + In-Memory
Compute
Hadoop
Hi-Speed / -
Resiliency
Networking
Converged
Infrastructure
Cloud
Non-relational
DWH
SYSTEMS INTEGRATION
VOLUMEVELOCITYVARIETY
OBJECTIVES
Stream Processing
Event Management
Data Exploration
Contextualized Data
Modeling / Scenarios
Forecasting
DELIVERY MODELS
Access-Anywhere
Analytics Services
Context-Aware
Business Applications
ON-DEMAND
Location-Based
Services
Alert and
Respond
PUSH
Workflow and
Interaction
Automation
Smart devices
and systems
EMBEDDED
Email and
Messaging
Mobile Apps Data
Transaction and
Usage Logs
Machine and
Sensors
Geolocation
Relationships and
Social Influence
Real-time
Events
Deep
Insights
VALUE
8
The Present – Business Value of Data
• Data is valuable – re-use of data even more so
– Not ephemeral value – can be re-consumed ad infinitum
– Economists call this a “non-rivalrous” good
• Cost/benefit of storage ~ 0 – so keep everything
– Ewan Birney, European Biomatics Information Institute, “Hidden Treasures
In Junk DNA” http://www.scientificamerican.com/article.cfm?id=hidden-treasures-in-junk-dna
– Last 50 years, cost/byte ~1/2x every 2 years
– Density has increased ~50 million times since 1956
• Consider electric cars:
– Battery level indicates when to “fill up” from the power grid
– Power utility monitors grid usage over time
– Correlate both data sets together
• Determine when/where to build recharge stations on which roads
• Recombinant data
– “Old” data combined into new forms for new insights
– “Noisy” datasets enable feedback loops – e.g. better/faster search/index
9
The Future 1 – Wild, Wild West?
• Can we treat data as a corporate asset?
– A ledger entry, like “brand value” (intangible)
– Or is data a tangible asset to be kept on the books?
– Does data have “cash value”? Asset amortization?
– Can a business be legally “liable” for its data collection?
• Facebook book-valued at $6.3B. IPO value: $104B
– Why the difference? Facebook is essentially data
– Or, every FB user is worth ~ $100 (~1B subscribers)
• We will see much more “data value chain” ahead
– Ingest, analyze, sell results, analyze, sell results …downstreaming
– Licensing of data in its infancy – much more to come
– Think about the data just from your car – 40 uPs
10
The Future 2 – Data as Policy -
Can Data save Us from Us?
• “In God We Trust – all others bring data”
– Commonly attributed to W. Edward Deming
• New jobs/titles coming out of the woodwork
– CAO (Chief Analytics Officer), CDO (Data)
– Data Scientist, Data Correlationist, Data Ethicist
• Knowing “what” not “why” is good enough. Is it?
• Remember Bayes’ “inductive probability” (250 yrs!)
– We update our beliefs about something as new data arrives
– Bayes T. (1763) "An Essay towards solving a Problem in the Doctrine of Chances". Phil. Trans., 53, 370–418.
• Data Policy in the immortal words of Yogi Berra:
– “We make too many wrong mistakes”
– “You can observe a lot just by watching.”
11
The Future 3 – N=all?
Keep Everything? Seriously?
• Data Silos or the Data Lake?
– HDFS presents a crisis: i.e. 危機, weiji
• dangerous ‘critical point’ (not crisis; mis-translation)
– Write-once, read-many, modify-never; delete-never?
– Time is not your friend when moving data
• (So, don’t move it between repositories; move it to the CPU)
• One 40GE NIC yields same rate on bus as 28 disks @ 140MB/s
• One million seconds is 277.7 hours (~ 11.5 days)
• 1 PB @ 1 GB/sec is … 1 EB @ 1 TB/sec is …
• Non-shared (1 protocol) or shared (N protocols)?
• Time versus Space – the Essential Judgment
• Cost of Having Data vs. Cost of Not Having Data
12
THANK YOU

More Related Content

What's hot

Big data
Big dataBig data
Big data
Claire Choong
 
Big Data Analytics for Dodd-Frank
Big Data Analytics for Dodd-FrankBig Data Analytics for Dodd-Frank
Big Data Analytics for Dodd-Frank
DataWorks Summit
 

What's hot (20)

Creating Value in Health through Big Data
Creating Value in Health through Big DataCreating Value in Health through Big Data
Creating Value in Health through Big Data
 
Big data
Big dataBig data
Big data
 
Asking More - Jon Iwata, IBM
Asking More - Jon Iwata, IBMAsking More - Jon Iwata, IBM
Asking More - Jon Iwata, IBM
 
Qu'est ce que le Big Data ? Avec Victoria Galano Data Scientist chez Air France
Qu'est ce que le Big Data ? Avec Victoria Galano Data Scientist chez Air FranceQu'est ce que le Big Data ? Avec Victoria Galano Data Scientist chez Air France
Qu'est ce que le Big Data ? Avec Victoria Galano Data Scientist chez Air France
 
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-ShapiroKeynote -  An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Data science
Data scienceData science
Data science
 
U4 l01 What is big data?
U4 l01 What is big data?U4 l01 What is big data?
U4 l01 What is big data?
 
Big Data & Machine Learning
Big Data & Machine LearningBig Data & Machine Learning
Big Data & Machine Learning
 
The promise and challenge of Big Data
The promise and challenge of Big DataThe promise and challenge of Big Data
The promise and challenge of Big Data
 
NewMR 2016 presents: 9 Big Applications of Big Data
NewMR 2016 presents: 9 Big Applications of Big DataNewMR 2016 presents: 9 Big Applications of Big Data
NewMR 2016 presents: 9 Big Applications of Big Data
 
Big data, big opportunities
Big data, big opportunitiesBig data, big opportunities
Big data, big opportunities
 
Data Science and Culture
Data Science and CultureData Science and Culture
Data Science and Culture
 
Business analytics
Business analyticsBusiness analytics
Business analytics
 
Big Data Analytics for Dodd-Frank
Big Data Analytics for Dodd-FrankBig Data Analytics for Dodd-Frank
Big Data Analytics for Dodd-Frank
 
The Field Guide to Data Science
The Field Guide to Data ScienceThe Field Guide to Data Science
The Field Guide to Data Science
 
The Science of Data Science
The Science of Data Science The Science of Data Science
The Science of Data Science
 
Applications of Big Data
Applications of Big DataApplications of Big Data
Applications of Big Data
 
Data science and_analytics_for_ordinary_people_ebook
Data science and_analytics_for_ordinary_people_ebookData science and_analytics_for_ordinary_people_ebook
Data science and_analytics_for_ordinary_people_ebook
 
Lecture #01
Lecture #01Lecture #01
Lecture #01
 

Similar to Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014

Why quality control and quality assurance is important for the legacy of GEOT...
Why quality control and quality assurance is important for the legacy of GEOT...Why quality control and quality assurance is important for the legacy of GEOT...
Why quality control and quality assurance is important for the legacy of GEOT...
Adam Leadbetter
 

Similar to Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014 (20)

DBMS
DBMSDBMS
DBMS
 
Big Data and the Art of Data Science
Big Data and the Art of Data ScienceBig Data and the Art of Data Science
Big Data and the Art of Data Science
 
Big data
Big dataBig data
Big data
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research Opportunities
 
DataEd Slides: Getting Data Quality Right – Success Stories
DataEd Slides: Getting Data Quality Right – Success StoriesDataEd Slides: Getting Data Quality Right – Success Stories
DataEd Slides: Getting Data Quality Right – Success Stories
 
Big Data World
Big Data WorldBig Data World
Big Data World
 
Data science and its potential to change business as we know it. The Roadmap ...
Data science and its potential to change business as we know it. The Roadmap ...Data science and its potential to change business as we know it. The Roadmap ...
Data science and its potential to change business as we know it. The Roadmap ...
 
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...
 
Why quality control and quality assurance is important for the legacy of GEOT...
Why quality control and quality assurance is important for the legacy of GEOT...Why quality control and quality assurance is important for the legacy of GEOT...
Why quality control and quality assurance is important for the legacy of GEOT...
 
Big Data et eGovernment
Big Data et eGovernmentBig Data et eGovernment
Big Data et eGovernment
 
Spark
SparkSpark
Spark
 
Spark Social Media
Spark Social Media Spark Social Media
Spark Social Media
 
Ictam big data
Ictam big dataIctam big data
Ictam big data
 
Big Data – Are You Ready?
Big Data – Are You Ready?Big Data – Are You Ready?
Big Data – Are You Ready?
 
Big Data: What's it Really About?
Big Data: What's it Really About?Big Data: What's it Really About?
Big Data: What's it Really About?
 
Level Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentationLevel Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentation
 
Big Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar SemwalBig Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar Semwal
 
1. Data Science overview - part1.pptx
1. Data Science overview - part1.pptx1. Data Science overview - part1.pptx
1. Data Science overview - part1.pptx
 
Data warehouse Vs Big Data
Data warehouse Vs Big Data Data warehouse Vs Big Data
Data warehouse Vs Big Data
 
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
 

More from StampedeCon

Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
StampedeCon
 

More from StampedeCon (20)

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
 

Recently uploaded

Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
UK Journal
 

Recently uploaded (20)

TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024
 

Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014

  • 1. 1 Big Data Past, Present & Future Where are We Headed? Rob Peglar CTO Americas Isilon Storage Division EMC Corporation rob.peglar@emc.com @peglarr
  • 2. 2 • In order to understand what’s coming, we must understand our past • We must also understand that Big Data is fundamentally different than what we’re used to • Consider the difference between a still photograph and a movie – and our human perception of them – More than a collection of still photographs – why? Prediction is Very Difficult - Especially About the Future - Niels Bohr
  • 3. 3 The Past – and I Mean the Past • Consider the census… • From the Latin “censere” – meaning “to estimate” • “In those days a decree went out from Emperor Augustus that all the world should be registered.” Luke 2:1 • The Domesday Book of 1086 – England – Comprehensive tally of people, their land, and property • The US Constitution mandates a decennial census – The 1880 census took eight years (!) to complete • This led to Hollerith’s punched card tabulator in 1890 – The beginning of automated data processing – Reduced the census time to one year
  • 4. 4 Sampling – Good or Bad? • Sampling precision improves optimally with randomness – Not sample size – Jerzy Neyman (Poland, 1934) proved this • Neyman, J.(1934) "On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection", Journal of the Royal Statistical Society, 97 (4), 557–625 • Good - Sampling was a solution to information overload • Bad - Systematic bias in sampling gives wrong conclusions • A seismic shift is occurring – from – Sampling, keeping datasets small on purpose, using them once…to – N=all, keeping datasets large on purpose, using them many times • Why? The outliers are the most interesting! – Examples – credit card fraud, language translation, insurability – Don’t just follow the rules, look for the exceptions Williams Tube 1946 1024 bits
  • 5. 5 The Journey from Clean to Messy • 1998 – Linden et al, collaborative filtering patent, working at a Seattle startup selling books online – G. Linden J. Jacobi and E. Benson, Collaborative Recommendations Using Item-to-Item Similarity Mappings, US Patent 6,266,649 (to Amazon.com), Patent and Trademark Office, Washington, D.C., 2001 • “If it works perfectly, Amazon should show you just one book – the next one you will buy.” (Linden) • Hypothesis-driven approach becomes data-driven – “Proving” something (causation)  correlation • McGregor et al – using big data to improve the NICU – 16 data streams, 1,260 data points/sec – Valid improvement of premature infant adverse outcomes – No “proof” – it helps doctors make better diagnostic decisions – Carolyn McGregor, "Big Data in Neonatal Intensive Care," Computer, vol. 46, no. 6, pp. 54-59, June 2013, doi:10.1109/MC.2013.157
  • 6. 6 Manholes and Raw Data - Correlations • 94,000 miles of underground cable in NYC, 51,000 manholes in just Manhattan w/service boxes below • 1 in 20 cables laid before 1930; some Edison-era • Records kept since 1880’s – 38 different terms – All hand-written, paper, cards, ledgers, etc. • 2008 - How to prevent fires, exploding manholes? • Machine-correlate 106 predictors of imminent disaster – Top 10% predicted were 44% of total failures • Chris Anderson – “data deluge makes scientific method obsolete” – http://archive.wired.com/science/discoveries/magazine/16-07/pb_theory • “Datafication” – everything is data – Numbers to words to images to locations to relationships to feelings … – Graph theory & graph analysis changes the way we perceive the world
  • 7. 7© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. The Present - Architecture BUSINESS PROCESSINFO PROCESSINGDATA ACQUISITIONDATA CREATION END USERSANALYSTS / SCIENTISTSARCHITECTS / ENGINEERSPRODUCERS Shared Nothing Scale-out Storage + SSD MPP + In-Memory Compute Hadoop Hi-Speed / - Resiliency Networking Converged Infrastructure Cloud Non-relational DWH SYSTEMS INTEGRATION VOLUMEVELOCITYVARIETY OBJECTIVES Stream Processing Event Management Data Exploration Contextualized Data Modeling / Scenarios Forecasting DELIVERY MODELS Access-Anywhere Analytics Services Context-Aware Business Applications ON-DEMAND Location-Based Services Alert and Respond PUSH Workflow and Interaction Automation Smart devices and systems EMBEDDED Email and Messaging Mobile Apps Data Transaction and Usage Logs Machine and Sensors Geolocation Relationships and Social Influence Real-time Events Deep Insights VALUE
  • 8. 8 The Present – Business Value of Data • Data is valuable – re-use of data even more so – Not ephemeral value – can be re-consumed ad infinitum – Economists call this a “non-rivalrous” good • Cost/benefit of storage ~ 0 – so keep everything – Ewan Birney, European Biomatics Information Institute, “Hidden Treasures In Junk DNA” http://www.scientificamerican.com/article.cfm?id=hidden-treasures-in-junk-dna – Last 50 years, cost/byte ~1/2x every 2 years – Density has increased ~50 million times since 1956 • Consider electric cars: – Battery level indicates when to “fill up” from the power grid – Power utility monitors grid usage over time – Correlate both data sets together • Determine when/where to build recharge stations on which roads • Recombinant data – “Old” data combined into new forms for new insights – “Noisy” datasets enable feedback loops – e.g. better/faster search/index
  • 9. 9 The Future 1 – Wild, Wild West? • Can we treat data as a corporate asset? – A ledger entry, like “brand value” (intangible) – Or is data a tangible asset to be kept on the books? – Does data have “cash value”? Asset amortization? – Can a business be legally “liable” for its data collection? • Facebook book-valued at $6.3B. IPO value: $104B – Why the difference? Facebook is essentially data – Or, every FB user is worth ~ $100 (~1B subscribers) • We will see much more “data value chain” ahead – Ingest, analyze, sell results, analyze, sell results …downstreaming – Licensing of data in its infancy – much more to come – Think about the data just from your car – 40 uPs
  • 10. 10 The Future 2 – Data as Policy - Can Data save Us from Us? • “In God We Trust – all others bring data” – Commonly attributed to W. Edward Deming • New jobs/titles coming out of the woodwork – CAO (Chief Analytics Officer), CDO (Data) – Data Scientist, Data Correlationist, Data Ethicist • Knowing “what” not “why” is good enough. Is it? • Remember Bayes’ “inductive probability” (250 yrs!) – We update our beliefs about something as new data arrives – Bayes T. (1763) "An Essay towards solving a Problem in the Doctrine of Chances". Phil. Trans., 53, 370–418. • Data Policy in the immortal words of Yogi Berra: – “We make too many wrong mistakes” – “You can observe a lot just by watching.”
  • 11. 11 The Future 3 – N=all? Keep Everything? Seriously? • Data Silos or the Data Lake? – HDFS presents a crisis: i.e. 危機, weiji • dangerous ‘critical point’ (not crisis; mis-translation) – Write-once, read-many, modify-never; delete-never? – Time is not your friend when moving data • (So, don’t move it between repositories; move it to the CPU) • One 40GE NIC yields same rate on bus as 28 disks @ 140MB/s • One million seconds is 277.7 hours (~ 11.5 days) • 1 PB @ 1 GB/sec is … 1 EB @ 1 TB/sec is … • Non-shared (1 protocol) or shared (N protocols)? • Time versus Space – the Essential Judgment • Cost of Having Data vs. Cost of Not Having Data