Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014


Published on

At StampedeCon 2014, Rob Peglar (EMC Isilon) presented "Big Data Past, Present and Future – Where are we Headed?"

Rob Peglar was one of the speakers at the very first StampedeCon. Following that talk two years ago, Rob will present an overview of and insight into the technologies and system approaches to computing, transport and storage of big data – where we’ve been, are now and are headed. There is a major ‘fork in the road’ upcoming in the treatment and business application of big data and the technology that surrounds it, one that is important enough to change the course of the methodologies and approaches used by large and small business alike, especially for the infrastructure required either on premise or in the cloud.

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

Big Data Past, Present and Future – Where are we Headed? - StampedeCon 2014

  1. 1. 1 Big Data Past, Present & Future Where are We Headed? Rob Peglar CTO Americas Isilon Storage Division EMC Corporation @peglarr
  2. 2. 2 • In order to understand what’s coming, we must understand our past • We must also understand that Big Data is fundamentally different than what we’re used to • Consider the difference between a still photograph and a movie – and our human perception of them – More than a collection of still photographs – why? Prediction is Very Difficult - Especially About the Future - Niels Bohr
  3. 3. 3 The Past – and I Mean the Past • Consider the census… • From the Latin “censere” – meaning “to estimate” • “In those days a decree went out from Emperor Augustus that all the world should be registered.” Luke 2:1 • The Domesday Book of 1086 – England – Comprehensive tally of people, their land, and property • The US Constitution mandates a decennial census – The 1880 census took eight years (!) to complete • This led to Hollerith’s punched card tabulator in 1890 – The beginning of automated data processing – Reduced the census time to one year
  4. 4. 4 Sampling – Good or Bad? • Sampling precision improves optimally with randomness – Not sample size – Jerzy Neyman (Poland, 1934) proved this • Neyman, J.(1934) "On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection", Journal of the Royal Statistical Society, 97 (4), 557–625 • Good - Sampling was a solution to information overload • Bad - Systematic bias in sampling gives wrong conclusions • A seismic shift is occurring – from – Sampling, keeping datasets small on purpose, using them once…to – N=all, keeping datasets large on purpose, using them many times • Why? The outliers are the most interesting! – Examples – credit card fraud, language translation, insurability – Don’t just follow the rules, look for the exceptions Williams Tube 1946 1024 bits
  5. 5. 5 The Journey from Clean to Messy • 1998 – Linden et al, collaborative filtering patent, working at a Seattle startup selling books online – G. Linden J. Jacobi and E. Benson, Collaborative Recommendations Using Item-to-Item Similarity Mappings, US Patent 6,266,649 (to, Patent and Trademark Office, Washington, D.C., 2001 • “If it works perfectly, Amazon should show you just one book – the next one you will buy.” (Linden) • Hypothesis-driven approach becomes data-driven – “Proving” something (causation)  correlation • McGregor et al – using big data to improve the NICU – 16 data streams, 1,260 data points/sec – Valid improvement of premature infant adverse outcomes – No “proof” – it helps doctors make better diagnostic decisions – Carolyn McGregor, "Big Data in Neonatal Intensive Care," Computer, vol. 46, no. 6, pp. 54-59, June 2013, doi:10.1109/MC.2013.157
  6. 6. 6 Manholes and Raw Data - Correlations • 94,000 miles of underground cable in NYC, 51,000 manholes in just Manhattan w/service boxes below • 1 in 20 cables laid before 1930; some Edison-era • Records kept since 1880’s – 38 different terms – All hand-written, paper, cards, ledgers, etc. • 2008 - How to prevent fires, exploding manholes? • Machine-correlate 106 predictors of imminent disaster – Top 10% predicted were 44% of total failures • Chris Anderson – “data deluge makes scientific method obsolete” – • “Datafication” – everything is data – Numbers to words to images to locations to relationships to feelings … – Graph theory & graph analysis changes the way we perceive the world
  7. 7. 7© Copyright 2014 EMC Corporation. All rights reserved.© Copyright 2014 EMC Corporation. All rights reserved. The Present - Architecture BUSINESS PROCESSINFO PROCESSINGDATA ACQUISITIONDATA CREATION END USERSANALYSTS / SCIENTISTSARCHITECTS / ENGINEERSPRODUCERS Shared Nothing Scale-out Storage + SSD MPP + In-Memory Compute Hadoop Hi-Speed / - Resiliency Networking Converged Infrastructure Cloud Non-relational DWH SYSTEMS INTEGRATION VOLUMEVELOCITYVARIETY OBJECTIVES Stream Processing Event Management Data Exploration Contextualized Data Modeling / Scenarios Forecasting DELIVERY MODELS Access-Anywhere Analytics Services Context-Aware Business Applications ON-DEMAND Location-Based Services Alert and Respond PUSH Workflow and Interaction Automation Smart devices and systems EMBEDDED Email and Messaging Mobile Apps Data Transaction and Usage Logs Machine and Sensors Geolocation Relationships and Social Influence Real-time Events Deep Insights VALUE
  8. 8. 8 The Present – Business Value of Data • Data is valuable – re-use of data even more so – Not ephemeral value – can be re-consumed ad infinitum – Economists call this a “non-rivalrous” good • Cost/benefit of storage ~ 0 – so keep everything – Ewan Birney, European Biomatics Information Institute, “Hidden Treasures In Junk DNA” – Last 50 years, cost/byte ~1/2x every 2 years – Density has increased ~50 million times since 1956 • Consider electric cars: – Battery level indicates when to “fill up” from the power grid – Power utility monitors grid usage over time – Correlate both data sets together • Determine when/where to build recharge stations on which roads • Recombinant data – “Old” data combined into new forms for new insights – “Noisy” datasets enable feedback loops – e.g. better/faster search/index
  9. 9. 9 The Future 1 – Wild, Wild West? • Can we treat data as a corporate asset? – A ledger entry, like “brand value” (intangible) – Or is data a tangible asset to be kept on the books? – Does data have “cash value”? Asset amortization? – Can a business be legally “liable” for its data collection? • Facebook book-valued at $6.3B. IPO value: $104B – Why the difference? Facebook is essentially data – Or, every FB user is worth ~ $100 (~1B subscribers) • We will see much more “data value chain” ahead – Ingest, analyze, sell results, analyze, sell results …downstreaming – Licensing of data in its infancy – much more to come – Think about the data just from your car – 40 uPs
  10. 10. 10 The Future 2 – Data as Policy - Can Data save Us from Us? • “In God We Trust – all others bring data” – Commonly attributed to W. Edward Deming • New jobs/titles coming out of the woodwork – CAO (Chief Analytics Officer), CDO (Data) – Data Scientist, Data Correlationist, Data Ethicist • Knowing “what” not “why” is good enough. Is it? • Remember Bayes’ “inductive probability” (250 yrs!) – We update our beliefs about something as new data arrives – Bayes T. (1763) "An Essay towards solving a Problem in the Doctrine of Chances". Phil. Trans., 53, 370–418. • Data Policy in the immortal words of Yogi Berra: – “We make too many wrong mistakes” – “You can observe a lot just by watching.”
  11. 11. 11 The Future 3 – N=all? Keep Everything? Seriously? • Data Silos or the Data Lake? – HDFS presents a crisis: i.e. 危機, weiji • dangerous ‘critical point’ (not crisis; mis-translation) – Write-once, read-many, modify-never; delete-never? – Time is not your friend when moving data • (So, don’t move it between repositories; move it to the CPU) • One 40GE NIC yields same rate on bus as 28 disks @ 140MB/s • One million seconds is 277.7 hours (~ 11.5 days) • 1 PB @ 1 GB/sec is … 1 EB @ 1 TB/sec is … • Non-shared (1 protocol) or shared (N protocols)? • Time versus Space – the Essential Judgment • Cost of Having Data vs. Cost of Not Having Data
  12. 12. 12 THANK YOU