How will we find the information we need when we need it?
How will we know what information we need to keep, and how will we keep it?
How will we follow the growing number of government and industry rules about retaining records tracking transactions, ensuring information privacy?
How will we protect the information we need to protect?
New search and discovery tools
Ways to add structure to unstructured data
New storage and information management technique
More compliance tools
Source: McKinsey Global Institute, Big data: The next frontier for innovation, competition and productivity , May 2011
The Evolution of Science Learning Systems (XXI Century) Era of Natural Philosophy Era of Modern Science Industrial Revolution Astronomy (Babylon, 1900 BC) Platonic Academy (387 BC) Mathematics (India, 499 BC) Scientific Revolution (1543 AD) Newton’s Laws (1687 AD) Relativity (1905 AD) Quantum Physics (1925 AD) Computing (1946 AD) DNA (1953 AD) Evolution of Science Time
Today’s Systems – The Calculating Paradigm Algorithms and Applications Static programming Archives Structured Data and Text The Calculating Paradigm People Hypothesize, Determine “what it means”, Run other applications…
Future Systems – The Learning Paradigm Training and Learning Engines To Build Models and Define Insight Hypothesis Engines To Understand and Plan Actions Policy Engine Business, Legal and Ethical Rules Verification Engines (e.g. Simulations) Active Learning (Natural Interfaces) Outcome Engine Actuation and Validation Society Nature Institutions Archives
New “Big Data” Brings New Opportunities, Requires New Analytics Up to 10,000 Times larger Up to 10,000 times faster Traditional Data Warehouse and Business Intelligence Data Scale Decision Frequency Data in Motion Data at Rest Telco Promotions 100,000 records/sec, 6B/day 10 ms/decision 270TB for Deep Analytics Smart Traffic 250K GPS probes/sec 630K segments/sec 2 ms/decision, 4K vehicles Homeland Security 600,000 records/sec, 50B/day 1-2 ms/decision 320TB for Deep Analytics yr mo wk day hr min sec … ms s Exa Peta Tera Giga Mega Kilo Occasional Frequent Real-time DeepQA 100s GB for Deep Analytics 3 sec/decision
Enabling Multiple Solutions & Appliances to Achieve a Smarter Planet Peta 2 Analytics Appliance + + Reactive + Deep Analytics Platform Big Analytics Ecosystem Peta 2 Data-centric System Algorithms Big Data Skills DeepFriends Social Network Monitor DeepResponse Emergency Coordination DeepEyes Webcam Fusion DeepCurrent Power Delivery DeepSafety Police/Security DeepTraffic Area Traffic Prediction DeepWater Water management DeepBasket Food Market Prediction DeepBreath Air Quality Control DeepPulse Political Polling DeepThunder Local Weather Prediction DeepSoil Farm Prediction
Statistical Ensemble of 600 to 800 Scoring Engines ~30 Machine Learning Models Weigh Scores, Produce Confidence for Each Question 0<P<1 Hypothetical Question With Greatest Confidence is Chosen Evidence-Based Decision Support System S1 S2 S3 SN . . . Answer: A large country in the Western Hemisphere whose capital has a similar name. Hypothesis Generated from “Answer” Guess Questions Q1, Q2 … Qi Question: What is Brazil? Watson Today: Processes Unstructured Text & 200 Hypothesis/3 seconds Watson 3,000 cores;100 TFlops 2 TB memory ~ 200 KW Static Data Corpus Element Refresh Time Data Corpus 2 Weeks Hypothesis Engines Weeks to Months Scoring Engines Weeks to Months Decision Support Engine 4 Days
Exascale Research and Development Source: Exascale Research and Development – Request for Information , July 2011
Big Data Systems Require a Data-centric Architecture for Performance Data lives on disk and tape Move data to CPU as needed Deep Storage Hierarchy Data lives in persistent memory Many CPU ’s surround and use Shallow/Flat Storage Hierarchy Old Compute-centric Model New Data-centric Model Massive Parallelism Persistent Memory Largest change in system architecture since the System 360 Huge impact on hardware, systems software, and application design Flash Phase Change Manycore FPGA input output
Scale-in is the New Systems Battlefield Scale-down Scale-up Scale-in Exascale Peta 2 Low Med High Extreme System Density (1/Latency end-to-end ) Device Clusters Single Device Low Med High Physical Limits Scale-out NAS Blade Server Scale-out Maximize system capacity FLASH SSD 3D Chips FPGA Manycore BPRAM/SCM Interconnect In-mem DB DAS Scale-in Maximize system density Minimize end-to-end latency System Capacity (capability) Single Device Device Clusters 100K 10K 1K 100 10 High Med Low Terabyte HDD POWER 7 Scale-up Maximize device capacity Atom Transistor Atom Storage Scale-down Maximize feature density Cloud Computing
Storage Class Memory - The Tipping Point for Data-centric Systems HDD cost advantage continues, 1/10 SCM cost, but SCM dominates in performance, 10,000x faster than HDD Source: Chung Lam, IBM FLASH (Phase Change) SCM in 2015 $0.05 per GB $50K per PB $0.10 / GB $0.01 / GB Relative Cost Relative Latency DRAM 100 1 SCM 1 10 FLASH 15 1000 HDD 0.1 100000
Background: 3 Styles of Massively Parallel Systems Data in Motion: High Velocity Mixed Variety High Volume* (*over time) SPL, C, Java Data at Rest*: High Volume Mixed Variety Low Velocity Deep Analytics Extreme Scale-out (*pre-partitioned) Simulation (BlueGene) Generative Modeling Extreme Physics C/C++, Fortran, MPI, OpenMP Reactive Analytics Extreme Ingestion Streaming (Streams) Long Running Small Input Massive Output Hadoop/MapReduce (BigInsights) JAQL, Java Reducers Mappers Input Data (on disk) Output Data = compute node