Petascale Analytics - The World of Big Data Requires Big Analytics

  • 1,547 views
Uploaded on

 

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • I enjoyed this presentation very much. Questions: along what dimension(s) of scale is knowledge intensivity? e.g. the amount and the intricacy of organization of knowledge...
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
1,547
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
79
Comments
1
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Picture 1 So let's start with the big numbers.  It would not be an exaggeration to say that we've clearly entered the "Zettabyte Era".  A zettabyte is a trillion gigabytes, or a billion terabytes -- as you prefer. This year (2011) we're forecasted to generate and consume 1.8 zettabytes of information as a society.  That's up from an estimated 1.3 zettabytes in 2010, with a forecasted 35 zettabytes by the end of this decade. Indeed, the most fascinating statement comes from the subhead of the press release: the rate of information growth appears to be exceeding Moore's Law -- a powerful argument for scale-out architectures if there ever was one. Picture 2 Go ahead, you need to visualize this.  Take your total amount of storage capacity you've got today, and throw a multiplier of 50x against it. Contemplate that, just for a moment. And remember, that's just an average -- information-intensive businesses will likely see far more. Picture 3 Or consider that all this wonderful information will be stored in 75x more "containers" (files, objects, etc.) than we are dealing with today.  Picture 4 Or that, by the end of the decade, we'll have 10x as many servers to deal with: both physical and virtual.  It makes a certain sense -- more information -- and ever-more uses -- means vastly more servers -- of all types -- sloshing around than before, putting all that information to work. 
  • Picture 1 So let's start with the big numbers.  It would not be an exaggeration to say that we've clearly entered the "Zettabyte Era".  A zettabyte is a trillion gigabytes, or a billion terabytes -- as you prefer. This year (2011) we're forecasted to generate and consume 1.8 zettabytes of information as a society.  That's up from an estimated 1.3 zettabytes in 2010, with a forecasted 35 zettabytes by the end of this decade. Indeed, the most fascinating statement comes from the subhead of the press release: the rate of information growth appears to be exceeding Moore's Law -- a powerful argument for scale-out architectures if there ever was one. Picture 2 Go ahead, you need to visualize this.  Take your total amount of storage capacity you've got today, and throw a multiplier of 50x against it. Contemplate that, just for a moment. And remember, that's just an average -- information-intensive businesses will likely see far more. Picture 3 Or consider that all this wonderful information will be stored in 75x more "containers" (files, objects, etc.) than we are dealing with today.  Picture 4 Or that, by the end of the decade, we'll have 10x as many servers to deal with: both physical and virtual.  It makes a certain sense -- more information -- and ever-more uses -- means vastly more servers -- of all types -- sloshing around than before, putting all that information to work. 
  • Picture 1 So let's start with the big numbers.  It would not be an exaggeration to say that we've clearly entered the "Zettabyte Era".  A zettabyte is a trillion gigabytes, or a billion terabytes -- as you prefer. This year (2011) we're forecasted to generate and consume 1.8 zettabytes of information as a society.  That's up from an estimated 1.3 zettabytes in 2010, with a forecasted 35 zettabytes by the end of this decade. Indeed, the most fascinating statement comes from the subhead of the press release: the rate of information growth appears to be exceeding Moore's Law -- a powerful argument for scale-out architectures if there ever was one. Picture 2 Go ahead, you need to visualize this.  Take your total amount of storage capacity you've got today, and throw a multiplier of 50x against it. Contemplate that, just for a moment. And remember, that's just an average -- information-intensive businesses will likely see far more. Picture 3 Or consider that all this wonderful information will be stored in 75x more "containers" (files, objects, etc.) than we are dealing with today.  Picture 4 Or that, by the end of the decade, we'll have 10x as many servers to deal with: both physical and virtual.  It makes a certain sense -- more information -- and ever-more uses -- means vastly more servers -- of all types -- sloshing around than before, putting all that information to work. 
  • "The evolution of science" 1. Babylonian astronomy was probably the first real example of science in practice. The next step along this path was the invention of mathematics. 2. The platonic academy was the first codification of scientific principles. 3. Not much happened during the middle ages but there was a rebirth after the renaissance which triggered the scientific revolution. Copernicus introduced a heliocentric world view and human anatomy was born. 4. Newton's laws marked the pinnacle of the era of natural philosophy. 5. The industrial revolution marked the birth of modern science (where science became useful to humanity broadly) and marked transition between the era of natural philosophy and the era of modern science. 6. Modern science progressed at an exponential rate with major advances coming quickly on the back of each other. 7. This entire evolution of science can be viewed as an example of exponential growth.
  • 3 CHANGES to EMPHASIZE: EMPHASIS ON DYNAMIC NATURE OF MODELS, NOT STATIC - ACTIVE LEARNING – Hard - DYNAMIC Engines (Training, Policy, Hypothesis, Outcome, Verification) - Natural interfaces How is our Learning System different from past Machine Learning approaches? Our Learning System will automatically identify key features. Key Features selection is the technique of selecting a subset of relevant features for learning models. For example, key features to diagnose an illness may be a person's temperature, white blood cell count, pH level, etc. Current state of the art either has (A) humans identifying what are the key features for different domains or (B) allowing machine learning programs to extract key features based on expert rules (provided by humans) or statistical methods which may lead to false conclusions in domains that involve semantic ambiguity. The Learning System we're building will use crowd sourcing techniques to automatically identify key features for a domain and will proactively ask humans for disambiguation, instead of waiting for humans to notice the model is erroneous (for example models that rated questionable mortgages as AAA or a software program that deduces that Internet Cookies are edible). In this vein, another key difference in our Learning System is active continuous verification. The current trends that provide increasing amounts of digital data (e.g. IBM Smarter Planet sensors) will enable our Learning System to modify itself to prune key features that are no longer relevant. In summary, (1) Automatic Extraction of Key Features (2) Continuous active self verification and (3) The ability to select the appropriate Machine Learning technique (statistical, genetic programming, neural networks, etc), and modify these techniques to changing conditions - all these three features have not been integrated into prior Machine Learning approaches. A hypothesis is necessarily about a problem that is not formalized (if the problem were formalized, then no hypothesis would be required, only a formal solution) Without a formal problem, the task of formulating hypotheses becomes one of creating alternative problem representations and selecting among them, in part, based on possible solutions to each Known systems that attempt to do this require a defined problem space, where the range of possible hypotheses is calculated from a range of possible system states “ Real world” problems do not emerge from a range of possible states, however, but instead occur when previously defined ranges (or dimensions) are violated The only known systems capable of formulating hypotheses about arbitrary states and selecting among them are biological cognitive systems Explanation of this is necessary before a system that "Creates Hypotheses" can be introduced, even as a hypothetical
  • 0067011990999991950051507004 ... 9999999N9+00001+99999999999 ... 0043011990999991950051512004 ... 9999999N9+00221+99999999999 ... 0043011990999991950051518004 ... 9999999N9-00111+99999999999 ... 0043012650999991949032412004 ... 0500001N9+01111+99999999999 ... 0043012650999991949032418004 ... 0500001N9+00781+99999999999 ... (0, 006701199099999 1950 051507004...9999999N9+ 0000 1+99999999999...) (106, 004301199099999 1950 051512004...9999999N9+ 0022 1+99999999999...) (212, 004301199099999 1950 051518004...9999999N9- 0011 1+99999999999...) (318, 004301265099999 1949 032412004...0500001N9+ 0111 1+99999999999...) (424, 004301265099999 1949 032418004...0500001N9+ 0078 1+99999999999...) (1950, 0) (1950, 22) (1950, −11) (1949, 111) (1949, 78) (1949, [111, 78]) (1950, [0, 22, −11]) (1949, 111) (1950, 22)
  • 0067011990999991950051507004 ... 9999999N9+00001+99999999999 ... 0043011990999991950051512004 ... 9999999N9+00221+99999999999 ... 0043011990999991950051518004 ... 9999999N9-00111+99999999999 ... 0043012650999991949032412004 ... 0500001N9+01111+99999999999 ... 0043012650999991949032418004 ... 0500001N9+00781+99999999999 ... (0, 006701199099999 1950 051507004...9999999N9+ 0000 1+99999999999...) (106, 004301199099999 1950 051512004...9999999N9+ 0022 1+99999999999...) (212, 004301199099999 1950 051518004...9999999N9- 0011 1+99999999999...) (318, 004301265099999 1949 032412004...0500001N9+ 0111 1+99999999999...) (424, 004301265099999 1949 032418004...0500001N9+ 0078 1+99999999999...) (1950, 0) (1950, 22) (1950, −11) (1949, 111) (1949, 78) (1949, [111, 78]) (1950, [0, 22, −11]) (1949, 111) (1950, 22)
  • The input to our map phase is the raw NCDC data
  • 0067011990999991950051507004 ... 9999999N9+00001+99999999999 ... 0043011990999991950051512004 ... 9999999N9+00221+99999999999 ... 0043011990999991950051518004 ... 9999999N9-00111+99999999999 ... 0043012650999991949032412004 ... 0500001N9+01111+99999999999 ... 0043012650999991949032418004 ... 0500001N9+00781+99999999999 ... (0, 006701199099999 1950 051507004...9999999N9+ 0000 1+99999999999...) (106, 004301199099999 1950 051512004...9999999N9+ 0022 1+99999999999...) (212, 004301199099999 1950 051518004...9999999N9- 0011 1+99999999999...) (318, 004301265099999 1949 032412004...0500001N9+ 0111 1+99999999999...) (424, 004301265099999 1949 032418004...0500001N9+ 0078 1+99999999999...) (1950, 0) (1950, 22) (1950, −11) (1949, 111) (1949, 78) (1949, [111, 78]) (1950, [0, 22, −11]) (1949, 111) (1950, 22)
  • 0067011990999991950051507004 ... 9999999N9+00001+99999999999 ... 0043011990999991950051512004 ... 9999999N9+00221+99999999999 ... 0043011990999991950051518004 ... 9999999N9-00111+99999999999 ... 0043012650999991949032412004 ... 0500001N9+01111+99999999999 ... 0043012650999991949032418004 ... 0500001N9+00781+99999999999 ... (0, 006701199099999 1950 051507004...9999999N9+ 0000 1+99999999999...) (106, 004301199099999 1950 051512004...9999999N9+ 0022 1+99999999999...) (212, 004301199099999 1950 051518004...9999999N9- 0011 1+99999999999...) (318, 004301265099999 1949 032412004...0500001N9+ 0111 1+99999999999...) (424, 004301265099999 1949 032418004...0500001N9+ 0078 1+99999999999...) (1950, 0) (1950, 22) (1950, −11) (1949, 111) (1949, 78) (1949, [111, 78]) (1950, [0, 22, −11]) (1949, 111) (1950, 22)
  • 0067011990999991950051507004 ... 9999999N9+00001+99999999999 ... 0043011990999991950051512004 ... 9999999N9+00221+99999999999 ... 0043011990999991950051518004 ... 9999999N9-00111+99999999999 ... 0043012650999991949032412004 ... 0500001N9+01111+99999999999 ... 0043012650999991949032418004 ... 0500001N9+00781+99999999999 ... (0, 006701199099999 1950 051507004...9999999N9+ 0000 1+99999999999...) (106, 004301199099999 1950 051512004...9999999N9+ 0022 1+99999999999...) (212, 004301199099999 1950 051518004...9999999N9- 0011 1+99999999999...) (318, 004301265099999 1949 032412004...0500001N9+ 0111 1+99999999999...) (424, 004301265099999 1949 032418004...0500001N9+ 0078 1+99999999999...) (1950, 0) (1950, 22) (1950, −11) (1949, 111) (1949, 78) (1949, [111, 78]) (1950, [0, 22, −11]) (1949, 111) (1950, 22)

Transcript

  • 1. Petascale Analytics The World of Big Data Requires Big Analytics October 2011 H. J. Schick IBM Germany Research & Development GmbH
  • 2. Source: The Evolution of Live in 60 Seconds
  • 3.  
  • 4. Source: Realtime Apache Hadoop at Facebook
  • 5.
  • 6.
  • 7. Quiz: What comes after zettabyte? 1 yottabyte = 1,000,000,000,000,000,000,000,000 bytes
  • 8. Experiment:
  • 9. Source: IDC Digital Universe Study , sponsored by EMC, May 2011
  • 10. Google’s Server Design Source: cnet News, Google Uncloaks Once-Secret Server , April 2009
  • 11. The Digital Universe is a Perpetual Tsunami
    • How will we find the information we need when we need it?
    • How will we know what information we need to keep, and how will we keep it?
    • How will we follow the growing number of government and industry rules about retaining records tracking transactions, ensuring information privacy?
    • How will we protect the information we need to protect?
      • Solution:
        • New search and discovery tools
        • Ways to add structure to unstructured data
        • New storage and information management technique
        • More compliance tools
        • Better Security
  • 12. Source: McKinsey Global Institute, Big data: The next frontier for innovation, competition and productivity , May 2011
  • 13. The Evolution of Science Learning Systems (XXI Century) Era of Natural Philosophy Era of Modern Science Industrial Revolution Astronomy (Babylon, 1900 BC) Platonic Academy (387 BC) Mathematics (India, 499 BC) Scientific Revolution (1543 AD) Newton’s Laws (1687 AD) Relativity (1905 AD) Quantum Physics (1925 AD) Computing (1946 AD) DNA (1953 AD) Evolution of Science Time
  • 14. Today’s Systems – The Calculating Paradigm Algorithms and Applications Static programming Archives Structured Data and Text The Calculating Paradigm People Hypothesize, Determine “what it means”, Run other applications…
  • 15. Future Systems – The Learning Paradigm Training and Learning Engines To Build Models and Define Insight Hypothesis Engines To Understand and Plan Actions Policy Engine Business, Legal and Ethical Rules Verification Engines (e.g. Simulations) Active Learning (Natural Interfaces) Outcome Engine Actuation and Validation Society Nature Institutions Archives
  • 16. New “Big Data” Brings New Opportunities, Requires New Analytics Up to 10,000 Times larger Up to 10,000 times faster Traditional Data Warehouse and Business Intelligence Data Scale Decision Frequency Data in Motion Data at Rest Telco Promotions 100,000 records/sec, 6B/day 10 ms/decision 270TB for Deep Analytics Smart Traffic 250K GPS probes/sec 630K segments/sec 2 ms/decision, 4K vehicles Homeland Security 600,000 records/sec, 50B/day 1-2 ms/decision 320TB for Deep Analytics yr mo wk day hr min sec … ms  s Exa Peta Tera Giga Mega Kilo Occasional Frequent Real-time DeepQA 100s GB for Deep Analytics 3 sec/decision
  • 17. Enabling Multiple Solutions & Appliances to Achieve a Smarter Planet Peta 2 Analytics Appliance + + Reactive + Deep Analytics Platform Big Analytics Ecosystem Peta 2 Data-centric System Algorithms Big Data Skills DeepFriends Social Network Monitor DeepResponse Emergency Coordination DeepEyes Webcam Fusion DeepCurrent Power Delivery DeepSafety Police/Security DeepTraffic Area Traffic Prediction DeepWater Water management DeepBasket Food Market Prediction DeepBreath Air Quality Control DeepPulse Political Polling DeepThunder Local Weather Prediction DeepSoil Farm Prediction
  • 18. Statistical Ensemble of 600 to 800 Scoring Engines ~30 Machine Learning Models Weigh Scores, Produce Confidence for Each Question 0<P<1 Hypothetical Question With Greatest Confidence is Chosen Evidence-Based Decision Support System S1 S2 S3 SN . . . Answer: A large country in the Western Hemisphere whose capital has a similar name. Hypothesis Generated from “Answer” Guess Questions Q1, Q2 … Qi Question: What is Brazil? Watson Today: Processes Unstructured Text & 200 Hypothesis/3 seconds Watson 3,000 cores;100 TFlops 2 TB memory ~ 200 KW Static Data Corpus Element Refresh Time Data Corpus 2 Weeks Hypothesis Engines Weeks to Months Scoring Engines Weeks to Months Decision Support Engine 4 Days
  • 19.
  • 20. Exascale Research and Development Source: Exascale Research and Development – Request for Information , July 2011
  • 21. Big Data Systems Require a Data-centric Architecture for Performance Data lives on disk and tape Move data to CPU as needed Deep Storage Hierarchy Data lives in persistent memory Many CPU ’s surround and use Shallow/Flat Storage Hierarchy Old Compute-centric Model New Data-centric Model Massive Parallelism Persistent Memory Largest change in system architecture since the System 360 Huge impact on hardware, systems software, and application design Flash Phase Change Manycore FPGA input output
  • 22. Scale-in is the New Systems Battlefield Scale-down Scale-up Scale-in Exascale Peta 2 Low Med High Extreme System Density (1/Latency end-to-end ) Device Clusters Single Device Low Med High Physical Limits Scale-out NAS Blade Server Scale-out Maximize system capacity FLASH SSD 3D Chips FPGA Manycore BPRAM/SCM Interconnect In-mem DB DAS Scale-in Maximize system density Minimize end-to-end latency System Capacity (capability) Single Device Device Clusters 100K 10K 1K 100 10 High Med Low Terabyte HDD POWER 7 Scale-up Maximize device capacity Atom Transistor Atom Storage Scale-down Maximize feature density Cloud Computing
  • 23. Storage Class Memory - The Tipping Point for Data-centric Systems HDD cost advantage continues, 1/10 SCM cost, but SCM dominates in performance, 10,000x faster than HDD Source: Chung Lam, IBM FLASH (Phase Change) SCM in 2015 $0.05 per GB $50K per PB $0.10 / GB $0.01 / GB Relative Cost Relative Latency DRAM 100 1 SCM 1 10 FLASH 15 1000 HDD 0.1 100000
  • 24. Background: 3 Styles of Massively Parallel Systems Data in Motion: High Velocity Mixed Variety High Volume* (*over time) SPL, C, Java Data at Rest*: High Volume Mixed Variety Low Velocity Deep Analytics Extreme Scale-out (*pre-partitioned) Simulation (BlueGene) Generative Modeling Extreme Physics C/C++, Fortran, MPI, OpenMP Reactive Analytics Extreme Ingestion Streaming (Streams) Long Running Small Input Massive Output Hadoop/MapReduce (BigInsights) JAQL, Java Reducers Mappers Input Data (on disk) Output Data = compute node
  • 25.
  • 26. Fault-tolerant Hadoop Distributed File System (HDFS) Source: Hadoop Overview , http://www.cloudera.com
  • 27. Map Reduce Logical Data Flow Source: O’Reilly, Hadoop – The Definitive Guide 0067011990999991950051507004 ... 9999999N9+00001+99999999999 ... 0043011990999991950051512004 ... 9999999N9+00221+99999999999 ... 0043011990999991950051518004 ... 9999999N9-00111+99999999999 ... 0043012650999991949032412004 ... 0500001N9+01111+99999999999 ... 0043012650999991949032418004 ... 0500001N9+00781+99999999999 ... 1
  • 28. Map Reduce Logical Data Flow Source: O’Reilly, Hadoop – The Definitive Guide (0, 006701199099999 1950 051507004...9999999N9+ 0000 1+99999999999...) (106, 004301199099999 1950 051512004...9999999N9+ 0022 1+99999999999...) (212, 004301199099999 1950 051518004...9999999N9- 0011 1+99999999999...) (318, 004301265099999 1949 032412004...0500001N9+ 0111 1+99999999999...) (424, 004301265099999 1949 032418004...0500001N9+ 0078 1+99999999999...) 2
  • 29. Map Reduce Logical Data Flow Source: O’Reilly, Hadoop – The Definitive Guide (1950, 0) (1950, 22) (1950, −11) (1949, 111) (1949, 78) 3
  • 30. Map Reduce Logical Data Flow Source: O’Reilly, Hadoop – The Definitive Guide (1949, [111, 78]) (1950, [0, 22, −11]) 4
  • 31. Map Reduce Logical Data Flow Source: O’Reilly, Hadoop – The Definitive Guide (1949, 111) (1950, 22) 5
  • 32. The Blue Gene/Q ASIC Source: EDN News, Hot Chips: The Puzzle of Many Cores
  • 33. The Blue Gene/Q Packaging Hierarchy Source: The Register, IBM ’s Blue Gene/Q Super Chip Grows 18 th Core
  • 34. Opportunity: Blue Gene Active Storage + 512 BGQ Flash Cards Blue Gene/Q Active Storage Rack … scale it like BG/Q.
    • BG/Q + Flash Memory => Blue Gene Active Storage (BGAS)
    • BGAS Capabilities Per Rack
      • 104 TeraFLOPS – 512 nodes, 8196 cores -- 50% of Standard BG/Q System
      • 512 GB/s Bi-Section Bandwidth - All-to-All Throughput of 2GB/s per Node
      • 768 GB/s I/O bandwidth – 100TB Sort in ~330 sec (vs 10,000 sec today)
      • 100 Million IOPS – Equivalent to order 1 Million Disks
    • Research and Development Challenges:
      • Packaging: integrate Flash today, tomorrow PCM, Memristor, Racetrack, etc.
      • System Software: Persistent Memory Management, k-v Store on BGAS
      • Resilience: Single Path to Storage, BG/Q Network for General Workloads
      • Integration: System Management, Middleware and Frameworks, Applications
    Flash Capacity 320 GB I/O Bandwidth 1.5 GB/s IOPS 207,000 Nodes 512 Storage Cap 640 TB I/O Bandwidth 768 GB/s Random IOPS 100 Million Compute 104 TF Bi-Section BW 512 GB/s
  • 35. NAND Flash Challenges
    • Need to erase before writing
    • Data retention errors
    • Limited number of writes
    • Management of initial and runtime bad blocks
    • Data errors cause by read and write disturb
      • Factors that influence reliability, performance, write endurance:
        • Use of Single Level Cell (SLC) and Multi Level Cell (MLC) NAND technology
        • Wear out mechanism that limits service life: Wear-leveling algorithm
        • Ensuring data integrity through bad block management techniques
        • Use of error detection and correction algorithms
        • Write amplification
  • 36. Gartner ’s Hype Cycle
  • 37. Thank you very much for your attention.