Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

IBM - Managing Uncertain Data at Scale


Published on

EXTENT Trading Technology Trends & Quality Assurance Conference in Obninsk, 2 March, 2013
Managing Uncertain Data at Scale
Nikolay Marin

  • Be the first to comment

  • Be the first to like this

IBM - Managing Uncertain Data at Scale

  1. 1. • Click to add textManaging Uncertain Data at Scale Nikolay Marin © 2013 IBM Corporation
  2. 2. Managing Uncertain Data at ScaleManaging Uncertain Data at Scale  By 2015, 80% of the world’s data will be uncertain Trend: Most of the world’s analyzed  Uncertain data management requires new techniques data will be uncertain  These techniques are necessary for real-world Big Data Analytics Opportunity:  Robust, business-aware uncertain data management Business leadership  Use analytics over uncertain web, sensor, and human-generated data using Big Data Analytics  Enable good business decisions by understanding analysis confidence Challenge: Taking  Analysis of text is highly nuanced; sensor-based data is imprecise Big Data Analytics  Timely business decisions require efficient large-scale analytics into an uncertain world  It is more difficult to obtain insight about an individual than a group, especially if the source data is uncertain© 2013 3IBM Corporation 2
  3. 3. Managing Uncertain Data at ScaleThe fourth dimension of Big Data: Veracity – handling data in doubt Volume Velocity Variety Veracity* Data in Many Data at Rest Data in Motion Data in Doubt Forms Terabytes to Streaming data, Structured, Uncertainty due to exabytes of existing milliseconds to unstructured, text, data inconsistency data to process seconds to respond multimedia & incompleteness, ambiguities, latency, deception, model approximations* Truthfulness, accuracy or precision, correctness© 2013 3IBM Corporation 3
  4. 4. Managing Uncertain Data at ScaleUncertainty arises from many sources Process Uncertainty Data Uncertainty Model Uncertainty Processes contain Data input is uncertain All modeling is approximate “randomness” Intended Actual Spelling Text Entry Spelling ? ? ? Fitting a curve to data Uncertain travel times GPS Uncertainty ? ? Testimony ? {Paris Airport} Ambiguity {John Smith, Dallas} Semiconductor yield {John Smith, Kansas} Forecasting a hurricane Contaminated? ( Rumors Conflicting Data© 2013 3IBM Corporation 4
  5. 5. Managing Uncertain Data at Scale By 2015, 80% of all available data will be uncertain By 2015 the number of networked devices will be double the entire global population. All 9000 sensor data has uncertainty. 8000 100Global Data Volume in Exabytes 90 The total number of social media 7000 accounts exceeds the entire global Aggregate Uncertainty % 80 population. This data is highly uncertain 6000 in both its expression and content. 70 s) 5000 of r s in g rn nso 60 Th Data quality solutions exist for e 4000 S 50 et enterprise data like customer, te (In 3000 40 product, and address data, but this is only a fraction of the ia ) M ed d text 2000 30 total enterprise data. i a l an S ,oc audio 20 eo P 1000 (vid VoI 10 0 Enterprise Data Multiple sources: IDC,Cisco 2005 2010 2015© 2013 3IBM Corporation 5
  6. 6. Managing Uncertain Data at ScaleHow to reduce uncertainty in processes, models, and dataConstructing context for better understanding Extract as much information as feasible from each source Combine (condense) data from multiple sources More data from more sources is better – Gathers more evidence for statistical methods Using statistical methods scaled for Big Data  Stochastic techniques efficiently reason about uncertainty  Monte Carlo techniques explore many possible scenarios in order to gain insightRequires specific business process and industry context© 2013 3IBM Corporation 6
  7. 7. Managing Uncertain Data at ScaleStatistical techniques reduce uncertainty in analytical models Attributes Trouble tickets Help agent find similar tickets Use stochastic search to find trouble tickets that are similar Trouble ticket attributes Model approximation Prediction  Some attributes such as server type  Treat N attributes as N are precise dimensions in space  Improve predictability by getting  Other attributes such as words in  Model similarity as closeness in agent feedback trouble tickets may be imprecise the N dimensional space indicators of the problem  Improve suggestions for similar problems using corroborating data and better mathematical techniques  Analyze all the data – do not subset  Use related techniques to automate Level 1 support, finding problem clusters, etc.© 2013 3IBM Corporation 7
  8. 8. Managing Uncertain Data at ScaleAnalytics is broadly defined as the use of data and computation to makesmart decisions Data Decision point Possible outcomes  Data instances Historical 1 n  Reports and queries on Optio data aggregates  Predictive models Option 2  Answers and confidence Opt Simulated ion  Feedback and learning 3 Text Video, Images Audio© 2013 3IBM Corporation 8
  9. 9. Managing Uncertain Data at ScaleFuture of Analytics Explosion of  Creates new analytics opportunities unstructured data  Addresses new enterprise needs Consistent, extensible, and  Reduces cost-to-value for enterprises consumable analytics  Increases analytics solution coverage with limited supply of skills platform Optimizing across  Analytics becomes a dominant IT workload and drives HW design the stack to deploy  Opportunity to seamlessly scale from terascale to exascale analytics at scale© 2013 3IBM Corporation 9
  10. 10. Managing Uncertain Data at Scale Analytics toolkits will be expanded to support ingestion and interpretation of unstructured data, and enable adaptation and learning Adaptive Analysis Responding to context  Learn In the context of Continual Analysis Responding to local change/feedbackNew the decisionMethods Optimization under Uncertainty Quantifying or mitigating risk process  Decide and Act Optimization Decision complexity, solution speed Predictive Modeling Causality, probabilistic, confidence levels Simulation High fidelity, games, data farming  Understand Forecasting Larger data sets, nonlinear regression and PredictTradi-tional Alerts Rules/triggers, context sensitive, complex events Query/Drill Down In memory data, fuzzy search, geo spatial Ad hoc Reporting Query by example, user defined reports  Report Standard Reporting Real time, visualizations, user interaction Entity Resolution People, roles, locations, things  Collect andNew Relationship, Feature Extraction Rules, semantic inferencing, matching Ingest/InterpretData Decide what to count; Annotation and Tokenization Automated, crowd sourced enable accurate counting Extended from: Competing on Analytics, Davenport and Harris, 2007 © 2013 3IBM Corporation 10
  11. 11. Managing Uncertain Data at ScaleFinally...what about a longer term view.... say the next 10-50 years?1. Artificial Intelligence2. Nano –“everything”3. Cognitive Computing4. Deep (Exascale) Computing5. Automic & Quantum Computing6. Human / Computer Interaction7. Machine to Machine Interaction8. BioTech / Human Augmentation9. Robots & Robotics10. Advanced / Predictive Analytics11. Security & Privacy12. 3-D Printing13. Video-enabled Business Processes14. Personalized Web/Assistants15. Ubiquitous Computing16. Gaming17. Simulation18. Virtual Computing (including virtual worlds, tele-presence, etc.)19. Augmented RealityIBM Academy of Technology and Global Technology Outlook can help you find some answers© 2013 3IBM Corporation 11
  12. 12. Managing Uncertain Data at Scale© 2013 3IBM Corporation