Vladimir_Suvorov_Big_data

737 views

Published on

Meeting #1. Game|Changers. Data Mining Track.

Published in: Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
737
On SlideShare
0
From Embeds
0
Number of Embeds
16
Actions
Shares
0
Downloads
33
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Vladimir_Suvorov_Big_data

  1. 1. Big Data Concepts & Practice Vladimir Suvorov vladimir.suvorov@emc.com EMC & DataScienceSquad.comNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 1
  2. 2. About myselfNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 2
  3. 3. Why Big Data How We Got HereFebruary 16, 2013 © 2012 IBM Corporation
  4. 4. …by the end of 2011, this was about 30 In 2005 there were 1.3 billion RFID billion and growing even faster tags in circulation…4 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 4
  5. 5. An increasingly sensor-enabled and instrumented business environment generates HUGE volumes of data with MACHINE SPEED characteristics… 1 BILLION lines of code EACH engine generating 10 TB every 30 minutes!Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 5
  6. 6. 350B Transactions/Year Meter Reads every 15 min. 120M – meter reads/month 3.65B – meter reads/dayNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 6
  7. 7. In August of 2010, Adam Savage, of “Myth Busters,” took a photo of his vehicle using his smartphone. He then posted the photo to his Twitter account including the phrase “Off to work.” Since the photo was taken by his smartphone, the image contained metadata revealing the exact geographical location the photo was taken By simply taking and posting a photo, Savage revealed the exact location of his home, the vehicle he drives, and the time he leaves for workNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 7
  8. 8. The Social Layer in an Instrumented Interconnected World 4.6 30 billion billion RFID tags today camera 12+ TBs (1.3B in 2005) phones of tweet data world every day wide 100s of millions of GPS data every of enabled ? TBs devices day sold annually 25+ TBs of 2+ log data billion every day people on the 76 million smart Web by meters in 2009… end 200M by 2014 2011 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 8
  9. 9. Twitter Tweets per Second Record Breakers of 2011 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 9
  10. 10. Extract Intent, Life Events, Micro SegmentationAttributes Pauline Name, Birthday, Family Tom Sit Not Relevant - Noise Tina Mu Monetizable Intent Jo Jobs Not Relevant - Noise Location Wishful Thinking Relocation Monetizable Intent SPAMbots Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 10
  11. 11. Big Data Includes Any of the following CharacteristicsExtracting insight from an immense volume, variety and velocity of data, in context, beyond what was previously possibleVariety: Manage the complexity of data in many different structures, ranging from relational, to logs, to raw textVelocity: Streaming data and large volume data movementVolume: Scale from Terabytes to Petabytes (1K TBs) to Zetabytes (1B TBs) Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 11
  12. 12. Bigger and Bigger Volumes of Data• Retailers collect click-stream data from Web site interactions and loyalty card data – This traditional POS information is used by retailer for shopping basket analysis, inventory replenishment, +++ – But data is being provided to suppliers for customer buying analysis• Healthcare has traditionally been dominated by paper-based systems, but this information is getting digitized• Science is increasingly dominated by big science initiatives – Large-scale experiments generate over 15 PB of data a year and can’t be stored within the data center; sent to laboratories• Financial services are seeing large and large volumes through smaller trading sizes, increased market volatility, and technological improvements in automated and algorithmic trading• Improved instrument and sensory technology – Large Synoptic Survey Telescope’s GPixel camera generates 6PB+ of image data per year or consider Oil and Gas industry Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 12
  13. 13. The Big Data Conundrum• The percentage of available data an enterprise can analyze is decreasing proportionately to the available to itQuite simply, this means as enterprises, we are getting “more naive” about our business over timeWe don’t know what we could already know…. Data AVAILABLE to an organization Data an organization can PROCESS Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 13
  14. 14. Why Not All of Big Data Before: Didn’t have the Tools? Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 14
  15. 15. Applications for Big Data AnalyticsSmarter Healthcare Multi-channel Finance Log Analysis salesHomeland Security Traffic Control Telecom Search Quality Manufacturing Trading Fraud and Retail: Churn, Analytics Risk NBO Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 15
  16. 16. Most Requested Uses of Big Data• Log Analytics & Storage• Smart Grid / Smarter Utilities• RFID Tracking & Analytics• Fraud / Risk Management & Modeling• 360° View of the Customer• Warehouse Extension• Email / Call Center Transcript Analysis• Call Detail Record Analysis 16 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 16
  17. 17. What companies & analytics think of Big DataNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 17
  18. 18. Gartner & McKinsleyNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 18
  19. 19. Hype Cycle of Big DataNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 19
  20. 20. Priority matrixNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 20
  21. 21. Key vision• Predictive modeling is gaining momentum with property and casualty (P&C) companies who are using them to support claims analysis, CRM, risk management, pricing and actuarial workflows, quoting, and underwriting.• Social content is the fastest growing category of new content in the enterprise and will eventually attain 20% market penetration.• Gartner reports that 45% as sales management teams identify sales analytics as a priority to help them understand sales performance, market conditions and opportunities.• Over 80% of Web Analytics solutions are delivered via Software-as-a-Service (SaaS).Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 21
  22. 22. Big Data deliverables by McKinsleyNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 22
  23. 23. Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 23
  24. 24. IntelNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 24
  25. 25. Intel Big Data Cluster Example Application Big Data Algorithms Compute Style Scientific study Ground model Earthquake HPC (e.g. earthquake simulation, thermal study) conduction, … Internet library Historic web Data mining MapReduce search snapshots Virtual world Virtual world Data mining TBD analysis database Language Text corpuses, Speech recognition, MapReduce & translation audio archives,… machine translation, HPC text-to-speech, … Video search Video data Object/gesture MapReduce identification, face recognition, … There has been more video uploaded to YouTube in the last 2 months than if ABC, NBC, and CBS had been airing content 24/7/365 continuously since 1948. - Gartner 25Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 25
  26. 26. Example Motivating Application: Online Processing of Archival Video • Research project: Develop a context recognition system that is 90% accurate over 90% of your day • Leverage a combination of low- and high-rate sensing for perception • Federate many sensors for improved perception • Big Data: Terabytes of archived video from many egocentric cameras • Example query 1: “Where did I leave my briefcase?” • Sequential search through all video streams [Parallel Camera] • Example query 2: “Now that I’ve found my briefcase, track it” • Cross-cutting search among related video streams [Parallel Time] Big Data Cluster 26 26Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 26
  27. 27. OracleNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 27
  28. 28. Big Data Use Cases Today’s Challenge New Data What’s Possible Healthcare Remote patient Preventive care, Expensive office visits monitoring reduced hospitalization Manufacturing Automated diagnosis, Product sensors In-person support support Location-Based Services Geo-advertising, traffic, Real time location data Based on home zip local search code Public Sector Tailored services, Citizen surveys Standardized services cost reductions Retail Sentiment analysis One size fits all Social media segmentation marketingNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 28
  29. 29. What’s in Big Data for Public Sector •Operational efficiency and productivity •Fraud detection and prevention •Close tax gaps •Value for money for citizens •Prevent crime waves •Customize actions based on population segments •Public utilities to reduce consumption •Produce safety from farm to forkNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 29
  30. 30. MicrosoftNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 30
  31. 31. New opportunities Measures and ranks online user Increases ad revenue by processing 3.5 influence by processing 3 billion signals Improving investigation time by analyzing billion events per day per day large volume & variety of data Massive Volumes Cloud Connectivity Real-Time Insight Processes 464 billion rows per quarter, Connects across 15 social networks via Cut investigation time from 2 years to with average query time under 10 secs. the cloud for data and API access 15 daysNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 31
  32. 32. Microsoft’s Approach to Big DataNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 32
  33. 33. A Holistic Big Data Solution from MicrosoftNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 33
  34. 34. Data Scientist JobNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 34
  35. 35. Sexy Job of Data Scientist Tom Davenport, who is teaching an executive program in Big Data and analytics at Harvard University, said some data scientists are earning annual salaries as high as $300,000, which is “pretty good for somebody that doesnt have anyone else working for them.” Davenport also said such workers are motivated by the problems and opportunities data provides.Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 35
  36. 36. What EMC Think of Data ScientistsNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 36
  37. 37. Job evolutionNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 37
  38. 38. What Forbes think of Data ScientistsNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 38
  39. 39. Data Science CoursesNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 39
  40. 40. Course Modules and Navigation Icons Data Science and Big Data Analytics 1. Introduction to Big Data Analytics 2. Data Analytics Lifecycle + Lab 3. Review of Basic Data Analytics Methods Using R + Labs 4. Advanced Analytics - Theory & Methods + Labs 5. Advanced Analytics - Technology & Tools + Labs 6. The Endgame, or Putting it All Together + Final Lab 40Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 40
  41. 41. Topics : DataofScience and Big Advanced Analytics Introducti Review Basic Advanced Data The Endgame, on to Big Data Analytic Analytics – Analytics - or Putting it AllCourse Methods Using R Theory and Technology Data Together Analytics Methods and Tools + + Final Lab on Big Data Data Analytics Analytics Lifecycle Big Data Using R to Look at K-means Analytics for Operationalizing Overview Data - Clustering Unstructured an Analytics Introduction to R Data Project State of Association (MapReduce the Analyzing and Rules and Hadoop) Creating the Practice in Exploring the Data Final Analytics Linear The Hadoop Deliverables Statistics for Regression Ecosystem The Data Model Building Data Scientist and Evaluation Logistic In-database Visualization Regression Analytics – Techniques Big Data SQL Essentials Analytics Naive + Final Lab – in Bayesian Advanced SQL Application of Industry Classifier and MADlib for the Data Verticals In-database Analytics Decision Trees Analytics Lifecycle to a Data Big Data Analytics Time Series Analytics Lifecycle Analysis Challenge Text AnalysisNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 41 41
  42. 42. HadoopNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 42
  43. 43. Top companies need Hadoop Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 43
  44. 44. What is Hadoop and Where did it start?• Created by Doug Cutting, formerly of Yahoo! Now Cloudera – HDFS (storage) & MapReduce (compute) – Inspired by Google’s MapReduce and Google File System (GFS) papers• Much of the initial work on Hadoop was done by Yahoo• It is now a top-level Apache project backed by large open source development community Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 44
  45. 45. What is Hadoop? Two Core Components HDFS MapReduce Storage in the Compute via the Hadoop Distributed MapReduce distributed File System processing platform• Storage & Compute in 1 Framework• Open Source Project of the Apache Software Foundation• Written in JavaNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 45
  46. 46. Hadoop cluster architecture Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 46
  47. 47. MapReduce example Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 47
  48. 48. Hadoop versions Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 48
  49. 49. Hadoop Wave Report “EMC Greenplum is the first mover in Hadoop appliances. EMC Greenplum the first EDW vendor to provide a full-featured enterprise-grade Hadoop appliance and roll out an appliance family that integrates its Hadoop, EDW, and data integration in a single rack. It provides its own open source Hadoop distribution software, integrates EMC’s strong storage product portfolio in its appliances, and has an extensive professional services force of EMC technical consultants and data scientists with Hadoop expertise.” Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 49
  50. 50. Hadoop Players TodayNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 50
  51. 51. Get Started With Hadoop TodayData Scientists & Hadoop Architecture teams deliver customer success  Hadoop Architecture Services – POC planning and deployment – Installation and best practices – Educate the team  Greenplum Analytics Labs – Leverage the expertise of Greenplum’s Data Scientists – Packaged solutions that produce business value and actionable results – Accelerate Hadoop capabilities on your data with your analysts  Establish a strategic vision – Roadmap for Hadoop and unified analytics Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 51
  52. 52. The Greenplum Unified Analytics PlatformNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 52
  53. 53. NoSQLNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 53
  54. 54. Definitionfrom nosql-databases.org• Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontal scalable. The original intention has been modern web-scale databases. The movement began early 2009 and is growing rapidly. Often more characteristics apply as: schema-free, easy replication support, simple API, eventually consistent /BASE (not ACID), a huge data amount, and more. So the misleading term "nosql" (the community now translates it mostly with "not only sql") should be seen as an alias to something like the definition above.Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 54
  55. 55. NoSQLhttp://nosql-database.org/• Non relational• Scalability – Vertically • Add more data – Horizontally • Add more storage• Collection of structures – Hashtables, maps, dictionaries• No pre-defined schema• No join operations• CAP not ACID – Consistency, Availability and Partitioning (but not all three at once!) – Atomicity, Consistency, Isolation and DurabilityNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 55
  56. 56. Advantages of NoSQL• Cheap, easy to implement• Data are replicated and can be partitioned• Easy to distribute• Dont require a schema• Can scale up and down• Quickly process large amounts of data• Relax the data consistency requirement (CAP)• Can handle web-scale data, whereas Relational DBs cannotNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 56
  57. 57. Disadvantages of NoSQL• New and sometimes buggy• Data is generally duplicated, potential for inconsistency• No standardized schema• No standard format for queries• No standard language• Difficult to impose complicated structures• Depend on the application layer to enforce data integrity• No guarantee of support• Too many options, which one, or ones to pickNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 57
  58. 58. NoSQL OptionsKey-Value Stores• This technology you know and love and use all the time – Hashmap for example• Put(key,value)• value = Get(key)• Examples – Redis (my favorite!!) – in memory store – Memcached – and 100s moreNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 58
  59. 59. Column Stores • Not to be confused with the relational-db version of this – Sybase-IQ etc. • Multi-dimensional map • Not all entries are relevant each time – Column families • Examples – Cassandra – Hbase – Amazon SimpleDBNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 59
  60. 60. Document Stores • Key-document stores – However the document can be seen as a value so you can consider this is a super-set of key-value • Big difference is that in document stores one can query also on the document, i.e. the document portion is structured (not just a blob of data) • Examples – MongoDB – CouchDBNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 60
  61. 61. Graph Stores • Use a graph structure – Labeled, directed, attributed multi-graph • Label for each edge • Directed edges • Multiple attributes per node • Multiple edges between nodes – Relational DBs can model graphs, but an edge requires a join which is expensive • Example Neo4j – http://www.infoq.com/articles/graph-nosql-neo4jNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 61
  62. 62. Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 62

×