Your SlideShare is downloading. ×
0
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Vladimir_Suvorov_Big_data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Vladimir_Suvorov_Big_data

483

Published on

Meeting #1. Game|Changers. Data Mining Track.

Meeting #1. Game|Changers. Data Mining Track.

Published in: Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
483
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
31
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Big Data Concepts & Practice Vladimir Suvorov vladimir.suvorov@emc.com EMC & DataScienceSquad.comNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 1
  • 2. About myselfNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 2
  • 3. Why Big Data How We Got HereFebruary 16, 2013 © 2012 IBM Corporation
  • 4. …by the end of 2011, this was about 30 In 2005 there were 1.3 billion RFID billion and growing even faster tags in circulation…4 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 4
  • 5. An increasingly sensor-enabled and instrumented business environment generates HUGE volumes of data with MACHINE SPEED characteristics… 1 BILLION lines of code EACH engine generating 10 TB every 30 minutes!Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 5
  • 6. 350B Transactions/Year Meter Reads every 15 min. 120M – meter reads/month 3.65B – meter reads/dayNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 6
  • 7. In August of 2010, Adam Savage, of “Myth Busters,” took a photo of his vehicle using his smartphone. He then posted the photo to his Twitter account including the phrase “Off to work.” Since the photo was taken by his smartphone, the image contained metadata revealing the exact geographical location the photo was taken By simply taking and posting a photo, Savage revealed the exact location of his home, the vehicle he drives, and the time he leaves for workNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 7
  • 8. The Social Layer in an Instrumented Interconnected World 4.6 30 billion billion RFID tags today camera 12+ TBs (1.3B in 2005) phones of tweet data world every day wide 100s of millions of GPS data every of enabled ? TBs devices day sold annually 25+ TBs of 2+ log data billion every day people on the 76 million smart Web by meters in 2009… end 200M by 2014 2011 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 8
  • 9. Twitter Tweets per Second Record Breakers of 2011 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 9
  • 10. Extract Intent, Life Events, Micro SegmentationAttributes Pauline Name, Birthday, Family Tom Sit Not Relevant - Noise Tina Mu Monetizable Intent Jo Jobs Not Relevant - Noise Location Wishful Thinking Relocation Monetizable Intent SPAMbots Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 10
  • 11. Big Data Includes Any of the following CharacteristicsExtracting insight from an immense volume, variety and velocity of data, in context, beyond what was previously possibleVariety: Manage the complexity of data in many different structures, ranging from relational, to logs, to raw textVelocity: Streaming data and large volume data movementVolume: Scale from Terabytes to Petabytes (1K TBs) to Zetabytes (1B TBs) Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 11
  • 12. Bigger and Bigger Volumes of Data• Retailers collect click-stream data from Web site interactions and loyalty card data – This traditional POS information is used by retailer for shopping basket analysis, inventory replenishment, +++ – But data is being provided to suppliers for customer buying analysis• Healthcare has traditionally been dominated by paper-based systems, but this information is getting digitized• Science is increasingly dominated by big science initiatives – Large-scale experiments generate over 15 PB of data a year and can’t be stored within the data center; sent to laboratories• Financial services are seeing large and large volumes through smaller trading sizes, increased market volatility, and technological improvements in automated and algorithmic trading• Improved instrument and sensory technology – Large Synoptic Survey Telescope’s GPixel camera generates 6PB+ of image data per year or consider Oil and Gas industry Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 12
  • 13. The Big Data Conundrum• The percentage of available data an enterprise can analyze is decreasing proportionately to the available to itQuite simply, this means as enterprises, we are getting “more naive” about our business over timeWe don’t know what we could already know…. Data AVAILABLE to an organization Data an organization can PROCESS Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 13
  • 14. Why Not All of Big Data Before: Didn’t have the Tools? Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 14
  • 15. Applications for Big Data AnalyticsSmarter Healthcare Multi-channel Finance Log Analysis salesHomeland Security Traffic Control Telecom Search Quality Manufacturing Trading Fraud and Retail: Churn, Analytics Risk NBO Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 15
  • 16. Most Requested Uses of Big Data• Log Analytics & Storage• Smart Grid / Smarter Utilities• RFID Tracking & Analytics• Fraud / Risk Management & Modeling• 360° View of the Customer• Warehouse Extension• Email / Call Center Transcript Analysis• Call Detail Record Analysis 16 Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 16
  • 17. What companies & analytics think of Big DataNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 17
  • 18. Gartner & McKinsleyNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 18
  • 19. Hype Cycle of Big DataNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 19
  • 20. Priority matrixNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 20
  • 21. Key vision• Predictive modeling is gaining momentum with property and casualty (P&C) companies who are using them to support claims analysis, CRM, risk management, pricing and actuarial workflows, quoting, and underwriting.• Social content is the fastest growing category of new content in the enterprise and will eventually attain 20% market penetration.• Gartner reports that 45% as sales management teams identify sales analytics as a priority to help them understand sales performance, market conditions and opportunities.• Over 80% of Web Analytics solutions are delivered via Software-as-a-Service (SaaS).Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 21
  • 22. Big Data deliverables by McKinsleyNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 22
  • 23. Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 23
  • 24. IntelNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 24
  • 25. Intel Big Data Cluster Example Application Big Data Algorithms Compute Style Scientific study Ground model Earthquake HPC (e.g. earthquake simulation, thermal study) conduction, … Internet library Historic web Data mining MapReduce search snapshots Virtual world Virtual world Data mining TBD analysis database Language Text corpuses, Speech recognition, MapReduce & translation audio archives,… machine translation, HPC text-to-speech, … Video search Video data Object/gesture MapReduce identification, face recognition, … There has been more video uploaded to YouTube in the last 2 months than if ABC, NBC, and CBS had been airing content 24/7/365 continuously since 1948. - Gartner 25Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 25
  • 26. Example Motivating Application: Online Processing of Archival Video • Research project: Develop a context recognition system that is 90% accurate over 90% of your day • Leverage a combination of low- and high-rate sensing for perception • Federate many sensors for improved perception • Big Data: Terabytes of archived video from many egocentric cameras • Example query 1: “Where did I leave my briefcase?” • Sequential search through all video streams [Parallel Camera] • Example query 2: “Now that I’ve found my briefcase, track it” • Cross-cutting search among related video streams [Parallel Time] Big Data Cluster 26 26Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 26
  • 27. OracleNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 27
  • 28. Big Data Use Cases Today’s Challenge New Data What’s Possible Healthcare Remote patient Preventive care, Expensive office visits monitoring reduced hospitalization Manufacturing Automated diagnosis, Product sensors In-person support support Location-Based Services Geo-advertising, traffic, Real time location data Based on home zip local search code Public Sector Tailored services, Citizen surveys Standardized services cost reductions Retail Sentiment analysis One size fits all Social media segmentation marketingNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 28
  • 29. What’s in Big Data for Public Sector •Operational efficiency and productivity •Fraud detection and prevention •Close tax gaps •Value for money for citizens •Prevent crime waves •Customize actions based on population segments •Public utilities to reduce consumption •Produce safety from farm to forkNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 29
  • 30. MicrosoftNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 30
  • 31. New opportunities Measures and ranks online user Increases ad revenue by processing 3.5 influence by processing 3 billion signals Improving investigation time by analyzing billion events per day per day large volume & variety of data Massive Volumes Cloud Connectivity Real-Time Insight Processes 464 billion rows per quarter, Connects across 15 social networks via Cut investigation time from 2 years to with average query time under 10 secs. the cloud for data and API access 15 daysNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 31
  • 32. Microsoft’s Approach to Big DataNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 32
  • 33. A Holistic Big Data Solution from MicrosoftNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 33
  • 34. Data Scientist JobNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 34
  • 35. Sexy Job of Data Scientist Tom Davenport, who is teaching an executive program in Big Data and analytics at Harvard University, said some data scientists are earning annual salaries as high as $300,000, which is “pretty good for somebody that doesnt have anyone else working for them.” Davenport also said such workers are motivated by the problems and opportunities data provides.Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 35
  • 36. What EMC Think of Data ScientistsNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 36
  • 37. Job evolutionNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 37
  • 38. What Forbes think of Data ScientistsNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 38
  • 39. Data Science CoursesNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 39
  • 40. Course Modules and Navigation Icons Data Science and Big Data Analytics 1. Introduction to Big Data Analytics 2. Data Analytics Lifecycle + Lab 3. Review of Basic Data Analytics Methods Using R + Labs 4. Advanced Analytics - Theory & Methods + Labs 5. Advanced Analytics - Technology & Tools + Labs 6. The Endgame, or Putting it All Together + Final Lab 40Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 40
  • 41. Topics : DataofScience and Big Advanced Analytics Introducti Review Basic Advanced Data The Endgame, on to Big Data Analytic Analytics – Analytics - or Putting it AllCourse Methods Using R Theory and Technology Data Together Analytics Methods and Tools + + Final Lab on Big Data Data Analytics Analytics Lifecycle Big Data Using R to Look at K-means Analytics for Operationalizing Overview Data - Clustering Unstructured an Analytics Introduction to R Data Project State of Association (MapReduce the Analyzing and Rules and Hadoop) Creating the Practice in Exploring the Data Final Analytics Linear The Hadoop Deliverables Statistics for Regression Ecosystem The Data Model Building Data Scientist and Evaluation Logistic In-database Visualization Regression Analytics – Techniques Big Data SQL Essentials Analytics Naive + Final Lab – in Bayesian Advanced SQL Application of Industry Classifier and MADlib for the Data Verticals In-database Analytics Decision Trees Analytics Lifecycle to a Data Big Data Analytics Time Series Analytics Lifecycle Analysis Challenge Text AnalysisNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 41 41
  • 42. HadoopNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 42
  • 43. Top companies need Hadoop Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 43
  • 44. What is Hadoop and Where did it start?• Created by Doug Cutting, formerly of Yahoo! Now Cloudera – HDFS (storage) & MapReduce (compute) – Inspired by Google’s MapReduce and Google File System (GFS) papers• Much of the initial work on Hadoop was done by Yahoo• It is now a top-level Apache project backed by large open source development community Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 44
  • 45. What is Hadoop? Two Core Components HDFS MapReduce Storage in the Compute via the Hadoop Distributed MapReduce distributed File System processing platform• Storage & Compute in 1 Framework• Open Source Project of the Apache Software Foundation• Written in JavaNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 45
  • 46. Hadoop cluster architecture Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 46
  • 47. MapReduce example Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 47
  • 48. Hadoop versions Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 48
  • 49. Hadoop Wave Report “EMC Greenplum is the first mover in Hadoop appliances. EMC Greenplum the first EDW vendor to provide a full-featured enterprise-grade Hadoop appliance and roll out an appliance family that integrates its Hadoop, EDW, and data integration in a single rack. It provides its own open source Hadoop distribution software, integrates EMC’s strong storage product portfolio in its appliances, and has an extensive professional services force of EMC technical consultants and data scientists with Hadoop expertise.” Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 49
  • 50. Hadoop Players TodayNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 50
  • 51. Get Started With Hadoop TodayData Scientists & Hadoop Architecture teams deliver customer success  Hadoop Architecture Services – POC planning and deployment – Installation and best practices – Educate the team  Greenplum Analytics Labs – Leverage the expertise of Greenplum’s Data Scientists – Packaged solutions that produce business value and actionable results – Accelerate Hadoop capabilities on your data with your analysts  Establish a strategic vision – Roadmap for Hadoop and unified analytics Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 51
  • 52. The Greenplum Unified Analytics PlatformNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 52
  • 53. NoSQLNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 53
  • 54. Definitionfrom nosql-databases.org• Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontal scalable. The original intention has been modern web-scale databases. The movement began early 2009 and is growing rapidly. Often more characteristics apply as: schema-free, easy replication support, simple API, eventually consistent /BASE (not ACID), a huge data amount, and more. So the misleading term "nosql" (the community now translates it mostly with "not only sql") should be seen as an alias to something like the definition above.Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 54
  • 55. NoSQLhttp://nosql-database.org/• Non relational• Scalability – Vertically • Add more data – Horizontally • Add more storage• Collection of structures – Hashtables, maps, dictionaries• No pre-defined schema• No join operations• CAP not ACID – Consistency, Availability and Partitioning (but not all three at once!) – Atomicity, Consistency, Isolation and DurabilityNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 55
  • 56. Advantages of NoSQL• Cheap, easy to implement• Data are replicated and can be partitioned• Easy to distribute• Dont require a schema• Can scale up and down• Quickly process large amounts of data• Relax the data consistency requirement (CAP)• Can handle web-scale data, whereas Relational DBs cannotNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 56
  • 57. Disadvantages of NoSQL• New and sometimes buggy• Data is generally duplicated, potential for inconsistency• No standardized schema• No standard format for queries• No standard language• Difficult to impose complicated structures• Depend on the application layer to enforce data integrity• No guarantee of support• Too many options, which one, or ones to pickNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 57
  • 58. NoSQL OptionsKey-Value Stores• This technology you know and love and use all the time – Hashmap for example• Put(key,value)• value = Get(key)• Examples – Redis (my favorite!!) – in memory store – Memcached – and 100s moreNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 58
  • 59. Column Stores • Not to be confused with the relational-db version of this – Sybase-IQ etc. • Multi-dimensional map • Not all entries are relevant each time – Column families • Examples – Cassandra – Hbase – Amazon SimpleDBNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 59
  • 60. Document Stores • Key-document stores – However the document can be seen as a value so you can consider this is a super-set of key-value • Big difference is that in document stores one can query also on the document, i.e. the document portion is structured (not just a blob of data) • Examples – MongoDB – CouchDBNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 60
  • 61. Graph Stores • Use a graph structure – Labeled, directed, attributed multi-graph • Label for each edge • Directed edges • Multiple attributes per node • Multiple edges between nodes – Relational DBs can model graphs, but an edge requires a join which is expensive • Example Neo4j – http://www.infoq.com/articles/graph-nosql-neo4jNon-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 61
  • 62. Non-commercial education only. Corresponding information belongs to its respectful owner. These includes EMC, IBM, Microsoft, Oracle, Gartner etc 62

×