Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Big Data

441 views

Published on

Introduces a sample of open-source technologies available today for solving big data problems.

Published in: Data & Analytics
  • Be the first to comment

Introduction to Big Data

  1. 1. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 1 Mohammed Guller Oct 02, 2016 Introduction to Big Data
  2. 2. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 2 Big Data Big Data Technologies Kafka Hadoop Spark Agenda
  3. 3. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 4 About Me • Engineering Manager / Principal Architect at Glassbeam • Founded two startups • Passionate about building products, big data analytics, and machine learning • www.linkedin.com/in/mohammedguller • @MohammedGuller 4
  4. 4. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 6 • Hands-on guide with lots of examples • Covers both fundamental and advanced topics such as machine learning • Includes a primer on functional programming and Scala • Introduces other important Big Data technologies such as HDFS, Parquet, Kafka, HBase, Cassandra, Mesos, and YARN Big Data Analytics with Spark Available on Amazon
  5. 5. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 7 About Glassbeam Glassbeam brings structure and meaning to data from any connected machine or device while providing actionable intelligence Cloud based analytics platform that helps organizations turn raw machine data to insights Making sense of multi structured machine data  Data center devices  Medical devices  Sensors  ATMs  Automobiles  Data from any machine Providing comprehensive set of apps & tools for machine data analysis  50,000+ systems being tracked today  1,500+ different software rev codes  1.2 Billion sensor readings per day  1+ Trillion sensor readings tracked
  6. 6. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 8 Big Data Big Data Technologies Kafka Hadoop Spark
  7. 7. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 9 Data Growing At a Faster Pace Than Ever 9
  8. 8. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 10 Internet of Things (IoT) • Network of objects embedded with software for collecting and sending data over the Internet • 5x more connected things than people by 2020
  9. 9. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 11 Industrial IoT • Manufacturing • Automotive • Medical • Data Center • EVC • Smart Meter 11 Glassbeam target market is focused on driving opera onal & business analy cs value for connected product companies in Industrial IoT market IT & Networks Medical & Health Care Transporta on EV Chargers & Smart Grid Industrial & Mfg
  10. 10. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 12 Key Attributes of Big Data Volume Scale of Data Variety Diversity of Data Velocity Speed of Data • • • • • • • • •
  11. 11. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 13 Big Data Comes with Big Challenges • Storage • Processing • Value
  12. 12. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 14 Storage Challenges • Legacy SAN / NAS storage devices are expensive • Traditional RDBMS were not designed for Big Data • Cannot handle volume, velocity, variety of Big Data 14
  13. 13. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 15 Processing Challenges • Diverse processing • Organizations want do more than just BI / traditional analytics • Go beyond SQL queries • Timeliness • Process data in reasonable amount of time • Value of data decreases over time 15
  14. 14. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 16 How Much Data Can a Standard Server Process
  15. 15. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 17 • • 17
  16. 16. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 18 • Large number of CPUs / cores • Faster cores • Large amount of memory • Faster memory bus • High-performance architecture Scale-up with Powerful High-end Server 18
  17. 17. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 19 Disadvantages of Scale-up Architecture • Proprietary • Expensive • Limited scalability 19
  18. 18. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 20 • Cluster of servers • Commodity machines • Pool together resources • CPU • Memory • Disk Scale-out Architecture 20
  19. 19. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 21 Benefits of Scale-out Architecture • Relatively inexpensive • Economical to scale • No huge upfront investment • Start small and expand cluster as workload increases 21
  20. 20. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 22 Challenges With Scale-out Architecture • Writing distributed applications is very hard • Split job into chunks that can be distributed across a cluster • Schedule compute resources among different jobs • Manage inter-node communication • Handle network and node failures • Hardware failures are more common at a cluster level • Probability of a single node failing is low • Probability of any one node in a large cluster failing is high 22
  21. 21. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 23 Getting Value Out of Big Data • Traditional analytics / BI • Custom processing • Machine Learning • Predictive analytics • Automate complex tasks • Stream processing • Analyze in real-time/near real-time • React in real-time 23
  22. 22. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 24 Traditional Analytics / BI • What • Customer growth for the last month/quarter/year • Segmentation of customers by demographics • Average time spent by mobile app users • Why • Sales growth slowed • regional issue • supply issue • Profit dropped • revenue dropped • expenses increased 24
  23. 23. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 25 Custom Processing • Index web pages • Google • Bing • Process genome data • Identify mutations linked to cancer, Alzheimer's and other disease • Click analysis • Log analysis • 360-degree real time view of a customer 25
  24. 24. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 26 Predictive Analytics • Advertisements that a visitor will most likely click • Movies / songs / news that a customer will like • Products that a customer will buy • Patient will have an heart attack 26
  25. 25. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 27 • Virtual assistant • Siri • Google Now • Autonomous machine • Self-driving car • Robots • Tag Images • Facebook • Flickr • Expert System • Medical diagnosis • Personalized medicine • Security • Fraud detection • Network Security • Music recognition • Shazam • SoundHound Automate Complex Tasks 27
  26. 26. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 28 Big Data Big Data Technologies Kafka Hadoop Spark
  27. 27. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 29 • • • • • • 29
  28. 28. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 30 • • • • • • 30
  29. 29. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 31 • Text • CSV • JSON • XML • Binary • Sequence File • Avro • Parquet • Optimized Row Columnar (ORC) File Formats 31
  30. 30. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 32 • Hive • Spark SQL • Impala • Presto • Drill • Phoenix • HAWQ • Tajo Distributed SQL Query Engine 32 Data Warehouse Distributed Storage Distributed Query Engine
  31. 31. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 33 • • • • • • 33
  32. 32. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 34 • • • 34
  33. 33. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 35 • • 35
  34. 34. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 36 Publish – Subscribe / Messaging Systems • Kafka • RabbitMQ • ActiveMQ • ZeroMQ 36
  35. 35. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 37 • Batch • Hadoop MapReduce • HPCC • Stream • Kafka Streams • Heron • Storm • Samza • Batch and Stream • Spark • Flink • Beam • Apex • Ignite Big Data Computing Frameworks 37
  36. 36. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 38 Big Data Big Data Technologies Kafka Hadoop Spark
  37. 37. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 39 • Distributed publish-subscribe messaging system • Partitioned and replicated commit log service for building distributed datastore Kafka
  38. 38. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 40 • • • • • • • • • • • • • • • 40
  39. 39. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 41 • • • • • • • • 41
  40. 40. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 42 • • • • • • 42
  41. 41. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 43 Big Data Big Data Technologies Kafka Hadoop Spark
  42. 42. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 44 Hadoop
  43. 43. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 45 • • • • 45
  44. 44. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 46 • • • • •
  45. 45. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 47
  46. 46. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 48
  47. 47. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 49
  48. 48. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 50 Hadoop is Not a Single Product 50
  49. 49. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 51 Hadoop Core Components 51 =
  50. 50. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 52 Big Data Big Data Technologies Kafka Hadoop Spark
  51. 51. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 53 • • • 53
  52. 52. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 54 • • • • • • • 54
  53. 53. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 55 Adoption of Spark is Growing Rapidly
  54. 54. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 56 Spark Fast, easy-to-use, general-purpose cluster computing framework for processing large datasets using a simpler programming model 56 • • •
  55. 55. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 57 Benefits • Scale • Fault-tolerance • Abstracts distributed computing • Hides the messy details of writing distributed applications • Allows developers to just focus on the data processing logic • Same code works on a laptop or a cluster of servers • Ease-of-use • Speed • Flexibility 57
  56. 56. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 58 Easy To Use • Library with an expressive API • Scala, Java, Python, R • RDD API with 80+ operators (MR has only two) • Dataset/DataFrame API • Interactive development • spark-shell • notebooks 58
  57. 57. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 59 • Batch processing • Interactive analytics • Stream analysis • Machine learning • Graph analytics Integrated Libraries For a Variety of DP Tasks Spark Core Spark SQL GraphX Spark Streaming MLlib
  58. 58. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 60 Benefits of a Unified Platform • Solve a variety of problems with a single toolkit • No need to learn different tools for each use case • Avoid code and data duplication • Achieve operational simplicity
  59. 59. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 61 Why is Spark Fast • Advanced job execution engine • Allows applications to cache data in memory 61
  60. 60. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 62 Advanced Job Execution Engine • Directed Acyclic Graph (DAG) of stages • simple job can contain just one stage • complex job can contain many stages • eliminates expensive operations between multiple jobs • synchronization • serialization/deserialization • disk I/O • Lazy operator evaluation • Pipelined operations 62
  61. 61. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 65 Allows Applications to Cache Data in Memory •Minimize disk I/O •Reading data from memory is orders of magnitude faster than reading from disk •In-memory data sharing across DAGs • different jobs can work with the same cached data 65
  62. 62. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 66 Why Caching Makes Applications Run Faster 66 100 MB/s 500 MB/s 10 GB/s
  63. 63. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 67 Read Latency Comparison 67 0 50 100 150 200 1 TB Time (Min) Data Read HDD SSD RAM
  64. 64. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 74 Spark Does Not Provide Storage • Works with a variety of data sources • No need to import data into Spark • Scale compute and storage cluster independently
  65. 65. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 75 Process Data From a Variety Of Data Sources And Many More
  66. 66. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 76 Spark Does Not Replace Hadoop 76 = =
  67. 67. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 77 Hadoop is Optional 77 = =
  68. 68. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 78 Ideal Applications • Complex data processing • multi-step pipeline • Iterative algorithm • Machine Learning • Graph analytics • Ad hoc analysis • Interactive
  69. 69. © COPYRIGHT 2016 GLASSBEAM INC. CONFIDENTIAL. DO NOT DISTRIBUTE 110110

×